Thursday, October 4, 2018

Why I support Plan S


1. The world spends $10 billion annually on scientific publishing.

2. Most scientific papers are only accessible with subscription, which means they are only accessible to academia and large companies in countries with a large per capita GDP. The papers are not accessible to the tax payers who paid for the research and the university subscriptions, nor to small and medium-sized companies.

3. The price of scientific publishing has been increasing at an unsustainable rate

Image result for rising cost of scientific publishing

4. The increased price is not due to increased costs on the side of the publishers, but rather to their aggressive negotiation tactics, leading to profit margins unheard of in any other business.



5. Pushback during price-negotiations with publishers in the EU has resulted in the publishers denying access to their journals in several countries, as a negotiation tactic. This includes access to past issues. We can't do anything about it because they own the papers. They own the papers because scientists signed away their copyright.

I believe this is an untenable situation. Who's going to do anything about it?

6. Publishers are obviously not going to do anything about it on their own accord. Any company would fight to preserve these profit margins, and they do.

7. Scientific societies are not going to do anything about it. Scientific societies derive the bulk of their income from subscriptions and are every bit as ruthless in their negotiations on subscription price as commercial publishers. For all intends-and-purposes they are publishers first, societies second.

8. By-and-large (one notable exception is PLoS) scientists haven’t done anything about it either. In my experience, scientists are first on foremost focussed on career advancement and competition for research funds and don't think about the (rising) cost of publishing.

9. Now it seems some EU funders are finally trying to do something about it with Plan S. Plan S is designed to bring about change in the current system.

This is why I fully support Plan S

If Twitter is any indication most scientists are not happy with Plan S. From what I can tell, the worries center mostly on not being able to publish in "good" journals and can be classified into two main categories:

None of the "good" journals will change
Unless something changes within the next 2-3 years researchers would not be able to publish work funded by some EU-based funding agencies in most, but not all, "good" journals.

So your colleagues who are not funded by some EU-based funding agencies publish in "good" journals and get all the recognition. As a result you may not get promoted or get new grants.

It's an unlikely scenario but if this does happen remember that your colleagues, who decide on your promotion or that you are competing against for funds, are in exactly the same boat.

Another worry is that people won't want to collaborate with you because you can't publish together in "good" journals. In my experience, this is not how scientific collaborations work, but if you do happen to meet such a potential collaborator my advice would be to avoid them at all cost.

All of the "good" journals will change
If all the "good" journals change to Gold OA in response to Plan S, then people without funding who can't pay the APC won't be able to publish in "good" journals.

This is also an unlikely scenario, but if it does happen society would spend considerably less on subscription fees that could be used to pay APCs. Notice that Plan S calls for a APC-cap, meaning that Plan S-friendly journal should be affordable. Remember that the current APCs are designed to maintain a very large profit margin for the publishers, so there is plenty of "fat" to trim. Finally, APCs are tied to volume. If the number of submissions increase the cost of each individual paper decreases.

The most likely scenario in my opinion
Plan S will be (sadly) softened a little. Some "good" journals will change to comply with the final version of Plan S, some won't, and some new journals will spring up. Some researchers won't be able to publish some of their work in some of their favorit "good" journals and will have to find a new favorit for some of their papers. Your colleagues are in the same boat, so it won't have much effect on either career advancements nor funding success rate. The world of scientific publishing may become a little bit less ridiculous but not thanks to us scientists.

I view Plan S as a signal to the publishers for the next round of negotiations. We pay the bills and this is how we would like it. It's not an unreasonable  position at all. Something does not need to change. Publisher will fight this tooth and nail and their main argument will be that scientists say it will be bad for science. Sadly, many scientists are saying just that. However, the main worry of the publishers is their profit margin and the main worry of the scientists is, I believe, their careers. The only thing they have in common is a fear of change.  


This work is licensed under a Creative Commons Attribution 4.0




Tuesday, September 11, 2018

Reviews of Solvation Energy Predictions Using The SMD Solvation Method and Semiempirical Electronic Structure Methods

Really late posting this. The paper is already out at JCP. Here are the reviews for the record.

Reviewer #1 Evaluations:
Recommendation: Revision
New Potential Energy Surface: No

Reviewer #1 (Comments to the Author):

In this contribution, the authors report a set of systematic analyses of semi-empirical (NDDO and DFTB) methods combined with continuum solvation models (COSMO and SMD) for the description of solvation free energies of well-documented benchmark cases (the MNSOL dataset). They found that the performance of NDDO and DFTB continuum solvation models can be substantially improved when the atomic radii are optimized, and that the results are most sensitive to the radii of HCNO. Another interesting observation is that the optimized radii have a considerable degree of transferability to other solvents.

Since an efficient computation of solvation free energies and related quantities (e.g., pKa values) is valuable in many chemical and biological applications, the results of this study are of considerable interest to the computational chemistry community.

In the current form, the ms can benefit from further discussion of several points:

1. The authors chose to optimize the atomic radii based entirely on element type (e.g., HCNOS). In the literature, many solvation models either further consider atom types (e.g., UAKS) or atomic charge distribution (based on either atomic point charges or charge density); in many cases, a higher degree of accuracy appears to be obtained. It would be useful to further clarify the principle behind the current optimization and the expected level of accuracy; for example, to what degree should we expect the same set of radii to work well for both neutral and ionic (especially anionic) species?
2. Although it is well known - it is useful to explicitly point out that the experimental values for neutral and charged species have different magnitudes of errors.
3. It would be informative to further dissect/discuss the physical origins for the errors of NDDO/DFTB continuum solvation models. For example, are the larger errors (as compared to, for example, HF based calculations) due primarily to the less reliable description of the solute charge density (e.g., multipole moments) or solute polarizability? Discussion along this line might be relevant to the transferability of the optimized model to non-aqueous solvents.

4. Cases with very large errors deserve further analysis/discussion - for example, some neutral solutes apparently have very large errors at HF, NDDO and DFTB levels - as much as 20 or even 30 kcal/mol! What are these molecules? Are the same set of molecules problematic for all methods? What is the physical origin for these large errors?


Reviewer #2 Evaluations:
Recommendation: Revision
New Potential Energy Surface: No

Reviewer #2 (Comments to the Author):

In this paper, the authors make the case for efficient solvation models in
fast electronic structure methods (currently heavily utilized for high-throughput
screening approaches). They extend an implementation of PM6 in the Gamess
programm to account for d orbitals. The SMD and COSMO continuum models in combination with
various semi-empirical NDDO and also DFT tight-binding approaches is considered.
Their analysis clearly highlights deficiencies of the semi-empiricial approaches
compared to HF/DFT. The authors then proceed to propose a remedy (changing the
radii for H, C, O, N, and S). Although this change was driven by data on aqueous
solvation energies, the authors find that other polar solvents (DMSO, CH3CN, CH3OH)
are also improved, which is a sign of transferability of this simple fix.
The prediction of pKa data, as an important application field, concludes the
results section. The paper is clearly written, however, it raises questions that
should be addressed in a revision:
1) Table 2 shows very (too) small Coulomb radii for H and on page 6 this is commented on.
The authors note that for radii smaller than 0.6 A the proton moved into the solvent.
However, no further analysis if provided. I assume that this is due to an increased outlying
charge and this outlying charge shoud be quantified. Apparently, some error compensation
is in operation. This also relates to the statement 'error for the ions is considerably larger
than for neutral molecules' on page 5. Error compensation also raises a concern about
transferability that the authors must address.
2) The authors should also review their list of references (I assume that the first author's
surname in Ref. 1 is misspelled, the abbreviations of journals are not in JCP style, Ref. 34
appears to lack a journal title, Ref. 16 lacks author names ...).
3) Moreover, figures 1-3 lack a label for the y-axis, figure 4 lacks units
on the y-axis.
4) Few typos need also be removed (see, e.g., "mainly only" -> "mainly on"
on page 2).


This work is licensed under a Creative Commons Attribution 4.0

Sunday, September 2, 2018

Assigning bond orders



This last week I've been working off and on changing the way xyz2mol.py assigns bond orders. xyx2mol is an implementation of this paper and works by first assigning connectivity and then increasing the bond order between atoms with unsatisfied valences. As the article points out the outcome depends on the order you do it in.

For example, if you start by adding a double bond between atoms 6 and 7 in biphenyl (1), you don't get the right result. The authors propose to start by adding double bonds between atoms with the lowest connectivity, which works fine for biphenyl.

But Mads found an example (2) where that doesn't work. According to the rules you would add double bonds between 1-7, 3-4, and 8-9, which is not correct.

I solved this by finding the longest path of atoms unsatisfied valences (7-1-2-3-4-5-8-9) and then adding double bonds between alternate pairs. If there is an uneven number of such atoms you get different results depending on where you start, but that could probably be fixed by some ad-hoc rules. However, this approach fails for benzoquinone (3), where the longest path is 7-6-1-2-3-4-5. (Finding the longest path can also become intractable for many atoms)

The approach I have settled on so far, finds the maximum number of non-consecutive double bonds. So if you have 6 or 7 atoms with unsatisfied valence you can have a maximum of 3 non-consecutive double bonds. I find this combination by generating all combinations of three bonds and determining the one with the most different atoms (e.g. [(1,2), (3,4)] is better than  [(1,2), (2,3)]. This doesn't scale all that well when you're trying to place 10 or more double bonds, so if you have a better idea I'm all ears.

3.09.2018 update. Thanks to Geoff's suggestion the code now runs considerably faster


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, August 2, 2018

Comparison of genetic algorithm and Monte Carlo tree search for logP optimisation




What have I done?
1. Inspired by this and this study I have implemented graph-based crossover and mutation algorithms and hooked then up to a basic genetic algorithm (GA). "Graph-based" means I'm modifying the molecules by changing/adding atoms and bonds (using RDKit) rather than changing SMILES strings.
2. I hooked up my probabilistic atom-adder to a rudimentary Monte Carlo tree search (MCTS) algorithm I found on GitHub. I modified it to have leaf parallelisation, i.e. 25 random leafs a made instead of one.
3. The score to be optimised is the penalised logP (plogP) value using here and here.

Size matters
All other things being equal, the plogP score increases linearly with molecular size so this must be considered when comparing to other studies. I want to compare to this study where the maximum SMILES length is capped at about 80 according to their code. The average size of the molecules in Figure 2 of that paper is 39.15 ± 3.50 atoms. So for GA I rejected a molecule if the number of atoms was larger than a number sampled from a normal distribution with mean 39.15 and a standard deviation of 3.50. Similarly for MCTS I stopped the rollout using the same criterion.

The results
The following results are the maximum plogP score averaged over 10 runs. In the MCTS case I ran each run for 40 iterations, while for the GA I my population size was 20 and I ran for 50 generations. In each case this works out to be 1000 plogP evaluations per run. The initial mating pool is 20 random molecules sampled from the first 1000 molecules in the ZINC data set used in this study by Yang et al.. Similarly, probabilities used in the atom-adder comes from an analysis of these 1000 molecules. The mean plogP score for this set is 0.2 and the maximum value is 3.6.

The average maximum plogP for GA is 6.8 ± 0.7 and 7.4 ± 0.9 using a 50% and 1% mutation rate. For comparison the best average maximum value in the Yang et al study is 5.6 ± 0.5, obtained using MCTS and close to 20,000 plogP evaluations. The latter took 8 hours, while the GA calculations took 30 seconds each on my laptop.

For MCTS the average maximum plogP value is 3.3 ± 0.7, which is significantly lower than the lowest value in the Yang et al. study, 4.9 ± 0.4. The most likely explanation is the atom adder makes relatively few benzene rings (as discussed previously), which, together with Cl atoms, is the defining structural feature of  the high scoring molecules found by Yang et al. Indeed, if you increase the probability of generating C=C-C containing rings (from 63% to 80%) then the average maximum plogP value increases to  4.4 ± 0.4. If you increase the plogP evaluations to 5000 (the value used by Yang et al. to obtain the 4.9 ± 0.4 value) then the value increases to 5.0 ± 0.4.  The latter run takes about 9 min, compared to 2 hrs in the Yang et al. study.

Conclusion
GA is very good at maximising plogP values and very efficient. For this particular property GA completely outperforms MCTS. 

You can find the code I used here


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, July 19, 2018

Making random molecules one atom at a time


I'm trying to make my own version of ChemTS, which uses a recurrent neural network to generate molecules (SMILES string) and Monte Carlo tree search to optimise their properties. This post is about some very recent preliminary (i.e. possibly wrong) results on the molecule generation part. I decided to write my own code rather than use the RNN because I thought I could whip up some code using RDKit in no time. But, as usual, the problem turned out to be harder than expected so I ended spending quite a bit of time to get to this point.

The main idea
The main idea is to use RDKits reaction SMARTS to add one atom at a time and selecting the reactions SMARTS based on a probabilistic analysis of a representative set of molecules (here I use the first 1000 molecules in the ZINC data set that Tsuda and coworkers used for ChemTS). I use three kinds of reaction SMARTS (shown here for single bonds):

So, for example, 63% of the atoms in the data set are ring atoms so I choose ring creation (if there are no suitable rings to work with) or ring insertion 63% of the time. The most common 3-atom combination in rings is C=C-C, which accounts for 45% of all 3-atom combinations in rings, so I choose that 45% of the time, and similarly for the 34 other combinations I found in the data set. The site of insertion/creation (in this case a C-C bond) is chosen randomly. I stop adding to a particular ring if it is aromatic or is 6-membered.

Similarly, for addition the most common bond involving at least one non-ring atom is C-C, so I choose that more often. I played around with a more specific bonding analysis, e.g. there probabilities for C=C-C vs C-C-C, but if one gets too specific then the most probable bonding patters are often not found in the early stages of molecule growth and the growth process effectively stops.

The average molecular size in the data set is 23.2±4.4 atoms, so I stop growing the molecule when the size exceeds a random value sampled from a normal distribution with mean 23.2 and standard deviation 4.4.

An analysis of 1000 generated molecules
To get an idea of how well this works I generated 1000 molecules using ethane as the "seed" molecule and performed the same statistical analysis on the new set. It takes about a 90 seconds to generate the set, which is about 600 molecules a minute. For comparison the RNN can generate about 40 molecules a minute according to the ChemTS paper, because most of the suggested SMILES are either invalid or lead to "invalid" molecules as judged by RDKit.

The average molecular size in the new data set is 24.2±4.6 atoms, which is pretty close to the target. The discrepancy is probably due to the fact that if I expand some 3- or 4- membered rings to 5 membered rings after the growth stops. There are 2612 rings compared to 2753 in the reference set and 61% of the atoms are in rings, which also is close to 63% in the reference set. 41% of the 3-atom combinations in rings is C=C-C, which is reasonably close to the 45% in the reference set. I think the difference is due to the fact that at least one of the two carbons must be secondary to "accept" the double bond, and if such a bonding pattern is not always be present then no "=C-" atom will be added. It is thus a little bit more likely that a "-C-" atom will be "accepted", compared to a "=C-" atom.

A feature, not a bug?
Not surprisingly, there are bigger differences for the larger scale features not specifically encoded in the rules such as the type of ring.

Ring reference generated gen 80%
[*]1-[*]-[*]-1 57 92 49
[*]1-[*]=[*]-1 0 12 31
[*]1-[*]-[*]-[*]-1 17 37 8
[*]1=[*]-[*]-[*]-1 0 8 15
[*]1=[*]-[*]=[*]-1 0 0 0
[*]1-[*]-[*]-[*]-[*]-1 280 11 0
[*]1=[*]-[*]-[*]-[*]-1 120 345 394
[*]1=[*]-[*]=[*]-[*]-1 470 198 317
[*]1-[*]-[*]-[*]-[*]-[*]-1 409 45 7
[*]1=[*]-[*]-[*]-[*]-[*]-1 77 337 87
[*]1=[*]-[*]=[*]-[*]-[*]-1 100 627 447
[*]1=[*]-[*]-[*]=[*]-[*]-1 7 393 255
[*]1=[*]-[*]=[*]-[*]=[*]-1 1206 507 877
Total 2743 2612 2487


For example, there are many fewer benzene-type rings as well as cyclopentane and cyclohexane type rings, while there are of most of the other types. The reason, of course, it that the molecules in the ZINC data set are not made by randomly adding atoms, but by assembling larger functional groups such as aliphatic and aromatic rings. So the average ring composition does not reflect the most likely ring compositions.

It is possible to scale the probabilities to skew the results towards one or the other ring-type. For example in the last column I have scaled the probabilities such that the probability X=Z-Y is 80% rather than the 62% in the reference set.

But since I intend to use the method to explore chemical space and use a optimisation algorithm to guide the exploration it might be an advantage that the algorithm suggests a broad range of ring architectures.

Lots left to be done
The codes I wrote can be found here and here. It's messy proto-type code and not finished. For one thing it doesn't generate fused rings yet, which is quite a limitation. I'm sure I've also missed other pretty obvious things and any feedback is appreciated.



This work is licensed under a Creative Commons Attribution 3.0 Unported License.


Tuesday, July 10, 2018

Comparing a Monte Carlo tree search and a genetic algorithm for conformational search


I've been playing around with this Monte Carlo tree search (MCTS) code (if you need a short intro to MCTS click here). I want to learn how to use MCTS to optimise molecular properties so to get started I modified the code to minimise a very simple energy function that mimics a torsional potential.
$$E = \sum_i 1 + \cos(\phi_i) + \cos(3\phi_i)$$ It seems to work OK but in order to get some point of comparison for the efficiency I tested it against a genetic algorithm (GA). You can find the codes here and here.

More specifically, I tested for a function with 10-dihedral angles, each with three possible values (180.0, 60.0, and -60.0), meaning there are 59,049 possible combinations (i.e. "conformations"). Each 180 and ±60 angle contributes -1 and 0.5 to the energy so the global minimum is one in which all angles are 180 and has an energy of -10.

If you simply make 10 x 1000 random combinations of 10-dihedral angles and average the lowest energy found among each set of 1000 you get 0.35. This makes sense because the highest energy conformer has an energy of +5, but there more ways to make it, so the average should be around 0.

Using 1000 iterations (i.e. 1000 energy evaluation) the MCTS finds the global minimum 0 out of 10 times and gives an average energy of -7.6. In 4 cases the energy is -8.5 (i.e. it misses one dihedral) and -7 in the rest (i.e. it misses 2 dihedrals).

For comparison, the GA finds the global minimum 0 out of 10 times and gives an average energy of -6.85. In 2 cases the energy is -8.5, in 5 cases the energy is -7, and in the rest, -5.5. I use a population size of 50 and 20 generations, which also requires 1000 energy evaluations, and a mutation rate of 0.01. Using a population size of 20 and 50 generations results in an average energy of -6.7.

I'm actually a bit surprised that MCTS out-performs GA because I don't recall seeing MCTS being used for conformational search. One possible reason is that it's hard to see how to parallelise MCTS since energy evaluations are done sequentially, while they can be done in parallel for each generation for GA.

I did try running 10 100-iteration MCTSs, which could be done in parallel, and 1 of them managed to find a -8.5 conformer so maybe that's a possibility. Another way would to do a complete expansion of a terminal node and do the three energy evaluations in parallel. This might also be a better search strategy in any case. I am not quite sure how to implement this yet (recursion is not my strong suit) so any suggestions are welcome.

Also the reward function could probably be optimised. Currently it is simply 1 if the energy is lower than the current "global" minimum and 0 otherwise. Suggestions for other functions are most welcome.



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Saturday, July 7, 2018

Planned papers for 2018 - six months in

In January I wrote about the papers I plan to publish in 2018 and made this list:

Accepted
1. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions

Probable
2. Random Versus Systematic Errors in Reaction Enthalpies Computed using Semi-empirical and Minimal Basis Set Methods
3. Improving Solvation Energy Predictions using the SMD Solvation Method and Semi-empirical Electronic Structure Methods

Maybe
4. Towards a barrier height benchmark set for biologically relevant systems - part 2
5. pKaSQM: Automated Prediction of pKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods
6. Prediction of CH pKa values

The status is


Submitted

Probably submitted in 2018
4. Towards a barrier height benchmark set for biologically relevant systems - part 2

Paper 4: We've already gathered the data for 11 new systems to add to the data set and Jimmy's working on applying what we learned in paper 2 to most of these systems. We have to submit the paper in October as part of a special issue. I hope we'll make it.

I don't see papers 5 and 6 happen in the foreseeable future as we're getting further into high-throughput screening of various molecular properties. 



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Friday, July 6, 2018

Why I just sent my paper to Journal of Chemical Physics

Image result for journal of chemical physics

The last few years I have published my papers (i.e. the ones where I'm corresponding author) open access as much as possible, but I just sent my latest paper to JCP. Here's why.

Why not OA?
The paper isn't "impactful" enough for free OA options such as Chemical Science and ACS Central Science.

When it comes to impact-neutral journals my main consideration is price. My go-to journal in the past has been PeerJ where I purchased a life-time membership years ago. My two co-authors also have memberships so we could have published the article for free. Unfortunately, the current paper is not "bio enough" for PeerJ.

It's also not "organic" enough for Beilstein Journal of Organic Chemistry (which is free) and it doesn't fall in to any of the categories for Living Journal of Computational Molecular Science ($100).

The next-most cheapest options are ACS Omega ($750) or RSC Advances (£500 ≈ $670). When I published in ACS Omega earlier this year I discovered that you still sign away copyright to the ACS who then graciously agrees to publish it under an CC license if you pay them $750! In my book that's not OA and I don't think I'll publish there again.

I was all set to send it to RSC Advances but a few months ago I discovered that the RSC is engaging in the sort of behaviour I would normally expect from Elsevier. I don't see why I should support the RSC with an additional £500.

That's the only journals I know of with APCs less than $1000 and I simply don't have that kind of money to spend on publications right now.

Why JCP?
In March 2016, AIP Publishing switched to a new publishing agreement where authors retain copyright but give the AIP an exclusive licence to publish. As a copyright owner you are allowed to, for example

1. Reprint portions of the Work (excerpts, figures, tables) in future works created by the
Author, in keeping with professional publication ethics.
2. Post the Accepted Manuscript (AM) to their personal web page or their employer’s
web page immediately after acceptance by AIP Publishing.
3. Deposit the AM in an institutional or funder-designated repository immediately after
acceptance by AIP Publishing.
5. Reprint the Version of Record (VOR) in print collections written by the Author, or in the
Author’s thesis or dissertation. It is understood and agreed that the thesis or
dissertation may be made available electronically on the university’s site or in its
repository and that copies may be offered for sale on demand.
10. Post the VOR to their personal web page or their employer’s web page 12 months
after publication by AIP Publishing.
11. Deposit the VOR in an institutional or funder-designated repository 12 months after
publication by AIP Publishing.
12. Update a prior posting with the VOR on a noncommercial server such as arXiv, 12
months after publication by AIP Publishing.

Just to be clear, most non-OA journals allow you to post the accepted manuscript in a repository 12 months after publication, while JCP allows you to do this immediately as well as posting the version of record (i.e. the typeset manuscript!) 12-months later.

So it's not OA, because reuse is restricted, but the version of record will be openly accessible after 12 months. And at least I feel like I am dealing with a publisher that is not looking for any opportunity to screw me, or the scientific community in general, over. But maybe I'm just getting sentimental in my old age ...



This work is licensed under a Creative Commons Attribution 4.0

Monday, July 2, 2018

Gaussian Processes for dumb dummies

It's quite disheartening when you don't understand something titled "xxx for dummies" so that's what I was after reading Katherine Bailey's blogpost "Gaussian Processes for Dummies". Luckily the post included Python code and a link to some lecture notes for which I also found the corresponding recorded lecture on YouTube.

Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.

What is GP good for?
GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point.  What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point

I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a $sin$ function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of x-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty.  We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.

mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)


What does the GP algorithm do?
1. Construct the kernel matrix $\mathbf{K_{**}}$, where $K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}$, for the test set. The kernel is a measure of how similar $x_i$ is to $x_j$ and $\lambda$ is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).

2. Construct the kernel matrix $\mathbf{K}$ for the test set and Cholesky-decompose it to get $\mathbf{L}$., i.e. $\mathbf{K = LL^T}$

3. Construct the kernel matrix $\mathbf{K_*}$ connecting the test set to the training set and compute $\mathbf{L_k = L^{-1}K_*}$ (use the solve function for better numerical stability).

4. Compute  $\boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}$

5. Compute the standard deviation $s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}}$ where $\mathbf{k_{**}}$ is obtained by diagonalising $\mathbf{K_{**}}$

6. Compute $\mathbf{L}$ by Cholesky-decomposing $\mathbf{K_{** }- L_k^T L_k}$

What is the basic idea behind the GP algorithm?
How would you write an algorithm that would generate an random function $y=f(x)$. I would argue that the simplest way is simply to generate a random number $y \sim \mathcal{N}(0,1)$ (i.e. y = np.random.normal()) for each value of $x$. ($ \mathcal{N}(0,1)$ is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)

Here's a plot of three such functions.


If I generated 1000 of them the average y-value at each x-coordinate $\langle y_i \rangle$ would be 0. I can change this by $y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1)$.

You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of $y$ tend to be similar if the corresponding values of $x$ are similar, i.e. the $y$ values are correlated by distance ($|x_i - x_j|$).

This correlation is quantified by the kernel matrix $\mathbf{K}$ and can be used to generate smoother functions by $y_i \sim \sum_j L_{ij} \mathcal{N}(0,1)$. This works great as you can see from this plot


You can think of $\mathbf{K}$ and $\mathbf{L}$ a bit like the variance $\sigma^2$ and the standard deviation $\sigma$.  $\langle y_i \rangle$ = 0 as before but this can be changed in analogy with the uncorrelated case $y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)$

The GP is a way of generalising this equation as $y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1)$ and using the training data to obtain values for $\mu_i$ and $L_{ij}$ such that $\mu_i$ matches the y-values in the training data with correspondingly small $L_{ij}$ values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get  $\boldsymbol{\mu}$.



This work is licensed under a Creative Commons Attribution 4.0