Thursday, July 19, 2018

Making random molecules one atom at a time

I'm trying to make my own version of ChemTS, which uses a recurrent neural network to generate molecules (SMILES string) and Monte Carlo tree search to optimise their properties. This post is about some very recent preliminary (i.e. possibly wrong) results on the molecule generation part. I decided to write my own code rather than use the RNN because I thought I could whip up some code using RDKit in no time. But, as usual, the problem turned out to be harder than expected so I ended spending quite a bit of time to get to this point.

The main idea
The main idea is to use RDKits reaction SMARTS to add one atom at a time and selecting the reactions SMARTS based on a probabilistic analysis of a representative set of molecules (here I use the first 1000 molecules in the ZINC data set that Tsuda and coworkers used for ChemTS). I use three kinds of reaction SMARTS (shown here for single bonds):

So, for example, 63% of the atoms in the data set are ring atoms so I choose ring creation (if there are no suitable rings to work with) or ring insertion 63% of the time. The most common 3-atom combination in rings is C=C-C, which accounts for 45% of all 3-atom combinations in rings, so I choose that 45% of the time, and similarly for the 34 other combinations I found in the data set. The site of insertion/creation (in this case a C-C bond) is chosen randomly. I stop adding to a particular ring if it is aromatic or is 6-membered.

Similarly, for addition the most common bond involving at least one non-ring atom is C-C, so I choose that more often. I played around with a more specific bonding analysis, e.g. there probabilities for C=C-C vs C-C-C, but if one gets too specific then the most probable bonding patters are often not found in the early stages of molecule growth and the growth process effectively stops.

The average molecular size in the data set is 23.2±4.4 atoms, so I stop growing the molecule when the size exceeds a random value sampled from a normal distribution with mean 23.2 and standard deviation 4.4.

An analysis of 1000 generated molecules
To get an idea of how well this works I generated 1000 molecules using ethane as the "seed" molecule and performed the same statistical analysis on the new set. It takes about a 90 seconds to generate the set, which is about 600 molecules a minute. For comparison the RNN can generate about 40 molecules a minute according to the ChemTS paper, because most of the suggested SMILES are either invalid or lead to "invalid" molecules as judged by RDKit.

The average molecular size in the new data set is 24.2±4.6 atoms, which is pretty close to the target. The discrepancy is probably due to the fact that if I expand some 3- or 4- membered rings to 5 membered rings after the growth stops. There are 2612 rings compared to 2753 in the reference set and 61% of the atoms are in rings, which also is close to 63% in the reference set. 41% of the 3-atom combinations in rings is C=C-C, which is reasonably close to the 45% in the reference set. I think the difference is due to the fact that at least one of the two carbons must be secondary to "accept" the double bond, and if such a bonding pattern is not always be present then no "=C-" atom will be added. It is thus a little bit more likely that a "-C-" atom will be "accepted", compared to a "=C-" atom.

A feature, not a bug?
Not surprisingly, there are bigger differences for the larger scale features not specifically encoded in the rules such as the type of ring.

Ring reference generated gen 80%
[*]1-[*]-[*]-1 57 92 49
[*]1-[*]=[*]-1 0 12 31
[*]1-[*]-[*]-[*]-1 17 37 8
[*]1=[*]-[*]-[*]-1 0 8 15
[*]1=[*]-[*]=[*]-1 0 0 0
[*]1-[*]-[*]-[*]-[*]-1 280 11 0
[*]1=[*]-[*]-[*]-[*]-1 120 345 394
[*]1=[*]-[*]=[*]-[*]-1 470 198 317
[*]1-[*]-[*]-[*]-[*]-[*]-1 409 45 7
[*]1=[*]-[*]-[*]-[*]-[*]-1 77 337 87
[*]1=[*]-[*]=[*]-[*]-[*]-1 100 627 447
[*]1=[*]-[*]-[*]=[*]-[*]-1 7 393 255
[*]1=[*]-[*]=[*]-[*]=[*]-1 1206 507 877
Total 2743 2612 2487

For example, there are many fewer benzene-type rings as well as cyclopentane and cyclohexane type rings, while there are of most of the other types. The reason, of course, it that the molecules in the ZINC data set are not made by randomly adding atoms, but by assembling larger functional groups such as aliphatic and aromatic rings. So the average ring composition does not reflect the most likely ring compositions.

It is possible to scale the probabilities to skew the results towards one or the other ring-type. For example in the last column I have scaled the probabilities such that the probability X=Z-Y is 80% rather than the 62% in the reference set.

But since I intend to use the method to explore chemical space and use a optimisation algorithm to guide the exploration it might be an advantage that the algorithm suggests a broad range of ring architectures.

Lots left to be done
The codes I wrote can be found here and here. It's messy proto-type code and not finished. For one thing it doesn't generate fused rings yet, which is quite a limitation. I'm sure I've also missed other pretty obvious things and any feedback is appreciated.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Tuesday, July 10, 2018

Comparing a Monte Carlo tree search and a genetic algorithm for conformational search

I've been playing around with this Monte Carlo tree search (MCTS) code (if you need a short intro to MCTS click here). I want to learn how to use MCTS to optimise molecular properties so to get started I modified the code to minimise a very simple energy function that mimics a torsional potential.
$$E = \sum_i 1 + \cos(\phi_i) + \cos(3\phi_i)$$ It seems to work OK but in order to get some point of comparison for the efficiency I tested it against a genetic algorithm (GA). You can find the codes here and here.

More specifically, I tested for a function with 10-dihedral angles, each with three possible values (180.0, 60.0, and -60.0), meaning there are 59,049 possible combinations (i.e. "conformations"). Each 180 and ±60 angle contributes -1 and 0.5 to the energy so the global minimum is one in which all angles are 180 and has an energy of -10.

If you simply make 10 x 1000 random combinations of 10-dihedral angles and average the lowest energy found among each set of 1000 you get 0.35. This makes sense because the highest energy conformer has an energy of +5, but there more ways to make it, so the average should be around 0.

Using 1000 iterations (i.e. 1000 energy evaluation) the MCTS finds the global minimum 0 out of 10 times and gives an average energy of -7.6. In 4 cases the energy is -8.5 (i.e. it misses one dihedral) and -7 in the rest (i.e. it misses 2 dihedrals).

For comparison, the GA finds the global minimum 0 out of 10 times and gives an average energy of -6.85. In 2 cases the energy is -8.5, in 5 cases the energy is -7, and in the rest, -5.5. I use a population size of 50 and 20 generations, which also requires 1000 energy evaluations, and a mutation rate of 0.01. Using a population size of 20 and 50 generations results in an average energy of -6.7.

I'm actually a bit surprised that MCTS out-performs GA because I don't recall seeing MCTS being used for conformational search. One possible reason is that it's hard to see how to parallelise MCTS since energy evaluations are done sequentially, while they can be done in parallel for each generation for GA.

I did try running 10 100-iteration MCTSs, which could be done in parallel, and 1 of them managed to find a -8.5 conformer so maybe that's a possibility. Another way would to do a complete expansion of a terminal node and do the three energy evaluations in parallel. This might also be a better search strategy in any case. I am not quite sure how to implement this yet (recursion is not my strong suit) so any suggestions are welcome.

Also the reward function could probably be optimised. Currently it is simply 1 if the energy is lower than the current "global" minimum and 0 otherwise. Suggestions for other functions are most welcome.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Saturday, July 7, 2018

Planned papers for 2018 - six months in

In January I wrote about the papers I plan to publish in 2018 and made this list:

1. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions

2. Random Versus Systematic Errors in Reaction Enthalpies Computed using Semi-empirical and Minimal Basis Set Methods
3. Improving Solvation Energy Predictions using the SMD Solvation Method and Semi-empirical Electronic Structure Methods

4. Towards a barrier height benchmark set for biologically relevant systems - part 2
5. pKaSQM: Automated Prediction of pKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods
6. Prediction of CH pKa values

The status is


Probably submitted in 2018
4. Towards a barrier height benchmark set for biologically relevant systems - part 2

Paper 4: We've already gathered the data for 11 new systems to add to the data set and Jimmy's working on applying what we learned in paper 2 to most of these systems. We have to submit the paper in October as part of a special issue. I hope we'll make it.

I don't see papers 5 and 6 happen in the foreseeable future as we're getting further into high-throughput screening of various molecular properties. 

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Friday, July 6, 2018

Why I just sent my paper to Journal of Chemical Physics

Image result for journal of chemical physics

The last few years I have published my papers (i.e. the ones where I'm corresponding author) open access as much as possible, but I just sent my latest paper to JCP. Here's why.

Why not OA?
The paper isn't "impactful" enough for free OA options such as Chemical Science and ACS Central Science.

When it comes to impact-neutral journals my main consideration is price. My go-to journal in the past has been PeerJ where I purchased a life-time membership years ago. My two co-authors also have memberships so we could have published the article for free. Unfortunately, the current paper is not "bio enough" for PeerJ.

It's also not "organic" enough for Beilstein Journal of Organic Chemistry (which is free) and it doesn't fall in to any of the categories for Living Journal of Computational Molecular Science ($100).

The next-most cheapest options are ACS Omega ($750) or RSC Advances (£500 ≈ $670). When I published in ACS Omega earlier this year I discovered that you still sign away copyright to the ACS who then graciously agrees to publish it under an CC license if you pay them $750! In my book that's not OA and I don't think I'll publish there again.

I was all set to send it to RSC Advances but a few months ago I discovered that the RSC is engaging in the sort of behaviour I would normally expect from Elsevier. I don't see why I should support the RSC with an additional £500.

That's the only journals I know of with APCs less than $1000 and I simply don't have that kind of money to spend on publications right now.

Why JCP?
In March 2016, AIP Publishing switched to a new publishing agreement where authors retain copyright but give the AIP an exclusive licence to publish. As a copyright owner you are allowed to, for example

1. Reprint portions of the Work (excerpts, figures, tables) in future works created by the
Author, in keeping with professional publication ethics.
2. Post the Accepted Manuscript (AM) to their personal web page or their employer’s
web page immediately after acceptance by AIP Publishing.
3. Deposit the AM in an institutional or funder-designated repository immediately after
acceptance by AIP Publishing.
5. Reprint the Version of Record (VOR) in print collections written by the Author, or in the
Author’s thesis or dissertation. It is understood and agreed that the thesis or
dissertation may be made available electronically on the university’s site or in its
repository and that copies may be offered for sale on demand.
10. Post the VOR to their personal web page or their employer’s web page 12 months
after publication by AIP Publishing.
11. Deposit the VOR in an institutional or funder-designated repository 12 months after
publication by AIP Publishing.
12. Update a prior posting with the VOR on a noncommercial server such as arXiv, 12
months after publication by AIP Publishing.

Just to be clear, most non-OA journals allow you to post the accepted manuscript in a repository 12 months after publication, while JCP allows you to do this immediately as well as posting the version of record (i.e. the typeset manuscript!) 12-months later.

So it's not OA, because reuse is restricted, but the version of record will be openly accessible after 12 months. And at least I feel like I am dealing with a publisher that is not looking for any opportunity to screw me, or the scientific community in general, over. But maybe I'm just getting sentimental in my old age ...

This work is licensed under a Creative Commons Attribution 4.0

Monday, July 2, 2018

Gaussian Processes for dumb dummies

It's quite disheartening when you don't understand something titled "xxx for dummies" so that's what I was after reading Katherine Bailey's blogpost "Gaussian Processes for Dummies". Luckily the post included Python code and a link to some lecture notes for which I also found the corresponding recorded lecture on YouTube.

Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.

What is GP good for?
GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point.  What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point

I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a $sin$ function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of x-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty.  We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.

mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)

What does the GP algorithm do?
1. Construct the kernel matrix $\mathbf{K_{**}}$, where $K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}$, for the test set. The kernel is a measure of how similar $x_i$ is to $x_j$ and $\lambda$ is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).

2. Construct the kernel matrix $\mathbf{K}$ for the test set and Cholesky-decompose it to get $\mathbf{L}$., i.e. $\mathbf{K = LL^T}$

3. Construct the kernel matrix $\mathbf{K_*}$ connecting the test set to the training set and compute $\mathbf{L_k = L^{-1}K_*}$ (use the solve function for better numerical stability).

4. Compute  $\boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}$

5. Compute the standard deviation $s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}}$ where $\mathbf{k_{**}}$ is obtained by diagonalising $\mathbf{K_{**}}$

6. Compute $\mathbf{L}$ by Cholesky-decomposing $\mathbf{K_{** }- L_k^T L_k}$

What is the basic idea behind the GP algorithm?
How would you write an algorithm that would generate an random function $y=f(x)$. I would argue that the simplest way is simply to generate a random number $y \sim \mathcal{N}(0,1)$ (i.e. y = np.random.normal()) for each value of $x$. ($ \mathcal{N}(0,1)$ is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)

Here's a plot of three such functions.

If I generated 1000 of them the average y-value at each x-coordinate $\langle y_i \rangle$ would be 0. I can change this by $y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1)$.

You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of $y$ tend to be similar if the corresponding values of $x$ are similar, i.e. the $y$ values are correlated by distance ($|x_i - x_j|$).

This correlation is quantified by the kernel matrix $\mathbf{K}$ and can be used to generate smoother functions by $y_i \sim \sum_j L_{ij} \mathcal{N}(0,1)$. This works great as you can see from this plot

You can think of $\mathbf{K}$ and $\mathbf{L}$ a bit like the variance $\sigma^2$ and the standard deviation $\sigma$.  $\langle y_i \rangle$ = 0 as before but this can be changed in analogy with the uncorrelated case $y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)$

The GP is a way of generalising this equation as $y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1)$ and using the training data to obtain values for $\mu_i$ and $L_{ij}$ such that $\mu_i$ matches the y-values in the training data with correspondingly small $L_{ij}$ values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get  $\boldsymbol{\mu}$.

This work is licensed under a Creative Commons Attribution 4.0