I'm trying to make my own version of ChemTS, which uses a recurrent neural network to generate molecules (SMILES string) and Monte Carlo tree search to optimise their properties. This post is about some very recent preliminary (i.e. possibly wrong) results on the molecule generation part. I decided to write my own code rather than use the RNN because I thought I could whip up some code using RDKit in no time. But, as usual, the problem turned out to be harder than expected so I ended spending quite a bit of time to get to this point.
The main idea
The main idea is to use RDKits reaction SMARTS to add one atom at a time and selecting the reactions SMARTS based on a probabilistic analysis of a representative set of molecules (here I use the first 1000 molecules in the ZINC data set that Tsuda and coworkers used for ChemTS). I use three kinds of reaction SMARTS (shown here for single bonds):
So, for example, 63% of the atoms in the data set are ring atoms so I choose ring creation (if there are no suitable rings to work with) or ring insertion 63% of the time. The most common 3-atom combination in rings is C=C-C, which accounts for 45% of all 3-atom combinations in rings, so I choose that 45% of the time, and similarly for the 34 other combinations I found in the data set. The site of insertion/creation (in this case a C-C bond) is chosen randomly. I stop adding to a particular ring if it is aromatic or is 6-membered.
Similarly, for addition the most common bond involving at least one non-ring atom is C-C, so I choose that more often. I played around with a more specific bonding analysis, e.g. there probabilities for C=C-C vs C-C-C, but if one gets too specific then the most probable bonding patters are often not found in the early stages of molecule growth and the growth process effectively stops.
The average molecular size in the data set is 23.2±4.4 atoms, so I stop growing the molecule when the size exceeds a random value sampled from a normal distribution with mean 23.2 and standard deviation 4.4.
An analysis of 1000 generated molecules
To get an idea of how well this works I generated 1000 molecules using ethane as the "seed" molecule and performed the same statistical analysis on the new set. It takes about a 90 seconds to generate the set, which is about 600 molecules a minute. For comparison the RNN can generate about 40 molecules a minute according to the ChemTS paper, because most of the suggested SMILES are either invalid or lead to "invalid" molecules as judged by RDKit.
The average molecular size in the new data set is 24.2±4.6 atoms, which is pretty close to the target. The discrepancy is probably due to the fact that if I expand some 3- or 4- membered rings to 5 membered rings after the growth stops. There are 2612 rings compared to 2753 in the reference set and 61% of the atoms are in rings, which also is close to 63% in the reference set. 41% of the 3-atom combinations in rings is C=C-C, which is reasonably close to the 45% in the reference set. I think the difference is due to the fact that at least one of the two carbons must be secondary to "accept" the double bond, and if such a bonding pattern is not always be present then no "=C-" atom will be added. It is thus a little bit more likely that a "-C-" atom will be "accepted", compared to a "=C-" atom.
A feature, not a bug?
Not surprisingly, there are bigger differences for the larger scale features not specifically encoded in the rules such as the type of ring.
Ring | reference | generated | gen 80% |
[*]1-[*]-[*]-1 | 57 | 92 | 49 |
[*]1-[*]=[*]-1 | 0 | 12 | 31 |
[*]1-[*]-[*]-[*]-1 | 17 | 37 | 8 |
[*]1=[*]-[*]-[*]-1 | 0 | 8 | 15 |
[*]1=[*]-[*]=[*]-1 | 0 | 0 | 0 |
[*]1-[*]-[*]-[*]-[*]-1 | 280 | 11 | 0 |
[*]1=[*]-[*]-[*]-[*]-1 | 120 | 345 | 394 |
[*]1=[*]-[*]=[*]-[*]-1 | 470 | 198 | 317 |
[*]1-[*]-[*]-[*]-[*]-[*]-1 | 409 | 45 | 7 |
[*]1=[*]-[*]-[*]-[*]-[*]-1 | 77 | 337 | 87 |
[*]1=[*]-[*]=[*]-[*]-[*]-1 | 100 | 627 | 447 |
[*]1=[*]-[*]-[*]=[*]-[*]-1 | 7 | 393 | 255 |
[*]1=[*]-[*]=[*]-[*]=[*]-1 | 1206 | 507 | 877 |
Total | 2743 | 2612 | 2487 |
For example, there are many fewer benzene-type rings as well as cyclopentane and cyclohexane type rings, while there are of most of the other types. The reason, of course, it that the molecules in the ZINC data set are not made by randomly adding atoms, but by assembling larger functional groups such as aliphatic and aromatic rings. So the average ring composition does not reflect the most likely ring compositions.
It is possible to scale the probabilities to skew the results towards one or the other ring-type. For example in the last column I have scaled the probabilities such that the probability X=Z-Y is 80% rather than the 62% in the reference set.
But since I intend to use the method to explore chemical space and use a optimisation algorithm to guide the exploration it might be an advantage that the algorithm suggests a broad range of ring architectures.
Lots left to be done
The codes I wrote can be found here and here. It's messy proto-type code and not finished. For one thing it doesn't generate fused rings yet, which is quite a limitation. I'm sure I've also missed other pretty obvious things and any feedback is appreciated.
This work is licensed under a Creative Commons Attribution 3.0 Unported License.