Wednesday, June 5, 2019

Comparison of SMILES-, DeepSMILES- and graph-based genetic algorithms

Emilie is wrapping up her bachelor project and writing the report and here are some preliminary results (which are likely to change a bit).

I recently developed a graph-based genetic algorithm that seems to work pretty well. The crossover and mutation code is about 250 lines with a lot of hyperparameters that mainly specify the probabilities of performing different crossovers and mutations.

The question is whether all this was really necessary or could I have gotten away with about 25 lines of code that perform crossover and mutation operations on SMILES strings? For crossover you simply cut two strings at random places and recombine the fragments, e.g. OCC|C and CC|N (where "|" indicates the cut) yield OCCN and CCC and for mutation you simply change one character to another, e.g. CCC becomes C=C.

The potential problem with using SMILES is that one can imagine many scenarios where this wouldn't work, e.g. OC(|C)C and C1C|O1 would yield OC(O1 and C1CC)C, which are not valid SMILES string.  But can you still find molecules with the desired properties using this approach? If so, do the molecules look very different than the ones you find with the graph-based approach? Which approach is more efficient?

And what about the DeepSMILES representation developed by O'Boyle and Dalke? Here, OC(|C)C and C1C|O1 are written as OC|C)C and CC|O3, which would yield OCO3 and CCC)C - both valid DeepSMILES strings.

Finding molecules with specific penalised logP values
We start by looking at what Brown et al. call a "trivial optimisation objective": finding molecules with a particular modified logP values. We use the same Gaussian modifier approach with a standard deviation of 2 logP units and select the initial population from the first 1000 molecules in the ZINC data set (after removing molecules with logP values within 2 units of the target). The mating pool size and mutation rates are 20 and 10%, respectively. The table shows the average number of generations (based on 10 runs) needed to find a molecule with a logP values within 0.01 of the target.


It is clear that a SMILES-based GA has no problems meeting the objective, but that using DeepSMILES is more effective both in terms of number of required generations and CPU time. The latter, because the percentage of valid strings generated by crossover and mutation (the succes rate) is considerably higher for DeepSMILES as expected. For this target there appears to be no real advantage in using graph-based GA.

Rediscovering a molecule
The next target is considerably harder: generating molecules with a Tanimoto similarity of 1.0 with a target molecule (naphthalene, celecoxib, or tiotixene).  A Tanimoto similarity of 1.0 means that each atom has the same bonding pattern out to a certain radius (here 4 bonds), i.e. that the molecules are very, very similar. The population size is 100 and the mating pool size is 200. The initial populations is 100 molecules with the highest Tanimoto scores to the target (but with Tanimoto scores less than 0.323, following Brown et al.) found among the first 50,000 molecules in the ZINC data set.


Here it's clear that the graph-based approach offer an advantage over string-based methods, while DeepSMILES only offers an advantage over SMILES in terms of effciency. The Tanimoto score goes from 0 to 1, so the molecules found for the tiotexene search using graph-based GA look significantly more like tiotexene than those found with the string based methods. (When I run celecoxib search I succeed 5/10 times, and we are still trying to find the cause of this difference.)

Finding molecules that absorb at a particular wavelength
Inspired by the study of Tsuda and co-workers, we search for molecules that absorb at 400 and 800 nm. We use Grimme's semiempirical sTDA-xTB method to estimate the absorption wavelength and oscillator strength based on a low energy MMFF-optimised structure. We use a Gaussian(400/800,50) scoring function for the wavelength and a LinearThresholded(0.3) scoring function for the oscillator strength (0.3 is the oscillator strength computed for indigo dye). The population and mating pool sizes are 20 and 40, respectively and the mutation rate is 15%. We select the initial population from the first 1000 molecules in the ZINC data set (after removing molecules with absorption wavelengths within 300 nm of the target). The GA search is stopped if the top-scoring molecule has an absorption wave length within 5 nm of the target value.


It turns out that the find molecules that absorb relatively strongly at 400 and 800 nm is a relatively easy optimisation problem.

Conclusion
If there are very few ways to meet the target then graph-based GA performs better than string-based GA methods, but otherwise not. DeepSMILES-based GA is computationally more efficient than SMILES-based GA in many cases.  It would be interesting to test the newly introduced SELFIES representation.



This work is licensed under a Creative Commons Attribution 4.0

No comments: