Proteins and Wave Functions: August 2018

Thursday, August 2, 2018

Comparison of genetic algorithm and Monte Carlo tree search for logP optimisation

What have I done?
1. Inspired by this and this study I have implemented graph-based crossover and mutation algorithms and hooked then up to a basic genetic algorithm (GA). "Graph-based" means I'm modifying the molecules by changing/adding atoms and bonds (using RDKit) rather than changing SMILES strings.
2. I hooked up my probabilistic atom-adder to a rudimentary Monte Carlo tree search (MCTS) algorithm I found on GitHub. I modified it to have leaf parallelisation, i.e. 25 random leafs a made instead of one.
3. The score to be optimised is the penalised logP (plogP) value using here and here.

Size matters
All other things being equal, the plogP score increases linearly with molecular size so this must be considered when comparing to other studies. I want to compare to this study where the maximum SMILES length is capped at about 80 according to their code. The average size of the molecules in Figure 2 of that paper is 39.15 ± 3.50 atoms. So for GA I rejected a molecule if the number of atoms was larger than a number sampled from a normal distribution with mean 39.15 and a standard deviation of 3.50. Similarly for MCTS I stopped the rollout using the same criterion.

The results
The following results are the maximum plogP score averaged over 10 runs. In the MCTS case I ran each run for 40 iterations, while for the GA I my population size was 20 and I ran for 50 generations. In each case this works out to be 1000 plogP evaluations per run. The initial mating pool is 20 random molecules sampled from the first 1000 molecules in the ZINC data set used in this study by Yang et al.. Similarly, probabilities used in the atom-adder comes from an analysis of these 1000 molecules. The mean plogP score for this set is 0.2 and the maximum value is 3.6.

The average maximum plogP for GA is 6.8 ± 0.7 and 7.4 ± 0.9 using a 50% and 1% mutation rate. For comparison the best average maximum value in the Yang et al study is 5.6 ± 0.5, obtained using MCTS and close to 20,000 plogP evaluations. The latter took 8 hours, while the GA calculations took 30 seconds each on my laptop.

For MCTS the average maximum plogP value is 3.3 ± 0.7, which is significantly lower than the lowest value in the Yang et al. study, 4.9 ± 0.4. The most likely explanation is the atom adder makes relatively few benzene rings (as discussed previously), which, together with Cl atoms, is the defining structural feature of the high scoring molecules found by Yang et al. Indeed, if you increase the probability of generating C=C-C containing rings (from 63% to 80%) then the average maximum plogP value increases to 4.4 ± 0.4. If you increase the plogP evaluations to 5000 (the value used by Yang et al. to obtain the 4.9 ± 0.4 value) then the value increases to 5.0 ± 0.4. The latter run takes about 9 min, compared to 2 hrs in the Yang et al. study.

Conclusion

GA is very good at maximising plogP values and very efficient. For this particular property GA completely outperforms MCTS.

You can find the code I used here