Monday, June 17, 2019

Useful introductory books and blogposts on neural networks

Here's a list of books and blogposts on neural networks and related aspects that I have found particularly useful. In general, I like very simple examples - preferably with python code - to introduce me to a topic.

Books
This book is an excellent place to start. The book explains the basics of NNs and guides you through writing your own 3-layer NN code from scratch and applying it to the MNIST set. The book even introduces you to Python, so this is something virtually anyone can do. My only (minor) complaint is that the code uses classes, which can be quite difficult for beginners to grasp and it not really needed here.

Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More
This book offers brief and to-the-point descriptions of some of the major classes of NNs, such CNN and RNN in the first chapters and then walks you though many interesting applications using the DeepChem library. This book gets you started using NNs very quickly and is an excellent supplement to the more basic or more theoretical approaches in this list.

Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning
This is a more formal treatment of deep learning but I still found it (mostly) very readable and there are several useful pseudo-code examples with Python equivalents. The topics are discussed in roughly chronological order, so you also get a good feel for how the NN field developed including major milestones.

Blogposts
This is basically the equivalent of Make Your Own NN but for a RNN applied to a toy problem.

Both posts offer some very simple Python examples of what convolution actually means for images.

A very simple Python introduction to graph convolution, which works quite bit differently from image convolution.



This work is licensed under a Creative Commons Attribution 4.0

Friday, June 14, 2019

Comparison of SMILES-, DeepSMILES-, SELFIES-, and graph-based genetic algorithms Part 2

This post is a follow up to this post. There are two changes:

In that post I generated the data for the string based methods using my graph-based GA (GB-GA) code interfaced with new, string-based, crossover and mutation code. However, this involves going back and forth between graph and string-based representations which could potentially change the atom order. To make sure that doesn't happen I have now written a stand alone string-based GA code, where strings only are converted to graphs when computing the score and graphs are never converted back to strings.

I also had a another look at Brown et al.'s GA code and noticed that they remove duplicates from the population for each generation, which my code didn't. So implemented that as well for both the graph- and string-based methods. In the table below I list the best results, where the original implementation that does not remove duplicates are indicated by a "*".




For GB the removal of duplicates only improves results for celecoxib, where it is now rediscovered 8 times instead of 4. Tiotixene is not rediscovered and troglitazone is only found once with GB-GA, when duplicates are removed.

The new string-based implementation improves results for SMILES and DeepSMILES, with the exception of SMILES for troglitazone, which is discovered once using the old implementation. For SELFIES the new implementation is a little bit worse, but I would say the difference is within the statistical uncertainty. 

GB still tends to outperform string based methods, though they all perform much better than I had expected. Amazingly, DeepSMILES and SELFIES do not appear to offer a clear advantage over SMILES with the exception of troglitazone, where DeepSMILES performs significantly better.

Here are the high scoring molecules found with string based methods. Some of the molecules have radical centers (red boxes) due to misplaced chiral centers.



x



This work is licensed under a Creative Commons Attribution 4.0

Sunday, June 9, 2019

Comparison of SMILES-, DeepSMILES-, SELFIES-, and graph-based genetic algorithms

This post is a follow up to this post. There are three main changes: 1) I have included Emilie's code in my code, 2) I have extended the implementation to SELFIES, and 3) the initial pool of molecules is now constructed exactly as described by Brown et al. (i.e. we use the 100 highest scoring molecules from ChEMBL, but remove molecules with scores higher than 0.323).

As before, I run the 10 GA searches, each for 1000 generations, and record the overall highest score found and the average high score. If the score is 1.00 I also record the number of times I found it, in parentheses. I also record the CPU time on 8 cores (note that I stop the search once the score is 1.00, so the time is not necessarily for 10 x 1000 generations).


Here are the high scoring molecules found with string based methods



Bottom line, DeepSMILES and SELFIES perform about the same, and both tend to outperform SMILES for rediscovery using GA.


This work is licensed under a Creative Commons Attribution 4.0

Wednesday, June 5, 2019

Comparison of SMILES-, DeepSMILES- and graph-based genetic algorithms

Emilie is wrapping up her bachelor project and writing the report and here are some preliminary results (which are likely to change a bit).

I recently developed a graph-based genetic algorithm that seems to work pretty well. The crossover and mutation code is about 250 lines with a lot of hyperparameters that mainly specify the probabilities of performing different crossovers and mutations.

The question is whether all this was really necessary or could I have gotten away with about 25 lines of code that perform crossover and mutation operations on SMILES strings? For crossover you simply cut two strings at random places and recombine the fragments, e.g. OCC|C and CC|N (where "|" indicates the cut) yield OCCN and CCC and for mutation you simply change one character to another, e.g. CCC becomes C=C.

The potential problem with using SMILES is that one can imagine many scenarios where this wouldn't work, e.g. OC(|C)C and C1C|O1 would yield OC(O1 and C1CC)C, which are not valid SMILES string.  But can you still find molecules with the desired properties using this approach? If so, do the molecules look very different than the ones you find with the graph-based approach? Which approach is more efficient?

And what about the DeepSMILES representation developed by O'Boyle and Dalke? Here, OC(|C)C and C1C|O1 are written as OC|C)C and CC|O3, which would yield OCO3 and CCC)C - both valid DeepSMILES strings.

Finding molecules with specific penalised logP values
We start by looking at what Brown et al. call a "trivial optimisation objective": finding molecules with a particular modified logP values. We use the same Gaussian modifier approach with a standard deviation of 2 logP units and select the initial population from the first 1000 molecules in the ZINC data set (after removing molecules with logP values within 2 units of the target). The mating pool size and mutation rates are 20 and 10%, respectively. The table shows the average number of generations (based on 10 runs) needed to find a molecule with a logP values within 0.01 of the target.


It is clear that a SMILES-based GA has no problems meeting the objective, but that using DeepSMILES is more effective both in terms of number of required generations and CPU time. The latter, because the percentage of valid strings generated by crossover and mutation (the succes rate) is considerably higher for DeepSMILES as expected. For this target there appears to be no real advantage in using graph-based GA.

Rediscovering a molecule
The next target is considerably harder: generating molecules with a Tanimoto similarity of 1.0 with a target molecule (naphthalene, celecoxib, or tiotixene).  A Tanimoto similarity of 1.0 means that each atom has the same bonding pattern out to a certain radius (here 4 bonds), i.e. that the molecules are very, very similar. The population size is 100 and the mating pool size is 200. The initial populations is 100 molecules with the highest Tanimoto scores to the target (but with Tanimoto scores less than 0.323, following Brown et al.) found among the first 50,000 molecules in the ZINC data set.


Here it's clear that the graph-based approach offer an advantage over string-based methods, while DeepSMILES only offers an advantage over SMILES in terms of effciency. The Tanimoto score goes from 0 to 1, so the molecules found for the tiotexene search using graph-based GA look significantly more like tiotexene than those found with the string based methods. (When I run celecoxib search I succeed 5/10 times, and we are still trying to find the cause of this difference.)

Finding molecules that absorb at a particular wavelength
Inspired by the study of Tsuda and co-workers, we search for molecules that absorb at 400 and 800 nm. We use Grimme's semiempirical sTDA-xTB method to estimate the absorption wavelength and oscillator strength based on a low energy MMFF-optimised structure. We use a Gaussian(400/800,50) scoring function for the wavelength and a LinearThresholded(0.3) scoring function for the oscillator strength (0.3 is the oscillator strength computed for indigo dye). The population and mating pool sizes are 20 and 40, respectively and the mutation rate is 15%. We select the initial population from the first 1000 molecules in the ZINC data set (after removing molecules with absorption wavelengths within 300 nm of the target). The GA search is stopped if the top-scoring molecule has an absorption wave length within 5 nm of the target value.


It turns out that the find molecules that absorb relatively strongly at 400 and 800 nm is a relatively easy optimisation problem.

Conclusion
If there are very few ways to meet the target then graph-based GA performs better than string-based GA methods, but otherwise not. DeepSMILES-based GA is computationally more efficient than SMILES-based GA in many cases.  It would be interesting to test the newly introduced SELFIES representation.



This work is licensed under a Creative Commons Attribution 4.0

Thursday, February 21, 2019

Reviews of Graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space

Here are the reviews of my latest paper which just appeared in Chemical Sciences. I submitted the paper December 1, 2018 and got these reviews on January 13, 2019. I resubmitted January 20, and got the final decision on February 8. As usual with Chem. Sci. a very efficient and positive experience. Kudos to Geoff Hutchison for signing the review (and being cool with me sharing it here) and kudos to Chem. Sci. for passing it on to me.

REVIEWER REPORT(S):
Referee: 1

Recommendation: Revisions required

Comments:
The main question of this paper is whether GA-based algorithms perform better than deep-learning. This paper includes interesting comparisons and free-to-use software, which is potentially a good resource for the AI-chemistry community. Yet I have following reservations about this paper.

1) Essentially GA-based method is faster than ML, creating more molecules. The logP computation is extremely fast, allowing GA to create many molecules in a fixed period of time. When simulation takes a longer time (like DFT), it may be beneficial to use more time to design. It is necessary to give a comparison in terms of the number of simulations needed to obtain good molecules. In that case, ML would be better, because it might be creating "high-quality" molecules using more time. Please provide comparisons with this respect.

2) It seems like the author tuned GA parameters such that the molecules are "realistic looking". It would be beneficial to readers if you can elaborate on this aspect. What exactly did you mean by "realistic looking"? Can you quantify somehow?

3) It seems to me that GA crossover parameters are inspired by chemical reactions. Can you claim that GA-based molecules are more synthesizable than deep-learning-based ones?


Referee: 2

Recommendation: Accept

Comments:
The manuscript “Graph-based genetic algorithm and generative model Monte Carlo tree search for the exploration of chemical space” by Jan Jensen is an excellent addition to recent work on using computational methods to generate new molecular compounds for target properties.

I will admit up-front that I am a proponent of GA strategies, so the conclusions were not surprising. I think the work should be published but would like to make some minor suggestions that I think will strongly improve the work.

- On page 2, the number of non-H atoms is described coming “from a distribution with a mean 39.15 and a standard deviation of 3.50” - this is a number without a unit. Based on the article, I think it should read “a mean of 39.15 atoms, with standard deviation of 3.50 atoms”
- The last paragraph on page 3 is perhaps a bit technical for the Chemical Science audience, discussing “leaf parallelization” and “leaf nodes.” I think the whole paragraph needs to be written for a general audience (i.e., not someone implementing a MCTS code) or moved to the supporting information. The code is, after all, open source and available.
- The penalized logP score could be described better. From the text, the penalty for “unrealistically large rings” was not described.
- The J(m) scores in Table 2 could perhaps be a bit expanded. For example, my assumption is that the SA scores and/or ring penalities may be higher in some methods than others. I think it would be useful to add columns for the raw logP, SA, and penalty scores - if not in the text then in the supporting information.
- The results and discussion could benefit from a figure indicating the rate of improvement with generations for the GA methods and/or the GB-GM-MCTS methods. We have, for example, shown that GA methods show high rate of improvement in early generations, but finding beneficial mutations slows over time. This would likely explain why the lower mutation rate shows better performance in this work - and moreover suggest an “early stop” (e.g., are all 50 generations needed for this problem?)
- Similarly, I find it strange that the author didn’t try longer runs or attempt to find an optimal mutation rate for the GA, particularly if the CPU time is so short.
- The caption for Table 3 includes a typo - I believe “BG-GM” should read “GB-GM”
- The conclusions suggest that the GB-GA approach “can traverse a relatively large distance in chemical space” - the author should really use similarity scores (e.g., a Tanimoto coefficient using ECFP fingerprints or similar) to quantify this - again, the discussion could be expanded.

Overall, I think it’s a great addition to the discussion on optimization of molecular structures for properties.

-Geoff Hutchison, University of Pittsburgh


Wednesday, January 30, 2019

Screening for large energy storage capacity of meta-stable dihydroazulenes Part 3



This is a follow up to this post, which was a follow up to this post. Briefly, we (that is to say Mads) have computed $\Delta E_{rxn}$ and $\Delta E^\ddagger$ for about 32,500 molecules using xTB and PM3 respectively. We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies?

In my last post we saw that the xTB and PM3 energies can be trusted well enough and this post we'll see that the conformational search also can be trusted.


Conformational search
The xTB storage energy calculation is based on the lowest energy DHA and VHF structures found by optimisations of $5+5n_{rot}$ geometries generated using RDKit. The plot above shows a comparison of this approach to one where we generated all conformers by systematically rotating each rotateable bond by $\pm 120^\circ$ for a subset of 100 molecules. It is clear that the $5+5n_{rot}$-approach works really well for most molecules (including the 20 with high storage energy) and, if anything, overestimates the storage energy (i.e. at worst we will have som false positives).

From 35,588 to 41 to 6 to ?? candidates
The fact that we can trust the storage energies reasonably well means that we can proceed with the 41 molecules I identified in the previous post. As a first step we optimised the geometries at the DFT level and as expected most of the molecules have barriers that are too low. But 6 of them still look promising, so the next step is to perform a systematic conformer search using xTB (just to be safe) and then re-optimise all structures with energies close to the xTB minimum with DFT. Stay tuned ... with fingers crossed.




This work is licensed under a Creative Commons Attribution 4.0

Monday, January 21, 2019

Screening for large energy storage capacity of meta-stable dihydroazulene Part 2



This is a follow up to this post. Briefly, we have computed $\Delta E_{rxn}$ and $\Delta E^\ddagger$ for about 32,500 molecules using xTB and PM3 respectively. We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies?

To try to answer the latter question we (i.e. Mads) randomly chose 20, 60, and 20 molecules with high, medium, and low xTB-$\Delta E_{rxn}$ values and recomputed $\Delta E_{rxn}$ at the M06-2X/6-31G(d) level of theory using the lowest xTB-energy DHA and VHF structures as starting geometries. The results are shown below


The vertical and and horizontal lines denote $\Delta E_{rxn}$ and $\Delta E^\ddagger$, respectively, for an experimentally characterised "reference" compound that we want to improve upon, i.e. we are looking for compounds that have larger $\Delta E_{rxn}$ and $\Delta E^\ddagger$ values. $\Delta E_{rxn}$ should ideally be at least 5 kcal/mol higher than the reference (but in general as large as possible) and $\Delta E^\ddagger$ should be at least 2 kcal/mol higher than the reference, but anything higher than that is not necessarily better. The half life depends very sensitively on the barrier, and is hard to compute accurately, so we have to be careful about excluding molecules with high storage capacity in cases where the barrier is close to the reference.

Using these criteria I would select 5 points for further study indicated by the red points (let's call them Square, Star, Triangle, Dot, and Diamond). Square is clearly the most promising one with significantly higher $\Delta E_{rxn}$ and $\Delta E^\ddagger$ compared to the reference. The remaining 4 points are chosen because they either have reasonably high $\Delta E_{rxn}$ and $\Delta E^\ddagger$-values  that are only a few kcal/mol below the reference (Triangle and Star) or reasonably high $\Delta E^\ddagger$ and $\Delta E_{rxn}$ that are somewhat (ca 3 kcal/mol) higher than the reference. I'd call the last 4 points pseudo-positives.


The corresponding plot for SQM storage energies and barrier heights is shown above. The good news is that Square is still the clear winner and I would also have picked Star and Triangle for further investigation. However, there are many more points that I also would picked (false pseudo-positives) and Diamond and Dot would not have been picked (false pseudo-negatives).


The false positives are due to PM3 overestimating the barrier, so let's use B3LYP/6-31+G(d)//xTB barriers instead. Square is still the winner, Star would still be picked (but not Triangle), and the false positives are gone. However, the barriers for Dot and Diamond are now so low that they will not be picked (false pseudo-negatives).

Summary and outlook
SQM finds the only true positive (Square). PM3 tends to over estimate the barrier relative to the reference, which leads to false pseudo-positives. xTB tends to underestimate the storage capacity which leads to a few false pseudo-negatives. DFT/PM3 barriers removes the false pseudo-positives but also leads to some false pseudo-negatives.


The above plot shows the molecules with an xTB storage energy that is at least 4 kcal/mol higher than the reference (4 rather than 5 kcal/mol since xTB tends to underestimate the storage energy). The solid line shows the reference barrier and the dotted line lies 2 kcal/mol above that (since PM3 tends to underestimate the barrier). There are 41 points above that line, including good old Square, that we should investigate further and it's probably also good to look at a few more points with very high storage energy and reasonably high barrier.

But all this assumes that we can trust the conformational search so this has to be tested next.


This work is licensed under a Creative Commons Attribution 4.0