Thursday, February 21, 2019

Reviews of Graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space

Here are the reviews of my latest paper which just appeared in Chemical Sciences. I submitted the paper December 1, 2018 and got these reviews on January 13, 2019. I resubmitted January 20, and got the final decision on February 8. As usual with Chem. Sci. a very efficient and positive experience. Kudos to Geoff Hutchison for signing the review (and being cool with me sharing it here) and kudos to Chem. Sci. for passing it on to me.

Referee: 1

Recommendation: Revisions required

The main question of this paper is whether GA-based algorithms perform better than deep-learning. This paper includes interesting comparisons and free-to-use software, which is potentially a good resource for the AI-chemistry community. Yet I have following reservations about this paper.

1) Essentially GA-based method is faster than ML, creating more molecules. The logP computation is extremely fast, allowing GA to create many molecules in a fixed period of time. When simulation takes a longer time (like DFT), it may be beneficial to use more time to design. It is necessary to give a comparison in terms of the number of simulations needed to obtain good molecules. In that case, ML would be better, because it might be creating "high-quality" molecules using more time. Please provide comparisons with this respect.

2) It seems like the author tuned GA parameters such that the molecules are "realistic looking". It would be beneficial to readers if you can elaborate on this aspect. What exactly did you mean by "realistic looking"? Can you quantify somehow?

3) It seems to me that GA crossover parameters are inspired by chemical reactions. Can you claim that GA-based molecules are more synthesizable than deep-learning-based ones?

Referee: 2

Recommendation: Accept

The manuscript “Graph-based genetic algorithm and generative model Monte Carlo tree search for the exploration of chemical space” by Jan Jensen is an excellent addition to recent work on using computational methods to generate new molecular compounds for target properties.

I will admit up-front that I am a proponent of GA strategies, so the conclusions were not surprising. I think the work should be published but would like to make some minor suggestions that I think will strongly improve the work.

- On page 2, the number of non-H atoms is described coming “from a distribution with a mean 39.15 and a standard deviation of 3.50” - this is a number without a unit. Based on the article, I think it should read “a mean of 39.15 atoms, with standard deviation of 3.50 atoms”
- The last paragraph on page 3 is perhaps a bit technical for the Chemical Science audience, discussing “leaf parallelization” and “leaf nodes.” I think the whole paragraph needs to be written for a general audience (i.e., not someone implementing a MCTS code) or moved to the supporting information. The code is, after all, open source and available.
- The penalized logP score could be described better. From the text, the penalty for “unrealistically large rings” was not described.
- The J(m) scores in Table 2 could perhaps be a bit expanded. For example, my assumption is that the SA scores and/or ring penalities may be higher in some methods than others. I think it would be useful to add columns for the raw logP, SA, and penalty scores - if not in the text then in the supporting information.
- The results and discussion could benefit from a figure indicating the rate of improvement with generations for the GA methods and/or the GB-GM-MCTS methods. We have, for example, shown that GA methods show high rate of improvement in early generations, but finding beneficial mutations slows over time. This would likely explain why the lower mutation rate shows better performance in this work - and moreover suggest an “early stop” (e.g., are all 50 generations needed for this problem?)
- Similarly, I find it strange that the author didn’t try longer runs or attempt to find an optimal mutation rate for the GA, particularly if the CPU time is so short.
- The caption for Table 3 includes a typo - I believe “BG-GM” should read “GB-GM”
- The conclusions suggest that the GB-GA approach “can traverse a relatively large distance in chemical space” - the author should really use similarity scores (e.g., a Tanimoto coefficient using ECFP fingerprints or similar) to quantify this - again, the discussion could be expanded.

Overall, I think it’s a great addition to the discussion on optimization of molecular structures for properties.

-Geoff Hutchison, University of Pittsburgh

Wednesday, January 30, 2019

Screening for large energy storage capacity of meta-stable dihydroazulenes Part 3

This is a follow up to this post, which was a follow up to this post. Briefly, we (that is to say Mads) have computed $\Delta E_{rxn}$ and $\Delta E^\ddagger$ for about 32,500 molecules using xTB and PM3 respectively. We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies?

In my last post we saw that the xTB and PM3 energies can be trusted well enough and this post we'll see that the conformational search also can be trusted.

Conformational search
The xTB storage energy calculation is based on the lowest energy DHA and VHF structures found by optimisations of $5+5n_{rot}$ geometries generated using RDKit. The plot above shows a comparison of this approach to one where we generated all conformers by systematically rotating each rotateable bond by $\pm 120^\circ$ for a subset of 100 molecules. It is clear that the $5+5n_{rot}$-approach works really well for most molecules (including the 20 with high storage energy) and, if anything, overestimates the storage energy (i.e. at worst we will have som false positives).

From 35,588 to 41 to 6 to ?? candidates
The fact that we can trust the storage energies reasonably well means that we can proceed with the 41 molecules I identified in the previous post. As a first step we optimised the geometries at the DFT level and as expected most of the molecules have barriers that are too low. But 6 of them still look promising, so the next step is to perform a systematic conformer search using xTB (just to be safe) and then re-optimise all structures with energies close to the xTB minimum with DFT. Stay tuned ... with fingers crossed.

This work is licensed under a Creative Commons Attribution 4.0

Monday, January 21, 2019

Screening for large energy storage capacity of meta-stable dihydroazulene Part 2

This is a follow up to this post. Briefly, we have computed $\Delta E_{rxn}$ and $\Delta E^\ddagger$ for about 32,500 molecules using xTB and PM3 respectively. We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies?

To try to answer the latter question we (i.e. Mads) randomly chose 20, 60, and 20 molecules with high, medium, and low xTB-$\Delta E_{rxn}$ values and recomputed $\Delta E_{rxn}$ at the M06-2X/6-31G(d) level of theory using the lowest xTB-energy DHA and VHF structures as starting geometries. The results are shown below

The vertical and and horizontal lines denote $\Delta E_{rxn}$ and $\Delta E^\ddagger$, respectively, for an experimentally characterised "reference" compound that we want to improve upon, i.e. we are looking for compounds that have larger $\Delta E_{rxn}$ and $\Delta E^\ddagger$ values. $\Delta E_{rxn}$ should ideally be at least 5 kcal/mol higher than the reference (but in general as large as possible) and $\Delta E^\ddagger$ should be at least 2 kcal/mol higher than the reference, but anything higher than that is not necessarily better. The half life depends very sensitively on the barrier, and is hard to compute accurately, so we have to be careful about excluding molecules with high storage capacity in cases where the barrier is close to the reference.

Using these criteria I would select 5 points for further study indicated by the red points (let's call them Square, Star, Triangle, Dot, and Diamond). Square is clearly the most promising one with significantly higher $\Delta E_{rxn}$ and $\Delta E^\ddagger$ compared to the reference. The remaining 4 points are chosen because they either have reasonably high $\Delta E_{rxn}$ and $\Delta E^\ddagger$-values  that are only a few kcal/mol below the reference (Triangle and Star) or reasonably high $\Delta E^\ddagger$ and $\Delta E_{rxn}$ that are somewhat (ca 3 kcal/mol) higher than the reference. I'd call the last 4 points pseudo-positives.

The corresponding plot for SQM storage energies and barrier heights is shown above. The good news is that Square is still the clear winner and I would also have picked Star and Triangle for further investigation. However, there are many more points that I also would picked (false pseudo-positives) and Diamond and Dot would not have been picked (false pseudo-negatives).

The false positives are due to PM3 overestimating the barrier, so let's use B3LYP/6-31+G(d)//xTB barriers instead. Square is still the winner, Star would still be picked (but not Triangle), and the false positives are gone. However, the barriers for Dot and Diamond are now so low that they will not be picked (false pseudo-negatives).

Summary and outlook
SQM finds the only true positive (Square). PM3 tends to over estimate the barrier relative to the reference, which leads to false pseudo-positives. xTB tends to underestimate the storage capacity which leads to a few false pseudo-negatives. DFT/PM3 barriers removes the false pseudo-positives but also leads to some false pseudo-negatives.

The above plot shows the molecules with an xTB storage energy that is at least 4 kcal/mol higher than the reference (4 rather than 5 kcal/mol since xTB tends to underestimate the storage energy). The solid line shows the reference barrier and the dotted line lies 2 kcal/mol above that (since PM3 tends to underestimate the barrier). There are 41 points above that line, including good old Square, that we should investigate further and it's probably also good to look at a few more points with very high storage energy and reasonably high barrier.

But all this assumes that we can trust the conformational search so this has to be tested next.

This work is licensed under a Creative Commons Attribution 4.0

Wednesday, January 16, 2019

Open access chemistry publishing options in 2019

Here is an updated list of affordable impact neutral and other select OA publishing options for chemistry

Impact neutral journals
$0 (in 2019) PeerJ chemistry journals. Open peer review. (Disclaimer I am an editor on PeerJ Physical Chemistry)

$425 (normally $850) Results in Chemistry. Closed peer review

$750 ACS Omega (+ ACS membership $166/year). Closed peer review. WARNING: not real OA. You still sign away your copyright to the ACS.

$1000 F1000Research. Open peer review. Bio-related

$1095 PeerJ - Life and Environment. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

$1350 Cogent Chemistry. Has a "pay what you can" policy. Closed peer review.

$1595 PLoS ONE. Closed peer review.

$1790 Scientific Reports. Closed peer review

Free or reasonably priced journals that judge perceived impact
$0 Chemical Science Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review.

$0 Beilstein Journal of Nanotechnology. Closed peer review.

$0 ACS Central Science. Closed peer review. ($500-1000 for CC-BY, WARNING: not real OA. You still sign away your copyright to the ACS as far as I know) 

$100 Living Journal of Computational Molecular Science. Closed peer review

€500 Chemistry2. Closed peer review.

£750 RSC Advances. Closed peer review.

Let me know if I have missed anything.

This work is licensed under a Creative Commons Attribution 4.0

Thursday, January 10, 2019

Screening for large energy storage capacity of meta-stable dihydroazulenes

Here's a summary of where we are at with Mads project

The Challenge
Dihydroazulenes (DHAs) are promising candidates for storing solar energy as chemical energy, which can be released as thermal energy when needed. The ideal DHA derivative has a large $\Delta H_{rxn}$ and a $\Delta G^{\ddagger}$ that is large enough to give a half life of days to months but low enough so that the energy release can be reduced.  Of course any modification should not affect light adsorption adversely. This presents an interesting optimisation challenge!

We asked Mogens Brøndsted Nielsen to come up with a list of substituents and he suggested 40 different substituents and 7 positions, which results in 164 billion different molecules (we chose to interpret the right hand figure more generally). We decided to start by looking at all single and double substitutions, which amounts to 35,588 different molecules. The first question is what level of theory will allow us to screen this many molecules. 

The initial screen
At a minimum we need to compute $\Delta E_{rxn}$ which involves at least a rudimentary conformational search for both reactants and products. We used an approach similar to this study, ($5+5n_{rot}$ RDKit generated start geometries) which results in over 1 million SQM geometry optimisations, but used GFN-xTB instead of PM3 because the former is about 10 times faster.

To find the barriers, we did a 12-point scan along the breaking bond in DHA (out to 3.5Å) starting from the lowest energy DHA conformer. The highest energy structure was then used as a starting point for a TS search using Gaussian and "calcall". We used ORCA for the scan and Gaussian for the TS search, and used PM3 because it is implemented in both programs. We also optimised the lowest VHF structure with PM3 to compute the barrier. We verify the TS by checking whether the imaginary normal mode lies along the reacting bond. This worked in 32,623 cases. The whole thing took roughly 5 days using roughly 250 cores.

Note that this approach finds a TS to cis-VHF, which we assume is in thermal equilibrium with the lower energy trans-VHF form. For both barriers and reaction energies we use the electronic energy differences rather than free energies of activation and reaction enthalpies.

The next step
We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies? I plan to cover this in a future blog post.

A more efficient initial screen
We now have data with which to test more efficient ways of performing the initial screen:

1. (a) We could perform the conformational search using MMFF and only xTB-optimise the lowest energy DHA and VHF structures.
(b) We could only perform the TS search for molecules with large $ \Delta E_{rxn}$ values.
(c) We could perform the bond-scan with xTB rather than ORCA. 
(d) We could test whether the bond-scan barrier can be used instead of the TS-based barrier.

2. We could test the use ML-based energy functions such as ANI-1 instead of SQM.

3. We could test whether ML can be trained to predict $\Delta E_{rxn}$ and/or $\Delta E^{\ddagger}$ based on the DHA structure.

We'd be very happy to collaborate on this.

Beyond double substitution
No matter how efficient we make the initial screen, screening all 164 billion molecules is simply not feasible. Instead we'll need to use search algorithms such as genetic algorithms or random forest.

Other ideas/comments/questions on this or anything else related to this blogpost are very welcome.

This work is licensed under a Creative Commons Attribution 4.0

Friday, January 4, 2019

Planned papers for 2019

A year ago I thought I'd probably publish three papers in 2018:

1. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions

2. Random Versus Systematic Errors in Reaction Enthalpies Computed using Semi-empirical and Minimal Basis Set Methods
3. Improving Solvation Energy Predictions using the SMD Solvation Method and Semi-empirical Electronic Structure Methods

5. Empirical corrections and pair interaction energies in the fragment molecular orbital method

The end result was five papers. In addition I also published a proposal.

Here's the plan for 2019


2. Screening for energy storage capacity of meta-stable vinylheptafulvenes
3. Testing algorithms for finding the global minimum of drug-like compounds
4. Towards a barrier height benchmark set for biologically relevant systems - part 2
5. SMILES-based genetic algorithms for chemical space exploration

6. Further screening of bicyclo[2.2.2]octane-based molecular insulators
7. Screening for electronic properties using a graph-based genetic algorithm
8. Further screening for energy storage capacity of meta-stable vinylheptafulvenes

I haven't started a draft on any of papers 2-5 so I'm not exactly brimming with confidence that all 4 will make it into print in 2019. However, we have 80-90% of all the data needed to write each paper, and I've blogged a bit about paper 3.