Tuesday, March 6, 2018

Reviews of Random Versus Systematic Errors in Reaction Enthalpies Computed Using Semi-empirical and Minimal Basis Set Methods

We submitted this paper to ACS Omega January 31st and the reviews just came back

Reviewer: 1

Recommendation: Publish after minor revisions.

The authors explored the CBH approach proposed by Sengupta and Raghavachari to compute the reaction enthalpy of a series of organic reactions using semi-empirical and low-cost HF/DFT as the low-level method. They also discussed the origin of errors for several cases that exhibited very large errors. The results will be very useful to the computational chemistry community, in terms of identifying effective means to compute reaction energetics and better ways to improve low-cost methods.

I have only a few minor comments:
1. It appears that dispersion correction was not included for DFTB3 and some NDDO methods. For reactions that involve very large molecules, dispersion may make a non-negligible contribution, as found, for example, for Diels-Alder reactions in the recent benchmark analysis by Gruden et al. (J. Comp. Chem. 38, 2171-2185).

2. There are several typos: line 55 of pg 2, "corrections WERE not included"; line 48 of pg 6, there is one additional "is".

3. It might be useful to report and comment on the computational cost for the different approaches. For example, PBEh-3c is still rather expensive compared to the semi-empirical methods.

4. Is there a "simple" explanation for the difference between xTB and DFTB3? For example, does the improved description of frequencies by xTB make a major difference?

Reviewer: 2

Recommendation: Publish after minor revisions.

The authors have carried out an analysis of the performance of highly efficient computational methods for the computation of enthalpies of organic reactions using the connectivity-based hierarchy. The methods considered include DFT, HF, and a range of semi-empirical methods. The analysis is clear and some of the reported findings are indeed significant. While the good performance of DFT and HF is consistent with previous results, the lack of significant improvement with semi-empirical methods is particularly noteworthy. The paper is acceptable for publication after the authors address the following comment.

A more detailed analysis is reported for reaction 19 that is an outlier for some methods such as HF. In this system, the larger errors are attributed partly to the presence of the strained oxirane ring. Similarly, reaction 23 poses problems for some semi-empirical methods due to the presence of larger errors involving allene. In light of these observations, it may be useful to add a cautionary note to the range of problems that can be studied with such methods. I suggest a small paragraph to address the potential limitations of the inexpensive methods for such systems containing unusual bonding situations.

This work is licensed under a Creative Commons Attribution 4.0

Sunday, March 4, 2018

Improvements to take_elementary_step.py

First of all I have fixed a bug in the bond order energy function so the program now suggests significantly fewer structures. Second I have written some code that creates input files for a QM program (GFN-xTB) and analyses the output. I choose GFN-xTB because it's a super fast semiempirical QM method that is also reasonably accurate. I have tried to keep things fairly modular so that it's easy to interface it with other QM programs (more details in the code comments). 

Using the Diels-Alder reaction as an example, staring with ethene and butadiene as the reactant the program now suggests 28 other structures with energies no more than 100 kcal/mol higher than the reactants. I choose 100 kcal/mol because the energy function is very crude.

All structures are then optimized with GFN-xTB and I now select structures with energies no more than 20 kcal/mol higher than the reactants. I choose a lower value here because I trust the energy function a bit more. I use the electronic energy plus an estimate of the translational entropy, but the code can easily be modified to use free energies. This narrows it down to 21 structures (ranked by energy below) in addition to the reactants, so the bond order energy function is actually doing quite well.

You'll notice that some of the structures appears to have two H atoms, but this is in fact an H2 molecule. The reason for this is that the code that translates the Cartesian coordinates back to SMILES uses a distance criterion to assign bonds and that apparently needs some tweaking for H2.  It's important to recompute the SMILES because the bonds can rearrange during the geometry optimization.

I also need to fix the code for calculations on atoms where GFN-xTB doesn't print out coordinates.

The next step is to find TSs connecting all these structures to the reactants to find the lowest barrier.

This work is licensed under a Creative Commons Attribution 4.0

Saturday, March 3, 2018

Very simple way to animate in Jupyter notebooks

I want to use Google Colaboratory for my python course, but the module we use for animation doesn't work there because the necessary libraries are not installed. I had planned to spend part of the weekend to see if I could find a solution but by amazing coincidence a tweet in my Twitter stream this morning had the answer!

The answer is simply to display the images sequentially and clear the output each time with clear_output(wait=True) from IPython.display. You can see the demo code I wrote below.  This works very nicely on my off-line notebook (where you want to execute this line in a cell above the notebook itself: %matplotlib inline) while the animation is a bit choppy on Colaboratory, presumably due  to network latency. But it's certainly good enough.

Thanks again to Twitter for saving me tons of time!

This work is licensed under a Creative Commons Attribution 4.0

Saturday, February 17, 2018

get_conformations.py: a simple script to generate rotamers

Here's a simple script for generating rotamers. It assumes the input geometry is reasonable and perturbs the dihedral angles of $n$ rotateable bonds by $\pm 120^\circ$. It makes all $3^n$ combinations of dihedral angles but if that is larger than a user-defined cutoff it will pick a random subset. Rotateable bonds include alcohol, phenols, and primary amines, but also amide bonds.  It doesn't know about equivalent atoms, so it will happily rotate a tert-butyl or trifluoromethyl group.

This work is licensed under a Creative Commons Attribution 4.0

Wednesday, February 14, 2018

Predicting pKa values using PM3 - conformer search part 3

This is a follow up to this post, but now we (that is Mads) use up to 1000 structures. To summarise: we use RDKit to generate $n$ starting conformations of neutral ("0") and protonated ("+") acebutolol and minimise them in aqueous solution using PM3/COSMO (MOPAC) and PM3/SMD /GAMESS) for a maximum of 200 steps.

The plot shows the lowest energy structure relative to the lowest energy found among all values of $n$. For example, the lowest PM3/COSMO value for the neutral form is found when generating 400 starting geometries and using 1000 starting structure, the lowest energy found is 1.1 kcal/mol higher.

I was a bit worried about the numerical stability of the COSMO and SMD methods, i.e. that the $n$ = 400 structure was generated in the $n$ = 1000 run, but had a higher energy for some reason. The figure above shows the lowest energy structure found ($n$ = 400) in purple and the structure most closely resembling it in the $n$ = 1000 search in green. So the $n$ = 400 structure was not generated in the $n$ = 1000 run.

Acebutolol has 12 rotateable bonds and roughly $3^{12}$ = 500,000 possible rotamers and RDKit samples these randomly. So, on hindsight, it is not surprising that $n$ = 400 and $n$ = 1000 randomly chosen samples do not both contain the lowest energy structure.

Incredibly (at least to me), the green structure is 1.7 kcal/mol higher in energy than the green. But, further optimisation of the green structure does not change energy, nor does changing the optimiser to EF instead of the default. I even rotated the molecule and restarted to the optimisation to test the robustness of the COSMO grid. But, no, this energy difference due to what I think is a minor strutural difference appears to be real. This means that the conformational PES is quite rugged, i.e. small structural changes can result in relatively big energy changes.

As it stands, pKa-predictions using PM3/COMSO or SMD (where I use $n$ > 100) will have a ca 1 pKa unit random error due to the conformational search and increasing $n$ is not really going to decrease that very much.

I'm not sure what to do about that.  We might try combining several low-$n$ conformational searches, with different random seeds. Alternatively, we may have to implement some sort of directed search, e.g. a genetic algorithm but that is going to be quite expensive for an SQM based method.

Alternatively, we have to pick the structures we use based on an MM screen where we can do a more thorough search but then we really need a MM/continuum solvation method interface.  If you have other ideas I'd be happy to hear about it.

Let me end with a challenge. Can you make a robust method that can find a PM3/COSMO energy equal to or lower than -183.01 kcal/mol for neutral acebutolol? And that is converged with respect to starting geometry and number of conformers tested? The MOPAC keywords are

pm3 eps=78.4 charge=0 cycles=200

This work is licensed under a Creative Commons Attribution 4.0

Saturday, February 10, 2018

take_elementary_step.py: exploring chemical space one elementary step at a time

I have written a program called take_elementary_step.py. It is basically an implementation of the idea described in this paper by Zimmerman, but using the atom connectivity matrix approach described in this paper by Woo Youn Kim.

Zimmerman defines an elementary reaction step as a chemical rearrangement "with no more than two connections breaking and two connections forming simultaneously, while maintaining the upper and lower limits for coordination number of each atom." Two structures that are related in this way are very likely to be connected by a single transition state.

The code generates all such states for a given molecule and then selects the "best ones" based on a crude energy function based on bond dissociation energies. The idea is then to optimize the molecules using a QM method to refine the selection and then locate the TS connecting the selected structures to the input structures.  The atom ordering should be preserved (haven't checked yet) so they are suitable for interpolation-based methods such as NEB or GSM.

The default is to lists all structures with energies no higher than 200 kJ/mol compared to the input structure and to use homolytic bond cleavage, i.e. a C-H bonds is broken into a C and H radical, rather than, say, C- and H+, but this can be controlled by the "heterolytic" keyword.

Using the heterolytic option will result in incorrectly charged fragments in some cases, but these should be weeded out by the energy criterion. For example, if you input acetylene the code will generate HC3- + HC3- instead of HC3- + HC3+, but I didn't feel like writing complicated code to fix this since these possibilities will no be presented to the user because they are too high in energy.

The figure at the bottom shows the results using 1,3-butadiene + ethene as input. For some reason, RDKit has trouble displaying H2 molecules so they have been removed and acetylene looks very weird (e.g. for comp64).  There are 88 structures in all, but that number could be reduced significantly be removing molecules with unsaturated atoms and/or three- and four-membered rings.

In his paper, Zimmerman shows 5 selected structures obtained using his implementation:

Taken from P.M. Zimmerman, J. Comp. Chem. 2013, 34, 1385. Copyright John Wiley & Sons. 

The first three also appear below as comp8, comp21, and comp54, while the last two are seemingly missing. However, they actually don't correspond to elementary steps since more than 2 bonds are broken or formed, but result from the subsequent QM optimization of comp7 and comp52 where additional bonds are formed between unsaturated C atoms.

This work is licensed under a Creative Commons Attribution 4.0

Saturday, February 3, 2018

Simple transformation rules lead to complex and beautiful molecules

I'm messing around with some code for a new project. The code starts with the first molecule shown below (NCCN) and repeatedly applies two transformation: it inserts a C atom between 2 C atoms (CC>>CCC) and cyclizes 3 C atoms (CCC>>C1CC1). The image below shows the result for 7 cycles of this, i.e. cycle 1 creates NCCCN, cycle 2 creates  NC1CC1N and NCCCCN, etc.

After 7 cycles these two simples transformations lead to 121 molecules, some of which are quite beautiful in my opinion. I wonder how many of these have been made.

This work is licensed under a Creative Commons Attribution 4.0

Friday, January 26, 2018

xyz2mol: converting an xyz file to an RDKit mol object

I have written a program called xyzmol that converts an xyz file to an RDKit mol object, based on this paper.  For ions the molecular charge has to be specified in the xyz file.

In principle this could be accomplished by using OpenBabel to convert an xyz file to and sdf file and reading the sdf file with RDKit. However, in my experience OpenBabel occasionally mis-assigns bond orders and I believe this code does this less often.

The code works by constructing an atom connectivity (AC) matrix from the xyz file, converts the AC matrix to a bond order (BO) matrix, computes the formal atomic charges, and then constructs a molecule object using all this information, including the 3D coordinates.

The code that constructs the AC matrix from the xyz file (xyz2AC) uses scaled covalent radii to find bonds and the current scale factor is 1.25, but this may need adjusting.

The code currently works only for organic molecules, i.e. molecules containing the following elements: H, B, C-F, Si-Cl, Br, and I.

The current version assumes a close shell molecule, i.e. it does not work with radicals yet.

I hope you find it useful.

This work is licensed under a Creative Commons Attribution 4.0

Tuesday, January 23, 2018

Open access chemistry publishing options in 2018

I just noticed that my go-to journal increased its APC again.* Now there's a flat fee of $1095 so I am re-evaluating my options for impact neutral OA publishing. I don't think PeerJ is greedy, so I think the most likely explanation is be that their old model was not sustainable. I now feel I have been a bit to hard on some other OA publishers (e.g. here and here, but not here).

While price and impact-neutrality is the main consideration, open peer review is a nice bonus that I became accustomed to from PeerJ. In my experience it makes for much better reviews and keeps the tone civil.

Impact neutral journals
$750 ACS Omega (+ ACS membership $166/year). Closed peer review.

$1000 F1000Research. Open peer review. Bio-related

$1095 PeerJ. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

$1350 Cogent Chemistry. Has a "pay what you can" policy. Closed peer review.

$1495 PLoS ONE. Closed peer review.

$1760 Scientific Reports. Closed peer review

Free or reasonably priced journals that judge perceived impact
$0 Chemical Science Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review.

$0 Beilstein Journal of Nanotechnology. Closed peer review.

$0 ACS Central Science. Closed peer review.

$100 Living Journal of Computational Molecular Science. Closed peer review

€500 Chemistry2. Closed peer review.

£500 RSC Advances. Closed peer review. (Normally £750)

Let me know if I have missed anything.

This work is licensed under a Creative Commons Attribution 4.0

Wednesday, January 17, 2018

Drug design: My latest paper explained without the jargon

Our latest paper has appeared in the latest issue of Chemical Science. It's ultimately related to making better drugs so first some background.

Making complex drug candidates for testing is usually done in several steps where new chemical groups are added at each step. There are usually several reactive atoms and the challenge is to predict the most reactive one. This is currently done mostly by chemical intuition and related literature examples – an approach that often fails, which contributes significantly to the cost of making new drugs. So, there is a need for a fast, yet powerful and easy-to-use tool to predict the most reactive atom in a molecule. 

The New Study
For this study we collaborated with the pharmaceutical company Lundbeck A/S to develop a user-friendly tool to predict the most reactive atoms for one of the most often used chemical reaction within drug design.  We compared our predictions against published results for more than 500 different kinds of molecules and found that we made correct predictions in 96% of the cases.
We have made the software is available free to anyone and also made an easy-to-use web interface at regiosqm.org.  We are now working on extending the method to other types of reactions.

This work is licensed under a Creative Commons Attribution 4.0