Proteins and Wave Functions: Random versus systematic errors in computed reaction enthalpies

Friday, November 24, 2017

Random versus systematic errors in computed reaction enthalpies

Jimmy Kromann, Alexander Welford, Anders Christensen, and I have been testing the connectivity-based hierarchy (CBH) approach, developed by Raghavachari and co-workers, for semempirical methods. For most methods the improvement has been quite modest and here I discuss why (you'll need to read the paper to follow what I write here).

PM6
For example, for PM6 the RMSE relative to G4 for the 23 parent reactions is 9.4 kcal/mol, which increases to 16.4 kcal/mol for CBH-1 and decreases to 7.1 kcal/mol for CBH-2. For comparison, the corresponding values for HF/6-311G(d,p) are 15.2, 11.4, and 3.0 kcal/mol, respectively.

To see what's going on, let's look at the CBH-1 correction for the first reaction in the set. The PM6 error is -0.3 kcal/mol for the parent reaction and the CBH-1 correction is -12.7 kcal/mol. PM6 is doing great for the parent reaction so, ideally, the CBH-1 correction should be 0. Why is it so large in magnitude?

It is a Diels-Alder reaction, so two double bonds are turned into four single bonds and the CBH-1 reaction is (using SMILES notation) 4 C + 2 C=C -> 4 CC. The error in the heats of formation (HOF) relative to G4 are C: -5.6, C=C -3.3, and CC: -4.1 kcal/mol. So the CBH-1 correction is -[4(-4.1) - 2(-3.3) - 4(-5.6)] = -12.6 kcal/mol. So the CBH-1 correction is large because the error for C is somewhat larger than for CC and, especially, C=C.

These errors also shows that the 0 kcal/mol error for the parent reaction is clearly fortuitous. If the errors in the HOFs of the reactants and products are similar to those observed for C, C=C, and CC then one would expect an error in the reaction energy of about 3-6 kcal/mol since you go from 2 to 1 molecule.

HF/6-311G(d,p)
Let's compare the PM6 results to HF/6-311G(d,p). The error for the parent reaction is -12.2 kcal/mol, which decreases to -7.9 with the CBH-1 correction. The error in the HOFs relative to G4 are C: -95.4, C=C -140.6, and CC: -166.8 kcal/mol. So the CBH-1 correction is -[4(-166.8) - 2(-140.6) - 4(-95.4)] = 4.4 kcal/mol, which when added to -12.2 lowers the error in reaction energy to -7.9 kcal/mol. Clearly the errors in HOFs are much than larger for PM6, yet most of the error cancels. Why?

The answer is that the magnitude of the HOF error scales with the number and types of bonds in the molecule. For example, the HOF-error of C=CC is -213.2 kcal/mol, which is approximately the sum of the HOF-errors of CC and C=C, minus the error for C [-166.8 -140.6 + 95.4 = -212.0 kcal/mol]. This type of error scaling makes HF-results very amenable to improvement using the CBH approach.

Implications for optimisation of new semiempirical methods
Most NDDO-based semi-empirical methods are optimised by reducing the HOF-RMSE (or MUE). If the absolute errors for predictions are consistently below, say, 1 kcal/mol then the maximum possible reaction energy-error will be 2, 3, and 4 kcal/mol for a A -> B, A + B -> C, and A + B -> C + D type reactions respectively.

However, if the errors are larger (say 5 kcal/mol on average) then the best case scenario for this approach is one where the errors are as systematic as possible (i.e. that all are as close to 5 kcal/mol as possible). Then the reaction energy-error will be around 0 for A -> B and A + B -> C + D reactions and typically around 5 kcal/mol for A + B -> C reaction (which one could correct for since it is consistent).

As shown by the HF example, another approach could be to minimise the the error in a "bond-corrected HOF" for larger molecules, e.g. minimise the errors in HOF(C), HOF(C=C), HOF(CC), and [HOF(C=CC) - HOF(C=C) - HOF(CC) + HOF(C)] and then present a CBH-1-corrected HOF for C=CC to the user.

This approach might make it easier to find more generally applicable parameters that better and systematically minimise the error, since the underlying HF approach "naturally" gives larger errors for larger systems and we are no longer asking the parameters to "undo that" by searching for small errors independent of system size.

Can we tell from the fragment-HOF errors if CBH-corrections will improve results?
No, but perhaps we can identify cases where the correction makes it much, much worse. For example, the fifth reaction in the set is also a Diels-Alder reaction, so the CBH-1 correction is the same as for reaction 1, but the PM6 error for the parent reaction is 9.1 kcal/mol, so the CBH-1 correction lowers the error to -3.2 kcal/mol.

On the other hand, the PM6 CBH-2 correction for Reaction 23 is -44.2 kcal/mol, which is unusually large (and makes things much worse). The CBH-2 reaction 2 C#C + 3 CC + 5 C=C=C -> 6 C=C + 5 C#CC and the large error comes from the fact that the HOF error is 7.6 kcal/mol, much larger and of opposite sign than the other HOF-errors in the reaction, plus the fact that it is multiplied by 5.

However, it is not enough to simply look for outliers. The largest PM6 HOF error among the fragment we have studied so far is 11.7 kcal/mol for NN, and this enters in to the CBH-2 correction for R10, but this correction actually lowers the error relative to the parent reaction.

Perhaps the best one can do is to view corrections that are significantly larger than 20 kcal/mol with suspicion since such large errors are rarely observed for the parent reactions.