We've been working on conformational search for a while and are nearing the point were we have enough to write it up. This post is to get my head around the central point of the study.
Motivation
I'm interested in conformational search because I want to compute accurate reaction energies. Therefore I need to find the global energy minimum of both reactants and products (or something very close to them as explained below).
I need to make to make 2 choices: what conformational search method and what level of theory to use.
Establishing a benchmark set
The only way to make reasonably sure you find the global minimum is to do a systematic search with a relatively fine dihedral angle resolution. This can be painful, even with MMFF, but needs only to be done once. So what if it runs for 24 hours?
A dihedral scan doesn't sample ring conformations directly, so you may have to repeat the systematic search for several ring conformations.
The global energy minimum is obviously fully energy minimised so when I say "search" I mean that you are generating starting geometries that are then fully energy minimised. Also, if you are interested in reaction enthalpies, the global minimum is the structure with the lowest enthalpy and similarly for free energies.
Test your new conformational search method
Here's the only test that matters: run your new conformational search method N times and record how many of the N runs find the global minimum, i.e. the success rate. Do that for M different molecules and calculate an average success rate.
Let's say the average success rate is 90%. That tells me that if I use your method once there is 10% chance I that I won't find the global minimum, but if I use the method twice on the same molecule that chance drops to (0.1 x 0.1) 1% (assuming your method is stochastic).
The larger N and M are, the lower the uncertainty in the average success rates: 5 is no good, 10 is borderline, >15-20 is acceptable.
The success rate will depend on the number of rotatable bonds so it's important that the sizes of your M molecules are representative of the molecules you want to study.
Obviously, if your new method finds a lower energy value then there you need to go back and look the way you found your benchmark set.
Depending on the accuracy you can live with, you can loosen the success criteria to having found a structure within, say, 0.5 kcal/mol of the global minimum.
Benchmarking energy functions for conformational search
Most benchmark studies of this kind compare conformational energies and report RMSD or MAE values. The problem with this approach is that large errors for high energy conformers can lead to large RMSD values which are misleading for the purposes of finding global minima.
The only test of an energy function that really matters is whether the "true" global minimum structure is among the predicted structures and, if so, how close in energy it is to the "predicted" global minimum. Here the "true" global minimum is the global minimum predicted by your reference energy function that you trust.
Let's say you want to compare a semiempirical method (SQM) to a DFT method you trust and you can afford to do 100 geometry optimisations with DFT per molecule. The SQM is cheap enough that you can perform either a systematic search or a search using a conformational search method you have found reliable. Now you take the 100 (unique) lowest energy structures and use them as starting points for 100 DFT optimisations. The lowest energy structure found with DFT is your best guess for the "true" global minimum and the question is what is the minimum number of DFT optimisations you need to perform to find the "true" global minimum for each molecule? The lower the number, the better the SQM method for that molecule.
Let's say you do this for 5 molecules and the answer for SQM1 is 3, 1, 4, 10, and 3 while for SQM2 it is 4, 6, 4, 7, and 4. The I would argue that SQM2 is better because you need to perform 7 DFT optimisations to be 100% sure to find the "true" global minimum for all 5 molecules, compared to 10 DFT optimisations for SQM1.
Another metric would be to say that in practice I can only afford, say, 5 DFT optimisations and compute the energy relative to the "true" global minimum, e.g. 0.0, 0.0. 0.0, 0.5, and 0.0 kcal/mol for SQM1 and 0.0, 0.2, 0.0, 0.8, and 0.0 kcal/mol for SQM2. In this case you could argue that SQM1 is better since the maximum error is smaller. The best metric depends on your computational resources and what kind of error you can live with.
A need for a global minimum benchmark sets
Most, if not all, conformer benchmark sets that are currently available are made starting from semirandomly chosen starting geometries, with no guarantee that the true global minimum is among the structures. What is really needed is a diverse set of molecular structures and total energies, obtained using trustworthy methods, that one is reasonably sure correspond to global minima.
As I mentioned above, the only "sure" way is to perform a systematic search but for large molecules this may be practically impossible for energy functions that you trust.
One option is to perform the systematic search using a cheaper method and then re-optimise the P lowest energy structures with the more expensive method. The danger here is that the global minimum on the expensive energy surface is not a minimum on the cheap energy surface or, more precisely, that none of the P starting geometries leads to the global minimum on the expensive surface. One way to test this is to start the stochastic search, which hopefully is so efficient that you can afford to use a trustworthy energy function, from the global minimum candidate you found.
Additionally, it is useful to use two different stochastic conformational search algorithms, such as Monte Carlo and genetic algorithm. If both method locate the same global minimum, then there is a good chance it truly is a global minimum, since it is very unlikely to find the global minimum by random chance.
This work is licensed under a Creative Commons Attribution 4.0