Here is an idea for parameterizing semiempirical energies based on the connectivity-based hierarchy (CBH) approach I have written about earlier.

Most semiempirical methods are based on some ab initio ansatz and for the NDDO-based methods such as PM6 that ansatz is minimal basis set HF. These NDDO methods are parameterized (in part) by the minimizing the difference between computed and experimental heats of formation (HOF, although here I'll use G4 heats of formation since I have them handy).

The PM6 error for H$_2$, methane, ethane, and propane are 25.4, -5.6, -4.1, and -3.8 kcal/mol, where the error is defined as

$$ \text{Error} = \Delta H^\circ_{f,G4}-\Delta H^\circ_{f,SQM}$$

which is not too bad considering that the corresponding error for HF/STO-3G are 4.2, -10.8, -25.6, and -42.1 kcal/mol,

However, while the MUE for PM6 (9.7 kcal/mol) is considerably lower than that for HF/STO-3G (20.7 kcal/mol), the reaction enthalpy for 2 methane -> ethane + H$_2$ is considerably higher for PM6 (32.5 kcal/mol) than for HF/STO-3G (0.2 kcal/mol). Somehow, the error cancellation inherent in HF/STO-3G is lost when minimizing errors in heats of formation, which is perhaps not too surprising given that the parameterization has to decrease a considerable error for larger molecules (e.g. the HF/STO-3G error for neopentane is -79.8 kcal/mol).

This problem could potentially be avoided using the CBH approach. Here I'll use CBH-1 for simplicity, where the building blocks are mono-, and diatomic molecules (with added hydrogen atoms). In the CBH approach the computed enthalpies are corrected before presenting them to the user

$$\Delta H^\circ_{f,CBH} = \Delta H^\circ_{f,SQM} + \text{Corr} $$

For the building bloks such as methane and ethane (CC) the correction is simply the error and the HOF for, say, ethane is simply the reference value

$$\Delta H^\circ_{f,CBH}(\text{CC}) = \Delta H^\circ_{f,SQM}(\text{CC}) + \text{Error(CC)} = \Delta H^\circ_{f,G4}(\text{CC})$$

and the deviation of the corrected HOF from G4 is zero

$$ \text{Dev(CC)} = \Delta H^\circ_{f,G4}(\text{CC})-\Delta H^\circ_{f,CBH}(\text{CC}) = 0$$

For propane (CCC) the correction is

$$ \text{Corr(CCC)} = 2\text{Error(CC)}-\text{Error(C)} $$

(note that "C" in the last term represents methane, not the carbon atom). Using this approach the corrected HOF computed by HF/STO-3G

$$ \text{Dev(CCC)} = \Delta H^\circ_{f,G4}(\text{CCC})-\Delta H^\circ_{f,CBH}(\text{CCC}) $$

differs from G4 by only 1.7 kcal/mol.

So if I wanted to somehow tweak STO-3G to give better agreement with G4 using H$_2$, methane, ethane, and propane as a training set the, errors (Devs) I need to minimize are 0, 0, 0, and 1.7 kcal/mol rather than 4.2, -10.8, -25.6, and -42.1 kcal/mol. Achieving good results for the former seems much more doable as the starting MUE is only 0.4 kcal/mol and some error cancellation is already build in.

The caveat is that I have to have Errors values for all possible monoatomic and diatomic molecules. While that's doable it could present a problem when using larger building blocks. However, one can use smaller building blocks to partially correct the problem.

Let's take ethanol (CCO) as an example. Using bonds as building bloks

$$ \text{Corr(CCO)} = \text{Error(CC)}+\text{Error(CO)}-\text{Error(C)} $$

and the corresponding Dev is 2.1 kcal/mol. Let's say Error(CO) isn't availble, then the next best thing is

$$ \text{Corr(CCO)} \approx \text{Error(CC)}+\text{Error(O)}-\mathrm{Error(H_2)} $$

While the Dev value is considerably higher (11.9 kcal/mol) it is still considerably lower than the uncorrected HOF error for ethanol (108.7 kcal/mol) and a better ansatz.

The CBH scheme also lends itself well to dealing with outliers through custom fragments. For example, the Dev for oxirane is 53.3 kcal/mol - a clear outlier due to the poor description of ring strain by STO-3G. Rather than forcing the parameters to compensate or excluding oxiranes from the training set one can include oxirane as an additional building block. So, for example, the correction for for 2,2-dimethyloxirane is

$$ \text{Corr(CC1(C)OC1) } = \text{Error(C1OC1)}+2 \text{Error(CC)}-2 \text{Error(C)} $$

and the corresponding Dev is only 5.8 kcal/mol.

Getting gradients using this approach is a little tricky but doable. But a better approach is to parameterize a separate semiempirical method for geometries as proposed by Grimme.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Most semiempirical methods are based on some ab initio ansatz and for the NDDO-based methods such as PM6 that ansatz is minimal basis set HF. These NDDO methods are parameterized (in part) by the minimizing the difference between computed and experimental heats of formation (HOF, although here I'll use G4 heats of formation since I have them handy).

The PM6 error for H$_2$, methane, ethane, and propane are 25.4, -5.6, -4.1, and -3.8 kcal/mol, where the error is defined as

$$ \text{Error} = \Delta H^\circ_{f,G4}-\Delta H^\circ_{f,SQM}$$

which is not too bad considering that the corresponding error for HF/STO-3G are 4.2, -10.8, -25.6, and -42.1 kcal/mol,

However, while the MUE for PM6 (9.7 kcal/mol) is considerably lower than that for HF/STO-3G (20.7 kcal/mol), the reaction enthalpy for 2 methane -> ethane + H$_2$ is considerably higher for PM6 (32.5 kcal/mol) than for HF/STO-3G (0.2 kcal/mol). Somehow, the error cancellation inherent in HF/STO-3G is lost when minimizing errors in heats of formation, which is perhaps not too surprising given that the parameterization has to decrease a considerable error for larger molecules (e.g. the HF/STO-3G error for neopentane is -79.8 kcal/mol).

This problem could potentially be avoided using the CBH approach. Here I'll use CBH-1 for simplicity, where the building blocks are mono-, and diatomic molecules (with added hydrogen atoms). In the CBH approach the computed enthalpies are corrected before presenting them to the user

$$\Delta H^\circ_{f,CBH} = \Delta H^\circ_{f,SQM} + \text{Corr} $$

For the building bloks such as methane and ethane (CC) the correction is simply the error and the HOF for, say, ethane is simply the reference value

$$\Delta H^\circ_{f,CBH}(\text{CC}) = \Delta H^\circ_{f,SQM}(\text{CC}) + \text{Error(CC)} = \Delta H^\circ_{f,G4}(\text{CC})$$

and the deviation of the corrected HOF from G4 is zero

$$ \text{Dev(CC)} = \Delta H^\circ_{f,G4}(\text{CC})-\Delta H^\circ_{f,CBH}(\text{CC}) = 0$$

For propane (CCC) the correction is

$$ \text{Corr(CCC)} = 2\text{Error(CC)}-\text{Error(C)} $$

(note that "C" in the last term represents methane, not the carbon atom). Using this approach the corrected HOF computed by HF/STO-3G

$$ \text{Dev(CCC)} = \Delta H^\circ_{f,G4}(\text{CCC})-\Delta H^\circ_{f,CBH}(\text{CCC}) $$

differs from G4 by only 1.7 kcal/mol.

So if I wanted to somehow tweak STO-3G to give better agreement with G4 using H$_2$, methane, ethane, and propane as a training set the, errors (Devs) I need to minimize are 0, 0, 0, and 1.7 kcal/mol rather than 4.2, -10.8, -25.6, and -42.1 kcal/mol. Achieving good results for the former seems much more doable as the starting MUE is only 0.4 kcal/mol and some error cancellation is already build in.

**Dealing with missing error-parameters**The caveat is that I have to have Errors values for all possible monoatomic and diatomic molecules. While that's doable it could present a problem when using larger building blocks. However, one can use smaller building blocks to partially correct the problem.

Let's take ethanol (CCO) as an example. Using bonds as building bloks

$$ \text{Corr(CCO)} = \text{Error(CC)}+\text{Error(CO)}-\text{Error(C)} $$

and the corresponding Dev is 2.1 kcal/mol. Let's say Error(CO) isn't availble, then the next best thing is

$$ \text{Corr(CCO)} \approx \text{Error(CC)}+\text{Error(O)}-\mathrm{Error(H_2)} $$

While the Dev value is considerably higher (11.9 kcal/mol) it is still considerably lower than the uncorrected HOF error for ethanol (108.7 kcal/mol) and a better ansatz.

**Dealing with missing outliers**The CBH scheme also lends itself well to dealing with outliers through custom fragments. For example, the Dev for oxirane is 53.3 kcal/mol - a clear outlier due to the poor description of ring strain by STO-3G. Rather than forcing the parameters to compensate or excluding oxiranes from the training set one can include oxirane as an additional building block. So, for example, the correction for for 2,2-dimethyloxirane is

$$ \text{Corr(CC1(C)OC1) } = \text{Error(C1OC1)}+2 \text{Error(CC)}-2 \text{Error(C)} $$

and the corresponding Dev is only 5.8 kcal/mol.

**Geometries/gradients**Getting gradients using this approach is a little tricky but doable. But a better approach is to parameterize a separate semiempirical method for geometries as proposed by Grimme.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.