We (Anders Larsen and +Lars Bratholm and myself) are in the process of turning Anders Larsen's MS thesis into a paper. This is a progress report. Note that any numbers in here are likely to change somewhat.

ProCS takes a protein structure and computes isotropic chemical shielding values of all backbone atoms as well as CB in less than a second. ProCS is basically a look-up and interpolation function based on about 2.35 million PCM/OPBE/6-31G(d,p)//PM6 calculations on tripeptides. In addition to this we correct select atoms for hydrogen bonding to H and HA and plus a ring-current correction for H atoms.

Using a previous incarnation of ProCS which only predicted H chemical shifts we have shown that one can improve hydrogen bond geometries in protein structures by finding structures that match measured chemical shifts. We further showed that using empirical chemical shift predictors did not result in hydrogen bond geometries. We want to extend this to other aspects of the protein structure by including NMR chemical shifts for more nuclei. Note that this chemical shift-based structure refinement requires many thousands of chemical shift predictions, so doing QM calculations on the entire protein is not feasible.

As an example, we optimize ubiquitin (1UBQ) with PCM/PM6-D3H+ and use the structure to compute PCM/OPBE/6-31G(d,p)//PM6 chemical shielding values, which we then compare to corresponding ProCS values computed using the same structure.

For CA, CB, and C atoms we get RMSD values of 2.0, 2.5, and 2.3 ppm with R values of 0.94, 0.98, and 0.74. The RMSD for CA can be reduced to 1.6 ppm by subtracting an offset of 1.1 ppm, but accuracy for the other two atoms types is not greatly improved by an offset or scaling + offset. For H and HA the RMSDs are 0.9 and 0.7 ppm and the R values are 0.86 and 0.84. The result for H can be improved somewhat (to 0.6 ppm and R = 0.89) by removing a clear outlier and scaling + offset. Finally, the RMSD and R for N is 7.2 and 0.85. The RMSD can be improved considerably by adding and offset (5.2 ppm and 0.88) or scaling + offset (4.4 ppm) after removing a clear outlier.

From additional analysis we know that, with the exception of HA, more than 50% of the error comes from the way we approximate the effect of the side chains on the residues immediately next to the residue of interest in the sequence.

Much of the variation in some of the chemical shifts comes from the nature of the side-chain itself and the side chains before and after in the sequence, which can lead to inflated R values. To separate the contributions of the sequence and the structure we subtract the measured sequence corrected random coil values from both the ProCs and QM results. The R values for CA, CB, C, HA, H, and N are now 0.76, 0.69, 0.71, 0.83, 0.89, and 0.75, respectively, while the RMSD values are much the same (after an offset correction). It is interesting to note that CB went from being most to least predictive (in terms of R values), that the amide proton is now most predictive, and that N and CA are roughly equally predictive even though the former has a larger RMSD. The reason is that the corrected carbon chemical shifts span a range of about 10 ppm, while the corrected N chemical shifts span a range of about 20 ppm. For comparison, H and HA corrected chemical shifts span a range of about 4 ppm.

Before I try to answer that there are a few point to consider. One point is that we are comparing computed chemical shieldings to experimental chemical shifts so it only makes sense to compare results that have been corrected by scaled + offset using linear regression (or possibly just an off-set). Another point is that deviation from experiment result not only for deficiencies in the model, but also deficiencies in the structure including the fact that several conformation could contribute to the measured chemical shifts.

The R (RMSD) values for CA, CB, C, HA, H, and N for ProCS vs Exp are:

0.90 (2.0) 0.99 (2.2) 0.38 (1.8) 0.66 (0.4) 0.33 (0.6) 0.75 (4.5)

While the PCM/OPBE/6-31G(d,p)//PM6-DH3+ values are

0.90 (2.1) 0.98 (3.0) 0.48 (2.7) 0.61 (0.8) 0.38 (1.25) 0.81 (5.5)

According to these R values ProCS performs roughly the same as QM for CA, CB, and HA and could be improved a bit for C, H, and N, but a greater limitation appears to be PCM/OPBE/6-31G(d,p)//PM6 itself for C and H and some extent HA. Here it is worth noting that Zu, He, and Zhang have shown that R for H atoms can be greatly improved by including a shell of explicit water molecules in addition to the continuum.

The corresponding ProCS (QM) R values for the random-coil corrected chemical shifts are: 0.56, 0.42, 0.29, 0.69, 0.32, and 0.41, which are quite comparable to the QM values: 0.55, 0.46, 0.39, 0.63, 0.36, and 0.49. Notice that the R value for H no longer is inordinately lower than for some of the other atoms. The ProCS R value for C is close to being statistically insignificant (R < 0.23).

A lot. Here are R values for ProsCS vs experiment for random coil-corrected values for choices of energy function for the geometry optimization for CA, CB, C, HA, H, and N (and next to that the corresponding QM values):

PM6-D3H+ 0.56 0.42 0.29 0.69 0.32 0.41 | 0.55 0.46 0.39 0.63 0.36 0.49

PM6-DH+ 0.61 0.40 0.37 0.71 0.33 0.44 | 0.62 0.39 0.42 0.64 0.42 0.51

AMBER 0.61 0.38 0.09 0.60 0.24 0.42 | 0.58 0.46 0.39 0.71 0.17 0.44

CHARMM 0.71 0.36 0.39 0.81 0.34 0.44 | 0.71 0.48 0.58 0.80 0.45 0.60

AMOEBA 0.50 0.36 0.34 0.56 0.44 0.36 | 0.48 0.39 0.54 0.65 0.51 0.47

X-ray 0.71 0.30 0.35 0.83 0.30 0.49 | 0.58 0.46 0.39 0.71 0.17 0.44

The value for C for AMBER is not a typo! Not sure what is going on there. The corresponding QM value is 0.39, so it looks like a bug. Anyway, the structural dependence is a positive in the sense that experimental chemical shifts potentially can be used to improve the structure. In all cases the ProCS and QM calculations lead to quite similar R factors, though the R values tend to be consistently higher for QM predicted C chemical shifts.

I don't have all the data I need yet, so consider this a preview of a subsequent blogpost. CheShift2, which is also a QM-based chemical shift predictor for CA and CB chemical shifts gives the following R values:

AMBER 0.53 0.38

CHARMM 0.68 0.61

AMOEBA 0.33 0.44

X-ray 0.59 0.47

So based on this preliminary data it appears that ProCS tends to do better for CA, while CheShift2 does better for CB.

How does ProCS compare to empirical shift predictors such as Camshift, SHIFTX, and SPARTA?

How do these number presented in this blog post compare to corresponding numbers for other proteins?

Whats the effect of averaging over different (side chain) conformers?

Is is possible to find a "universal" set of scaling and offset parameters for each atom type for a given choice of optimized structure? In this way, ProCS could predict chemical shifts instead of chemical shieldings.

What's the best way to use ProCS (or chemical shieldings in general) to judge the accuracy of a protein structure? If this turns out to be the R value instead of the RMSD, then we only need chemical shieldings.

Comments welcome as always

This work is licensed under a Creative Commons Attribution 4.0

**What is ProCS?**ProCS takes a protein structure and computes isotropic chemical shielding values of all backbone atoms as well as CB in less than a second. ProCS is basically a look-up and interpolation function based on about 2.35 million PCM/OPBE/6-31G(d,p)//PM6 calculations on tripeptides. In addition to this we correct select atoms for hydrogen bonding to H and HA and plus a ring-current correction for H atoms.

**What is ProCS good for?**Using a previous incarnation of ProCS which only predicted H chemical shifts we have shown that one can improve hydrogen bond geometries in protein structures by finding structures that match measured chemical shifts. We further showed that using empirical chemical shift predictors did not result in hydrogen bond geometries. We want to extend this to other aspects of the protein structure by including NMR chemical shifts for more nuclei. Note that this chemical shift-based structure refinement requires many thousands of chemical shift predictions, so doing QM calculations on the entire protein is not feasible.

**How well does ProCS reproduce QM calculations?**As an example, we optimize ubiquitin (1UBQ) with PCM/PM6-D3H+ and use the structure to compute PCM/OPBE/6-31G(d,p)//PM6 chemical shielding values, which we then compare to corresponding ProCS values computed using the same structure.

For CA, CB, and C atoms we get RMSD values of 2.0, 2.5, and 2.3 ppm with R values of 0.94, 0.98, and 0.74. The RMSD for CA can be reduced to 1.6 ppm by subtracting an offset of 1.1 ppm, but accuracy for the other two atoms types is not greatly improved by an offset or scaling + offset. For H and HA the RMSDs are 0.9 and 0.7 ppm and the R values are 0.86 and 0.84. The result for H can be improved somewhat (to 0.6 ppm and R = 0.89) by removing a clear outlier and scaling + offset. Finally, the RMSD and R for N is 7.2 and 0.85. The RMSD can be improved considerably by adding and offset (5.2 ppm and 0.88) or scaling + offset (4.4 ppm) after removing a clear outlier.

From additional analysis we know that, with the exception of HA, more than 50% of the error comes from the way we approximate the effect of the side chains on the residues immediately next to the residue of interest in the sequence.

Much of the variation in some of the chemical shifts comes from the nature of the side-chain itself and the side chains before and after in the sequence, which can lead to inflated R values. To separate the contributions of the sequence and the structure we subtract the measured sequence corrected random coil values from both the ProCs and QM results. The R values for CA, CB, C, HA, H, and N are now 0.76, 0.69, 0.71, 0.83, 0.89, and 0.75, respectively, while the RMSD values are much the same (after an offset correction). It is interesting to note that CB went from being most to least predictive (in terms of R values), that the amide proton is now most predictive, and that N and CA are roughly equally predictive even though the former has a larger RMSD. The reason is that the corrected carbon chemical shifts span a range of about 10 ppm, while the corrected N chemical shifts span a range of about 20 ppm. For comparison, H and HA corrected chemical shifts span a range of about 4 ppm.

**How well does ProCS reproduce experiment?**Before I try to answer that there are a few point to consider. One point is that we are comparing computed chemical shieldings to experimental chemical shifts so it only makes sense to compare results that have been corrected by scaled + offset using linear regression (or possibly just an off-set). Another point is that deviation from experiment result not only for deficiencies in the model, but also deficiencies in the structure including the fact that several conformation could contribute to the measured chemical shifts.

The R (RMSD) values for CA, CB, C, HA, H, and N for ProCS vs Exp are:

0.90 (2.0) 0.99 (2.2) 0.38 (1.8) 0.66 (0.4) 0.33 (0.6) 0.75 (4.5)

While the PCM/OPBE/6-31G(d,p)//PM6-DH3+ values are

0.90 (2.1) 0.98 (3.0) 0.48 (2.7) 0.61 (0.8) 0.38 (1.25) 0.81 (5.5)

According to these R values ProCS performs roughly the same as QM for CA, CB, and HA and could be improved a bit for C, H, and N, but a greater limitation appears to be PCM/OPBE/6-31G(d,p)//PM6 itself for C and H and some extent HA. Here it is worth noting that Zu, He, and Zhang have shown that R for H atoms can be greatly improved by including a shell of explicit water molecules in addition to the continuum.

The corresponding ProCS (QM) R values for the random-coil corrected chemical shifts are: 0.56, 0.42, 0.29, 0.69, 0.32, and 0.41, which are quite comparable to the QM values: 0.55, 0.46, 0.39, 0.63, 0.36, and 0.49. Notice that the R value for H no longer is inordinately lower than for some of the other atoms. The ProCS R value for C is close to being statistically insignificant (R < 0.23).

**How sensitive is the agreement with experiment to the structure?**A lot. Here are R values for ProsCS vs experiment for random coil-corrected values for choices of energy function for the geometry optimization for CA, CB, C, HA, H, and N (and next to that the corresponding QM values):

PM6-D3H+ 0.56 0.42 0.29 0.69 0.32 0.41 | 0.55 0.46 0.39 0.63 0.36 0.49

PM6-DH+ 0.61 0.40 0.37 0.71 0.33 0.44 | 0.62 0.39 0.42 0.64 0.42 0.51

AMBER 0.61 0.38 0.09 0.60 0.24 0.42 | 0.58 0.46 0.39 0.71 0.17 0.44

CHARMM 0.71 0.36 0.39 0.81 0.34 0.44 | 0.71 0.48 0.58 0.80 0.45 0.60

AMOEBA 0.50 0.36 0.34 0.56 0.44 0.36 | 0.48 0.39 0.54 0.65 0.51 0.47

X-ray 0.71 0.30 0.35 0.83 0.30 0.49 | 0.58 0.46 0.39 0.71 0.17 0.44

The value for C for AMBER is not a typo! Not sure what is going on there. The corresponding QM value is 0.39, so it looks like a bug. Anyway, the structural dependence is a positive in the sense that experimental chemical shifts potentially can be used to improve the structure. In all cases the ProCS and QM calculations lead to quite similar R factors, though the R values tend to be consistently higher for QM predicted C chemical shifts.

**How does ProCS compare to other chemical shift predictors?**I don't have all the data I need yet, so consider this a preview of a subsequent blogpost. CheShift2, which is also a QM-based chemical shift predictor for CA and CB chemical shifts gives the following R values:

AMBER 0.53 0.38

CHARMM 0.68 0.61

AMOEBA 0.33 0.44

X-ray 0.59 0.47

So based on this preliminary data it appears that ProCS tends to do better for CA, while CheShift2 does better for CB.

**Open questions**How does ProCS compare to empirical shift predictors such as Camshift, SHIFTX, and SPARTA?

How do these number presented in this blog post compare to corresponding numbers for other proteins?

Whats the effect of averaging over different (side chain) conformers?

Is is possible to find a "universal" set of scaling and offset parameters for each atom type for a given choice of optimized structure? In this way, ProCS could predict chemical shifts instead of chemical shieldings.

What's the best way to use ProCS (or chemical shieldings in general) to judge the accuracy of a protein structure? If this turns out to be the R value instead of the RMSD, then we only need chemical shieldings.

Comments welcome as always

This work is licensed under a Creative Commons Attribution 4.0