## Thursday, November 24, 2016

### Which method is more accurate? or Errors have error bars

This post is my attempt at distilling some of the information in two papers published by Anthony Nicholls (here and here). Anthony also very kindly provided some new equations, not found in the papers, in response to my questions.

Errors also have error bars
Say you have two methods, $A$ and $B$, for predicting some property and you want to determine which method is more accurate by computing the property using both methods for the same set of $N$ different molecules for which reference values are available. You evaluate the error (for example the RMSE) of each method relative to the reference values and compare. The point of this post is that these errors have uncertainties (error bars) that depend on the number of data points ($N$, more data less uncertainty) and you have to take these uncertainties into consideration when you compare errors.

The most common error bars reflect 95% confidence and that's what I'll use here.

The expression for the error bars assume a large $N$ where in practice "large" in this context means roughly 10 or more data points.  If you use fewer points or would like more accurate estimates please see the Nicholls papers for what to do.

Root-Mean-Square-Error (RMSE)
The error bars for the RMSE are asymmetric.  The lower and higher error bar on the RMSE for method $X$ $(RMSE_X)$ is
$$L_X = RMSE_X - \sqrt {RMSE_X^2 - \frac{{1.96\sqrt 2 RMSE_X^2}}{{\sqrt {N - 1} }}}$$
$$= RMSE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}} \right)$$

$$U_X = RMSE_X \left( \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1 \right)$$

Mean Absolute Error (MAE)
The error bars for the MAE is also asymetric. The lower and higher error bar on the MAE for method $X$ $(MAE_X)$ is

$$L_X = MAE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}} \right)$$

$$U_X = MAE_X \left( \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1 \right)$$

Mean Error (ME)
The error bars for the mean error are symmetric and given by
$$L_X = U_X = \frac{1.96 s_N}{\sqrt{N}}$$

where $s_N$ is the standard population deviation (e.g. STDEVP in Excel).

Pearson’s correlation coefficient, $\textbf{r}$
The first thing to check is whether your $r$ values themselves are statistically significant, i.e. $r_X > r_{significant}$ where

$$r_{significant} = \frac{1.96}{\sqrt{N-2+1.96^2}}$$

The error bars for the Pearson's $r$ value are asymmetric and given by
$$L_X = r_X - \frac{e^{2F_-}-1}{e^{2F_-}+1}$$
$$U_X = \frac{e^{2F_+}-1}{e^{2F_+}+1} - r_X$$

where

$$F_{\pm} = \frac{1}{2} \ln \frac{1+r_X}{1-r_X} \pm r_{significant}$$

Comparing two methods
If $error_X$ is some measure of the error, RMSE, MAE, etc, and $error_A > error_B$ then the difference is statistically significant only if

$$error_A - error_B > \sqrt {L_A^2 + U_B^2 - 2{r_{AB}}{L_A}{U_B}}$$

where $r_{AB}$ is the Pearson's $r$ value of method $A$ compared to $B$, not to be confused with $r_A$ which compares $A$ to the reference value.  Conversely, if this condition is not satisfied then you cannot say that method $B$ is not more accurate than method $A$ with 95% confidence because the error bars are too large.

Note also that if there is a high degree of correlation between the predictions ($r_{AB} \approx$ 1) and the error bars are similar in size $L_A \approx U_B$ then even small differences in error could be significant.

Usually one can assume that $r_{AB} > 0$ so if $error_A - error_B > \sqrt {L_A^2 + U_B^2}$ or $error_A - error_B > L_A + U_B$ then the difference is statistically significant, but it is better to evaluate $r_{AB}$ to be sure.

The meaning of 95% confidence
Say you compute errors for some property for 50 molecules using method $A$ ($error_A$) and $B$ ($error_B$) and observe that Eq 11 is true.

Assuming no prior knowledge on the performance of $A$ and $B$, if you repeat this process an additional 40 times using all new molecules each time then in 38 cases (38/40 = 0.95) the errors observed for method $A$ will likely be between $error_A - L_A$ and $error_A + U_A$ and similarly for method $B$. For one of the remaining two cases the error is expected to be larger than this range, while for the other remaining case it is expected to be smaller. Furthermore, in 39 of the 40 cases $error_A$ is likely larger than $error_B$, while $error_A$ is likely smaller than $error_B$ in the remaining case.

## Sunday, November 6, 2016

### Some useful PyMol commands

Here some PyMol commands I found useful while writing this paper.

Raytracing (pretty pictures)
ray

Selections
select br. all within 3 of 63/CA
select br. all within 3 of resi 63
select Ala63, br. all within 3 of resi 63
select br. all within 3 of 2kpp///82/
sele tail,  2kzn///142-147/
sele tail,  2KPP///1-7+91-114/

NMR ensembles set all_states, on
split_states your_object

Superimposing structures # superimpose protA residues 10-25 and 33-46 to protB residues 22-37 and 41-54:
pair_fit protA///10-25+33-46/CA, protB///22-37+41-54/CA

# superimpose ligA atoms C1, C2, and C4 to ligB atoms C8, C4, and C10, respectively:
pair_fit ligA////C1, ligB////C8, ligA////C2, ligB////C4, ligA////C3, ligB////C10

align cluster_lowe///13-25+36-105+111-141/CA,native///13-25+36-105+111-141/CA

Color using numbers in B-factor column
spectrum b, green_red, selection=n. CA,minimum=0.0, maximum=2
spectrum b, blue_white_red, selection=n. CA
spectrum b, blue_white_red, selection=n. CA,minimum=-1.37, maximum=1.37

spectrum b, blue_red, selection=test////CA
spectrum b, green_red, selection=n. CA,minimum=0.0, maximum=3.6

Label atoms
label n. CA and i. 44, "(%s%s, %s)" % (resn, resi, b)
label n. CA and i. 33+55, "(%s%s, %s)" % (resn, resi, b)
label n. CA and i. 2, "%s%s, %5.2f" % (resn, resi, b)

Get more digits on the distance measurement
set label_distance_digits, 2

Get the orientation data, which you can paste back in to restore orientation
get_view

Change cartoon representation (the first two commands go together)
alter 16:23/, ss='L'
rebuild

set ribbon_width, 8

create new_obj, cluster_lowe_all2
set cartoon_transparency, 0.5,new_obj

## Saturday, November 5, 2016

### Semi-automatic pKa prediction using QM

Here I outline how I automated the calculations for this paper.  The files and programs can be found in the smiles-code.zip file that's part of the supplementary material on Figshare.

Generating SMILES
I used PDF to Text to get a text version of Table 3 from this paper and extracted the first column with the names using Excel.  I removed some molecules and the superscripts on some of the names by hand and created a text file with a single name on each line.

I used a python script to convert the names to SMILES strings

python name2smiles.py table_3 > table_3.smiles

I used a python script to convert the SMILES string to 2D images

python smiles2png.py tabel_3.smiles

I used the Cover Flow option in Finder (Mac) to browse through the images looking for errors. Turned out that several SMILES strings included the HCl which I removed.  If I saw other tautomer possibilities I created those by hand (I need to automate this).  A handful of molecules had carboxyl group which I deprotonated by hand by changing the SMILES string (I need to automate this).

I used a python script to protonate the amines

python protam.py table_3.smiles > table_3+.smiles

This program creates all possible single protonation states of all nitrogen atoms (except amides) in the molecule.  I deleted some of the protonation states I didn't want.  For example, for histamine I want to compute the pKa of the amine group but not of the imidazole, so I deleted the line with the histamine SMILES string for the protonated imidazole.  For a few of the molecules I also needed SMILES for doubly protontaes molecules to I used protam.py and table_3+.smiles to create a  table_3++.smiles file also and extracted the molecules I needed.  These steps where repeated for the reference molecules as well.

File naming
The each line in the .smiles files is the name of the molecule followed by a SMILES strings, the name is used to construct the filename, and the subsequent workflow makes some assumptions about the filenames.  When I deprotonoate carboxyl group I add a "-" to the filename, e.g. "Phenylalanine-", in the table_3.smmiles and protoam adds a "+" to the filename, e.g.  "Phenylalanine-+".  Molecules in the table_3++.smiles file will have names like.e.g Procaine++. If more than one nitrogen is protonated protam.py adds "_1" and "_2", etc., e.g. "Histamine+_1".  When I make tautomers I add "_1" or "_A", etc., e.g. "Cimetidine+_A".  (So in principle you could have, e.g. "name_A_1", but that didn't happen in practise).  Some manual editing is required for everything to come out right.

Creating input files from SMILES
I use a python script to create sdf files from the SMILES strings, e.g.

python smiles2sdf.py table_3.smiles

The script creates 20 different conformations for each SMILES string. The sdf files are named xxx_y.sdf where xxx is the names in the .smiles file and y is an integer between 0 and 19.

I use OpenBabel to convert the .sdf files to input files.  For example for MOPAC

for x in *.sdf; do babel \$x \${x%.*}.mop -xf ../pm6-dh+.header; done

The header file contains the keyword and the charge is specfied as "charge=0". So I need to change the charge by,

sed -i 's/charge=0/charge=1/g' *+_*.mop
sed -i 's/charge=1/charge=2/g' *++*.mop
sed -i 's/charge=1/charge=0/g' *-+*.mop
sed -i 's/charge=0/charge=-1/g' *-_*.mop

I change the method by, for example,

sed -i 's/pm6-dh+/am1/g' *.mop

Computing the pKa values
Af all the jobs have run I extract energies from the output files. For MOPAC out files by

grep "FINAL HEAT" *.out > xx.energies
grep "CURRENT" *.out >> xx.energies

The last line is needed in case MOPAC doesn't converge, in which case I just extract the last energy.

For  GAMESS log files
grep "TOTAL FREE ENERGY IN SOLVENT" *.log > xx.energies

I use a phython script to compute the pKa values from the .energy files

MOPAC pKa values
python pka_morgan_list+sub.py xx.energies big_table_3.smiles > xx.pka

GAMESS pKa values
python pka_morgan_list+sub_gms.py xx.energies big_table_3.smiles > xx.pka

Here big_table_3.smiles all the .smiles files, including reference molecules, combined into one.

The python script finds the most lowest free energy for each protonation state and the appropriate reference molecule for each ionizable site.  The protonation state is defined by the "-" and "+"s so that the lowest free energy for protonated histamine is the lowest free energy found in the output files named "Histamine+_*".  The pKa is related to the free energy difference between e.g. "Histamine+" and "Histamine" or "Phenylalanine-+" and "Phenylalanine-" or  "Procaine++" and "Procaine+", i.e.

delta_G = energies[name+"+"] - energies[name]

The python script contains SMILES strings for all the reference molecules and the appropriate reference for a titration is the reference molecule with the largest substructure match around the titrating nitrogen.

Some future improvements
I need to automate the tautomer generation and I shouldn't hand-pick the protonation state.  For example, I should consider both possible single protonation states in histamine and have the program automatically use the lowest free energy.  This also means that for, e.g. phenylalanine I should consider both the neutral and zwitterionic protonation state, i.e. "Phenylalanine-+_A" and "Phenylalanine-+_B" and  pick the lowest energy.

### Generating coordinates for molecules with diverse set of elements

It looks like PM6 in GAMESS is working, but we need to test it for more elements.  Here's how I generated the MOPAC input files.  First I made a list of SMILES strings for for most of the elements, which I then converted to sdf files with smiles2coord.py.  The I used OpenBabel to convert to MOPAC input files.

The implementation is only done for RHF so the molecules need to be closed shell.  So, for example for Sc I used ScCl$_3$ For many of transition metals I couldn't really think of a closed shell molecule, so I didn't include those elements.  For the non-metals the Cactus server automatically adds hydrogen.  Cactus also interprets names like Sc as sulphur, so square brackets are needed.

### writing unicode csv files from Excel for Mac

update: ... or you can update Excel to the latest version (15.27) and then you get the option to save as CSV UTF-8 (thanks fxcoudert #twitterrules)

Special characters become corrupted when saving Excel files in the csv format.  There is apparently no easy fix for Mac, so here is what I ended up doing.

Save Excel file as UFT-16 Unicode Text.

Open the file in TextEdit.  Edit > Find > Find and Repace (OPTION-CMD-F)
In the search panel press OPTION-TAB and in the replace window type "," and press "all"
Save the extension to csv when you save the file.

You can also do it in vi
”:1,\$s/(Press CTRL+v) then (Press TAB)/,/g”

## Thursday, October 20, 2016

### Prediction of amine pKa values of drug-like molecules using semiempirical QM methods - take 2

I screwed up some of the calculations describes in this post so I am starting fresh.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds.  Amines were especially challenging.  The study used small molecules.  The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups.  These are preliminary results and may contain errors.

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt.  I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules.  I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values.  For example, for phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine). Notice that the cyanoguanidine group in cimetidine has a pKa value of about 0 and is therefore deprotonated when the imidazole group titrates.  Eckert and Klamt characterised the histamine pKa value of 9.7 as an amidine pKa and the thenyldiamine pKa as a pyridine pKa. This is corrected to a primary amine and tertiary amine, respectively.  Also, the experimental pKa values of morphine and niacin are changed to 8.2 and 4.2, respectively, while the remaining experimental pKa values are taken from Eckert and Klamt.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS.  Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible.  I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on.  I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.

The Results
The results are shown in the figure above (based on "Table 1 soln").  The negative outlier seen for the COSMO-based method is cefradoxil and is due to proton transfer in the zwitterionic protonation state. Cefadroxil is also the negative outlier for DFTB3/SMD although the proton doesn't transfer. Proton transfer in zwitterions is also a common problem for DFT/continuum calculations and is due to deficiencies in the continuum solvent method, not the electronic structure method.  The good performance observed for PM3/SMD is thus due to fortuitous cancellation of error.  For the three other zwitterions among the molecules, niacin, phenylalanine, and tryptophan, no proton transfer is observed and the error is relatively small.

The AM1- and PM3-based methods perform best with RMSE values of 1.4-1.6 ± 0.3-0.4, that are statistically identical.* The null model has an RMSE of 1.8 ± 0.4 which, given the error bars, are statistically no worse than the AM1- and PM3-based method.

If the cefadroxil outlier is removed, the RMSE values for PM3/COSMO and AM1/COSMO drop to 1.0 ± 0.2 and 1.1 ± 0.3, while PM3/SMD and AM1/SMD are remain at 1.5 ± 0.4 and 1.6 ± 0.4.  So for this subset, the COSMO-based predictions can be said to outperform the SMD-based predictions, as well as the null model.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4?  Here PM3/COSMO is performs best by getting it right 94% of the time, compared to 91%, 77%, and 91% for AM1/COSMO, PM3/SMD, and the null model.

Summary and Outlook
Overall the best method for pKa prediction of drug-like molecules is either PM3/SMD, AM1/SMD, or the null model with RMSE values of 1.5 ± 0.4, 1.6 ± 0.4, and 1.8 ± 0.4.  The corresponding RMSE values for PM3/COSMO and AM1/COSMO are very similar, but in general they can fail for certain types of zwitterions.

For a set of molecules that does not include such zwitterions then the best method is either PM3/COSMO and AM1/COSMO which deliver RMSE values of 1.0 ± 0.2 and 1.1 ± 0.3, respectively.  In this case, using gas phase geometries lead to slightly larger , but statistically identical, RMSE values of 1.4 ± 0.3 and 1.3 ± 0.3, respectively.

In this study I made sure that the suitable reference molecules where available for all molecules.  This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules.  Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)

*The RMSE uncertainties are computed using equation 14 in this paper and the two methods are taken to be statistically different if their difference in RMSE is larger than the composite error described on page 105 of this paper.

## Monday, October 10, 2016

### Prediction of amine pKa values of drug-like molecules using semiempirical QM methods

2016.10.16 update: I used the wrong keyword in the MOPAC calculations so these numbers are not right.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds.  Amines were especially challenging.  The study used small molecules.  The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups.  These are preliminary results and may contain errors.

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt.  I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules.  I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values.  For example, for Phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine).  When preparing the protonation states I noticed that Eckert and Klamt mischaracterised the Histamine pKa value of 9.7 as an amidine pKa. This is corrected to a primary amine.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS.  Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible.  I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on.  I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.

The Results
You can see the results here (Table 1). The COSMO-based predictions are very bad with MAD values of 13.1 - 15.4 pH units and the SMD-based predictions are OK with MAD values of 1.5 - 2.0 pH units. If I simply use the reference pKa values the MAD is 1.4 pH units.  The corresponding maximum ADs are 71.3 - 136.1, 7.3 - 9.9, and 6.5 pH units.

Inspection of the molecules with large AM1/SMD and PM3/SMD errors suggests that the two of the molecules (Cimetidine, Niacin) were prepared with incorrect protonation states.  The ACE JChem pKa predictor predicts that the pKa of the carboxyl group in Niacin is higher than that of the pyridine, while the pKa value of the nitrile-substituted guanidine group in Cimetidine is lower than at that of the imididazole. Similarly, the ACE JChem pKa predictor predicts that titrating amine in Thenyldiamine with pKa 8.9 is not the pyridine as indicated by Eckert and Klamt, but the tertiary amine.

After making these changes the MADs are 10.4 - 12.4, 1.3 - 1.7, and 1.4 pH values and the max ADs are 71.3 - 136.1, 3.3 - 9.9, and 6.5 pH units.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4?  The COSMO-based predictions get this right 74 - 77% of the time, while the SMD-predictions get it right 75 - 87% of the time.  In this regard PM3/SMD is considerably worse than AM1/SMD and DFTB3/SMD despite the fact that the MAD for PM3/SMD is 0.1 pH units lower than AM1/SMD and 0.4 pH units lower than DFTB/SMD. Simply using the reference values gets the protonation state right 91% of the time.

Why are COSMO-based predictions so bad?
The largest error occurs for Cefadroxil where the proton transfers in the zwitterionic state for the COSMO method.  The remaining large errors involve molecules that have a +2 charge where the pKa is much too low.  It appears that the COSMO method severely underestimates the solvation energy of +2 ions.

Outlook
In this study I made sure that the suitable reference molecules where available for all molecules.  This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules.  Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)