Proteins and Wave Functions: Prediction of amine pKa values of drug-like molecules using semiempirical QM methods

Monday, October 10, 2016

Prediction of amine pKa values of drug-like molecules using semiempirical QM methods

2016.10.20 update: please disregard this post and read this post instead.
2016.10.16 update: I used the wrong keyword in the MOPAC calculations so these numbers are not right.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds. Amines were especially challenging. The study used small molecules. The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups. These are preliminary results and may contain errors.

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt. I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules. I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values. For example, for Phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine). When preparing the protonation states I noticed that Eckert and Klamt mischaracterised the Histamine pKa value of 9.7 as an amidine pKa. This is corrected to a primary amine.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS. Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible. I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on. I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.

The Results
You can see the results here (Table 1). ~~The COSMO-based predictions are very bad with MAD values of 13.1 - 15.4 pH units~~ and the SMD-based predictions are OK with MAD values of 1.5 - 2.0 pH units. If I simply use the reference pKa values the MAD is 1.4 pH units. The corresponding maximum ADs are ~~71.3 - 136.1~~, 7.3 - 9.9, and 6.5 pH units.

Inspection of the molecules with large AM1/SMD and PM3/SMD errors suggests that the two of the molecules (Cimetidine, Niacin) were prepared with incorrect protonation states. The ACE JChem pKa predictor predicts that the pKa of the carboxyl group in Niacin is higher than that of the pyridine, while the pKa value of the nitrile-substituted guanidine group in Cimetidine is lower than at that of the imididazole. Similarly, the ACE JChem pKa predictor predicts that titrating amine in Thenyldiamine with pKa 8.9 is not the pyridine as indicated by Eckert and Klamt, but the tertiary amine.

After making these changes the MADs are ~~10.4 - 12.4~~, 1.3 - 1.7, and 1.4 pH values and the max ADs are ~~71.3 - 136.1~~, 3.3 - 9.9, and 6.5 pH units.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4? ~~The COSMO-based predictions get this right 74 - 77% of the time~~, while the SMD-predictions get it right 75 - 87% of the time. In this regard PM3/SMD is considerably worse than AM1/SMD and DFTB3/SMD despite the fact that the MAD for PM3/SMD is 0.1 pH units lower than AM1/SMD and 0.4 pH units lower than DFTB/SMD. Simply using the reference values gets the protonation state right 91% of the time.

~~Why are COSMO-based predictions so bad?~~
The largest error occurs for Cefadroxil where the proton transfers in the zwitterionic state for the COSMO method. The remaining large errors involve molecules that have a +2 charge where the pKa is much too low. It appears that the COSMO method severely underestimates the solvation energy of +2 ions.

Outlook

In this study I made sure that the suitable reference molecules where available for all molecules. This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules. Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)