## Thursday, June 16, 2016

### Reviews for Prediction of pKa values using the PM6 semiempirical method

2016.06.21 Update: Here's our response

Reviews of our latest PeerJ submission is in after only 15 days.  This must be some kind of record!

Personal comments from the editor:

Table 4 shows dramatic differences between PM6-D3H+ and PM6 although the previous tables did not show very large differences between both semiempirical methods. Please discuss this.

How do the errors in PM6 or PM6-D3H+ gas-phase protonation energies (vs. experiment or high level computation) change when moving from primary to secondary and tertiary amines? I believe that the addition of a table with these data (with each tested amine treated separately) would be very helpful for the readers and future practitioners.

Reviewer 1 (Anonymous)

Basic reporting

This is an interesting manuscript, but a frustrating aspect is that the experimental pKa values used for comparison are not included for most compounds. These could easily be added to Table 1. In fact, the best solution would be to modify Table 2 to give the calculated pKa values from the various methods along with the experimental values. Also, the authors should refer somewhere to the very relevant PM6 pKa calculations by Jimmy Stewart given in http://openmopac.net/pKa_table.html.

Experimental design

In general this work is properly designed.

Validity of the findings

It would be very helpful if the authors would provide a figure comparing the calculated and experimental values, and include in the text the relevant equation with proper statistics (n, r2, s, F) along with the uncertainties for the slope & intercept. (See, e.g., the book by Shields & Seybold on this topic, or their WIRES article.)

Comments for the Author

After improvements, this manuscript will be of interest to many people attempting to calculate pKas, especially those dealing with high throughput applications.

Reviewer 2 (Anonymous)

Basic reporting

The paper is well-written and organised in a manner that was easy to read. I did find the background / literature research on the short side. Specifically, the isodesmic or proton exchange scheme was developed quite some time ago by various groups . See for example: (a) http://dx.doi.org/10.1063/1.1337862 (b) 10.1021/ct800335v and (c) 10.1021/jp107890p. These studies have laid out quite clearly the effectiveness of an isodesmic scheme for error cancellation, as well as its limitations (e.g. the need for a structurally similar reference with accurately known pKas). Another minor point is there should be a footnote to explain what "**" in Table 2 means.

Experimental design

The research question is well-defined, namely whether contemporary semi-empirical methods can provide cost-effective predictions of pKas. I do have a number of suggestions for improvement:

(1) Computational methods: It was not clear how the solvation free energies were computed - e.g. were these done on gas phase or solution phase optimised geometries? Strictly speaking, the gas and solution phase components of the solvation free energy should be computed on geometries optimised in the respective phases. How sensitive are the results to this choice?

(2) There is a lot of data condensed into the Tables which could actually be used to provide even deeper insights. For example, I would love to see a breakdown of the solution phase energies into the gas phase and solvation contributions as laid out in eqn (6). This would be useful for identifying the sources of errors especially for the outliers.

(3) The dataset molecules in Table 1 are structurally very similar (the substituents are mostly aliphatic groups). It would be interesting to see a more diverse selection of molecules (e.g. EWG and EDG) as the authors alluded to in their conclusion.

Validity of the findings

I think the conclusions are fair based on the results presented. However, I do recommend the authors consider my earlier suggestions to provide clearer insights as to why semi-empirical methods can sometimes fail badly even for isodesmic reactions. This will spur further research into improving these methods.

Reviewer 3 (Anonymous)

Basic reporting

- Line 132: change "can play and important role" to "can play an important role"

- Citations need to conform to the journal style thoughout: see, e.g., "taken from (Morgenthaler et al., 2007)" in Table 4 caption should be changed to "taken from Morgenthaler et al. (2007)"

- References in the bibliography need to be consistently formatted to journal guidelines

- Other groups have reported validation efforts for predicting pKa values using the PM6 method (see, e.g., Rayne et al. [2009], Juranić [2014], etc.). The authors should cite and incorporate the findings of all these prior PM6 pKa validation efforts into the current study to demonstrate an understanding of the prior literature in this field. Otherwise, it looks as though the authors are attempting to make their research appear more novel than it actually is.

Experimental design

- Experimental design is appropriate.

Validity of the findings

- The findings appear valid.

## Friday, May 6, 2016

### A cheminformatics problem: protonate with SMILES and InChI

I am teaching a python programming course and this is one of the projects I want to try this year.

The over all goal of the project is to write one or more programs that generates protonation states for a list of nitrogen containing molecules specified by name. The project uses SMILES, and maybe InChI, which you can read more about here.

Getting started
1. Write code that from this list molecules = ["CCN", "CNC", "CN(C)C"] generates this output

CCN
CC[NH3+]

C[N@H]C
C[N@@H]C
C[NH2+]C

C[N@](C)C
C[N@@](C)C
C[N@H+](C)C
C[N@@H+](C)C

2. Write code that from this list  molecules = ["C(C(=O)O)N"] generates this output (the order is not important)

C(C(=O)O)N
C(C(=O)O)[NH3+]
C(C(=O)[O-])N
C(C(=O)[O-])[NH3+]

3. C(C(=O)O)N is the amino acid glycine.  Extend this program to work for alanine, asparagine, aspartate, and lysine.  Use this site to get SMILES strings for these amino acids. Find a picture of asparagine and make sure you're treating the side-chain correctly.

The project
4. (optional) Figure out how to generate a file containing SMILES strings from a file containing names. The best way is probably bash.  Get inspiration here, here, and here.

5. Generate all possible protonation state SMILES for the molecules in Table 2 in this paper. If you completed step 4 you can use tools like https://pdftables.com/ to generate a file with the names.

6. Repeat for Table 1 and 3

7. (optional) The neutral form of the amino acids histidine and arginine side chain groups have tautomers. Generate SMILES for all tautomers (InChI might help you identify tautomers).

8. (optional) Do any of the molecules in step 5 and 6 have tautomers? If so generate SMILES for all tautomers.

Some code snippets to get you started

## Tuesday, May 3, 2016

### Enzyme design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ.  It's ultimately related to making better enzymes so first some background.

Background
Enzymes are proteins that make certain chemical reactions go faster and nearly every complex molecule in your body is made by, or broken down by, enzymes.  But people have also started using enzymes in commercial products, for example in washing powder to break down oily stains at lower temperatures. This saves money on heating the water and, being proteins, the enzymes are biodegradable. So there is a lot of interest in designing new enzymes that build or break down new molecules efficiently. For example, there is a rather large company (Novozymes) near Copenhagen that does nothing but design, produce, and sell enzymes on an industrial scale.

Designing new enzymes currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new enzymes - a cost that is ultimately passed on to you and I as consumers. My long-term goal is to reduce the amount of trial-and-error by writing a computer program that can predict what changes you have to make to improve the enzyme before you ever make it in the lab.

I've had some modest success with a prototype program some years ago (you can find the papers here and here) for one enzyme.  But one of the many things we don't know is whether the method we base our approach on will work at all for other types of enzymes. The paper that just got published is a first small step in figuring this out.

The New Study
We've collected data for five other enzymes from other published papers that we trust reasonably well and tested two methods that are fast enough to design enzymes - one is the same method we used a few years ago and the other is a newer one that wasn't available to us before now.  The conclusion of our study is that the methods seem to work well enough for all but one system, and this system is different for the two methods.  This suggests that we can't just base future work on one method. We have to have both ready in case one of them fail.  We need to repeat the study for many other types of enzymes - I would say at least 15-20 more - and we need to improve the quality of the data so that we trust it completely, rather than "reasonably well".  In the paper we have extended an open invitation to other scientists to contribute to this effort.

## Saturday, April 16, 2016

### Computing pKa values for molecules with several ionizable groups

We're working on pKa prediction using semiempirical methods and need to compute pKa values  for molecules with several ionizable groups. Here are my current thoughts so far.

Background: one ionizable group
If there is only one tritrateable site
$$\mathrm{BH \rightleftharpoons B + H^+} \ \ \ K=\mathrm{\frac{[B][H^+]}{[BH]}}$$
then the fraction of $\mathrm{BH}$ molecules $f_{\mathrm{BH}}$ is $$\begin{split} f_{\mathrm{BH}} & =\mathrm{\frac{[BH]}{[B]+[BH]} } \\ & = \mathrm{\frac{[B]}{[B]}\frac{[BH]/[B]}{1+[BH]/[B]} } \\ & = \mathrm{\frac{[H^+]/K}{1+[H^+]/K} } \\ & = \mathrm{\frac{10^{p\textit{K}-pH}}{1+10^{p\textit{K}-pH}} } \end{split}$$ where $$\mathrm{pH = -log[H^+] \implies [H^+] = 10^{-pH}}$$
and similarly for $K$.

From this we can see that the pK value is the pH value for which $f_{\mathrm{BH}}$ = 1/2. So you compute the pK value from the standard free energy difference
$$\text{p}K =\left( \Delta G^\circ(\mathrm{B})+\Delta G^\circ(\mathrm{H^+})- \Delta G^\circ(\mathrm{BH})\right)/RT\ln(10)$$
and you're done.

Two ionizable groups
For a molecule with two titrateable groups ($\mathrm{HB_\alpha B_\beta H}$) and the following equilibrium constants $$\mathrm{HBBH \rightleftharpoons BBH + H^+} \ \ K_{\alpha1}$$ $$\mathrm{HBBH \rightleftharpoons HBB + H^+} \ \ K_{\beta1}$$ $$\mathrm{HBB \rightleftharpoons BB + H^+} \ \ K_{\alpha0}$$ $$\mathrm{BBH \rightleftharpoons BB + H^+} \ \ K_{\beta0}$$ The probability of, for example, $\mathrm{BBH}$ is $$f_{\mathrm{BBH}} =\mathrm{\frac{[BBH]}{[BB]+[BBH]+[HBB]+[HBBH]}= \frac{[BBH]}{\textit{P}}}$$ $f_{\mathrm{BBH}}$ can be rewritten in terms of pK values $$f_{\mathrm{BBH}} = \mathrm{\frac{[BBH]/[BB]}{\textit{P}/[BB]} = \frac{10^{p\textit{K}_{\beta0}-pH}}{\textit{P}/[BB]}}$$ where $$\mathrm{ \textit{P}/[BB] = 1+10^{p\textit{K}_{\alpha0}-pH}+10^{p\textit{K}_{\beta0}-pH}+ 10^{p\textit{K}_{\alpha0}+p\textit{K}_{\beta1}-2pH}}$$ Similarly, $$f_{\mathrm{HBB}} = \mathrm{\frac{10^{p\textit{K}_{\alpha0}-pH}}{\textit{P}/[BB]}}$$ and $$f_{\mathrm{HBBH}} = \mathrm{\frac{10^{p\textit{K}_{\alpha0}+p\textit{K}_{\beta1}-2pH}}{\textit{P}/[BB]}}$$  The apparent pK value of the $\alpha$ group ($\mathrm{p}K_{\alpha}$) is the pH at which its protonation probability $$f_\alpha =f_{\mathrm{HBB}} + f_{\mathrm{HBBH}}$$ is 1/2 and similarly for the $\beta$ group.  So compute the microscopic pK values (Eq 4-7), then $f_\alpha$ and $f_\beta$ as a function of pH, and then $\mathrm{p}K_{\alpha}$ and $\mathrm{p}K_{\beta}$

If one of the groups (say $\alpha$) titrates at a significantly lower pH than the other ($\mathrm{p}K_{\alpha1} << \mathrm{p}K_{\beta1}$) then $\mathrm{p}K_{\alpha}=\mathrm{p}K_{\alpha1}$ and $\mathrm{p}K_{\beta}=\mathrm{p}K_{\beta0}$ and it is not necessary to compute the free energy of $\mathrm{HBB}$, but it can be hard to determine this in advance.  Similarly, if there is no significant interaction between the sites then $\mathrm{p}K_{\alpha}=\mathrm{p}K_{\alpha1}=\mathrm{p}K_{\alpha0}$ and $\mathrm{p}K_{\beta}=\mathrm{p}K_{\beta1}=\mathrm{p}K_{\beta0}$ and one can skip one of the protonation states.

For $N$ ionizable groups one has to determine $2^N$ microscopic pKa values, which quickly gets out of hand if one has to do a conformational search for each protonation state and the molecule is large.

Related post
Generating protonation states and conformations

## Wednesday, April 13, 2016

### Reviewing for PeerJ: it's the little (and the not so little) things

I just did my first review for PeerJ and it was a real pleasure because there are a lot of "little things" that make your reviewing life easier:

1. Figures/tables are in the text and, get this, the captions are immediately above/below the corresponding figure/table.  Some other journals also do this, but not enough.

2. I annotate the mss in a pdf reader and usually this is a frustrating experience since the publisher generated pdf has all sorts of "quirks" that make highlighting and copying text hit and miss.  The previous pdf I reviewed turned every page with a figure into an image!  Annotating/copying in the PeerJ pdf worked flawlessly.

3. The pdf contained a 3 front pages with the due date, a summary of the review criteria, a link to the page with the supplementary material, and a link to the page where I should submit my review.  No hunting around for the email with the link! I teared up a little bit when I saw that.

Other "little things" include stuff like not having to rank the perceived importance or impact of the work on some bogus 1-10 scale, a strict policy on making the raw data available, and a button to click to make my review non-anonymous.

## Thursday, April 7, 2016

### Why I use Twitter: one scientist's perspective

Tomorrow I am giving a short talk on why and how I use Twitter to public relations people at the University.  I have 20 min + 10  min questions, but am aiming for a 10 min talk + 20 min questions.

## Tuesday, April 5, 2016

### ACS Omega is too expensive

Disclaimer: I applied to be a co-editor of this journal and was not selected.

ACS Omega has just announced its APCs: \$1500 (\$2000) under the ACS Authors Choice (CC-BY or CC-BY-NC-ND) license for members and an additional \$500 for non-members (since full membership costs$162, this is the real additional cost for one yearly publication).

The ACS Authors Choice license is not open access: under the ACS Authors Choice license you assign copyright to the ACS.  While
For non-commercial research and education purposes only, users may access, download, copy, display and redistribute articles as well as adapt, translate, text and data mine content contained in articles, ...
you still can't, for example, use a figure from such an article in a book chapter without the ACS permission.  Also, the ACS can, for example, sell your article or your figures.

So the cost for to publish OA is ACS Omega is $2000. That's more expensive than other impact neutral OA journals: PLoS One (\$1495), Scientific Reports (\$1495), F1000Research (\$1000), Rio Journal (\$850), PeerJ (\$100/author, Bio-only), and Royal Society Open Science (\$0, for now).

Since all journals are impact neutral and provide quick review (AFAIK) price is the main considerations and I see no reason to pay more for to publish in ACS Omega.