## Thursday, July 21, 2016

### Finding disordered residues in an NMR ensemble

Note to self: here's how you identified disordered residues in the NMR ensemble 2KCU.pdb

1. In Pymol: "fetch 2kcu"

2. Action > align > states (*/CA)

3. "save 2kcu_aligned.pdb, state=0"

4. In terminal: grep CA 2kcu_aligned.pdb > lis

5. python disorder.py

disorder.py (given below) calculates the standard deviation of the x, y, and z coordinate of each CA atom ($\sigma_{x,i}, \sigma_{y,i}, \sigma_{z,i})$. It then averages these three standard deviations for each CA atom $(\sigma_i)$.  To find outliers, it averages these values for the entire protein $(\langle \sigma_i \rangle)$ and computes the standard deviation of this average $(\sigma_{\langle \sigma_i \rangle})$. Any residues for which $\sigma_i > \langle \sigma_i \rangle + \sigma_{\langle \sigma_i \rangle}$ is identified as disordered.

Here I've colored the disordered residues red

Yes, I know: "the 1970's called and want their Fortran code back". How very droll.

## Tuesday, July 12, 2016

### Reproducing stats or verbose output from LINEST command in Excel or Google Sheet in Python

The python code above reproduces the output produced by the LINEST(y;x;true;true) command in Excel [LINEST(y,x,true,true) in Google Sheets] with a csv file as input.  In the csv file I have assumed that the x and y columns are labelled "x" and "y" respectively.  This page has a good explanation of the output (pdf).

## Monday, July 11, 2016

### 2nd Reviews for Prediction of pKa values using the PM6 semiempirical method

2016.07.15: Update: our rebuttal can be fund here.  Manuscript resubmitted.

The 2nd round of reviews on our latest PeerJ submission came in on July 7th.  The first round of round of reviews and a link to our response can be found here.

MINOR REVISIONS
Thank you for your efforts at addressing the reviewer's comments. In spite of that, the reviewers (and myself) still think that additional data should be moved from the Supporting Information to the main text as tables/graphs. Specifically:

-per reviewer 1's request, please include the ref. pKa data in table 1. The Supporting Material deposited in figshare is very complete, but it will enormously help the reader (and make your paper much more persusive at first reading) if the most salient pieces were included in the paper itself.

- contra reviewer 1's comment, I do acknowledge that p.799 of the quoted Stewart (2008) reference describes the pKa computation procedure which generated the data present in http://openmopac.net/pKa_table.html. This method (also used by Rayne) as well as the method by Juranic, however, do not computes pKa from the energy difference itself, but from an empirical fit of the O-H bond distances and approximate charges (or N and H charges, plus a dummy variable stating whether the amine is primary, secondary or terciary, for Juranic, 2014). These pKa computation approaches are therefore fundamentally different from the one used in your paper. Your references to this literature in the introduction, however, do not make this clear enough. Please improve this to clearly compare the competing methods for PM6-based pka computations to the your approach.

-Do include the statistical data regarding slope, R-squared and outliers. A motivated reader may easily graph the data you have computed (and which are present in the spreadsheet referred to in your figshare area), but your explanation and discussion would be much more readable, and certainly more persuasive, if you included those graphs, slopes and correlation coefficientes in the paper. That analysis shows more clearly than the aggregat tables exactly where PM6 affords better correlation/slope that even CBS-4B3 (pyridines), the identity of the outliers, how poorly all methods (even CBS-4B3) correlate to experimental pKa in amines (in spite of a seemingly low 0.2 MAD for CBS-4B3), etc.

-per reviewer 2's request (and also related to my previous request which I may not have worded clearly) please add data regarding the likely origin of the errors in the outliers: do they come from gas phase energies or the solvation? A simple comparison of the B3LYP gas-phase energy changes (on PM6-optimized geometries, to reduce computational effort) and solvation effects might be enough to tell whether the gas-phase acidities (and/or solvation) of PM6 generally track the DFT results.

Reviewer 1 (Anonymous)

The authors have not adequately responded to any of the concerns raised in my original review. My original comments are shown first. The authors’ responses are shown next; and my further responses to them are shown below.

(1) Basic reporting
This is an interesting manuscript, but a frustrating aspect is that the experimental pKa values used for comparison are not included for most compounds. These could easily be added to Table 1. In fact, the best solution would be to modify Table 2 to give the calculated pKa values from the various methods along with the experimental values.

Authors: The values are already provided in Supplementary Materials

The copy I received contained no reference at all to “Supplementary Materials”. If these materials are available directions for accessing them should be clearly presented in the normal position just before the References.

(2) Also, the authors should refer somewhere to the very relevant PM6 pKa calculations by Jimmy Stewart given in http://openmopac.net/pKa_table.html.

Authors: We already refer to this approach in the introduction (Stewart 2008).

The reference Stewart (2008) concerns proteins and has nothing at all to do with pKa estimates. As clearly indicated, the relevant Stewart study is not a formal publication, but has been made widely available to workers by Stewart on the web page as indicated. Apparently the authors didn’t even bother to look at it.

(3) Validity of the findings
It would be very helpful if the authors would provide a figure comparing the calculated and experimental values, and include in the text the relevant equation with proper statistics (n, r2, s, F) along with the uncertainties for the slope & intercept. (See, e.g., the book by Shields & Seybold on this topic, or their WIRES article.)

Authors: The statistical analysis the reviewer refers to is done in the context of a QSAR prediction of pKa from QM data, i.e. to gauge the accuracy a linear fit to be used in the prediction of unknown pKa values. The statistics used in this paper is just aimed at gauging the accuracy of the predicted values and, in our opinion, is more than adequate for the task. If the reviewer can explain how the requested statistics is to be used in the context of the current paper we will be happy to reconsider the request.

This is a standard way to compare not just QSAR results, but any studies in this field. It would be helpful, and I don’t understand the authors’ reluctance to include it.

Annotated manuscript
The reviewer has also provided an annotated manuscript as part of their review:

Reviewer 2 (Anonymous)

I thank the authors for the revised manuscript. I would still like the authors to address my second point as to what is the major source of error in these calculations, especially for the outliers. Is it the gas phase energies, or the solvation component?

## Sunday, June 26, 2016

### Planned papers for 2016 - six months in

Pedro's post reminded me that mine was due

In January I wrote about the papers I plan to publish and made this list:

Submitted
None

Probable
1. Benchmarking of PM6 and DFTB3 for barrier heights computed using enzyme active site models.
2. pKa prediction using PM6 - part 1
3. Protein structure refinement using ProCS15 - starting from x-ray structure

Maybe
4. PM6 for all elements in GAMESS, including PCM interface
5. Protein structure refinement using ProCS15 - starting from 5 Å Cα RMSD
6. Vibrational effects on N amide chemical shifts
7. pKa prediction using PM6 - amines
8. Predicting binding free energies for CB7
9. Linear scaling HF-3c calculations by interface to FMO2 in GAMESS
10. Side chain chemical shift prediction with ProCS15
11. Rienstra-like chemical shift assignment in PHAISTOS

The status is

Published

Submitted
Received "minor revision" verdict.  Sent back in last week.

Probable
3. Protein structure refinement using ProCS15 - starting from x-ray structure
Actively working on it. First draft about 2/3 done. It's a huge amount of work because I'm still new to the field and learning as I am writing. However, I am fairly confident that I'll get it published in 2016.

7. pKa prediction using PM6 - amines
The main issue here is whether we can automate all steps of the pKa calculation and based on what we have learned so far I am pretty sure we can. The main issue is the protonation. Once paper 3 is done I will start working on this.  The CPU requirements are not an issue so I am fairly confident that I'll get it published in 2016.

Maybe in 2016
*. Improved prediction of chemical shifts using machine learning
This wasn't even on the drawing board in January.  Lars spent a few months in Anatole von Lilienfelds lab working on increasing the accuracy of the ProCS15 data set  using machine learning. Calculations are still ongoing so I am a little hesitant to list it under "probable".
A companion paper in Scientific Data has also been discussed,

Probably not in 2016
4. PM6 for all elements in GAMESS, including PCM interface
Jimmy is working on the first part.  The interface is there but debugging is really tough.  I think we will get it working in 2016 but getting a paper published this year is unlikely.

5. Protein structure refinement using ProCS15 - starting from 5 Å Cα RMSD
I think we will get most of the calculations done in 2016.

6. Vibrational effects on N amide chemical shifts
I had a visiting student working this. Bottom line: still no bug-free, black-box approach for computing vibrational that just works.  Much harder problem than I had anticipated.

8. Predicting binding free energies for CB7
Have some data but a long way to go yet.

9. Linear scaling HF-3c calculations by interface to FMO2 in GAMESS
This is actually working and being incorporated into the official version of GAMESS.  Not sure when we'll get around to generating data for a paper.

10. Side chain chemical shift prediction with ProCS15
Susanne is working on this but I doubt we will publish a paper on it in 2016.

11. Rienstra-like chemical shift assignment in PHAISTOS
Still just an idea so far.

## Thursday, June 16, 2016

### Reviews for Prediction of pKa values using the PM6 semiempirical method

2016.06.21 Update: Here's our response

Reviews of our latest PeerJ submission is in after only 15 days.  This must be some kind of record!

Table 4 shows dramatic differences between PM6-D3H+ and PM6 although the previous tables did not show very large differences between both semiempirical methods. Please discuss this.

How do the errors in PM6 or PM6-D3H+ gas-phase protonation energies (vs. experiment or high level computation) change when moving from primary to secondary and tertiary amines? I believe that the addition of a table with these data (with each tested amine treated separately) would be very helpful for the readers and future practitioners.

Reviewer 1 (Anonymous)

Basic reporting

This is an interesting manuscript, but a frustrating aspect is that the experimental pKa values used for comparison are not included for most compounds. These could easily be added to Table 1. In fact, the best solution would be to modify Table 2 to give the calculated pKa values from the various methods along with the experimental values. Also, the authors should refer somewhere to the very relevant PM6 pKa calculations by Jimmy Stewart given in http://openmopac.net/pKa_table.html.

Experimental design

In general this work is properly designed.

Validity of the findings

It would be very helpful if the authors would provide a figure comparing the calculated and experimental values, and include in the text the relevant equation with proper statistics (n, r2, s, F) along with the uncertainties for the slope & intercept. (See, e.g., the book by Shields & Seybold on this topic, or their WIRES article.)

After improvements, this manuscript will be of interest to many people attempting to calculate pKas, especially those dealing with high throughput applications.

Reviewer 2 (Anonymous)

Basic reporting

The paper is well-written and organised in a manner that was easy to read. I did find the background / literature research on the short side. Specifically, the isodesmic or proton exchange scheme was developed quite some time ago by various groups . See for example: (a) http://dx.doi.org/10.1063/1.1337862 (b) 10.1021/ct800335v and (c) 10.1021/jp107890p. These studies have laid out quite clearly the effectiveness of an isodesmic scheme for error cancellation, as well as its limitations (e.g. the need for a structurally similar reference with accurately known pKas). Another minor point is there should be a footnote to explain what "**" in Table 2 means.

Experimental design

The research question is well-defined, namely whether contemporary semi-empirical methods can provide cost-effective predictions of pKas. I do have a number of suggestions for improvement:

(1) Computational methods: It was not clear how the solvation free energies were computed - e.g. were these done on gas phase or solution phase optimised geometries? Strictly speaking, the gas and solution phase components of the solvation free energy should be computed on geometries optimised in the respective phases. How sensitive are the results to this choice?

(2) There is a lot of data condensed into the Tables which could actually be used to provide even deeper insights. For example, I would love to see a breakdown of the solution phase energies into the gas phase and solvation contributions as laid out in eqn (6). This would be useful for identifying the sources of errors especially for the outliers.

(3) The dataset molecules in Table 1 are structurally very similar (the substituents are mostly aliphatic groups). It would be interesting to see a more diverse selection of molecules (e.g. EWG and EDG) as the authors alluded to in their conclusion.

Validity of the findings

I think the conclusions are fair based on the results presented. However, I do recommend the authors consider my earlier suggestions to provide clearer insights as to why semi-empirical methods can sometimes fail badly even for isodesmic reactions. This will spur further research into improving these methods.

Reviewer 3 (Anonymous)

Basic reporting

- Line 132: change "can play and important role" to "can play an important role"

- Citations need to conform to the journal style thoughout: see, e.g., "taken from (Morgenthaler et al., 2007)" in Table 4 caption should be changed to "taken from Morgenthaler et al. (2007)"

- References in the bibliography need to be consistently formatted to journal guidelines

- Other groups have reported validation efforts for predicting pKa values using the PM6 method (see, e.g., Rayne et al. [2009], Juranić [2014], etc.). The authors should cite and incorporate the findings of all these prior PM6 pKa validation efforts into the current study to demonstrate an understanding of the prior literature in this field. Otherwise, it looks as though the authors are attempting to make their research appear more novel than it actually is.

Experimental design

- Experimental design is appropriate.

Validity of the findings

- The findings appear valid.

## Friday, May 6, 2016

### A cheminformatics problem: protonate with SMILES and InChI

I am teaching a python programming course and this is one of the projects I want to try this year.

The over all goal of the project is to write one or more programs that generates protonation states for a list of nitrogen containing molecules specified by name. The project uses SMILES, and maybe InChI, which you can read more about here.

Getting started
1. Write code that from this list molecules = ["CCN", "CNC", "CN(C)C"] generates this output

CCN
CC[NH3+]

C[N@H]C
C[N@@H]C
C[NH2+]C

C[N@](C)C
C[N@@](C)C
C[N@H+](C)C
C[N@@H+](C)C

2. Write code that from this list  molecules = ["C(C(=O)O)N"] generates this output (the order is not important)

C(C(=O)O)N
C(C(=O)O)[NH3+]
C(C(=O)[O-])N
C(C(=O)[O-])[NH3+]

3. C(C(=O)O)N is the amino acid glycine.  Extend this program to work for alanine, asparagine, aspartate, and lysine.  Use this site to get SMILES strings for these amino acids. Find a picture of asparagine and make sure you're treating the side-chain correctly.

The project
4. (optional) Figure out how to generate a file containing SMILES strings from a file containing names. The best way is probably bash.  Get inspiration here, here, and here.

5. Generate all possible protonation state SMILES for the molecules in Table 2 in this paper. If you completed step 4 you can use tools like https://pdftables.com/ to generate a file with the names.

6. Repeat for Table 1 and 3

7. (optional) The neutral form of the amino acids histidine and arginine side chain groups have tautomers. Generate SMILES for all tautomers (InChI might help you identify tautomers).

8. (optional) Do any of the molecules in step 5 and 6 have tautomers? If so generate SMILES for all tautomers.

Some code snippets to get you started

## Tuesday, May 3, 2016

### Enzyme design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ.  It's ultimately related to making better enzymes so first some background.

Background
Enzymes are proteins that make certain chemical reactions go faster and nearly every complex molecule in your body is made by, or broken down by, enzymes.  But people have also started using enzymes in commercial products, for example in washing powder to break down oily stains at lower temperatures. This saves money on heating the water and, being proteins, the enzymes are biodegradable. So there is a lot of interest in designing new enzymes that build or break down new molecules efficiently. For example, there is a rather large company (Novozymes) near Copenhagen that does nothing but design, produce, and sell enzymes on an industrial scale.

Designing new enzymes currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new enzymes - a cost that is ultimately passed on to you and I as consumers. My long-term goal is to reduce the amount of trial-and-error by writing a computer program that can predict what changes you have to make to improve the enzyme before you ever make it in the lab.

I've had some modest success with a prototype program some years ago (you can find the papers here and here) for one enzyme.  But one of the many things we don't know is whether the method we base our approach on will work at all for other types of enzymes. The paper that just got published is a first small step in figuring this out.

The New Study
We've collected data for five other enzymes from other published papers that we trust reasonably well and tested two methods that are fast enough to design enzymes - one is the same method we used a few years ago and the other is a newer one that wasn't available to us before now.  The conclusion of our study is that the methods seem to work well enough for all but one system, and this system is different for the two methods.  This suggests that we can't just base future work on one method. We have to have both ready in case one of them fail.  We need to repeat the study for many other types of enzymes - I would say at least 15-20 more - and we need to improve the quality of the data so that we trust it completely, rather than "reasonably well".  In the paper we have extended an open invitation to other scientists to contribute to this effort.