## Monday, August 22, 2016

### Finding the reference molecule for a pKa calculation using RDKit

This is prototype code related to this project.  I use Histamine as an example which has two ionizable sites: a primary amine and an imidazole ring.

The code figures out that imidazole is the best reference for the imidazole ring, while ethylamine is the best reference for the primary amine.  The code does this by figuring out which atom is being deprotonated, computes the Morgan fingerprint around this atom, and compares it to the Morgan fingerprints of imizadole and ethylamine.

## Thursday, August 11, 2016

### Drug design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ. It's ultimately related to making better drugs so first some background.

Background
Designing new drugs currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new drugs - a cost that is ultimately passed on to you and I as consumers.  There are many, many reasons why drug design is so difficult. One of them is that we often don't know fundamental properties of drug-candidates such as the charge of the molecule at a given pH. Obviously, it is hard to figure out whether or how a drug-candidate interact with the body if you don't even know whether it is postive, negative or neutral.

It is not too difficult to measure the charge at a given pH, but modern day drug design involves the screening of hundreds of thousands of molecules and it is simply not feasible to measure them all. Besides, you have to make the molecules to do the measurement, which may be a waster of time if it turn out to have the wrong charge. There are several computer programs that can predict the charge at a given pH very quickly but they have been known to fail quite badly from time to time.  The main problem it that these programs rely on a database of experimental data and if the molecule of interest doesn't resemble anything in the database this approach will fail. The paper that just got published is a first step towards coming up with an alternative.

The New Study
We present a "new" method for predicting the charge of a molecule that relies less on experimental data but it fast enough to be of practical use in drug design. The paper shows that the basic approach works reasonably well for small prototypical molecules and we even test one drug-like molecule where one of the commercial programs fail and show that our new method performs better (but not great).  However, we have to test this new method for a lot more molecules and in order to do this we need to automate the prediction process, which currently requires some "manual" labor, so this is what we're working on now.

This work is licensed under a Creative Commons Attribution 4.0

## Sunday, August 7, 2016

### Conformer search with RDKit

I'm teaching myself how to use RDKit.  Here is code for conformer search using RDKit that also computes the energy of each conformer using the MMFF94 force field.

Comments welcome

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

## Monday, August 1, 2016

### Thoughts from the Gordon Research Conference on Computational Chemistry

Here are some of the things I took away from attenting the GRC on Computational Chemistry

Tweeting
The GRC had very strict "off-the-record" rules to encourage the presentation of unpublished results. However, most speakers devoted at least half their talks to published results and I and others - especially Marc van der Kamp - Tweetet some of these papers under the hashtag #compchemGRC.

Furthermore, I also explicitly waived my "off-the-record" rights at the beginning of my talk and encouraged Tweeting.  I also shared my slides on Twitter - before the conference and immediately before my talk.  Seeing these slides on Twitter, FX Coudert alerted me to the fact that PM6 is now fully implemented in CP2K, which is could be very useful for our work.

Open Access
I talked to a few people about my OA philosophy.  Here is what I put on my CV

"My publication policy since 2012:  If a paper has a shot at high impact journals such as JACS or PNAS then I will submit there. However, the majority of my papers are method development papers, which will be submitted to open access journals such as PLoS ONE or PeerJ as I fail to see a difference in impact between these journals and journals such as Journal of Chemical Theory and Computation and Journal of Computational Chemistry where I used to publish before."

However, it really doesn't have to be an all or nothing decision.  My best advice is one paper at a time.  Just try it once and see what happens.

For me "impact neutrality" has become just as important as OA.  It is so very liberating to just write down what I did and what I found rather than trying to put everything in the best possible light with elaborately constructed "technically-correct-but-not-really-telling-the-whole-story" paragraphs.

Reproduceability
Speakers usually show their "best" work at conferences and precious speaker time is generally not wasted on pitfalls and caveats. It is easy to get the impression that everything is going great for everyone else, while you are struggling with your own projects. Furthermore, when you see something potentially wonderful that you want to try but you just know from experience that it won't be so easy as the speaker makes it sound and, in fact, will be hard to reproduce from the published papers alone. (This is no reflection on any one particular speaker at the conference).

This general sentiment was shared by a number of people I talked to.  It's not a new problem but I do believe it is a  growing one in part because research projects are getting more complex making it nearly impossible to describe all steps in sufficient detail to make it reproduceable. The only solution is, in my opinion, to make everthing available as supplementary material. Tar the whole thing - input files, output files, submission and analysis scripts, spreadsheets, etc - and put it on a server such as Figshare.

Funding
The usual conference conversation starts with "Hey X, how are things going?", "Oh, fine, and you?", "Oh, fine."  But one person responded "Writing a lot of proposals and getting them rejected."  I really appreciated this honesty, and it makes me feel less bad about my own rejections. A few weeks ago I had a similar talk with another colleagues about the possibility of having no PhD students in the not-too-distant future and how this affects the choice of research projects one can take on.  I think a lot of scientists are going through the same type of thing and it is important to be open about it.

Co-vice chair election (This will only make sense to people who were at the meeting, and that's fine)
A few people asked me why I effectively withdrew from consideration just before the vote. The short answer is that it didn't know for sure who else was running, nor that the candidates would be split up in two group, until that morning. Had I thought a little faster, I probably could have gotten my name removed just in time. But I am not a fast thinker at the best of times and certainly not at 8:30 am.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

## Thursday, July 21, 2016

### Finding disordered residues in an NMR ensemble

Note to self: here's how you identified disordered residues in the NMR ensemble 2KCU.pdb

1. In Pymol: "fetch 2kcu"

2. Action > align > states (*/CA)
2016.08.07 update: the above command also aligns the tails.  Use "intra_fit (2kzn///6-158/CA)"

3. "save 2kcu_aligned.pdb, state=0"

4. In terminal: grep CA 2kcu_aligned.pdb > lis

5. python disorder.py

disorder.py (given below) calculates the standard deviation of the x, y, and z coordinate of each CA atom ($\sigma_{x,i}, \sigma_{y,i}, \sigma_{z,i})$. It then averages these three standard deviations for each CA atom $(\sigma_i)$.  To find outliers, it averages these values for the entire protein $(\langle \sigma_i \rangle)$ and computes the standard deviation of this average $(\sigma_{\langle \sigma_i \rangle})$. Any residues for which $\sigma_i > \langle \sigma_i \rangle + \sigma_{\langle \sigma_i \rangle}$ is identified as disordered.

Here I've colored the disordered residues red (haven't updated the picture based on Step 2-change yet)

Yes, I know: "the 1970's called and want their Fortran code back". How very droll.

This work is licensed under a Creative Commons Attribution 4.0

## Tuesday, July 12, 2016

### Reproducing stats or verbose output from LINEST command in Excel or Google Sheet in Python

The python code above reproduces the output produced by the LINEST(y;x;true;true) command in Excel [LINEST(y,x,true,true) in Google Sheets] with a csv file as input.  In the csv file I have assumed that the x and y columns are labelled "x" and "y" respectively.  This page has a good explanation of the output (pdf).

This work is licensed under a Creative Commons Attribution 4.0

## Monday, July 11, 2016

### 2nd Reviews for Prediction of pKa values using the PM6 semiempirical method

2016.07.15: Update: our rebuttal can be fund here.  Manuscript resubmitted.

The 2nd round of reviews on our latest PeerJ submission came in on July 7th.  The first round of round of reviews and a link to our response can be found here.

Editor's Comments
MINOR REVISIONS
Thank you for your efforts at addressing the reviewer's comments. In spite of that, the reviewers (and myself) still think that additional data should be moved from the Supporting Information to the main text as tables/graphs. Specifically:

-per reviewer 1's request, please include the ref. pKa data in table 1. The Supporting Material deposited in figshare is very complete, but it will enormously help the reader (and make your paper much more persusive at first reading) if the most salient pieces were included in the paper itself.

- contra reviewer 1's comment, I do acknowledge that p.799 of the quoted Stewart (2008) reference describes the pKa computation procedure which generated the data present in http://openmopac.net/pKa_table.html. This method (also used by Rayne) as well as the method by Juranic, however, do not computes pKa from the energy difference itself, but from an empirical fit of the O-H bond distances and approximate charges (or N and H charges, plus a dummy variable stating whether the amine is primary, secondary or terciary, for Juranic, 2014). These pKa computation approaches are therefore fundamentally different from the one used in your paper. Your references to this literature in the introduction, however, do not make this clear enough. Please improve this to clearly compare the competing methods for PM6-based pka computations to the your approach.

-Do include the statistical data regarding slope, R-squared and outliers. A motivated reader may easily graph the data you have computed (and which are present in the spreadsheet referred to in your figshare area), but your explanation and discussion would be much more readable, and certainly more persuasive, if you included those graphs, slopes and correlation coefficientes in the paper. That analysis shows more clearly than the aggregat tables exactly where PM6 affords better correlation/slope that even CBS-4B3 (pyridines), the identity of the outliers, how poorly all methods (even CBS-4B3) correlate to experimental pKa in amines (in spite of a seemingly low 0.2 MAD for CBS-4B3), etc.

-per reviewer 2's request (and also related to my previous request which I may not have worded clearly) please add data regarding the likely origin of the errors in the outliers: do they come from gas phase energies or the solvation? A simple comparison of the B3LYP gas-phase energy changes (on PM6-optimized geometries, to reduce computational effort) and solvation effects might be enough to tell whether the gas-phase acidities (and/or solvation) of PM6 generally track the DFT results.

Reviewer 1 (Anonymous)

Comments for the Author
Reviewer Comments – Reviewer 1
The authors have not adequately responded to any of the concerns raised in my original review. My original comments are shown first. The authors’ responses are shown next; and my further responses to them are shown below.

(1) Basic reporting
This is an interesting manuscript, but a frustrating aspect is that the experimental pKa values used for comparison are not included for most compounds. These could easily be added to Table 1. In fact, the best solution would be to modify Table 2 to give the calculated pKa values from the various methods along with the experimental values.

Authors: The values are already provided in Supplementary Materials

The copy I received contained no reference at all to “Supplementary Materials”. If these materials are available directions for accessing them should be clearly presented in the normal position just before the References.

(2) Also, the authors should refer somewhere to the very relevant PM6 pKa calculations by Jimmy Stewart given in http://openmopac.net/pKa_table.html.

Authors: We already refer to this approach in the introduction (Stewart 2008).

The reference Stewart (2008) concerns proteins and has nothing at all to do with pKa estimates. As clearly indicated, the relevant Stewart study is not a formal publication, but has been made widely available to workers by Stewart on the web page as indicated. Apparently the authors didn’t even bother to look at it.

(3) Validity of the findings
It would be very helpful if the authors would provide a figure comparing the calculated and experimental values, and include in the text the relevant equation with proper statistics (n, r2, s, F) along with the uncertainties for the slope & intercept. (See, e.g., the book by Shields & Seybold on this topic, or their WIRES article.)

Authors: The statistical analysis the reviewer refers to is done in the context of a QSAR prediction of pKa from QM data, i.e. to gauge the accuracy a linear fit to be used in the prediction of unknown pKa values. The statistics used in this paper is just aimed at gauging the accuracy of the predicted values and, in our opinion, is more than adequate for the task. If the reviewer can explain how the requested statistics is to be used in the context of the current paper we will be happy to reconsider the request.

This is a standard way to compare not just QSAR results, but any studies in this field. It would be helpful, and I don’t understand the authors’ reluctance to include it.

Annotated manuscript
The reviewer has also provided an annotated manuscript as part of their review:

Reviewer 2 (Anonymous)

Comments for the Author
I thank the authors for the revised manuscript. I would still like the authors to address my second point as to what is the major source of error in these calculations, especially for the outliers. Is it the gas phase energies, or the solvation component?