Saturday, May 2, 2015 Installing numpy and matplotlib and transferring files

note to self:

"kpm install pip"
"sudo apt-get install python-dev"
"sudo pip install numpy"  (takes a while)
"sudo apt-get install libfreetype6-dev libxft-dev "
"sudo easy_install matplotlib "

To transfer files from to your computer
move the file into the web directory (e.g. coordinates_end.png)
Go to vm setting modal and find the assigned URL (e.g.
The file can be accessed at

If you want to transfer a .py file change the extension to .txt

This work is licensed under a Creative Commons Attribution 4.0  

Saturday, April 25, 2015

The main reason I use OA? It makes my research better

When I started publishing OA the answer was "The people who payed for this research should be able to read about the results".  Now the answer is more complex and difficult to fit into 140 characters. Hence this blog post.

The OA movement has three important "side effects":
1. Pioneered by PLoS ONE, many OA journals have removed perceived "impact" as a review criterion
2. Pioneered by PLoS ONE, many OA journals are mega-journals where the appropriateness of the topic of the manuscript to the journal is not an acceptance criterion.
3. Most OA journals allow you to make your manuscript public prior to submission

I have found that points 1-2 has made my research much less risk-averse.  I can focus on truly challenging and long-term research questions without worrying whether or where I will be able to publish.  Before it was: "In order to do X I have to do Y and Z first, but where will I publish Y and Z?" or " If I manage to do X and Y this will sail in to Journal Z". Now it is "It's important to find out about X; let's try it and publish what we find".

I can share our work at any stage in any way I see fit.  We put all our manuscripts, MS and PhD theses on preprint servers such as arXiv and we get a lot of great feedback long before the "official reviews" arrive.

It still puzzles me when I see tweets like "Manuscript submitted to X!" and "Paper finally accepted in Y!!!" without a link to a preprint.  What's the information really communicated here?  "Hope to score a publication point soon!"? and "Another line on my CV!!"?

Of course "The people who payed for this research should be able to read about the results" is still a major factor but it has become so much more.

This work is licensed under a Creative Commons Attribution 4.0  

Wednesday, April 15, 2015

Megajournals: answers to six troubling questions

Jilian Buriak, editor-in-chief of the ACS journal Chemistry of Materials, recently asked six troubling questions related to the "explosive growth" of manuscripts published in mega journals that consider all technically sound papers irrespective of originality or novelty of the proposed research.”  Even more troubling, she did not provide any answers. No wonder. They are tough and often initially confusing questions. Either that or rhetorical. Anyway, I have given them some thought and have come up with some answers.

"(i) How will the community of peer reviewers (all practicing scientists) handle the onslaught of review requests?"

I was initially confused by this question. In my experience, papers rejected at one journal are just resubmitted to another journal until it is published. My initial impression was that the high rejection rate of some journals actually leads to more reviewing overall.  And then it hit me! I should only review for select journals like Chemistry of Materials.  That should keep my review pile nice and low.  I know, I know, some one will eventually point out that in this way I only get to see papers deemed important by a single person at one of these journals.  But no, at Chemistry of Materials this important decision is made by two people. And because "importance" is a completely subjective decision this can be done in as little as 5-7 days!

"(ii) How do we as scientists manage, and sort through, the vast increase in number of published journal pages, let alone read them?"

Yes, this is a tricky one.  Computers and crowd-sourcing are no good at dealing with wast amounts of information. I think the only solution is to agree on a list of journals we, as chemists, will review for and submit to.  Also, there should be a "one-strike-and-you're-out" rule. If one of these journals on the list rejects your paper, that's it.  As long as the list is relatively short, this can be strictly enforced.  I mean, if two people say "this is not important" what are the odds that a third person will say any different?  This should keep the number of published papers down to manageable number.  Well, or at least stop the increase.

"(iii) Some authors may be tempted by the apparent ease of publishing their work in a reports journal, but will these published reports make any lasting impact, or even be noticed?"

Of course it's tempting.  As an author I spend loads of time writing and re-writing the lasting impact parts of my manuscript.  It takes ages crafting sentences that makes these speculations as spectacular as possible without being demonstrably untrue!  Pro-tip: a slap-dash literature search makes the novelty section so much easier to write and if "caught" you can honestly say that these key previous studies had escaped your attention.  But remember, at this stage you have already gotten past the editor - sorry editors - so it's well worth the "risk".  Anyway, it's worth it.  Everyone knows that anything of lasting impacts are only published in the most selective journals.  And what self-respecting scientist discovers papers by search algorithms these days?  Especially now that one can easily peruse the table of contents - now with delightfully quirky graphics -  online!

"(iv) Is the goal of serving the public good, of doing high quality science with taxpayer funds, being diluted by the ambition of increasing one’s own publication metrics (i.e., sheer number of publications)?"

Yes, a paper is not primarily a way to share results, it's primarily a way to keep score.  Why would the taxpayer care about papers that suggest a promising idea doesn't work?  Or that an important published study can't be replicated? The taxpayer cares first an foremost about the individual careers of scientists and this is best judged by the number of papers an individual has published - but only if a handful of people have said it's important.  This is why it is so important where it is published. So when we make the list I propose in (ii) we should be sure to rank the journal in order of importance! I suggest a combination of collective gut feeling + some kind of average based on select citations. 

"(v) Can the scientific community, which is already over-burdened, maintain high scientific standards under these conditions?"

No! While some people include "reproducibility" under "scientific standards" it is hardly a "high" scientific standard like "importance" or "novelty".  So, a scientific community that tolerates the publication of numerous replication studies - positive as well as negative - can not be said to maintain a "high" scientific standard.  Importance and novelty are key.  This can only be done by publishing in a few select journals with that as a focus. Remember, these are subjective standards best left to a few "pros".

"(vi) How will the explosion of citations, including self-citations, skew existing metrics?"

Again, I was initially confused by this question.  But some Googling revealed that the primary function of citations is not really to refer to previous studies, but another way to score the impact of a particular paper or researcher.  Citation count is incredibly field-dependent and chemists have worked long and hard to establish a gut feeling for citation count vs impact and it would be a shame to loose this as more people start to share un-important information.  This is why the rules I have outlined under (i) and (ii) must be adhered to by all and strictly enforced by a select few.  Also, with some clever statistics and moderate data massaging, citations are a wonderful tool to rank journals!

This work is licensed under a Creative Commons Attribution 4.0

Friday, April 3, 2015

ProCS: Quantum Mechanics-based chemical shift predictions of backbone and CB atoms in proteins

We (Anders Larsen and +Lars Bratholm and myself) are in the process of turning Anders Larsen's MS thesis into a paper. This is a progress report. Note that any numbers in here are likely to change somewhat.

What is ProCS?
ProCS takes a protein structure and computes isotropic chemical shielding values of all backbone atoms as well as CB in less than a second.  ProCS is basically a look-up and interpolation function based on about 2.35 million PCM/OPBE/6-31G(d,p)//PM6 calculations on tripeptides.  In addition to this we correct select atoms for hydrogen bonding to H and HA and plus a ring-current correction for H atoms.

What is ProCS good for?
Using a previous incarnation of ProCS which only predicted H chemical shifts we have shown that one can improve hydrogen bond geometries in protein structures by finding structures that match measured chemical shifts. We further showed that using empirical chemical shift predictors did not result in hydrogen bond geometries.  We want to extend this to other aspects of the protein structure by including NMR chemical shifts for more nuclei.  Note that this chemical shift-based structure refinement requires many thousands of chemical shift predictions, so doing QM calculations on the entire protein is not feasible.

How well does ProCS reproduce QM calculations?
As an example, we optimize ubiquitin (1UBQ) with PCM/PM6-D3H+ and use the structure to compute PCM/OPBE/6-31G(d,p)//PM6 chemical shielding values, which we then compare to corresponding ProCS values computed using the same structure.

For CA, CB, and C atoms we get RMSD values of 2.0, 2.5, and 2.3 ppm with R values of 0.94, 0.98, and 0.74.  The RMSD for CA can be reduced to 1.6 ppm by subtracting an offset of 1.1 ppm, but accuracy for the other two atoms types is not greatly improved by an offset or scaling + offset. For H and HA the RMSDs are 0.9 and 0.7 ppm and the R values are 0.86 and 0.84. The result for H can be improved somewhat (to 0.6 ppm and R = 0.89) by removing a clear outlier and scaling + offset. Finally, the RMSD and R for N is 7.2 and 0.85.  The RMSD can be improved considerably by adding and offset (5.2 ppm and 0.88) or scaling + offset (4.4 ppm) after removing a clear outlier.

From additional analysis we know that, with the exception of HA, more than 50% of the error comes from the way we approximate the effect of the side chains on the residues immediately next to the residue of interest in the sequence.

Much of the variation in some of the chemical shifts comes from the nature of the side-chain itself and the side chains before and after in the sequence, which can lead to inflated R values. To separate the contributions of the sequence and the structure we subtract the measured sequence corrected random coil values from both the ProCs and QM results. The R values for CA, CB, C, HA, H, and N are now 0.76, 0.69, 0.71, 0.83, 0.89, and 0.75, respectively, while the RMSD values are much the same (after an offset correction).  It is interesting to note that CB went from being most to least predictive (in terms of R values), that the amide proton is now most predictive, and that N and CA are roughly equally predictive even though the former has a larger RMSD.  The reason is that the corrected carbon chemical shifts span a range of about 10 ppm, while the corrected N chemical shifts span a range of about 20 ppm.  For comparison, H and HA corrected chemical shifts span a range of about 4 ppm.

How well does ProCS reproduce experiment?
Before I try to answer that there are a few point to consider. One point is that we are comparing computed chemical shieldings to experimental chemical shifts so it only makes sense to compare results that have been corrected by scaled + offset using linear regression (or possibly just an off-set). Another point is that deviation from experiment result not only for deficiencies in the model, but also deficiencies in the structure including the fact that several conformation could contribute to the measured chemical shifts.

The R (RMSD) values for CA, CB, C, HA, H, and N for ProCS vs Exp are:

0.90 (2.0)   0.99 (2.2)   0.38 (1.8)   0.66 (0.4)   0.33 (0.6)   0.75 (4.5)

While the PCM/OPBE/6-31G(d,p)//PM6-DH3+ values are

0.90 (2.1)   0.98 (3.0)   0.48 (2.7)   0.61 (0.8)   0.38 (1.25)  0.81 (5.5)

According to these R values ProCS performs roughly the same as QM for CA, CB, and HA and could be improved a bit for C, H, and N, but a greater limitation appears to be PCM/OPBE/6-31G(d,p)//PM6 itself for C and H and some extent HA.  Here it is worth noting that Zu, He, and Zhang have shown that R for H atoms can be greatly improved by including a shell of explicit water molecules in addition to the continuum.

The corresponding ProCS (QM) R values for the random-coil corrected chemical shifts are: 0.56, 0.42, 0.29, 0.69, 0.32, and 0.41, which are quite comparable to the QM values: 0.55, 0.46, 0.39, 0.63, 0.36, and 0.49.  Notice that the R value for H no longer is inordinately lower than for some of the other atoms. The ProCS R value for C is close to being statistically insignificant (R < 0.23).

How sensitive is the agreement with experiment to the structure?  
A lot. Here are R values for ProsCS vs experiment for random coil-corrected values for choices of energy function for the geometry optimization for CA, CB, C, HA, H, and N (and next to that the corresponding QM values):

PM6-D3H+   0.56   0.42   0.29   0.69   0.32   0.41   |   0.55   0.46   0.39   0.63   0.36   0.49
PM6-DH+     0.61   0.40   0.37   0.71   0.33   0.44   |   0.62   0.39   0.42   0.64   0.42   0.51
AMBER        0.61   0.38   0.09   0.60   0.24   0.42   |   0.58   0.46   0.39   0.71   0.17   0.44
CHARMM    0.71   0.36   0.39   0.81   0.34   0.44   |   0.71   0.48   0.58   0.80   0.45   0.60
AMOEBA     0.50   0.36   0.34   0.56   0.44   0.36   |   0.48   0.39   0.54   0.65   0.51   0.47
X-ray            0.71   0.30   0.35   0.83   0.30   0.49   |   0.58   0.46   0.39   0.71   0.17   0.44

The value for C for AMBER is not a typo! Not sure what is going on there. The corresponding QM value is 0.39, so it looks like a bug.  Anyway, the structural dependence is a positive in the sense that experimental chemical shifts potentially can be used to improve the structure. In all cases the ProCS and QM calculations lead to quite similar R factors, though the R values tend to be consistently higher for QM predicted C chemical shifts.

How does ProCS compare to other chemical shift predictors?
I don't have all the data I need yet, so consider this a preview of a subsequent blogpost. CheShift2, which is also a QM-based chemical shift predictor for CA and CB chemical shifts gives the following R values:

AMBER        0.53   0.38
CHARMM    0.68   0.61
AMOEBA     0.33   0.44
X-ray            0.59   0.47

So based on this preliminary data it appears that ProCS tends to do better for CA, while CheShift2 does better for CB.

Open questions
How does ProCS compare to empirical shift predictors such as Camshift, SHIFTX, and SPARTA?
How do these number presented in this blog post compare to corresponding numbers for other proteins?
Whats the effect of averaging over different (side chain) conformers?
Is is possible to find a "universal" set of scaling and offset parameters for each atom type for a given choice of optimized structure?  In this way, ProCS could predict chemical shifts instead of chemical shieldings.
What's the best way to use ProCS (or chemical shieldings in general) to judge the accuracy of a protein structure?  If this turns out to be the R value instead of the RMSD, then we only need chemical shieldings.

Comments welcome as always

This work is licensed under a Creative Commons Attribution 4.0

Monday, March 30, 2015

Definition of Gly HA2 and HA3 nomenclature in PDB and BMRB files

Note to self:

HA2 is pro-R and HA3 is Pro-S. HA2 corresponds to HA in the other amino acids

Thanks to +Lars Bratholm for figuring this out.

This work is licensed under a Creative Commons Attribution 4.0

Saturday, March 28, 2015

Selecting full residues within a certain distance of another residue or atom in PyMOL

Note to self:

To select all atoms in a residue (plus any HETATMs) that is within 3 Å of any atom in residue 63 type:

select br. all within 3 of resi 63

To select all atoms in a residue (plus any HETATMs) that is within 3 Å of the CA atom in residue 63 type:

select br. all within 3 of 63/CA

If you want to exclude the HETAMS type "pol." (short for polymer) instead of "all".

This work is licensed under a Creative Commons Attribution 4.0

Thursday, March 26, 2015

Second reviews of the PCCP paper - almost there

PCCP writes: "I am pleased to inform you that your revised manuscript has been recommended for publication in Physical Chemistry Chemical Physics subject to further revision in line with the attached reports."

Referee: 1
Comments to the Author
Most of my comments have been addressed.

However, I am still concerned about the recommendations for treating low frequency vibrations.  The harmonic oscillator approximation is not appropriate for low frequencies because the entropy term goes to infinity.  As recommended by Grimme (2012), a hindered / free rotor may be a better treatment (for related methods and discussion see TRUHLAR, DG, J COMP CHEM 1991, 12, 266-270 DOI: 10.1002/jcc.540120217 and McClurg, RB, Flagan, RC, Goddard, WA JCP 1997, 106, 6675-6680 DOI: 10.1063/1.473664)

My comment about explicit water molecules also concerned entropy.  While including explicit water molecules may increase the CPU time, this can be overcome.  What extra CPU time cannot overcome is the fact that the explicit waters in reality can exchange with the bulk water increasing their entropy compared to a RRHO calculation of their entropy.  It is best to include explicitly only tightly bound

Immediate reactions
Point 1:
* The paper lists options for how to treat low frequency vibrations but there is not a specific recommendation because the issue as it applies to binding free energies has not been thoroughly compared.
* Grimme's approach is not derived from first principles and is therefore not a priori better than other approaches.  It will in some cases treat modes that are clearly stretch vibration as a free rotor
* The free rotor entropy also goes to infinity as the frequency approaches zero.

Point 2:
* Any such effect is included implicitly in the parameterization of the solvation free energy of H2O.
* Any error in this parameterization is largely cancelled by using the water cluster approach recommended in the paper.
* Bryantsev et al. have shown that using the water cluster approach leads to smooth convergence of solvation free energies of H+ and Cu2+ that are in good agreement with experiment.

This work is licensed under a Creative Commons Attribution 4.0