Proteins and Wave Functions: April 2015

Saturday, April 25, 2015

The main reason I use OA? It makes my research better

Going to hold a talk on #openaccess in some weeks. Whats the main reason you use #OA? Please RT or MT. Would be cool to show your tweets!
— Steven Ray Wilson (@StevenRayOslo) April 24, 2015

When I started publishing OA the answer was "The people who payed for this research should be able to read about the results". Now the answer is more complex and difficult to fit into 140 characters. Hence this blog post.

The OA movement has three important "side effects":
1. Pioneered by PLoS ONE, many OA journals have removed perceived "impact" as a review criterion
2. Pioneered by PLoS ONE, many OA journals are mega-journals where the appropriateness of the topic of the manuscript to the journal is not an acceptance criterion.
3. Most OA journals allow you to make your manuscript public prior to submission

I have found that points 1-2 has made my research much less risk-averse. I can focus on truly challenging and long-term research questions without worrying whether or where I will be able to publish. Before it was: "In order to do X I have to do Y and Z first, but where will I publish Y and Z?" or " If I manage to do X and Y this will sail in to Journal Z". Now it is "It's important to find out about X; let's try it and publish what we find".

I can share our work at any stage in any way I see fit. We put all our manuscripts, MS and PhD theses on preprint servers such as arXiv and we get a lot of great feedback long before the "official reviews" arrive.

It still puzzles me when I see tweets like "Manuscript submitted to X!" and "Paper finally accepted in Y!!!" without a link to a preprint. What's the information really communicated here? "Hope to score a publication point soon!"? and "Another line on my CV!!"?

Of course "The people who payed for this research should be able to read about the results" is still a major factor but it has become so much more.

This work is licensed under a Creative Commons Attribution 4.0

Wednesday, April 15, 2015

Megajournals: answers to six troubling questions

Jilian Buriak, editor-in-chief of the ACS journal Chemistry of Materials, recently asked six troubling questions related to the "explosive growth" of manuscripts published in mega journals that “consider all technically sound papers irrespective of originality or novelty of the proposed research.” Even more troubling, she did not provide any answers. No wonder. They are tough and often initially confusing questions. Either that or rhetorical. Anyway, I have given them some thought and have come up with some answers.

"(i) How will the community of peer reviewers (all practicing scientists) handle the onslaught of review requests?"

I was initially confused by this question. In my experience, papers rejected at one journal are just resubmitted to another journal until it is published. My initial impression was that the high rejection rate of some journals actually leads to more reviewing overall. And then it hit me! I should only review for select journals like Chemistry of Materials. That should keep my review pile nice and low. I know, I know, some one will eventually point out that in this way I only get to see papers deemed important by a single person at one of these journals. But no, at Chemistry of Materials this important decision is made by two people. And because "importance" is a completely subjective decision this can be done in as little as 5-7 days!

"(ii) How do we as scientists manage, and sort through, the vast increase in number of published journal pages, let alone read them?"

Yes, this is a tricky one. Computers and crowd-sourcing are no good at dealing with wast amounts of information. I think the only solution is to agree on a list of journals we, as chemists, will review for and submit to. Also, there should be a "one-strike-and-you're-out" rule. If one of these journals on the list rejects your paper, that's it. As long as the list is relatively short, this can be strictly enforced. I mean, if two people say "this is not important" what are the odds that a third person will say any different? This should keep the number of published papers down to manageable number. Well, or at least stop the increase.

"(iii) Some authors may be tempted by the apparent ease of publishing their work in a reports journal, but will these published reports make any lasting impact, or even be noticed?"

Of course it's tempting. As an author I spend loads of time writing and re-writing the lasting impact parts of my manuscript. It takes ages crafting sentences that makes these speculations as spectacular as possible without being demonstrably untrue! Pro-tip: a slap-dash literature search makes the novelty section so much easier to write and if "caught" you can honestly say that these key previous studies had escaped your attention. But remember, at this stage you have already gotten past the editor - sorry editors - so it's well worth the "risk". Anyway, it's worth it. Everyone knows that anything of lasting impacts are only published in the most selective journals. And what self-respecting scientist discovers papers by search algorithms these days? Especially now that one can easily peruse the table of contents - now with delightfully quirky graphics - online!

"(iv) Is the goal of serving the public good, of doing high quality science with taxpayer funds, being diluted by the ambition of increasing one’s own publication metrics (i.e., sheer number of publications)?"

Yes, a paper is not primarily a way to share results, it's primarily a way to keep score. Why would the taxpayer care about papers that suggest a promising idea doesn't work? Or that an important published study can't be replicated? The taxpayer cares first an foremost about the individual careers of scientists and this is best judged by the number of papers an individual has published - but only if a handful of people have said it's important. This is why it is so important where it is published. So when we make the list I propose in (ii) we should be sure to rank the journal in order of importance! I suggest a combination of collective gut feeling + some kind of average based on select citations.

"(v) Can the scientific community, which is already over-burdened, maintain high scientific standards under these conditions?"

No! While some people include "reproducibility" under "scientific standards" it is hardly a "high" scientific standard like "importance" or "novelty". So, a scientific community that tolerates the publication of numerous replication studies - positive as well as negative - can not be said to maintain a "high" scientific standard. Importance and novelty are key. This can only be done by publishing in a few select journals with that as a focus. Remember, these are subjective standards best left to a few "pros".

"(vi) How will the explosion of citations, including self-citations, skew existing metrics?"

Again, I was initially confused by this question. But some Googling revealed that the primary function of citations is not really to refer to previous studies, but another way to score the impact of a particular paper or researcher. Citation count is incredibly field-dependent and chemists have worked long and hard to establish a gut feeling for citation count vs impact and it would be a shame to loose this as more people start to share un-important information. This is why the rules I have outlined under (i) and (ii) must be adhered to by all and strictly enforced by a select few. Also, with some clever statistics and moderate data massaging, citations are a wonderful tool to rank journals!

This work is licensed under a Creative Commons Attribution 4.0

Friday, April 3, 2015

ProCS: Quantum Mechanics-based chemical shift predictions of backbone and CB atoms in proteins

We (Anders Larsen and +Lars Bratholm and myself) are in the process of turning Anders Larsen's MS thesis into a paper. This is a progress report. Note that any numbers in here are likely to change somewhat.

What is ProCS?
ProCS takes a protein structure and computes isotropic chemical shielding values of all backbone atoms as well as CB in less than a second. ProCS is basically a look-up and interpolation function based on about 2.35 million PCM/OPBE/6-31G(d,p)//PM6 calculations on tripeptides. In addition to this we correct select atoms for hydrogen bonding to H and HA and plus a ring-current correction for H atoms.

What is ProCS good for?
Using a previous incarnation of ProCS which only predicted H chemical shifts we have shown that one can improve hydrogen bond geometries in protein structures by finding structures that match measured chemical shifts. We further showed that using empirical chemical shift predictors did not result in hydrogen bond geometries. We want to extend this to other aspects of the protein structure by including NMR chemical shifts for more nuclei. Note that this chemical shift-based structure refinement requires many thousands of chemical shift predictions, so doing QM calculations on the entire protein is not feasible.

How well does ProCS reproduce QM calculations?
As an example, we optimize ubiquitin (1UBQ) with PCM/PM6-D3H+ and use the structure to compute PCM/OPBE/6-31G(d,p)//PM6 chemical shielding values, which we then compare to corresponding ProCS values computed using the same structure.

For CA, CB, and C atoms we get RMSD values of 2.0, 2.5, and 2.3 ppm with R values of 0.94, 0.98, and 0.74. The RMSD for CA can be reduced to 1.6 ppm by subtracting an offset of 1.1 ppm, but accuracy for the other two atoms types is not greatly improved by an offset or scaling + offset. For H and HA the RMSDs are 0.9 and 0.7 ppm and the R values are 0.86 and 0.84. The result for H can be improved somewhat (to 0.6 ppm and R = 0.89) by removing a clear outlier and scaling + offset. Finally, the RMSD and R for N is 7.2 and 0.85. The RMSD can be improved considerably by adding and offset (5.2 ppm and 0.88) or scaling + offset (4.4 ppm) after removing a clear outlier.

From additional analysis we know that, with the exception of HA, more than 50% of the error comes from the way we approximate the effect of the side chains on the residues immediately next to the residue of interest in the sequence.

Much of the variation in some of the chemical shifts comes from the nature of the side-chain itself and the side chains before and after in the sequence, which can lead to inflated R values. To separate the contributions of the sequence and the structure we subtract the measured sequence corrected random coil values from both the ProCs and QM results. The R values for CA, CB, C, HA, H, and N are now 0.76, 0.69, 0.71, 0.83, 0.89, and 0.75, respectively, while the RMSD values are much the same (after an offset correction). It is interesting to note that CB went from being most to least predictive (in terms of R values), that the amide proton is now most predictive, and that N and CA are roughly equally predictive even though the former has a larger RMSD. The reason is that the corrected carbon chemical shifts span a range of about 10 ppm, while the corrected N chemical shifts span a range of about 20 ppm. For comparison, H and HA corrected chemical shifts span a range of about 4 ppm.

How well does ProCS reproduce experiment?
Before I try to answer that there are a few point to consider. One point is that we are comparing computed chemical shieldings to experimental chemical shifts so it only makes sense to compare results that have been corrected by scaled + offset using linear regression (or possibly just an off-set). Another point is that deviation from experiment result not only for deficiencies in the model, but also deficiencies in the structure including the fact that several conformation could contribute to the measured chemical shifts.

The R (RMSD) values for CA, CB, C, HA, H, and N for ProCS vs Exp are:

0.90 (2.0) 0.99 (2.2) 0.38 (1.8) 0.66 (0.4) 0.33 (0.6) 0.75 (4.5)

While the PCM/OPBE/6-31G(d,p)//PM6-DH3+ values are

0.90 (2.1) 0.98 (3.0) 0.48 (2.7) 0.61 (0.8) 0.38 (1.25) 0.81 (5.5)

According to these R values ProCS performs roughly the same as QM for CA, CB, and HA and could be improved a bit for C, H, and N, but a greater limitation appears to be PCM/OPBE/6-31G(d,p)//PM6 itself for C and H and some extent HA. Here it is worth noting that Zu, He, and Zhang have shown that R for H atoms can be greatly improved by including a shell of explicit water molecules in addition to the continuum.

The corresponding ProCS (QM) R values for the random-coil corrected chemical shifts are: 0.56, 0.42, 0.29, 0.69, 0.32, and 0.41, which are quite comparable to the QM values: 0.55, 0.46, 0.39, 0.63, 0.36, and 0.49. Notice that the R value for H no longer is inordinately lower than for some of the other atoms. The ProCS R value for C is close to being statistically insignificant (R < 0.23).

How sensitive is the agreement with experiment to the structure?
A lot. Here are R values for ProsCS vs experiment for random coil-corrected values for choices of energy function for the geometry optimization for CA, CB, C, HA, H, and N (and next to that the corresponding QM values):

PM6-D3H+ 0.56 0.42 0.29 0.69 0.32 0.41 | 0.55 0.46 0.39 0.63 0.36 0.49
PM6-DH+ 0.61 0.40 0.37 0.71 0.33 0.44 | 0.62 0.39 0.42 0.64 0.42 0.51
AMBER 0.61 0.38 0.09 0.60 0.24 0.42 | 0.58 0.46 0.39 0.71 0.17 0.44
CHARMM 0.71 0.36 0.39 0.81 0.34 0.44 | 0.71 0.48 0.58 0.80 0.45 0.60
AMOEBA 0.50 0.36 0.34 0.56 0.44 0.36 | 0.48 0.39 0.54 0.65 0.51 0.47
X-ray 0.71 0.30 0.35 0.83 0.30 0.49 | 0.58 0.46 0.39 0.71 0.17 0.44

The value for C for AMBER is not a typo! Not sure what is going on there. The corresponding QM value is 0.39, so it looks like a bug. Anyway, the structural dependence is a positive in the sense that experimental chemical shifts potentially can be used to improve the structure. In all cases the ProCS and QM calculations lead to quite similar R factors, though the R values tend to be consistently higher for QM predicted C chemical shifts.

How does ProCS compare to other chemical shift predictors?
I don't have all the data I need yet, so consider this a preview of a subsequent blogpost. CheShift2, which is also a QM-based chemical shift predictor for CA and CB chemical shifts gives the following R values:

AMBER 0.53 0.38
CHARMM 0.68 0.61
AMOEBA 0.33 0.44
X-ray 0.59 0.47

So based on this preliminary data it appears that ProCS tends to do better for CA, while CheShift2 does better for CB.

Open questions
How does ProCS compare to empirical shift predictors such as Camshift, SHIFTX, and SPARTA?
How do these number presented in this blog post compare to corresponding numbers for other proteins?
Whats the effect of averaging over different (side chain) conformers?
Is is possible to find a "universal" set of scaling and offset parameters for each atom type for a given choice of optimized structure? In this way, ProCS could predict chemical shifts instead of chemical shieldings.
What's the best way to use ProCS (or chemical shieldings in general) to judge the accuracy of a protein structure? If this turns out to be the R value instead of the RMSD, then we only need chemical shieldings.

Comments welcome as always

This work is licensed under a Creative Commons Attribution 4.0