Thursday, December 15, 2016

The Open Access Reviewer Rewards Program or "Reviews for OA"

It's almost New Year so I'll soon get e-mails from journals saying how much they appreciate my reviews.  I might even get advertisement material thinly disguised as a calendar.

I've decided I want something more. Well, different - they can keep the calendar and they can keep the emails.  I wan't a partial APC voucher that I can use to publish open access in the journal. And I mean "CC-license open access", not your "pay-us-but-we-own-your-work-anyway" license.

How big a voucher, you ask?  I would say 5-10% of the CC-BY APC is reasonable.  Certainly, if I have reviewed 20 papers for a journal, I should be able to publish an accepted paper as OA free of charge there.

Obviously, it will take time to implement such a scheme.  I'll give them a year.  In 2018 I'll start saying "no" to journals that don't offer some kind of scheme like this. Or give them another year, I dunno.

And just as obviously, this won't happen by itself. Here's an example of my reply to the usual post review thank you email:

Dear Gus

You’re very welcome.  As you know reviewing takes a lot of time.  Would it be possible for JCTC to reward reviewers with a partial APC voucher along the lines described here: http://proteinsandwavefunctions.blogspot.dk/2016/12/the-open-access-reviewer-rewards.html?  This would be a tangible demonstration of how much you value your reviewers and increase open access publication, which is good for science.  If you like the idea perhaps you could pass this suggestion along to the ACS.

Best regards, Jan


2017.01.01 Update 
Journal/publishers who do something similar (may not be current)
Announcing a New Benefit for PeerJ Peer Reviewers  (HT @chanin_nanta)
Reviewing for MDPI Just Became More Rewarding (HT @chanin_nanta)
Reviewer Discount for BMC journals (HT @chanin_nanta)



To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighbouring rights to this work.

Thursday, November 24, 2016

Which method is more accurate? or Errors have error bars

2017.01.10 update: this blogpost is now available as a citeable preprint

This post is my attempt at distilling some of the information in two papers published by Anthony Nicholls (here and here). Anthony also very kindly provided some new equations, not found in the papers, in response to my questions.

Errors also have error bars
Say you have two methods, $A$ and $B$, for predicting some property and you want to determine which method is more accurate by computing the property using both methods for the same set of $N$ different molecules for which reference values are available. You evaluate the error (for example the RMSE) of each method relative to the reference values and compare. The point of this post is that these errors have uncertainties (error bars) that depend on the number of data points ($N$, more data less uncertainty) and you have to take these uncertainties into consideration when you compare errors. 

The most common error bars reflect 95% confidence and that's what I'll use here.  

The expression for the error bars assume a large $N$ where in practice "large" in this context means roughly 10 or more data points.  If you use fewer points or would like more accurate estimates please see the Nicholls papers for what to do.

Root-Mean-Square-Error (RMSE)
The error bars for the RMSE are asymmetric.  The lower and higher error bar on the RMSE for method $X$ $(RMSE_X)$ is
$$ L_X = RMSE_X - \sqrt {RMSE_X^2 - \frac{{1.96\sqrt 2 RMSE_X^2}}{{\sqrt {N - 1} }}} $$
$$ = RMSE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}}  \right) $$

$$ U_X =  RMSE_X \left(  \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1  \right) $$

Mean Absolute Error (MAE)
The error bars for the MAE is also asymetric. The lower and higher error bar on the MAE for method $X$ $(MAE_X)$ is

$$ L_X =  MAE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}}  \right)  $$

$$ U_X =  MAE_X \left(  \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1  \right)  $$

Mean Error (ME) 
The error bars for the mean error are symmetric and given by 
$$ L_X = U_X =  \frac{1.96 s_N}{\sqrt{N}} $$

where $s_N$ is the standard population deviation (e.g. STDEVP in Excel).

Pearson’s correlation coefficient, $\textbf{r}$
The first thing to check is whether your $r$ values themselves are statistically significant, i.e. $r_X > r_{significant}$ where

$$ r_{significant} = \frac{1.96}{\sqrt{N-2+1.96^2}}   $$

The error bars for the Pearson's $r$ value are asymmetric and given by 
$$ L_X = r_X - \frac{e^{2F_-}-1}{e^{2F_-}+1} $$
$$ U_X =  \frac{e^{2F_+}-1}{e^{2F_+}+1} - r_X $$

where

$$ F_{\pm} = \frac{1}{2} \ln \frac{1+r_X}{1-r_X} \pm r_{significant}$$

Comparing two methods
If $error_X$ is some measure of the error, RMSE, MAE, etc, and $error_A > error_B$ then the difference is statistically significant only if 

$$ error_A - error_B > \sqrt {L_A^2 + U_B^2 - 2{r_{AB}}{L_A}{U_B}} $$

where $r_{AB}$ is the Pearson's $r$ value of method $A$ compared to $B$, not to be confused with $r_A$ which compares $A$ to the reference value.  Conversely, if this condition is not satisfied then you cannot say that method $B$ is not more accurate than method $A$ with 95% confidence because the error bars are too large.

Note also that if there is a high degree of correlation between the predictions ($r_{AB} \approx $ 1) and the error bars are similar in size $L_A \approx U_B$ then even small differences in error could be significant.

Usually one can assume that $r_{AB} > 0$ so if $error_A - error_B > \sqrt {L_A^2 + U_B^2}$ or $error_A - error_B > L_A + U_B$ then the difference is statistically significant, but it is better to evaluate $r_{AB}$ to be sure.

The meaning of 95% confidence
Say you compute errors for some property for 50 molecules using method $A$ ($error_A$) and $B$ ($error_B$) and observe that Eq 11 is true.  

Assuming no prior knowledge on the performance of $A$ and $B$, if you repeat this process an additional 40 times using all new molecules each time then in 38 cases (38/40 = 0.95) the errors observed for method $A$ will likely be between $error_A - L_A$ and $error_A + U_A$ and similarly for method $B$. For one of the remaining two cases the error is expected to be larger than this range, while for the other remaining case it is expected to be smaller. Furthermore, in 39 of the 40 cases $error_A$ is likely larger than $error_B$, while $error_A$ is likely smaller than $error_B$ in the remaining case. 



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Sunday, November 6, 2016

Some useful PyMol commands

Here some PyMol commands I found useful while writing this paper.

Raytracing (pretty pictures)
ray

Selections
select br. all within 3 of 63/CA
select br. all within 3 of resi 63
select Ala63, br. all within 3 of resi 63
select br. all within 3 of 2kpp///82/
sele tail,  2kzn///142-147/
sele tail,  2KPP///1-7+91-114/

NMR ensembles set all_states, on 
split_states your_object

Superimposing structures # superimpose protA residues 10-25 and 33-46 to protB residues 22-37 and 41-54:
pair_fit protA///10-25+33-46/CA, protB///22-37+41-54/CA

# superimpose ligA atoms C1, C2, and C4 to ligB atoms C8, C4, and C10, respectively:
pair_fit ligA////C1, ligB////C8, ligA////C2, ligB////C4, ligA////C3, ligB////C10

align cluster_lowe///13-25+36-105+111-141/CA,native///13-25+36-105+111-141/CA

Color using numbers in B-factor column
spectrum b, green_red, selection=n. CA,minimum=0.0, maximum=2
spectrum b, blue_white_red, selection=n. CA
spectrum b, blue_white_red, selection=n. CA,minimum=-1.37, maximum=1.37

spectrum b, blue_red, selection=test////CA
spectrum b, green_red, selection=n. CA,minimum=0.0, maximum=3.6

Label atoms 
label n. CA and i. 44, "(%s%s, %s)" % (resn, resi, b)
label n. CA and i. 33+55, "(%s%s, %s)" % (resn, resi, b)
label n. CA and i. 2, "%s%s, %5.2f" % (resn, resi, b)

Get more digits on the distance measurement
set label_distance_digits, 2

Get the orientation data, which you can paste back in to restore orientation
get_view

Change cartoon representation (the first two commands go together)
alter 16:23/, ss='L'
rebuild

set ribbon_width, 8

create new_obj, cluster_lowe_all2
set cartoon_transparency, 0.5,new_obj



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Saturday, November 5, 2016

Semi-automatic pKa prediction using QM

Here I outline how I automated the calculations for this paper.  The files and programs can be found in the smiles-code.zip file that's part of the supplementary material on Figshare.

Generating SMILES
I used PDF to Text to get a text version of Table 3 from this paper and extracted the first column with the names using Excel.  I removed some molecules and the superscripts on some of the names by hand and created a text file with a single name on each line.

I used a python script to convert the names to SMILES strings

python name2smiles.py table_3 > table_3.smiles

I used a python script to convert the SMILES string to 2D images

python smiles2png.py tabel_3.smiles

I used the Cover Flow option in Finder (Mac) to browse through the images looking for errors. Turned out that several SMILES strings included the HCl which I removed.  If I saw other tautomer possibilities I created those by hand (I need to automate this).  A handful of molecules had carboxyl group which I deprotonated by hand by changing the SMILES string (I need to automate this).

I used a python script to protonate the amines

python protam.py table_3.smiles > table_3+.smiles

This program creates all possible single protonation states of all nitrogen atoms (except amides) in the molecule.  I deleted some of the protonation states I didn't want.  For example, for histamine I want to compute the pKa of the amine group but not of the imidazole, so I deleted the line with the histamine SMILES string for the protonated imidazole.  For a few of the molecules I also needed SMILES for doubly protontaes molecules to I used protam.py and table_3+.smiles to create a  table_3++.smiles file also and extracted the molecules I needed.  These steps where repeated for the reference molecules as well.

File naming
The each line in the .smiles files is the name of the molecule followed by a SMILES strings, the name is used to construct the filename, and the subsequent workflow makes some assumptions about the filenames.  When I deprotonoate carboxyl group I add a "-" to the filename, e.g. "Phenylalanine-", in the table_3.smmiles and protoam adds a "+" to the filename, e.g.  "Phenylalanine-+".  Molecules in the table_3++.smiles file will have names like.e.g Procaine++. If more than one nitrogen is protonated protam.py adds "_1" and "_2", etc., e.g. "Histamine+_1".  When I make tautomers I add "_1" or "_A", etc., e.g. "Cimetidine+_A".  (So in principle you could have, e.g. "name_A_1", but that didn't happen in practise).  Some manual editing is required for everything to come out right.

Creating input files from SMILES
I use a python script to create sdf files from the SMILES strings, e.g.

python smiles2sdf.py table_3.smiles

The script creates 20 different conformations for each SMILES string. The sdf files are named xxx_y.sdf where xxx is the names in the .smiles file and y is an integer between 0 and 19.

I use OpenBabel to convert the .sdf files to input files.  For example for MOPAC

for x in *.sdf; do babel \$x \${x%.*}.mop -xf ../pm6-dh+.header; done

The header file contains the keyword and the charge is specfied as "charge=0". So I need to change the charge by,

sed -i 's/charge=0/charge=1/g' *+_*.mop
sed -i 's/charge=1/charge=2/g' *++*.mop
sed -i 's/charge=1/charge=0/g' *-+*.mop
sed -i 's/charge=0/charge=-1/g' *-_*.mop

I change the method by, for example,

sed -i 's/pm6-dh+/am1/g' *.mop

Computing the pKa values
Af all the jobs have run I extract energies from the output files. For MOPAC out files by

grep "FINAL HEAT" *.out > xx.energies
grep "CURRENT" *.out >> xx.energies

The last line is needed in case MOPAC doesn't converge, in which case I just extract the last energy.

For  GAMESS log files
grep "TOTAL FREE ENERGY IN SOLVENT" *.log > xx.energies

I use a phython script to compute the pKa values from the .energy files

MOPAC pKa values
python pka_morgan_list+sub.py xx.energies big_table_3.smiles > xx.pka

GAMESS pKa values
python pka_morgan_list+sub_gms.py xx.energies big_table_3.smiles > xx.pka

Here big_table_3.smiles all the .smiles files, including reference molecules, combined into one.

The python script finds the most lowest free energy for each protonation state and the appropriate reference molecule for each ionizable site.  The protonation state is defined by the "-" and "+"s so that the lowest free energy for protonated histamine is the lowest free energy found in the output files named "Histamine+_*".  The pKa is related to the free energy difference between e.g. "Histamine+" and "Histamine" or "Phenylalanine-+" and "Phenylalanine-" or  "Procaine++" and "Procaine+", i.e.

delta_G = energies[name+"+"] - energies[name]

The python script contains SMILES strings for all the reference molecules and the appropriate reference for a titration is the reference molecule with the largest substructure match around the titrating nitrogen.

Some future improvements
I need to automate the tautomer generation and I shouldn't hand-pick the protonation state.  For example, I should consider both possible single protonation states in histamine and have the program automatically use the lowest free energy.  This also means that for, e.g. phenylalanine I should consider both the neutral and zwitterionic protonation state, i.e. "Phenylalanine-+_A" and "Phenylalanine-+_B" and  pick the lowest energy.


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Generating coordinates for molecules with diverse set of elements

It looks like PM6 in GAMESS is working, but we need to test it for more elements.  Here's how I generated the MOPAC input files.  First I made a list of SMILES strings for for most of the elements, which I then converted to sdf files with smiles2coord.py.  The I used OpenBabel to convert to MOPAC input files.

The implementation is only done for RHF so the molecules need to be closed shell.  So, for example for Sc I used ScCl$_3$ For many of transition metals I couldn't really think of a closed shell molecule, so I didn't include those elements.  For the non-metals the Cactus server automatically adds hydrogen.  Cactus also interprets names like Sc as sulphur, so square brackets are needed.



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

writing unicode csv files from Excel for Mac

update: ... or you can update Excel to the latest version (15.27) and then you get the option to save as CSV UTF-8 (thanks fxcoudert #twitterrules)

Special characters become corrupted when saving Excel files in the csv format.  There is apparently no easy fix for Mac, so here is what I ended up doing.

Save Excel file as UFT-16 Unicode Text.

Open the file in TextEdit.  Edit > Find > Find and Repace (OPTION-CMD-F)
In the search panel press OPTION-TAB and in the replace window type "," and press "all"
Save the extension to csv when you save the file.

You can also do it in vi
 ”:1,$s/(Press CTRL+v) then (Press TAB)/,/g”



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, October 20, 2016

Prediction of amine pKa values of drug-like molecules using semiempirical QM methods - take 2

I screwed up some of the calculations describes in this post so I am starting fresh.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds.  Amines were especially challenging.  The study used small molecules.  The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups.  These are preliminary results and may contain errors. 

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt.  I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules.  I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values.  For example, for phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine). Notice that the cyanoguanidine group in cimetidine has a pKa value of about 0 and is therefore deprotonated when the imidazole group titrates.  Eckert and Klamt characterised the histamine pKa value of 9.7 as an amidine pKa and the thenyldiamine pKa as a pyridine pKa. This is corrected to a primary amine and tertiary amine, respectively.  Also, the experimental pKa values of morphine and niacin are changed to 8.2 and 4.2, respectively, while the remaining experimental pKa values are taken from Eckert and Klamt.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS.  Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible.  I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on.  I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.



The Results
The results are shown in the figure above (based on "Table 1 soln").  The negative outlier seen for the COSMO-based method is cefradoxil and is due to proton transfer in the zwitterionic protonation state. Cefadroxil is also the negative outlier for DFTB3/SMD although the proton doesn't transfer. Proton transfer in zwitterions is also a common problem for DFT/continuum calculations and is due to deficiencies in the continuum solvent method, not the electronic structure method.  The good performance observed for PM3/SMD is thus due to fortuitous cancellation of error.  For the three other zwitterions among the molecules, niacin, phenylalanine, and tryptophan, no proton transfer is observed and the error is relatively small.

The AM1- and PM3-based methods perform best with RMSE values of 1.4-1.6 ± 0.3-0.4, that are statistically identical.* The null model has an RMSE of 1.8 ± 0.4 which, given the error bars, are statistically no worse than the AM1- and PM3-based method.  

If the cefadroxil outlier is removed, the RMSE values for PM3/COSMO and AM1/COSMO drop to 1.0 ± 0.2 and 1.1 ± 0.3, while PM3/SMD and AM1/SMD are remain at 1.5 ± 0.4 and 1.6 ± 0.4.  So for this subset, the COSMO-based predictions can be said to outperform the SMD-based predictions, as well as the null model.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4?  Here PM3/COSMO is performs best by getting it right 94% of the time, compared to 91%, 77%, and 91% for AM1/COSMO, PM3/SMD, and the null model.

Summary and Outlook
Overall the best method for pKa prediction of drug-like molecules is either PM3/SMD, AM1/SMD, or the null model with RMSE values of 1.5 ± 0.4, 1.6 ± 0.4, and 1.8 ± 0.4.  The corresponding RMSE values for PM3/COSMO and AM1/COSMO are very similar, but in general they can fail for certain types of zwitterions.

For a set of molecules that does not include such zwitterions then the best method is either PM3/COSMO and AM1/COSMO which deliver RMSE values of 1.0 ± 0.2 and 1.1 ± 0.3, respectively.  In this case, using gas phase geometries lead to slightly larger , but statistically identical, RMSE values of 1.4 ± 0.3 and 1.3 ± 0.3, respectively.

In this study I made sure that the suitable reference molecules where available for all molecules.  This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules.  Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)

*The RMSE uncertainties are computed using equation 14 in this paper and the two methods are taken to be statistically different if their difference in RMSE is larger than the composite error described on page 105 of this paper.


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Monday, October 10, 2016

Prediction of amine pKa values of drug-like molecules using semiempirical QM methods

2016.10.20 update: please disregard this post and read this post instead.
2016.10.16 update: I used the wrong keyword in the MOPAC calculations so these numbers are not right.

In an earlier study we showed that pKa values could be computed fairly accurately using PM6-based methods using isodesmic reactions and appropriate reference compounds.  Amines were especially challenging.  The study used small molecules.  The next question, which I address here, is how this approach performs for actual drug molecules containing amine functional groups.  These are preliminary results and may contain errors. 

The Molecules
I tool about 50 drug-like molecules with experimentally measured amine pKa values from Table 3 in the paper by Eckert and Klamt.  I moved some of the smaller molecules such as 2-methylbenzylamine, since they would differ very little from the corresponding reference molecules. The reference molecules are chosen to match the chemical environment of the nitrogen within a 2 bond radius as much as practically possible and I try match the ring-size if the nitrogen sits in a ring. I end up with 35 reference molecules.  I compute most of the reference pKa values using the ACE JChem pKa predictor.

Many of the molecules contain more than one ionizable group. Only the pKa values of the amine indicated in Eckert and Klamt's Table 3 are computed and the protonation states are prepared according standard pKa values.  For example, for Phenylalanine the carboxyl group is deprotonated because the "standard" pKa values of a carboxyl group (e.g. in acetic acid) is lower than the standard pKa values of a primary amine (e.g. ethylamine).  When preparing the protonation states I noticed that Eckert and Klamt mischaracterised the Histamine pKa value of 9.7 as an amidine pKa. This is corrected to a primary amine.

The Methods
I test the following methods: PM6-DH+, PM6, PM3 and AM1 with the COSMO solvation method using MOPAC, and PM3, AM1, and DFTB3 with the SMD solvation method in GAMESS.  Jimmy's getting quite close to finishing the PM6 implementation for heavy elements in GAMESS, but he still needs to interface it with PCM, so PM6/SMD calculations are not yet possible.  I use RDKit to generate 20 starting geometries for each protonation state and I optimise with the solvation method turned on.  I don't include RRHO free energy contributions and I use only the lowest free energy structure for the pKa calculations.

The Results
You can see the results here (Table 1). The COSMO-based predictions are very bad with MAD values of 13.1 - 15.4 pH units and the SMD-based predictions are OK with MAD values of 1.5 - 2.0 pH units. If I simply use the reference pKa values the MAD is 1.4 pH units.  The corresponding maximum ADs are 71.3 - 136.1, 7.3 - 9.9, and 6.5 pH units.

Inspection of the molecules with large AM1/SMD and PM3/SMD errors suggests that the two of the molecules (Cimetidine, Niacin) were prepared with incorrect protonation states.  The ACE JChem pKa predictor predicts that the pKa of the carboxyl group in Niacin is higher than that of the pyridine, while the pKa value of the nitrile-substituted guanidine group in Cimetidine is lower than at that of the imididazole. Similarly, the ACE JChem pKa predictor predicts that titrating amine in Thenyldiamine with pKa 8.9 is not the pyridine as indicated by Eckert and Klamt, but the tertiary amine.

After making these changes the MADs are 10.4 - 12.4, 1.3 - 1.7, and 1.4 pH values and the max ADs are 71.3 - 136.1, 3.3 - 9.9, and 6.5 pH units.

One of the main uses of pKa values is the prediction of the correct protonation state at physiological pH (7.4), i.e. is the predicted pKa above or below 7.4?  The COSMO-based predictions get this right 74 - 77% of the time, while the SMD-predictions get it right 75 - 87% of the time.  In this regard PM3/SMD is considerably worse than AM1/SMD and DFTB3/SMD despite the fact that the MAD for PM3/SMD is 0.1 pH units lower than AM1/SMD and 0.4 pH units lower than DFTB/SMD. Simply using the reference values gets the protonation state right 91% of the time.

Why are COSMO-based predictions so bad?
The largest error occurs for Cefadroxil where the proton transfers in the zwitterionic state for the COSMO method.  The remaining large errors involve molecules that have a +2 charge where the pKa is much too low.  It appears that the COSMO method severely underestimates the solvation energy of +2 ions.

Outlook
In this study I made sure that the suitable reference molecules where available for all molecules.  This will be difficult in the general case and it will be interesting to see how the accuracy is for these molecules.  Cases where no good reference molecule can be found can be flagged based on similarity scores.

I will now start working on a manuscript draft (you can follow along here if you're interested)


This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Sunday, September 18, 2016

Why is there no standard state temperature?

The standard state pressure is 1 bar, why is there no standard temperature?

The short answer
The standard state pressure is not an experimental condition, while the temperature is.

The long answer
The main reason the standard state is defined is because it leads to this very useful equation
$$K_p = e^{-\Delta G^\circ/RT}$$
Say you have this reaction: $A \rightleftharpoons B + C$ One way to use this equation is to compute the free energy of 1 mol of $A$, $B$, and $C$ at 1 bar using equations derived for an ideal gas, compute $\Delta G^\circ = G^\circ (B) + G^\circ (C) - G^\circ (A)$, and use that value to predict $K_p$.
If the gasses behave like ideal gasses "in real life" then the measured  $K_p$ will match the $K_p$ computed from $\Delta G^\circ$.  You can do the measurement at any pressure you want, not just at 1 bar.* The standard state refers to the pressure you use when computing $\Delta G^\circ$.  The only thing it has to do with the experimental measurement is that it defines the units you should use for your partial pressures when computing $K_p$

$\Delta G^\circ$ does also depend on temperature, but the temperature you chose should be the same as the experimental conditions.  So the temperature is not part of the standard state definition.

But what about "Standard temperature and pressure (STP)?"
Standard temperature and pressure (STP) refers to STP conditions under which $K_p$ is measured, not the pressure used to compute $\Delta G^\circ$. I know, they couldn't have made it more confusing if they tried when they named these things.


*Of course if you do the measurements at very high pressures or low temperatures, then the assumption that the gasses behave ideally will be less valid and the measured $K_p$ will differ more from the $K_p$ computed from Equation 1.  However, that is a separate issue unrelated to the standard state because the $K_p$ in Equation 1 refers to the $K_p$ you would measure if the gasses behaved ideally at the pressure and temperature used in the experiment.



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Saturday, September 17, 2016

Why I tweet and blog

Update: here is the audio.  Something wen't wrong with Google Hangouts so the slides are missing. Despite the fact that I did a few practice runs yesterday ... and have a PhD in theoretical quantum chemistry.  WTH Google!

Update 2: the trick (in addition to this tip) is to start the Powerpoint slideshow before you start streaming.




On Tuesday I am giving presentations on tweeting and blogging in a Scientific Writing course.  Here are my slides and the message to the students

Dear Scientific Writing students

On Tuesday I will give two presentations: one on tweeting and one on blogging.  You can find the slides below.

You'll also do some writing so please bring a laptop and make sure you can get on Eduroam.

In preparation for Tuesday, please find one science related blogpost and twitter account you think looks interesting and share them on the discussion forum I created on Absalon.

Finally, I may try to live broadcast my talk using Google's Hangout On Air.  I've never tried it, so I am not sure if I can get it to work by Tuesday.  If you are uncomfortable with this, just send me an email and I won't do it.

See you Tuesday!

Slides




This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Monday, August 22, 2016

Finding the reference molecule for a pKa calculation using RDKit

This is prototype code related to this project.  I use Histamine as an example which has two ionizable sites: a primary amine and an imidazole ring.



The code figures out that imidazole is the best reference for the imidazole ring, while ethylamine is the best reference for the primary amine.  The code does this by figuring out which atom is being deprotonated, computes the Morgan fingerprint around this atom, and compares it to the Morgan fingerprints of imizadole and ethylamine.

Thursday, August 11, 2016

Drug design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ. It's ultimately related to making better drugs so first some background.

Background
Designing new drugs currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new drugs - a cost that is ultimately passed on to you and I as consumers.  There are many, many reasons why drug design is so difficult. One of them is that we often don't know fundamental properties of drug-candidates such as the charge of the molecule at a given pH. Obviously, it is hard to figure out whether or how a drug-candidate interact with the body if you don't even know whether it is postive, negative or neutral.

It is not too difficult to measure the charge at a given pH, but modern day drug design involves the screening of hundreds of thousands of molecules and it is simply not feasible to measure them all. Besides, you have to make the molecules to do the measurement, which may be a waster of time if it turn out to have the wrong charge. There are several computer programs that can predict the charge at a given pH very quickly but they have been known to fail quite badly from time to time.  The main problem it that these programs rely on a database of experimental data and if the molecule of interest doesn't resemble anything in the database this approach will fail. The paper that just got published is a first step towards coming up with an alternative.

The New Study
We present a "new" method for predicting the charge of a molecule that relies less on experimental data but it fast enough to be of practical use in drug design. The paper shows that the basic approach works reasonably well for small prototypical molecules and we even test one drug-like molecule where one of the commercial programs fail and show that our new method performs better (but not great).  However, we have to test this new method for a lot more molecules and in order to do this we need to automate the prediction process, which currently requires some "manual" labor, so this is what we're working on now.



This work is licensed under a Creative Commons Attribution 4.0 

Sunday, August 7, 2016

Conformer search with RDKit

I'm teaching myself how to use RDKit.  Here is code for conformer search using RDKit that also computes the energy of each conformer using the MMFF94 force field.

Comments welcome

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.  

Monday, August 1, 2016

Thoughts from the Gordon Research Conference on Computational Chemistry

Here are some of the things I took away from attenting the GRC on Computational Chemistry

Tweeting
The GRC had very strict "off-the-record" rules to encourage the presentation of unpublished results. However, most speakers devoted at least half their talks to published results and I and others - especially Marc van der Kamp - Tweetet some of these papers under the hashtag #compchemGRC.

Furthermore, I also explicitly waived my "off-the-record" rights at the beginning of my talk and encouraged Tweeting.  I also shared my slides on Twitter - before the conference and immediately before my talk.  Seeing these slides on Twitter, FX Coudert alerted me to the fact that PM6 is now fully implemented in CP2K, which is could be very useful for our work.

Open Access
I talked to a few people about my OA philosophy.  Here is what I put on my CV

"My publication policy since 2012:  If a paper has a shot at high impact journals such as JACS or PNAS then I will submit there. However, the majority of my papers are method development papers, which will be submitted to open access journals such as PLoS ONE or PeerJ as I fail to see a difference in impact between these journals and journals such as Journal of Chemical Theory and Computation and Journal of Computational Chemistry where I used to publish before."

However, it really doesn't have to be an all or nothing decision.  My best advice is one paper at a time.  Just try it once and see what happens.

For me "impact neutrality" has become just as important as OA.  It is so very liberating to just write down what I did and what I found rather than trying to put everything in the best possible light with elaborately constructed "technically-correct-but-not-really-telling-the-whole-story" paragraphs.

Reproduceability
Speakers usually show their "best" work at conferences and precious speaker time is generally not wasted on pitfalls and caveats. It is easy to get the impression that everything is going great for everyone else, while you are struggling with your own projects. Furthermore, when you see something potentially wonderful that you want to try but you just know from experience that it won't be so easy as the speaker makes it sound and, in fact, will be hard to reproduce from the published papers alone. (This is no reflection on any one particular speaker at the conference).

This general sentiment was shared by a number of people I talked to.  It's not a new problem but I do believe it is a  growing one in part because research projects are getting more complex making it nearly impossible to describe all steps in sufficient detail to make it reproduceable. The only solution is, in my opinion, to make everthing available as supplementary material. Tar the whole thing - input files, output files, submission and analysis scripts, spreadsheets, etc - and put it on a server such as Figshare.

Funding
The usual conference conversation starts with "Hey X, how are things going?", "Oh, fine, and you?", "Oh, fine."  But one person responded "Writing a lot of proposals and getting them rejected."  I really appreciated this honesty, and it makes me feel less bad about my own rejections. A few weeks ago I had a similar talk with another colleagues about the possibility of having no PhD students in the not-too-distant future and how this affects the choice of research projects one can take on.  I think a lot of scientists are going through the same type of thing and it is important to be open about it.

Co-vice chair election (This will only make sense to people who were at the meeting, and that's fine)
A few people asked me why I effectively withdrew from consideration just before the vote. The short answer is that it didn't know for sure who else was running, nor that the candidates would be split up in two group, until that morning. Had I thought a little faster, I probably could have gotten my name removed just in time. But I am not a fast thinker at the best of times and certainly not at 8:30 am.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.  

Thursday, July 21, 2016

Finding disordered residues in an NMR ensemble

Note to self: here's how you identified disordered residues in the NMR ensemble 2KCU.pdb

1. In Pymol: "fetch 2kcu"

2. Action > align > states (*/CA)
2016.08.07 update: the above command also aligns the tails.  Use "intra_fit (2kzn///6-158/CA)"

3. "save 2kcu_aligned.pdb, state=0"

4. In terminal: grep CA 2kcu_aligned.pdb > lis

5. python disorder.py

disorder.py (given below) calculates the standard deviation of the x, y, and z coordinate of each CA atom ($\sigma_{x,i}, \sigma_{y,i}, \sigma_{z,i})$. It then averages these three standard deviations for each CA atom $(\sigma_i)$.  To find outliers, it averages these values for the entire protein $(\langle \sigma_i \rangle)$ and computes the standard deviation of this average $(\sigma_{\langle \sigma_i \rangle})$. Any residues for which $\sigma_i > \langle \sigma_i \rangle + \sigma_{\langle \sigma_i \rangle}$ is identified as disordered.

Here I've colored the disordered residues red (haven't updated the picture based on Step 2-change yet)



Yes, I know: "the 1970's called and want their Fortran code back". How very droll.



This work is licensed under a Creative Commons Attribution 4.0

Tuesday, July 12, 2016

Reproducing stats or verbose output from LINEST command in Excel or Google Sheet in Python



The python code above reproduces the output produced by the LINEST(y;x;true;true) command in Excel [LINEST(y,x,true,true) in Google Sheets] with a csv file as input.  In the csv file I have assumed that the x and y columns are labelled "x" and "y" respectively.  This page has a good explanation of the output (pdf).



This work is licensed under a Creative Commons Attribution 4.0

Monday, July 11, 2016

2nd Reviews for Prediction of pKa values using the PM6 semiempirical method

2016.07.15: Update: our rebuttal can be fund here.  Manuscript resubmitted.

The 2nd round of reviews on our latest PeerJ submission came in on July 7th.  The first round of round of reviews and a link to our response can be found here.

Editor's Comments
MINOR REVISIONS
Thank you for your efforts at addressing the reviewer's comments. In spite of that, the reviewers (and myself) still think that additional data should be moved from the Supporting Information to the main text as tables/graphs. Specifically:

-per reviewer 1's request, please include the ref. pKa data in table 1. The Supporting Material deposited in figshare is very complete, but it will enormously help the reader (and make your paper much more persusive at first reading) if the most salient pieces were included in the paper itself. 

- contra reviewer 1's comment, I do acknowledge that p.799 of the quoted Stewart (2008) reference describes the pKa computation procedure which generated the data present in http://openmopac.net/pKa_table.html. This method (also used by Rayne) as well as the method by Juranic, however, do not computes pKa from the energy difference itself, but from an empirical fit of the O-H bond distances and approximate charges (or N and H charges, plus a dummy variable stating whether the amine is primary, secondary or terciary, for Juranic, 2014). These pKa computation approaches are therefore fundamentally different from the one used in your paper. Your references to this literature in the introduction, however, do not make this clear enough. Please improve this to clearly compare the competing methods for PM6-based pka computations to the your approach.

-Do include the statistical data regarding slope, R-squared and outliers. A motivated reader may easily graph the data you have computed (and which are present in the spreadsheet referred to in your figshare area), but your explanation and discussion would be much more readable, and certainly more persuasive, if you included those graphs, slopes and correlation coefficientes in the paper. That analysis shows more clearly than the aggregat tables exactly where PM6 affords better correlation/slope that even CBS-4B3 (pyridines), the identity of the outliers, how poorly all methods (even CBS-4B3) correlate to experimental pKa in amines (in spite of a seemingly low 0.2 MAD for CBS-4B3), etc.

-per reviewer 2's request (and also related to my previous request which I may not have worded clearly) please add data regarding the likely origin of the errors in the outliers: do they come from gas phase energies or the solvation? A simple comparison of the B3LYP gas-phase energy changes (on PM6-optimized geometries, to reduce computational effort) and solvation effects might be enough to tell whether the gas-phase acidities (and/or solvation) of PM6 generally track the DFT results.

Reviewer 1 (Anonymous)

Comments for the Author
Reviewer Comments – Reviewer 1
The authors have not adequately responded to any of the concerns raised in my original review. My original comments are shown first. The authors’ responses are shown next; and my further responses to them are shown below.

(1) Basic reporting 
This is an interesting manuscript, but a frustrating aspect is that the experimental pKa values used for comparison are not included for most compounds. These could easily be added to Table 1. In fact, the best solution would be to modify Table 2 to give the calculated pKa values from the various methods along with the experimental values.

Authors: The values are already provided in Supplementary Materials

The copy I received contained no reference at all to “Supplementary Materials”. If these materials are available directions for accessing them should be clearly presented in the normal position just before the References.

(2) Also, the authors should refer somewhere to the very relevant PM6 pKa calculations by Jimmy Stewart given in http://openmopac.net/pKa_table.html. 

Authors: We already refer to this approach in the introduction (Stewart 2008).

The reference Stewart (2008) concerns proteins and has nothing at all to do with pKa estimates. As clearly indicated, the relevant Stewart study is not a formal publication, but has been made widely available to workers by Stewart on the web page as indicated. Apparently the authors didn’t even bother to look at it.

(3) Validity of the findings 
It would be very helpful if the authors would provide a figure comparing the calculated and experimental values, and include in the text the relevant equation with proper statistics (n, r2, s, F) along with the uncertainties for the slope & intercept. (See, e.g., the book by Shields & Seybold on this topic, or their WIRES article.) 

Authors: The statistical analysis the reviewer refers to is done in the context of a QSAR prediction of pKa from QM data, i.e. to gauge the accuracy a linear fit to be used in the prediction of unknown pKa values. The statistics used in this paper is just aimed at gauging the accuracy of the predicted values and, in our opinion, is more than adequate for the task. If the reviewer can explain how the requested statistics is to be used in the context of the current paper we will be happy to reconsider the request.

This is a standard way to compare not just QSAR results, but any studies in this field. It would be helpful, and I don’t understand the authors’ reluctance to include it.

Annotated manuscript
The reviewer has also provided an annotated manuscript as part of their review:

Reviewer 2 (Anonymous)

Comments for the Author
I thank the authors for the revised manuscript. I would still like the authors to address my second point as to what is the major source of error in these calculations, especially for the outliers. Is it the gas phase energies, or the solvation component?

Sunday, June 26, 2016

Planned papers for 2016 - six months in

Pedro's post reminded me that mine was due

In January I wrote about the papers I plan to publish and made this list:

Submitted
None

Probable
1. Benchmarking of PM6 and DFTB3 for barrier heights computed using enzyme active site models.
2. pKa prediction using PM6 - part 1
3. Protein structure refinement using ProCS15 - starting from x-ray structure

Maybe
4. PM6 for all elements in GAMESS, including PCM interface
5. Protein structure refinement using ProCS15 - starting from 5 Å Cα RMSD
6. Vibrational effects on N amide chemical shifts
7. pKa prediction using PM6 - amines
8. Predicting binding free energies for CB7
9. Linear scaling HF-3c calculations by interface to FMO2 in GAMESS
10. Side chain chemical shift prediction with ProCS15
11. Rienstra-like chemical shift assignment in PHAISTOS

The status is

Published 

Submitted
Received "minor revision" verdict.  Sent back in last week.

Probable
3. Protein structure refinement using ProCS15 - starting from x-ray structure
Actively working on it. First draft about 2/3 done. It's a huge amount of work because I'm still new to the field and learning as I am writing. However, I am fairly confident that I'll get it published in 2016.

7. pKa prediction using PM6 - amines
The main issue here is whether we can automate all steps of the pKa calculation and based on what we have learned so far I am pretty sure we can. The main issue is the protonation. Once paper 3 is done I will start working on this.  The CPU requirements are not an issue so I am fairly confident that I'll get it published in 2016.

Maybe in 2016
*. Improved prediction of chemical shifts using machine learning
This wasn't even on the drawing board in January.  Lars spent a few months in Anatole von Lilienfelds lab working on increasing the accuracy of the ProCS15 data set  using machine learning. Calculations are still ongoing so I am a little hesitant to list it under "probable".
A companion paper in Scientific Data has also been discussed, 


Probably not in 2016
4. PM6 for all elements in GAMESS, including PCM interface
Jimmy is working on the first part.  The interface is there but debugging is really tough.  I think we will get it working in 2016 but getting a paper published this year is unlikely.

5. Protein structure refinement using ProCS15 - starting from 5 Å Cα RMSD
I think we will get most of the calculations done in 2016.

6. Vibrational effects on N amide chemical shifts
I had a visiting student working this. Bottom line: still no bug-free, black-box approach for computing vibrational that just works.  Much harder problem than I had anticipated.

8. Predicting binding free energies for CB7
Have some data but a long way to go yet.

9. Linear scaling HF-3c calculations by interface to FMO2 in GAMESS
This is actually working and being incorporated into the official version of GAMESS.  Not sure when we'll get around to generating data for a paper.

10. Side chain chemical shift prediction with ProCS15
Susanne is working on this but I doubt we will publish a paper on it in 2016. 

11. Rienstra-like chemical shift assignment in PHAISTOS
Still just an idea so far.



This work is licensed under a Creative Commons Attribution 4.0

Thursday, June 16, 2016

Reviews for Prediction of pKa values using the PM6 semiempirical method

2016.06.21 Update: Here's our response

Reviews of our latest PeerJ submission is in after only 15 days.  This must be some kind of record!

Personal comments from the editor:

Table 4 shows dramatic differences between PM6-D3H+ and PM6 although the previous tables did not show very large differences between both semiempirical methods. Please discuss this.

How do the errors in PM6 or PM6-D3H+ gas-phase protonation energies (vs. experiment or high level computation) change when moving from primary to secondary and tertiary amines? I believe that the addition of a table with these data (with each tested amine treated separately) would be very helpful for the readers and future practitioners.


Reviewer 1 (Anonymous)

Basic reporting

This is an interesting manuscript, but a frustrating aspect is that the experimental pKa values used for comparison are not included for most compounds. These could easily be added to Table 1. In fact, the best solution would be to modify Table 2 to give the calculated pKa values from the various methods along with the experimental values. Also, the authors should refer somewhere to the very relevant PM6 pKa calculations by Jimmy Stewart given in http://openmopac.net/pKa_table.html.

Experimental design

In general this work is properly designed.

Validity of the findings

It would be very helpful if the authors would provide a figure comparing the calculated and experimental values, and include in the text the relevant equation with proper statistics (n, r2, s, F) along with the uncertainties for the slope & intercept. (See, e.g., the book by Shields & Seybold on this topic, or their WIRES article.)

Comments for the Author

After improvements, this manuscript will be of interest to many people attempting to calculate pKas, especially those dealing with high throughput applications.


Reviewer 2 (Anonymous)

Basic reporting

The paper is well-written and organised in a manner that was easy to read. I did find the background / literature research on the short side. Specifically, the isodesmic or proton exchange scheme was developed quite some time ago by various groups . See for example: (a) http://dx.doi.org/10.1063/1.1337862 (b) 10.1021/ct800335v and (c) 10.1021/jp107890p. These studies have laid out quite clearly the effectiveness of an isodesmic scheme for error cancellation, as well as its limitations (e.g. the need for a structurally similar reference with accurately known pKas). Another minor point is there should be a footnote to explain what "**" in Table 2 means.

Experimental design

The research question is well-defined, namely whether contemporary semi-empirical methods can provide cost-effective predictions of pKas. I do have a number of suggestions for improvement:

(1) Computational methods: It was not clear how the solvation free energies were computed - e.g. were these done on gas phase or solution phase optimised geometries? Strictly speaking, the gas and solution phase components of the solvation free energy should be computed on geometries optimised in the respective phases. How sensitive are the results to this choice?

(2) There is a lot of data condensed into the Tables which could actually be used to provide even deeper insights. For example, I would love to see a breakdown of the solution phase energies into the gas phase and solvation contributions as laid out in eqn (6). This would be useful for identifying the sources of errors especially for the outliers.

(3) The dataset molecules in Table 1 are structurally very similar (the substituents are mostly aliphatic groups). It would be interesting to see a more diverse selection of molecules (e.g. EWG and EDG) as the authors alluded to in their conclusion.

Validity of the findings

I think the conclusions are fair based on the results presented. However, I do recommend the authors consider my earlier suggestions to provide clearer insights as to why semi-empirical methods can sometimes fail badly even for isodesmic reactions. This will spur further research into improving these methods.


Reviewer 3 (Anonymous)

Basic reporting

- Line 132: change "can play and important role" to "can play an important role"

- Citations need to conform to the journal style thoughout: see, e.g., "taken from (Morgenthaler et al., 2007)" in Table 4 caption should be changed to "taken from Morgenthaler et al. (2007)"

- References in the bibliography need to be consistently formatted to journal guidelines

- Other groups have reported validation efforts for predicting pKa values using the PM6 method (see, e.g., Rayne et al. [2009], Juranić [2014], etc.). The authors should cite and incorporate the findings of all these prior PM6 pKa validation efforts into the current study to demonstrate an understanding of the prior literature in this field. Otherwise, it looks as though the authors are attempting to make their research appear more novel than it actually is.

Experimental design

- Experimental design is appropriate.

Validity of the findings

- The findings appear valid.

Friday, May 6, 2016

A cheminformatics problem: protonate with SMILES and InChI

I am teaching a python programming course and this is one of the projects I want to try this year.

The over all goal of the project is to write one or more programs that generates protonation states for a list of nitrogen containing molecules specified by name. The project uses SMILES, and maybe InChI, which you can read more about here.

Getting started
1. Write code that from this list molecules = ["CCN", "CNC", "CN(C)C"] generates this output

CCN
CC[NH3+]

C[N@H]C
C[N@@H]C
C[NH2+]C

C[N@](C)C
C[N@@](C)C
C[N@H+](C)C
C[N@@H+](C)C

2. Write code that from this list  molecules = ["C(C(=O)O)N"] generates this output (the order is not important)

C(C(=O)O)N
C(C(=O)O)[NH3+]
C(C(=O)[O-])N
C(C(=O)[O-])[NH3+]

3. C(C(=O)O)N is the amino acid glycine.  Extend this program to work for alanine, asparagine, aspartate, and lysine.  Use this site to get SMILES strings for these amino acids. Find a picture of asparagine and make sure you're treating the side-chain correctly.

The project
4. (optional) Figure out how to generate a file containing SMILES strings from a file containing names. The best way is probably bash.  Get inspiration here, here, and here.

5. Generate all possible protonation state SMILES for the molecules in Table 2 in this paper. If you completed step 4 you can use tools like https://pdftables.com/ to generate a file with the names.

6. Repeat for Table 1 and 3

7. (optional) The neutral form of the amino acids histidine and arginine side chain groups have tautomers. Generate SMILES for all tautomers (InChI might help you identify tautomers).

8. (optional) Do any of the molecules in step 5 and 6 have tautomers? If so generate SMILES for all tautomers.

Some code snippets to get you started



This work is licensed under a Creative Commons Attribution 4.0

Tuesday, May 3, 2016

Enzyme design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal PeerJ.  It's ultimately related to making better enzymes so first some background.

Background
Enzymes are proteins that make certain chemical reactions go faster and nearly every complex molecule in your body is made by, or broken down by, enzymes.  But people have also started using enzymes in commercial products, for example in washing powder to break down oily stains at lower temperatures. This saves money on heating the water and, being proteins, the enzymes are biodegradable. So there is a lot of interest in designing new enzymes that build or break down new molecules efficiently. For example, there is a rather large company (Novozymes) near Copenhagen that does nothing but design, produce, and sell enzymes on an industrial scale.

Designing new enzymes currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new enzymes - a cost that is ultimately passed on to you and I as consumers. My long-term goal is to reduce the amount of trial-and-error by writing a computer program that can predict what changes you have to make to improve the enzyme before you ever make it in the lab.

I've had some modest success with a prototype program some years ago (you can find the papers here and here) for one enzyme.  But one of the many things we don't know is whether the method we base our approach on will work at all for other types of enzymes. The paper that just got published is a first small step in figuring this out.

The New Study
We've collected data for five other enzymes from other published papers that we trust reasonably well and tested two methods that are fast enough to design enzymes - one is the same method we used a few years ago and the other is a newer one that wasn't available to us before now.  The conclusion of our study is that the methods seem to work well enough for all but one system, and this system is different for the two methods.  This suggests that we can't just base future work on one method. We have to have both ready in case one of them fail.  We need to repeat the study for many other types of enzymes - I would say at least 15-20 more - and we need to improve the quality of the data so that we trust it completely, rather than "reasonably well".  In the paper we have extended an open invitation to other scientists to contribute to this effort.




This work is licensed under a Creative Commons Attribution 4.0 

Saturday, April 16, 2016

Computing pKa values for molecules with several ionizable groups

We're working on pKa prediction using semiempirical methods and need to compute pKa values  for molecules with several ionizable groups. Here are my current thoughts so far.

Background: one ionizable group
If there is only one tritrateable site
$$ \mathrm{BH \rightleftharpoons B + H^+} \ \ \  K=\mathrm{\frac{[B][H^+]}{[BH]}} $$
then the fraction of $\mathrm{BH}$ molecules $f_{\mathrm{BH}}$ is $$
\begin{split}
 f_{\mathrm{BH}} & =\mathrm{\frac{[BH]}{[B]+[BH]} } \\
& = \mathrm{\frac{[B]}{[B]}\frac{[BH]/[B]}{1+[BH]/[B]} } \\
& = \mathrm{\frac{[H^+]/K}{1+[H^+]/K} } \\
& = \mathrm{\frac{10^{p\textit{K}-pH}}{1+10^{p\textit{K}-pH}} }
\end{split}
$$ where $$ \mathrm{pH = -log[H^+] \implies [H^+] = 10^{-pH}} $$
and similarly for $K$.

From this we can see that the pK value is the pH value for which $f_{\mathrm{BH}}$ = 1/2. So you compute the pK value from the standard free energy difference
$$\text{p}K =\left( \Delta G^\circ(\mathrm{B})+\Delta G^\circ(\mathrm{H^+})- \Delta G^\circ(\mathrm{BH})\right)/RT\ln(10) $$
and you're done.

Two ionizable groups
For a molecule with two titrateable groups ($ \mathrm{HB_\alpha B_\beta H}$) and the following equilibrium constants $$ \mathrm{HBBH \rightleftharpoons BBH + H^+} \ \ K_{\alpha1}$$ $$\mathrm{HBBH \rightleftharpoons HBB + H^+} \ \ K_{\beta1}$$ $$ \mathrm{HBB \rightleftharpoons BB + H^+} \ \ K_{\alpha0}$$ $$\mathrm{BBH \rightleftharpoons BB + H^+} \ \ K_{\beta0}$$ The probability of, for example, $\mathrm{BBH}$ is $$ f_{\mathrm{BBH}} =\mathrm{\frac{[BBH]}{[BB]+[BBH]+[HBB]+[HBBH]}= \frac{[BBH]}{\textit{P}}} $$ $f_{\mathrm{BBH}}$ can be rewritten in terms of pK values $$f_{\mathrm{BBH}} = \mathrm{\frac{[BBH]/[BB]}{\textit{P}/[BB]} = \frac{10^{p\textit{K}_{\beta0}-pH}}{\textit{P}/[BB]}} $$ where $$ \mathrm{ \textit{P}/[BB] = 1+10^{p\textit{K}_{\alpha0}-pH}+10^{p\textit{K}_{\beta0}-pH}+ 10^{p\textit{K}_{\alpha0}+p\textit{K}_{\beta1}-2pH}}  $$ Similarly, $$ f_{\mathrm{HBB}} = \mathrm{\frac{10^{p\textit{K}_{\alpha0}-pH}}{\textit{P}/[BB]}} $$ and $$ f_{\mathrm{HBBH}} = \mathrm{\frac{10^{p\textit{K}_{\alpha0}+p\textit{K}_{\beta1}-2pH}}{\textit{P}/[BB]}} $$  The apparent pK value of the $\alpha$ group ($\mathrm{p}K_{\alpha}$) is the pH at which its protonation probability $$f_\alpha =f_{\mathrm{HBB}} + f_{\mathrm{HBBH}} $$ is 1/2 and similarly for the $\beta$ group.  So compute the microscopic pK values (Eq 4-7), then $f_\alpha$ and $f_\beta$ as a function of pH, and then $\mathrm{p}K_{\alpha}$ and $\mathrm{p}K_{\beta}$

If one of the groups (say $\alpha$) titrates at a significantly lower pH than the other ($\mathrm{p}K_{\alpha1} << \mathrm{p}K_{\beta1}$) then $\mathrm{p}K_{\alpha}=\mathrm{p}K_{\alpha1}$ and $\mathrm{p}K_{\beta}=\mathrm{p}K_{\beta0}$ and it is not necessary to compute the free energy of $\mathrm{HBB}$, but it can be hard to determine this in advance.  Similarly, if there is no significant interaction between the sites then $\mathrm{p}K_{\alpha}=\mathrm{p}K_{\alpha1}=\mathrm{p}K_{\alpha0}$ and $\mathrm{p}K_{\beta}=\mathrm{p}K_{\beta1}=\mathrm{p}K_{\beta0}$ and one can skip one of the protonation states.

For $N$ ionizable groups one has to determine $2^N$ microscopic pKa values, which quickly gets out of hand if one has to do a conformational search for each protonation state and the molecule is large.

Related post
Generating protonation states and conformations


This work is licensed under a Creative Commons Attribution 4.0

Wednesday, April 13, 2016

Reviewing for PeerJ: it's the little (and the not so little) things

I just did my first review for PeerJ and it was a real pleasure because there are a lot of "little things" that make your reviewing life easier:

1. Figures/tables are in the text and, get this, the captions are immediately above/below the corresponding figure/table.  Some other journals also do this, but not enough.

2. I annotate the mss in a pdf reader and usually this is a frustrating experience since the publisher generated pdf has all sorts of "quirks" that make highlighting and copying text hit and miss.  The previous pdf I reviewed turned every page with a figure into an image!  Annotating/copying in the PeerJ pdf worked flawlessly.

3. The pdf contained a 3 front pages with the due date, a summary of the review criteria, a link to the page with the supplementary material, and a link to the page where I should submit my review.  No hunting around for the email with the link! I teared up a little bit when I saw that.

Other "little things" include stuff like not having to rank the perceived importance or impact of the work on some bogus 1-10 scale, a strict policy on making the raw data available, and a button to click to make my review non-anonymous.




This work is licensed under a Creative Commons Attribution 4.0

Thursday, April 7, 2016

Why I use Twitter: one scientist's perspective



Tomorrow I am giving a short talk on why and how I use Twitter to public relations people at the University.  I have 20 min + 10  min questions, but am aiming for a 10 min talk + 20 min questions.

Comments welcome




This work is licensed under a Creative Commons Attribution 4.0

Tuesday, April 5, 2016

ACS Omega is too expensive

Disclaimer: I applied to be a co-editor of this journal and was not selected.

ACS Omega has just announced its APCs: \$1500 (\$2000) under the ACS Authors Choice (CC-BY or CC-BY-NC-ND) license for members and an additional \$500 for non-members (since full membership costs $162, this is the real additional cost for one yearly publication).

The ACS Authors Choice license is not open access: under the ACS Authors Choice license you assign copyright to the ACS.  While
For non-commercial research and education purposes only, users may access, download, copy, display and redistribute articles as well as adapt, translate, text and data mine content contained in articles, ... 
you still can't, for example, use a figure from such an article in a book chapter without the ACS permission.  Also, the ACS can, for example, sell your article or your figures.

So the cost for to publish OA is ACS Omega is $2000.  That's more expensive than other impact neutral OA journals: PLoS One (\$1495), Scientific Reports (\$1495), F1000Research (\$1000), Rio Journal (\$850), PeerJ (\$100/author, Bio-only), and Royal Society Open Science ($0, for now).

Since all journals are impact neutral and provide quick review (AFAIK) price is the main considerations and I see no reason to pay more for to publish in ACS Omega.




This work is licensed under a Creative Commons Attribution 4.0