## Sunday, May 14, 2017

### Predicting pKa values using PM3 - conformer search using implicit solvation

Disclaimer: These are preliminary results and may contain errors

In a previous post I showed that minimising RDKit-generated conformers generated with MMFF led to slightly worse PM3/COSMO pKa predictions. One reason might be that the MMFF minimisation is done in the gas phase.  Anders Christensen reminded me that TINKER can do MMFF/GBSA minimisations so I used TINKER so minimise 20 and 50 RDKit-generated conformers in used these as initial guesses for the PM3/COSMO energy minimisations (20mmtk and 50mmtk

As you can see there is relatively little difference in the overall performance of Xmm and Xmmtk. The error distribution is a bit tighter for the tk variants and 50mmtk does better at finding structures closer to the "global" minimum, but 50nomm is still the best choice for avoiding >1 pKa errors.

So, either GBSA is not a good substitute for COSMO or MMFF is not a good substitute for PM3.  In light of this recent paper my money is on the latter.  So I'll go with 50nomm for now.

## Saturday, May 13, 2017

### Computing apparent pKa values using QM energies of isodesmic reactions

A few days ago I presented some preliminary pKa results.  Almost all the molecules in the set have more than one ionisable group. Here's how I compute the apparent pKa values

The microscopic pKa values are computed by

\label{eqn:pka}
\mathrm{pK_a}=\mathrm{pK_a^{ref}} + \frac{\Delta G^\circ}{RT\ln (10)}

where $\Delta G^\circ$ denotes the change in standard free energy for the isodesmic reaction

\mathrm{ BH^+ + B_{ref} \rightleftharpoons B + B_{ref}H^+ }

If there are more than one titrateable sites then several protonation states can contribute to the apparent pKa.  Here I follow the approach described by Bochevarov et al. If $\mathrm{D}$ and $\mathrm{P}$ differ by one proton and there are $M$ deprotonated sites and $N$ protonated sites then

\mathrm{K_{app}}  = \frac{([\mathrm{D}_1] + [\mathrm{D}_2] + ... [\mathrm{D}_M])[\mathrm{H}^+]}{[\mathrm{P}_1] + [\mathrm{P}_2] + ... [\mathrm{P}_N]}

which can be rewritten in terms of microscopic pKa values

\mathrm{K_{app}}  = \sum^M_j \frac{1}{\sum^N_i 10^{pK_{ij}} }

The sum contains contains microscopic pKa values for which more than one protonation state is changed. For example, molecule with three ionizable groups $(\mathrm{B_1H^+B_2H^+B_3H^+})$ will have the following microscopic pKa value

\label{eqn:pkapp}
K_{ij} = \frac{ [\mathrm{B_1B_2H^+B_3][H^+]}}{[\mathrm{B_1H^+B_2B_3H^+}]}

in the expression for the apparent pKa value for going from the doubly to the singly protonated state. However, the error in such a pKa value is considerably higher due to less error cancellation and such pKa values are therefore neglected in the sum.

In the Bochevarov et al. implementation the user defines the titrateable group for which the apparent pKa is computed.  However, in my approach all possible protonation states so the assignment of the apparent pKa value to a particular ionizable group is not immediately obvious.  Inspection of Eq $\ref{eqn:pkapp}$ shows that the largest microscopic pKa values will dominate the sum over $N$, while the smallest of these maximum pKa values will dominate the sum over $M$. Thus, the apparent pKa is assigned to the functional group corresponding to the microscopic pKa

\label{eqn:max}
pK^\prime_{ij} = \min_{j}(\{ \max_{i}(\{ pK_{ij} \} ) \} )

In some cases there are several microscopic pKa that contribute almost equally to the respective sums and for these cases the apparent pKa value cannot meaningfully be assigned to a single ionizable sites.  In my approach include all microscopic pKa values that are within 0.6 pKa units of the respective maximum and minimum values defined in Eq $\ref{eqn:max}$.   I choose 0.6 because that is the site-site interaction energy below which two groups titrate simultaneously.

You can find the code I wrote to do this here. It is not cleaned up yet.

## Thursday, May 11, 2017

### Small molecule energy minimisation and conformational search using Tinker

In this post I outlined the (possible) need for conformational analysis using a continuum model. Anders Christensen reminded me that Tinker can do this so I looked in to it. This blogpost and this GitHub repo were particularly helpful.  Here's what I found.

Generating atomic charges
Tinker has MMFF force field parameters except the atomic charges.  These can be generated from an sdf file using sdf2tinkerxyz (which relies on OpenBabel). When you download from SourceForge you get a bin file, which I didn't know what to do with so I downloaded the Linux executable from here instead.

./sdf2tinkerxyz -k default.key < xxx.sdf

creates yyy.xyz (coordinates) and yyy.key (charges), where yyy is the title in the sdf file (which I'll assume = xxx in the following). "default.key" is a file with Tinker keywords that is added to xxx.key. Here's mine

# Solvation Model
SOLVATE GBSA

#Force Field Parameters
PARAMETERS /opt/tinker/tinker/params/mmff.prm

Energy minimisation
To isolate the effect of continuum solvation on the pKa predictions, I want to generate conformers with RDKit and minimise them with MMFF/GBSA using Tinker.  This is done by

/opt/tinker/tinker/bin/minimize xxx.xyz -k xxx.key 0.01 > xxx.tout

"0.01" is the convergence criterium (in kcal/molÅ??). xxx.tout contains the energy information and a xxx.xyz_2 file is generated with the optimized coordinates.  This is in a format that OpenBabel can't handle, so I need to write a converter.  The main challenge is to translate the atom types into names. See the list called lookup in this code.

Conformational Search using SCAN
Tinker also has its own conformational search method. It looks like it's based on eigenvector following but I haven't looked closely at it yet.  This is done by

/opt/tinker/tinker/bin/scan xxx.xyz -k xxx.key 0 10 20 0.00001 > xxx.tout

Here I use the settings used in the DP4 GitHub repo.  "0" is automatic selection of torsional angles, "10" is the number of search directions (modes?), "20" is the energy threshold for local minima,and "0.00001" is the optimisation convergence criterium.  You get a bunch of coordinate files and xxx.tout holds the energy information.

I haven't really played with this option yet. If you have any tips, please leave comment.

## Sunday, May 7, 2017

### Predicting pKa values using PM3 - new code and conformer search

Disclaimer: These are preliminary results and may contain errors

The code I used in my previous pKa prediction paper had some limitations that I have now mostly removed: The setup of protonation states is automated and includes acids like acetic acid and phenol, there is a rudimentary tautomer generations, and the reference molecules are chosen a bit more systematically.

I ran the new code on the same set of molecules and saw a few differences.  Some were expected and due to different references and overall protonation states but most major differences were due to different conformations even though I used the same code to generate conformers. This let me to look at the effect of conformer generation.

The original study used 20 MMFF-minimised conformations ("20mm") generated using RDKit as starting points for PM3/COSMO minimisations.  Now I've tried skipping the MMFF minimisation ("nomm"), 50 conformations with and without MMFF minimisation, and a scheme where I MMFF-minimise 50 or 100 conformations and select conformers that have energies within either 10 or 20 kcal/mol of the lowest energy ("XecutY") for each protonation state/tautomer.

The RMSEs are very similar, but 50nomm has the fewest number of cases where the error in the pKa value (ΔpKa) is greater than 1 pH unit. However, the major outlier for 20mm and 20nomm is Sparteine for which 20mm and 20nomm actually find a conformer that is 4.9 kcal/mol lower than for 50nomm, for the protonated state.

This lead me to look at the energies themselves. The numbers labelled ΔE in the table are computed as follows. For a given molecule I find the lowest energy for the protonated and deprotonated state for each method and identify the method with the lowest energy $E_{min}$. Then I count the number of molecules for which the difference to $E_{min}$ is greater than 1.36 kcal/mol and average that number for protonated and deprotonated to get one number per method.

So there are 4 molecules for which 50nomm doesn't something close to the "global" minimum for either the protonated or deprotonated state, or both. In principle, I should go on and try 100nomm to see if I can "converge", but that's starting to become pretty expensive and the point is trying to develop a practically useful tool.

The 100ecut20 results suggest that the MM gas phase energetics don't represent the PM3/COSMO energetics all that well.  So it would be interesting to try a MM conformational search with an implicit solvent model. RDKit can't do this, but Macrmodel and NAMD should be able to do this.  It also looks like the next release of OpenMM will have this capability.

## Saturday, April 15, 2017

### Finding the function that is its own derivative

I saw this "derivation" some years ago but now I can't find it.  If anyone knows the source please let me know.

Let's say we want to find a function $f(x)$ for which
$$\frac{d}{dx} f(x) = f^\prime (x) = f(x)$$
Well, how about $f(x) = x$?
$$f(x) = x \implies f^\prime (x) = 1$$
No, because $f^\prime (x) \ne x$. OK, how about $f(x) = 1+x$?
$$f(x) = 1+x \implies f^\prime (x) = 1$$
That get's me a little closer.  I get the "$1$" back, but I need something whose derivative will give me back the $x$:
$$f(x) = 1+x+\tfrac{1}{2}x^2 \implies f^\prime (x) = 1 + x$$
Now I need something whose derivative will me back the $\tfrac{1}{2}x^2$:
$$f(x) = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 \implies f^\prime (x) = 1 + x + \tfrac{1}{2}x^2$$
OK, so if I use infinitely many terms then this function fits the bill
$$f(x) = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 + \cdots \tfrac{1}{n!}x^n \cdots \implies f^\prime (x) = 1 + x + \tfrac{1}{2}x^2 + \cdots \tfrac{1}{(n-1)!}x^{n-1} \cdots$$
Can I write $f(x)$ in some compact form?  Well, $f(0) = 1$ and $f(1)$ will give me some number, which I'll call $e$.
$$f(1) = 1+1+\tfrac{1}{2} + \tfrac{1}{2 \cdot 3} + \cdots = e$$
You only need a few terms before $e$ has converged to the first two decimal places (2.717).

For $x$ = 2 you need a few more terms to get the same precision, but the number (7.382) is roughly 2.717 x 2.717, an agreement that quickly gets better with a few more terms
$$f(2) = 1+2+\tfrac{1}{2}4 + \tfrac{1}{2 \cdot 3}8 + \cdots = e^2$$
So
$$f(x) = e^x = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 + \cdots \tfrac{1}{n!}x^n \cdots$$
and
$$\frac{d}{dx} e^x = e^x$$

### Where does the ln come from in S = k ln(W) ? - Take 2

Some years ago I wrote this post. Now I want to come at it from a different angle.

A closed system spontaneously goes towards the state with maximum multiplicity, $W$. For a system with $N$ molecules with energies $\varepsilon_1, \varepsilon_2, \ldots$ we therefore want to find the values of $N_i$ that maximises $W(N_1, N_2, \ldots)$.

This is easier to do for $\ln W$ than $W$, which is fine because $W$ is a maximum when $\ln W$ is a maximum, and this happens when
$$\frac{N_i}{N}= p_i = \frac{e^{-\beta \varepsilon_i}}{q}$$
$\beta$ can be found by comparison to experiment.  For example, the energy of an ideal monatomic gas with $N_A$ particles is
$$U^{\mathrm{Trans}} = N_A \langle \varepsilon^{\mathrm{Trans}} \rangle = N_A\sum_i p_i \varepsilon_i = \frac{3N_A}{2\beta} = \tfrac{3}{2}RT \implies \beta = \frac{1}{kT}$$
where $R$ is determined by measuring the temperature increase due to adding a known amount of energy to the system.

So far we have used $\ln W$ instead of $W$ for convenience, but is there something special about $\ln W$? Yes, the change in $\ln W$ has can be expressed quite simply
$$d \ln W = \beta \sum_i \varepsilon_i dN_i = N \beta \sum_i \varepsilon_i dp_i = \frac{dU}{kT}$$
So change in internal energy $U$ due to a redistribution of molecules among energy levels is equal to the change in $\ln W$ (as opposed to $W$) multiplied by $kT$
$$dU = Td\left( k \ln W \right) = TdS$$
The final question is whether there is something special about $\ln = \log_e$ as opposed to say $\log_a$ where $a \ne e$? Well, $\log_a W$ is a maximum when
$$\frac{N_i}{N}= p_i = \frac{a^{-\beta \varepsilon_i}}{q}$$
There is an extra term in the derivation but that cancels out in the end.  So no change there.

What about $\beta$?  There are two changes.  The previous derivation of $U^{\mathrm{Trans}}$ relied on this relation (I'll drop the "Trans" label for the moment)
$$\varepsilon_i e^{-\beta \varepsilon_i} = - \frac{d }{d \beta} e^{-\beta \varepsilon_i} \implies \langle \varepsilon \rangle = - \frac{1}{q} \frac{dq}{d\beta}$$
which now becomes
$$\varepsilon_i a^{-\beta \varepsilon_i} = - \frac{1}{\ln(a)} \frac{d }{d \beta} a^{-\beta \varepsilon_i} \implies \langle \varepsilon \rangle = - \frac{1}{q \ln(a)} \frac{dq}{d\beta}$$
While $q$ has an extra $\ln (a)$ term, the derivative wrt $\beta$ is still the same and
$$U^{\mathrm{Trans}} = N_A \langle \varepsilon^{\mathrm{Trans}} \rangle = N_A\sum_i p_i \varepsilon_i = \frac{3N_A}{2 \ln (a) \beta} = \tfrac{3}{2}RT \implies \beta = \frac{1}{\ln (a) kT}$$
So, $S = k \log_e W$ is special in the sense that the proportionality constant $k$ is the experimentally measured ideal gas constant divided by Avogadro's number.  In any other base we have to write either $S = k \ln (a) \log_a W$ where $k = R/N_A$ or $S = k^\prime \log_a W$ where $k^\prime = \ln(a)R/N_A$

Clearly, $S = k \log_e W$ is the most natural choice, and this is because $e$ is the (only) value of $a$ for which
$$\frac{d}{dx} a^x = a^x$$
In fact that is one way to define what $e$ actually is.

## Tuesday, February 28, 2017

### Biotechnology and drug-design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal Chemical Science. It's ultimately related to biotechnology and drug design so first some background.

Background
Most biotechnology and medicine involves proteins in some way. Many diseases involve mutations that alter the function of proteins, most drugs are molecules that bind to proteins and inhibit their actions, and a large part of industrial biotechnology involves making new types of proteins such as enzymes. Like with everything else, it is easier to deal with proteins of you know what they look like but protein structure determination can be very difficult and we don't know what many important proteins actually look like.

The most popular way of determining protein structure is a technique called x-ray crystallography where you basically take an x-ray of a crystal made from the protein. Unfortunately, it can be very difficult or impossible to grow crystals of some proteins and if you can't get the protein to crystallise you can't use x-ray crystallography to find the structure.  The other main way for determining protein structure is a technique called NMR spectroscopy where you basically take an MRI of a solution containing the protein. The advantage is that there is no need for crystallisation, but the disadvantage it that it is difficult to extract enough information from the "NMR-MRI" go get a good structure.

The "NMR-MRI" of a protein actually provides a unique fingerprint of each protein so in principle all one has to do is generate a lot of possible structures of a protein, compute the NMR fingerprint for each, and compare to the measured fingerprint. The structure with the best fingerprint match should be the correct protein structure.  The questions are how to best generate the structure and how to best predict the NMR fingerprint using the structure.

The New Study
In 2015 we published a new method for predicting NMR fingerprints and in the paper that just got published we combined it with a method for generating a lot of protein structures. We started with known x-ray structures and generated millions of relatively small variations of the structure and found the structure with the best match.  We started from a known structure to answer the question: what is the best match we can hope for? The answer is: not perfect but good enough.  Now that we know this the next step will be to start with a structure we know is wrong and see if the program can find the right structure.  Also, our NMR fingerprint method does not generate fingerprints for all parts of the protein so we need to improve the model as well.