## Saturday, November 5, 2016

### Semi-automatic pKa prediction using QM

Here I outline how I automated the calculations for this paper.  The files and programs can be found in the smiles-code.zip file that's part of the supplementary material on Figshare.

Generating SMILES
I used PDF to Text to get a text version of Table 3 from this paper and extracted the first column with the names using Excel.  I removed some molecules and the superscripts on some of the names by hand and created a text file with a single name on each line.

I used a python script to convert the names to SMILES strings

python name2smiles.py table_3 > table_3.smiles

I used a python script to convert the SMILES string to 2D images

python smiles2png.py tabel_3.smiles

I used the Cover Flow option in Finder (Mac) to browse through the images looking for errors. Turned out that several SMILES strings included the HCl which I removed.  If I saw other tautomer possibilities I created those by hand (I need to automate this).  A handful of molecules had carboxyl group which I deprotonated by hand by changing the SMILES string (I need to automate this).

I used a python script to protonate the amines

python protam.py table_3.smiles > table_3+.smiles

This program creates all possible single protonation states of all nitrogen atoms (except amides) in the molecule.  I deleted some of the protonation states I didn't want.  For example, for histamine I want to compute the pKa of the amine group but not of the imidazole, so I deleted the line with the histamine SMILES string for the protonated imidazole.  For a few of the molecules I also needed SMILES for doubly protontaes molecules to I used protam.py and table_3+.smiles to create a  table_3++.smiles file also and extracted the molecules I needed.  These steps where repeated for the reference molecules as well.

File naming
The each line in the .smiles files is the name of the molecule followed by a SMILES strings, the name is used to construct the filename, and the subsequent workflow makes some assumptions about the filenames.  When I deprotonoate carboxyl group I add a "-" to the filename, e.g. "Phenylalanine-", in the table_3.smmiles and protoam adds a "+" to the filename, e.g.  "Phenylalanine-+".  Molecules in the table_3++.smiles file will have names like.e.g Procaine++. If more than one nitrogen is protonated protam.py adds "_1" and "_2", etc., e.g. "Histamine+_1".  When I make tautomers I add "_1" or "_A", etc., e.g. "Cimetidine+_A".  (So in principle you could have, e.g. "name_A_1", but that didn't happen in practise).  Some manual editing is required for everything to come out right.

Creating input files from SMILES
I use a python script to create sdf files from the SMILES strings, e.g.

python smiles2sdf.py table_3.smiles

The script creates 20 different conformations for each SMILES string. The sdf files are named xxx_y.sdf where xxx is the names in the .smiles file and y is an integer between 0 and 19.

I use OpenBabel to convert the .sdf files to input files.  For example for MOPAC

for x in *.sdf; do babel \$x \${x%.*}.mop -xf ../pm6-dh+.header; done

The header file contains the keyword and the charge is specfied as "charge=0". So I need to change the charge by,

sed -i 's/charge=0/charge=1/g' *+_*.mop
sed -i 's/charge=1/charge=2/g' *++*.mop
sed -i 's/charge=1/charge=0/g' *-+*.mop
sed -i 's/charge=0/charge=-1/g' *-_*.mop

I change the method by, for example,

sed -i 's/pm6-dh+/am1/g' *.mop

Computing the pKa values
Af all the jobs have run I extract energies from the output files. For MOPAC out files by

grep "FINAL HEAT" *.out > xx.energies
grep "CURRENT" *.out >> xx.energies

The last line is needed in case MOPAC doesn't converge, in which case I just extract the last energy.

For  GAMESS log files
grep "TOTAL FREE ENERGY IN SOLVENT" *.log > xx.energies

I use a phython script to compute the pKa values from the .energy files

MOPAC pKa values
python pka_morgan_list+sub.py xx.energies big_table_3.smiles > xx.pka

GAMESS pKa values
python pka_morgan_list+sub_gms.py xx.energies big_table_3.smiles > xx.pka

Here big_table_3.smiles all the .smiles files, including reference molecules, combined into one.

The python script finds the most lowest free energy for each protonation state and the appropriate reference molecule for each ionizable site.  The protonation state is defined by the "-" and "+"s so that the lowest free energy for protonated histamine is the lowest free energy found in the output files named "Histamine+_*".  The pKa is related to the free energy difference between e.g. "Histamine+" and "Histamine" or "Phenylalanine-+" and "Phenylalanine-" or  "Procaine++" and "Procaine+", i.e.

delta_G = energies[name+"+"] - energies[name]

The python script contains SMILES strings for all the reference molecules and the appropriate reference for a titration is the reference molecule with the largest substructure match around the titrating nitrogen.

Some future improvements
I need to automate the tautomer generation and I shouldn't hand-pick the protonation state.  For example, I should consider both possible single protonation states in histamine and have the program automatically use the lowest free energy.  This also means that for, e.g. phenylalanine I should consider both the neutral and zwitterionic protonation state, i.e. "Phenylalanine-+_A" and "Phenylalanine-+_B" and  pick the lowest energy.