Proteins and Wave Functions: python

Showing posts with label python. Show all posts

Saturday, March 3, 2018

Very simple way to animate in Jupyter notebooks

I want to use Google Colaboratory for my python course, but the module we use for animation doesn't work there because the necessary libraries are not installed. I had planned to spend part of the weekend to see if I could find a solution but by amazing coincidence a tweet in my Twitter stream this morning had the answer!

It's alive!! #Python #procrastination #gameoflife #automaton pic.twitter.com/k9VYJ7y2KX
— Oscar Branson (@oscarbranson) March 2, 2018

The answer is simply to display the images sequentially and clear the output each time with clear_output(wait=True) from IPython.display. You can see the demo code I wrote below. This works very nicely on my off-line notebook (where you want to execute this line in a cell above the notebook itself: %matplotlib inline) while the animation is a bit choppy on Colaboratory, presumably due to network latency. But it's certainly good enough.

Thanks again to Twitter for saving me tons of time!

This work is licensed under a Creative Commons Attribution 4.0

Sunday, December 3, 2017

TS_conf_search: Conformer search for transition states

Here's som prototype code for performing a conformer search for a transition state. The idea is that you have found a TS for small model of the system and now you want to attach floppy ligands, generate different conformers for these ligands, and MMFF minimise them while keeping the core (model) frozen. Each conformer is then used as an initial guess for a TS search, perhaps preceded by a constrained minimisation using the QM method.

Usage
To use the program type "python TS_conf_search ts.sdf small_model.sdf 5".

ts.sdf is an sdf file that contains a guess structure of the TS you want to do a conformational search for. The coordinates of the small model-part of the structures does not have to match that of the small model and it doesn't even have to be very TS-like but it has reflect the bonding in the small model. Alternatively you can provide a SMILES string of the molecule.

small_model.sdf is an sdf file with with the optimized coordinates for the small model TS. You can use any name for the file as long as the extension is sdf. TS_conf will only use the coordinates of the non-hydrogen atoms, unless the word "keepHs" is part of the file name. If you use keepHs then the hydrogen atoms have to have corresponding hydrogen atoms in the big model (see examples below)

5 is the number conformers you want to generate and you can use any integer you want here. The default is 20.

What TS_conf_search does
TS_conf_search uses the ConstrainedEmbed function implemented in RDKit, which generates a conformer of a molecule, finds the atoms in the molecule corresponding to the small model, and then performs a FF minimization in which the Cartesian coordinates of these atoms are constrained to match the Cartesian coordinates of the small model. If the small model contains two separate molecules then intramolecular energy terms are ignored.

sdf files
sdf files contain the Cartesian coordinates of the atoms but also explicitly defines the bond orders (single, double, etc) and any formal charges. It is important the the bond orders in the ts.sdf and small_model.sdf files match. sdf files are basically the same as mol files, but if your program generates mol files be sure to change the file extension to sdf. You can use OpenBabel to generate sdf files from nearly any other format (however, see next section).

Examples (and when to use keepHs)
The GitHub repository contains two Diels-Alder examples, where the small models are the PM3 optimized transition states for R = H, and the big model is R = CH2CH2CH2CH3.

I created the small model sdf files by converting the GAMESS output files to sdf using Open Babel. However, the bond order didn't correspond to that shown in the picture above, so I manually edited the bond orders in the sdf file and remved the "RAD line" that OpenBabel created.

For Diels_Alder1 we type 'python TS_conf_search.py "Diels_Alder1;ClC=C.C=C(CCCC)C=C" Diels_Alder1_small.sdf 5' i.e. we build the TS guess from the SMILES string of the molecule.

For Diels_Alder2 we type "python TS_conf_search.py Diels_Alder2.sdf Diels_Alder2_small_keepHs.sdf 5". Here I got Diels_Alder2.sdf by loading the small model in to Avogadro, adding the CH2CH2CH2CH3 chain in some arbitrary conformation, and minimising the structure.

In the second example, I use keepHs because the dienophile (ethene) only has two heavy atoms, which is not enough to define the orientation of the plane of the molecule relative to the diene. Because I use keepHs I have removed the hydrogen corresponding to the R group, so that all the hydrogens in the small model can be matched to hydrogens in the large model.

While Diels-Alder2 provides one example where keepHs is useful. Another is where hydrogens atoms are an integral part of the mechanism, such as in hydrogen or proton transfer TSs.

Results

I used each of the geometries as a starting guess for a TS search using GAMESS where I calculated the Hessian to start with (hess=calc) and after every fifth step (ihrep=5).

In the first example none of the five runs found the correct TS, while for the latter example all 5 worked. When I re-ran the first example with python TS_conf_search.py "Diels_Alder1;ClC=C.C=C(CCCC)C=C" Diels_Alder1_small_keepHs.sdf 5 it worked fine.

Not using keepHs has the advantage that one doesn't have to make different small model sdf files when different hydrogens are replaced by substituents. But in this case an extra constrained minimisation step appears to be needed to get the hydrogen atoms placed correctly.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Sunday, November 12, 2017

Atom_mapper: determining atom correspondence between reactants and products

Many TS search algorithms interpolate between reactants and products and require that the atom order is identical. This is time consuming to enforce by hand and hard to automate, but this paper by the folks at Schrödinger describe a solution that seems to work pretty well. The general idea is also fairly easy to implement if you are familiar with RDKit or similar cheminformatics tools and here I describe my implementation so far.

The basic idea
If two molecules are identical but have different atom numbering then RDKit can map the atoms using "GetSubstructMatch". So (pseudocode) "OCC.GetSubstructMatch(CCO)" gives "(2,1,0)" meaning that atom 2 in CCO corresponds to atom 0 in OCC and so forth. And "RenumberAtoms(CCO, (2,1,0))" gives OCC.

Let's say our reaction is CH3-CH=O -> CH2=CH-OH. "product.GetSubstructMatch(reactant)" will not give a match because the molecules are different. To focus on connectivity we remove bond orders: CH3-CH-O -> CH2-CH-OH, but the molecules are still different. The next step is to break each bond in turn and compare. For the reactant you get [H + CH2-CH-O, CH3 + CH-OH, H + CH3-C-O, O + CH3-CH2] and the first entry will match the last entry in the corresponding list for the product [H + CH-CH-OH, CH2 + CH-OH, H + CH2-C-OH, H + CH2-CH-O].

It is very easy to check to for equivalence by converting the list of fragmented molecules to a set of corresponding SMILES strings and looking for identical elements in the two sets: "matches = list(set(react_frag_smiles) & set(prod_frag_smiles))". If there is a match then the atoms can be matched as before because they are identical (note that the bond breaking does not alter the atom numbering of the molecule). If no match is found then the procedure is repeated for the fragments.

One problem is that equivalent atoms are not matched in 3D. For example, if you label (i.e. number) the equivalent H atoms in a methyl group then the methyl group is chiral and RDKit will create a random chirality. So, the Cartesian coordinated of methyl hydrogens in the CCO example above are not necessarily superimposable, which creates problems in TS search algorithms that use interpolation.

Following the ideas presented in the Schrödinger paper, I try to solve this issue by switching pairs of equivalent atoms and comparing the RMSDs between reactant and product before and after. However, this will only work if the reactants and products have the same conformation, so I use a modified version of ConstrainedEmbed (ConstrainedEmbedRP) which uses either Cartesian or distance constraints to match two structures using a UFF force field minimization of one of the structures.

Using ConstrainedEmbed has the added advantage of producing reactant and product structures that are in the same orientation and, therefore, often good starting points for interpolation.

Atom_mapper
* I have implemented these ideas in Atom_mapper and made it available on GitHub under the MIT license.
Usage
* The input is SMILES strings of reactant and products but the code can be adapted to work with user-supplied coordinates via sdf files.
* The output is two sdf files labelled "template" and "mol" with (hopefully) identical atom ordering. Template is the most rigid of the two and the coordinates come from an unconstrained UFF minimization. Mol is superimposed on template using ConstrainedEmbedRP.
* Both geometries should probably be minimized using the energy function you plan to use for the TS search and, depending on the optimization algorithm, the structures should probably be rigidly superimposed before interpolation
Current limitations
* I haven't tested the code extensively. If you run into cases where it doesn't work, file a bug report on GitHub
* No conformation search is done so the conformations chosen will be random. Solution: start with coordinates rather than SMILES.
* If reactants or products consists of two molecules their relative orientation resulting from the UFF minimization will be somewhat arbitrary. This is not a problem if only the reactants or products consists of two molecules, since the coordinates will be superimposed on a more rigid molecule. However, it is a problem if the reactants and products both consist of two molecules. Solution: start with coordinates rather than SMILES, but not sure how to determine best intermolecular orientation automatically.
* The code only permutes pairs of equivalent atoms so it won't work for equivalent groups, e.g. methyl groups in CC(C)(C)CO because we need to switch several pairs (corresponding to C and 3 H atoms) simultaneously. Not sure how to fix that easily. Suggestions welcome.

There are also some general limitations to this "active bonds" approach as described in the article.

This work is licensed under a Creative Commons Attribution 4.0

Sunday, September 3, 2017

Automatic generation of a set of molecules

Many quantum chemistry projects have reached a point where setup and analysis consumes more human time than CPU time, e.g. it takes all day to set-up enough input files to keep the computer busy overnight. Many people use scripts to automatically extract and analyse the data, but few use scripts to generate the coordinates.

Here I show how this can easily be done using the RDKit toolkit. The following python script adds F and OH substituents to benzene making all possible 91 combinations of mono-, di-, hexa-substituted molecules.

However, the script can easily be changed to do something else by changing parent_smiles and rxn_smarts_list in line 4 and 5. If you are not familiar with SMILES start here and there are plenty of GUIs, such as this, that generate SMILES.

To use the Reaction SMARTS you have to learn SMARTS, which can be a bit tricky, but it is a very powerful tool. For example, if you change [cX3;H1:1]>>[*:1]F to [*;H1:1]>>[*:1]F then the program will add H to any atom with one H, i.e. also the OH group to create the OF substituent. So it you set substitutions = 2, you'll get mono-substituted Ph-OF in addition to mono- and di-substituted Ph-F and Ph-OH.

Similarly, ff you add [cX3;H1:1][cX3;H1:2]>>[c:1]1[c:2]cccc1 to the list (and use substitutions = 2) you'll get un- and mono-substituted napthalene as well as un-substituted anthracene and phenanthrene.

In my experience, the only thing that limits what I can build with this approach is my understanding of SMARTS. Hope this is of some use to you.

This work is licensed under a Creative Commons Attribution 4.0

Monday, August 22, 2016

Finding the reference molecule for a pKa calculation using RDKit

This is prototype code related to this project. I use Histamine as an example which has two ionizable sites: a primary amine and an imidazole ring.

The code figures out that imidazole is the best reference for the imidazole ring, while ethylamine is the best reference for the primary amine. The code does this by figuring out which atom is being deprotonated, computes the Morgan fingerprint around this atom, and compares it to the Morgan fingerprints of imizadole and ethylamine.

Sunday, August 7, 2016

Conformer search with RDKit

I'm teaching myself how to use RDKit. Here is code for conformer search using RDKit that also computes the energy of each conformer using the MMFF94 force field.

Comments welcome

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Thursday, July 21, 2016

Finding disordered residues in an NMR ensemble

Note to self: here's how you identified disordered residues in the NMR ensemble 2KCU.pdb

1. In Pymol: "fetch 2kcu"

2. ~~Action > align > states (*/CA)~~
2016.08.07 update: the above command also aligns the tails. Use "intra_fit (2kzn///6-158/CA)"

3. "save 2kcu_aligned.pdb, state=0"

4. In terminal: grep CA 2kcu_aligned.pdb > lis

5. python disorder.py

disorder.py (given below) calculates the standard deviation of the x, y, and z coordinate of each CA atom ($\sigma_{x,i}, \sigma_{y,i}, \sigma_{z,i})$. It then averages these three standard deviations for each CA atom $(\sigma_i)$. To find outliers, it averages these values for the entire protein $(\langle \sigma_i \rangle)$ and computes the standard deviation of this average $(\sigma_{\langle \sigma_i \rangle})$. Any residues for which $\sigma_i > \langle \sigma_i \rangle + \sigma_{\langle \sigma_i \rangle}$ is identified as disordered.

Here I've colored the disordered residues red (haven't updated the picture based on Step 2-change yet)

Yes, I know: "the 1970's called and want their Fortran code back". How very droll.

This work is licensed under a Creative Commons Attribution 4.0

Tuesday, July 12, 2016

Reproducing stats or verbose output from LINEST command in Excel or Google Sheet in Python

The python code above reproduces the output produced by the LINEST(y;x;true;true) command in Excel [LINEST(y,x,true,true) in Google Sheets] with a csv file as input. In the csv file I have assumed that the x and y columns are labelled "x" and "y" respectively. This page has a good explanation of the output (pdf).

This work is licensed under a Creative Commons Attribution 4.0

Sunday, May 10, 2015

Python peer instruction questions

I am teaching a molecular simulations/intro to python course and have just finished drafting the last sets of peer instruction questions. Here's the last question. The idea is that they have to write a small python program on the spot but this might be more of a take home question. Can you do it?

I'm not always this "evil". Here's a more typical one.

This work is licensed under a Creative Commons Attribution 4.0

Friday, April 26, 2013

Determining the distribution of multiple samples in python.

I am currently working on a probabilistic model for assigning peaks from protein NMR spectra to the corresponding spin-systems of the protein. So far I have assumed that the error for different measurements of the same chemical shift follows a normal distribution.
Analyzing a set of already assigned peak-lists for the protein S6, it is clear that there's more outliers than a normal distribution predicts, which justifies a deeper analysis of the data. However even if I assume that the error for e.g. all amid protons in the HNCA spectrum follows the same distribution, the samples variance might be different for the HNcaCO experiment.
A full analysis leaves me in this case with 45 different samples that most likely follow the same kind of distribution with differing parameters. So the question was how to continue from here. Luckily I found this blog post.

What the author did was to brute force all 80 distributions that scipy.stats offers and list the p and D values from a Kolmogorov-Smirnov test for a single sample. I ended up improving the code to support multiple samples, as well as displaying the log likelihood of the fit. (Any distribution that fits extremely poorly with a single sample is removed from the output)

The code can be found here: distribution-check.py

If all your samples are in the folder samples/ with the extension .txt, then simply run

python distribution-check.py samples/*.txt

As an example, here's the output for running the script on my 45 samples:

Apparently a normal distribution is not a very good 2-parameter representation of my samples, even though the simplicity of it might still be beneficial, so I will probably end up testing a few others. The -inf values probably occurs due to asymptotic behaviour at the mean and I would suggest just ignoring them, and use the KS test instead to determine if its a good fit