Tuesday, June 16, 2020

Generating a random molecule from a chemical formula

Theo posted the following question on the RDKit mailing list

is there maybe a way with RDKit to generate random (but valid) molecules with a given chemical sumformula?
For example:
C12H9N could generate Carbazole as valid compound.
The output would be mol or SMILES.
This is actually a difficult problem, if one wants to enumerate all the possibilities, but it is not too difficult to whip up code that suggests some possibilities, though some of the suggestions may be pretty unrealistic. 

I start by generating a linear hydrocarbon with the correct number of heavy atoms. The randomly change some of the carbons to the other atoms in the molecule. If there are too many hydrogens, I introduce multiple bond and rings until the atom count is correct. Here I use some of the mutation operations from my graph based genetic algorithm.

One issue is that is it will only produce linear molecules for saturated systems. This can be fixed by adding som branching mutations, e.g. CCCC>> CC(C)C.

This work is licensed under a Creative Commons Attribution 4.0

Sunday, January 19, 2020

Computing Graph Edit Distance between two molecules using RDKit and Networkx

During a Twitter discussion Noel O'Boyle introduced me to Graph Edit Distance (GDE) as a useful measure of molecular similarity. The advantages over other approaches such as Tanimoto similarity is discussed in these slides by Roger Sayle.

It turns out Networkx can compute this, so it's relatively easy to interface with RDKit and the implementation is shown below.

Unfortunately, the time required for computing GDE increases exponentially with molecule size, so this implementation is not really of practical use.

Sayle's slides discusses one solution to this, but it's far from trivial to implement. If you know of other open source implementations, please let me know.

Update: GitHub page

This work is licensed under a Creative Commons Attribution 4.0

Saturday, January 18, 2020

Open access chemistry publishing options in 2020

Here is an updated list of affordable impact neutral and other select OA publishing options for chemistry

Impact neutral journals
$0 (in 2020) PeerJ chemistry journals. Open peer review. (Disclaimer I am an editor for PeerJ Physical Chemistry)

\$638 (normally \$850) Results in Chemistry. Closed peer review

$1000 F1000Research. Open peer review. Bio-related

$1095 PeerJ - Life and Environment. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

$1250 ACS Omega. Closed peer review. WARNING: not real OA. You still sign away your copyright to the ACS.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

$1350 Cogent Chemistry. Has a "pay what you can" policy. Closed peer review.

$1595 PLoS ONE. Closed peer review.

$1790 Scientific Reports. Closed peer review

Free or reasonably priced journals that judge perceived impact
$0 Chemical Science Closed peer review

$0 CSS Chemistry Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review.

$0 Beilstein Journal of Nanotechnology. Closed peer review.

$0 ACS Central Science. Closed peer review. ($500-1000 for CC-BY, WARNING: not real OA. You still sign away your copyright to the ACS as far as I know) 

$100 Living Journal of Computational Molecular Science. Closed peer review

€500 Chemistry2. Closed peer review.

£750 RSC Advances. Closed peer review.

Let me know if I have missed anything.

This work is licensed under a Creative Commons Attribution 4.0

Wednesday, August 14, 2019

Machine Learning Basics

The Faculty of Science maintains a list of research presentations that high school classes can choose from when planning a visit. The description of the talk can include links to material the students and use to prepare and keep working on after the visit. So I made a series of video lectures about machine learning and Python for people with no other background than high school level mathematics.

I hope to add more videos/topics as I find the time and I hope this will get some of the students interested in programming and machine learning.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Tuesday, July 16, 2019

Planned papers for 2019 - six months in

In January I wrote about the papers I plan to publish in 2019 and made this list:

1. Graph-based Genetic Algorithm and Generative Model/Monte Carlo Tree Search for the Exploration of Chemical Space

2. Screening for energy storage capacity of meta-stable vinylheptafulvenes
3. Testing algorithms for finding the global minimum of drug-like compounds
4. Towards a barrier height benchmark set for biologically relevant systems - part 2
5. SMILES-based genetic algorithms for chemical space exploration

6. Further screening of bicyclo[2.2.2]octane-based molecular insulators
7. Screening for electronic properties using a graph-based genetic algorithm
8. Further screening for energy storage capacity of meta-stable vinylheptafulvenes

Six months later the status is:


Probably submitted in 2019
While we could certainly have gotten this version published, we decided to write an even better paper were we screen all 200 billion molecules and make an even better ML-learning model. We're almost done with the additional calculations.

5. SMILES-based genetic algorithms for chemical space exploration
The calculations are basically done (here, here, and here) and I just started working on the paper now.

3. Testing algorithms for finding the global minimum of drug-like compounds
The coding is basically done and I started generating data for a paper, but then decided on working on paper 5. This paper is next.

I think that'll be it for 2019. I went on to the 2nd round for a research center application and had to write a big proposal, so I got behind on paper writing in the Spring. I also decided to spend more time on making excuses :).

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Monday, June 17, 2019

Useful introductory books and blogposts on neural networks

Here's a list of books and blogposts on neural networks and related aspects that I have found particularly useful. In general, I like very simple examples - preferably with python code - to introduce me to a topic.

This book is an excellent place to start. The book explains the basics of NNs and guides you through writing your own 3-layer NN code from scratch and applying it to the MNIST set. The book even introduces you to Python, so this is something virtually anyone can do. My only (minor) complaint is that the code uses classes, which can be quite difficult for beginners to grasp and it not really needed here.

Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More
This book offers brief and to-the-point descriptions of some of the major classes of NNs, such CNN and RNN in the first chapters and then walks you though many interesting applications using the DeepChem library. This book gets you started using NNs very quickly and is an excellent supplement to the more basic or more theoretical approaches in this list.

Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning
This is a more formal treatment of deep learning but I still found it (mostly) very readable and there are several useful pseudo-code examples with Python equivalents. The topics are discussed in roughly chronological order, so you also get a good feel for how the NN field developed including major milestones.

This is basically the equivalent of Make Your Own NN but for a RNN applied to a toy problem.

Both posts offer some very simple Python examples of what convolution actually means for images.

A very simple Python introduction to graph convolution, which works quite bit differently from image convolution.

This work is licensed under a Creative Commons Attribution 4.0

Friday, June 14, 2019

Comparison of SMILES-, DeepSMILES-, SELFIES-, and graph-based genetic algorithms Part 2

This post is a follow up to this post. There are two changes:

In that post I generated the data for the string based methods using my graph-based GA (GB-GA) code interfaced with new, string-based, crossover and mutation code. However, this involves going back and forth between graph and string-based representations which could potentially change the atom order. To make sure that doesn't happen I have now written a stand alone string-based GA code, where strings only are converted to graphs when computing the score and graphs are never converted back to strings.

I also had a another look at Brown et al.'s GA code and noticed that they remove duplicates from the population for each generation, which my code didn't. So implemented that as well for both the graph- and string-based methods. In the table below I list the best results, where the original implementation that does not remove duplicates are indicated by a "*".

For GB the removal of duplicates only improves results for celecoxib, where it is now rediscovered 8 times instead of 4. Tiotixene is not rediscovered and troglitazone is only found once with GB-GA, when duplicates are removed.

The new string-based implementation improves results for SMILES and DeepSMILES, with the exception of SMILES for troglitazone, which is discovered once using the old implementation. For SELFIES the new implementation is a little bit worse, but I would say the difference is within the statistical uncertainty. 

GB still tends to outperform string based methods, though they all perform much better than I had expected. Amazingly, DeepSMILES and SELFIES do not appear to offer a clear advantage over SMILES with the exception of troglitazone, where DeepSMILES performs significantly better.

Here are the high scoring molecules found with string based methods. Some of the molecules have radical centers (red boxes) due to misplaced chiral centers.


This work is licensed under a Creative Commons Attribution 4.0