Proteins and Wave Functions

Affordable open access chemistry publishing options in 2023

2023-01-11T08:08:00.004+01:00

Here is an updated list of affordable (APC < $2000) impact neutral and other select OA publishing options for chemistry

Impact neutral journals
\$1195 PeerJ Chemistry journals. Open peer review. (Disclaimer I am Editor-in-chief for PeerJ Physical Chemistry). PeerJ also has a membership model, which may be cheaper than the APC.

\$1210 Results in Chemistry. Closed peer review

\$1350 F1000Research. Open peer review. Bio-related

\$1595 PeerJ - Life and Environment. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

\$1680. Royal Society Open Science. Open peer review.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

\$1685 ACS Omega. Closed peer review.

$1805 PLoS ONE. Closed peer review.

Free or reasonably priced journals that judge perceived impact

$0 Chemical Science Closed peer review

$0 CCS Chemistry Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review.

$0 Beilstein Journal of Nanotechnology. Closed peer review.

$0 SciPost Chemistry Open peer review. (Disclaimer: I am an editor for SciPost)

$0 SciPost Chemistry Core Open peer review. (Disclaimer: I am an editor for SciPost)

$0 Digital Discovery Open peer review. (Disclaimer: I am on the advisory board)

\$0 ACS Central Science. Open peer review.

$100 Living Journal of Computational Molecular Science. Closed peer review

€100 Chemistry². Closed peer review.

£1000 RSC Advances. Closed peer review.

Let me know if I have missed anything.

This work is licensed under a Creative Commons Attribution 4.0

Open access chemistry publishing options in 2022

2022-01-04T10:50:00.002+01:00

Here is an updated list of affordable impact neutral and other select OA publishing options for chemistry

Impact neutral journals
\$1000 Results in Chemistry. Closed peer review

\$1195 PeerJ Chemistry journals. Open peer review. (Disclaimer I am an editor for PeerJ Physical Chemistry). PeerJ also has a membership model, which may be cheaper than the APC.

\$1195 PeerJ - Life and Environment. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

\$1350 F1000Research. Open peer review. Bio-related

$1350 PLoS ONE. Closed peer review.

$1680. Royal Society Open Science. Open peer review.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

\$1685 ACS Omega. Closed peer review.

$1990 Scientific Reports. Closed peer review

Free or reasonably priced journals that judge perceived impact

$0 SciPost Chemistry Open peer review. (Disclaimer: I am an editor for SciPost)

$0 SciPost Chemistry Core Open peer review. (Disclaimer: I am an editor for SciPost)

$0 Digital Discovery Open peer review.

\$0 ACS Central Science. Open peer review. ($1000 for CC-BY)

$100 Living Journal of Computational Molecular Science. Closed peer review

€500 Chemistry². Closed peer review.

£750 RSC Advances. Closed peer review.

Let me know if I have missed anything.

This work is licensed under a Creative Commons Attribution 4.0

Can machine learning regression extrapolate?

2021-03-22T11:17:00.001+01:00

I recently developed a ML model to predict pIC50 values for molecules and used it together with a genetic algorithm code to search for molecules with large pIC50 values. However, the GA searches never found molecules with pIC50 values that where larger than in my training set.

This brought up the general question of whether ML models are capable of outputting values that are larger than those found in the training set. I made a simple example to investigate this issue for different ML models.

$\mathbf{X}_1 = (1, 0, 0)$ corresponds to 1, $\mathbf{X}_2 = (0, 1, 0)$ corresponds to 2, and $\mathbf{X}_3 = (0, 0, 1)$ corresponds to 3.

The code can be found here. If you are new to ML check out this site

Linear Regression
This training set can be fit by a linear regression model $y = \mathbf{wX}$ with the weights $\mathbf{w} = (1, 2, 3)$. Clearly this simple ML can extrapolate in the sense that, for example, $\mathbf{X} = (1, 0, 1)$ will yield 4, which is larger than max value in the training set (3). Similarly, $\mathbf{X} = (0, 0, 2)$ will yield 6.

Neural Network
Next I tried a NN with one hidden layer with 2 nodes and the sigmoid activation function. For this model $\mathbf{X} = (1, 0, 1)$ yields 1.6 and $\mathbf{X} = (0, 0, 2)$ yields 3.2, which is only slightly larger than 3.

The output of the NN is given by $\mathbf{O}_h\mathbf{w}_{ho}+\mathbf{b}_{ho}$, where $\mathbf{O}_h$ is the output of the hidden layer. Using the sigmoid function, the maximum value of $\mathbf{O}_h = (1, 1)$, for which $\mathbf{O}_h\mathbf{w}_{ho}+\mathbf{b}_{ho}$ = 3.3. So this is the maximum value this NN can output. For comparison, $\mathbf{O}_h = (0.99, 0.65)$ for $\mathbf{X}_3 = (0, 0, 1)$.

If I instead use the ReLU activation function (which doesn't have an upper bound), $\mathbf{X} = (1, 0, 1)$ yields 2.2 and $\mathbf{X} = (0, 0, 2)$ yields 4.2, which is somewhat larger than 3.

So, NNs that exclusively use ReLU can in principle yield values that are larger than those found in the training set. But if one layer uses bounded activation functions such as sigmoid, then it depends on how close the outputs of that layer are to 1 when predicting the largest value in the training set.

Random Forest
The example is so simple that it can be fit with a single decision tree (RF outputs the mean prediction of a collection of such decision trees):

Clearly the RF can only output values in the range of the training set.

This work is licensed under a Creative Commons Attribution 4.0

Open access chemistry publishing options in 2021

2021-01-08T10:49:00.016+01:00

Here is an updated list of affordable impact neutral and other select OA publishing options for chemistry

Impact neutral journals
\$1000 Results in Chemistry. Closed peer review

\$1250 ACS Omega. Closed peer review.

\$1350 F1000Research. Open peer review. Bio-related

$1680. Royal Society Open Science. Open peer review.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

$1695 PLoS ONE. Closed peer review.

$1990 Scientific Reports. Closed peer review

Free or reasonably priced journals that judge perceived impact

$0 SciPost Chemistry Open peer review. (Disclaimer: I am an editor for SciPost)

$0 SciPost Chemistry Core Open peer review. (Disclaimer: I am an editor for SciPost)

$0 Digital Discovery Open peer review.

\$0 ACS Central Science. Open peer review. ($1000 for CC-BY)

Generating publication quality figures of molecules using RDKit

2020-12-22T14:53:00.001+01:00

With RDKit version 2020.09 the drawing code got a major overhaul and, IMO, it can now be used to make figures for publications. Based on Greg Landrum's show_mol function I made a simpler version for multiple molecules that'll also give you png and, especially, pdf files.

Blogger doesn't allow you to upload pdf files, so the picture above really doesn't do the corresponding pdf file justice.

This work is licensed under a Creative Commons Attribution 4.0

Mapping atoms to reactants in products made with reaction SMARTS

2020-11-06T12:57:00.001+01:00

Reaction SMARTS is an RDKit feature that is very useful for generating reactant products pairs for a given reaction. Unfortunately the algorithm changes the atom order between reactants and products, which creates problems one tries to locate the reaction paths using an interpolation-based algorithm such as nudged elastic band (NEB).

Fortunately, RDKit keeps track of the the change in atom order (thanks to my MS student Julius Seumer for the tip!) and it's easy to reorder the atoms:

def reorder_product(product):
  reorder_inverse = [int(atom.GetProp('react_atom_idx')) for atom in product.GetAtoms()]
  reorder = len(reorder_inverse)*[0]

  for i in range(len(reorder_inverse)):
    reorder[reorder_inverse[i]] = i

  product = Chem.RenumberAtoms(product, reorder)

  return product

If the 3D structures are to be used for interpolation it is important to embed the reactant structure before converting it to product. This keeps the "label-chirality" of groups such as CH2 the same in reactant and products.

Here is a demo notebook

This work is licensed under a Creative Commons Attribution 4.0

atom_mapper: matching all atoms in reactants and products

2020-10-31T12:50:00.003+01:00

It's almost 3 years ago that I wrote about atom mapper and now the code has received a major overhaul thanks to my PhD student Mads Koerstz. The 2D atom mapping part is basically left untouched, but the main new thing is a general implementation of the 3D mapping problem.

The 3D mapping problem is that if labels (i.e. atom orders) are considered then all tetrahedral centers are chiral and the chirality of centers with equivalent atoms, such as CH2 groups, generated by RDKit's embed function will be arbitrary and unlikely to match in reactants and products. This creates problems for methods such as nudged elastic band (NEB) that try to determine the reaction path by interpolation.

Mads found a clever approach using chiral atom tags, where he generates arbitrary tags for the reactant and makes sure the tags match in the product. If there are true chiral centers he also generates all enantiomers.

The old version had some code that tried to align the coordinates, but that has been removed since that can be done much better with xTBs reaction path method.

Here is a demo notebook

This work is licensed under a Creative Commons Attribution 4.0

Generating a random molecule from a chemical formula

2020-06-16T15:20:00.000+02:00

Theo posted the following question on the RDKit mailing list

is there maybe a way with RDKit to generate random (but valid) molecules with a given chemical sumformula?
For example:
C12H9N could generate Carbazole as valid compound.
The output would be mol or SMILES.

This is actually a difficult problem, if one wants to enumerate all the possibilities, but it is not too difficult to whip up code that suggests some possibilities, though some of the suggestions may be pretty unrealistic.

I start by generating a linear hydrocarbon with the correct number of heavy atoms. The randomly change some of the carbons to the other atoms in the molecule. If there are too many hydrogens, I introduce multiple bond and rings until the atom count is correct. Here I use some of the mutation operations from my graph based genetic algorithm.

One issue is that is it will only produce linear molecules for saturated systems. This can be fixed by adding som branching mutations, e.g. CCCC>> CC(C)C.

This work is licensed under a Creative Commons Attribution 4.0

Computing Graph Edit Distance between two molecules using RDKit and Networkx

2020-01-19T14:14:00.000+01:00

During a Twitter discussion Noel O'Boyle introduced me to Graph Edit Distance (GDE) as a useful measure of molecular similarity. The advantages over other approaches such as Tanimoto similarity is discussed in these slides by Roger Sayle.

It turns out Networkx can compute this, so it's relatively easy to interface with RDKit and the implementation is shown below.

Unfortunately, the time required for computing GDE increases exponentially with molecule size, so this implementation is not really of practical use.

Sayle's slides discusses one solution to this, but it's far from trivial to implement. If you know of other open source implementations, please let me know.

Update: GitHub page

This work is licensed under a Creative Commons Attribution 4.0

Open access chemistry publishing options in 2020

2020-01-18T12:02:00.001+01:00

Here is an updated list of affordable impact neutral and other select OA publishing options for chemistry

Impact neutral journals
$0 (in 2020) PeerJ chemistry journals. Open peer review. (Disclaimer I am an editor for PeerJ Physical Chemistry)

\$638 (normally \$850) Results in Chemistry. Closed peer review

$1000 F1000Research. Open peer review. Bio-related

$1095 PeerJ - Life and Environment. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

$1250 ACS Omega. Closed peer review. WARNING: not real OA. You still sign away your copyright to the ACS.

$1260. Royal Society Open Science. Open peer review.

(The RSC manages "the journal’s chemistry section by commissioning articles and overseeing the peer-review process")

$1350 Cogent Chemistry. Has a "pay what you can" policy. Closed peer review.

$1595 PLoS ONE. Closed peer review.

$1790 Scientific Reports. Closed peer review

Free or reasonably priced journals that judge perceived impact

$0 Chemical Science Closed peer review

$0 CSS Chemistry Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review.

$0 Beilstein Journal of Nanotechnology. Closed peer review.

$0 ACS Central Science. Closed peer review. ($500-1000 for CC-BY, WARNING: not real OA. You still sign away your copyright to the ACS as far as I know)

Machine Learning Basics

2019-08-14T12:43:00.001+02:00

The Faculty of Science maintains a list of research presentations that high school classes can choose from when planning a visit. The description of the talk can include links to material the students and use to prepare and keep working on after the visit. So I made a series of video lectures about machine learning and Python for people with no other background than high school level mathematics.

I hope to add more videos/topics as I find the time and I hope this will get some of the students interested in programming and machine learning.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Planned papers for 2019 - six months in

2019-07-16T11:12:00.000+02:00

In January I wrote about the papers I plan to publish in 2019 and made this list:

Submitted
1. Graph-based Genetic Algorithm and Generative Model/Monte Carlo Tree Search for the Exploration of Chemical Space

Probable
2. Screening for energy storage capacity of meta-stable vinylheptafulvenes
3. Testing algorithms for finding the global minimum of drug-like compounds
4. Towards a barrier height benchmark set for biologically relevant systems - part 2
5. SMILES-based genetic algorithms for chemical space exploration

Maybe
6. Further screening of bicyclo[2.2.2]octane-based molecular insulators
7. Screening for electronic properties using a graph-based genetic algorithm
8. Further screening for energy storage capacity of meta-stable vinylheptafulvenes

Six months later the status is:

Accepted

1. Graph-based Genetic Algorithm and Generative Model/Monte Carlo Tree Search for the Exploration of Chemical Space

Probably submitted in 2019

2. High Throughput Virtual Screening of 200 Billion Molecular Solar Heat Battery Candidates

While we could certainly have gotten this version published, we decided to write an even better paper were we screen all 200 billion molecules and make an even better ML-learning model. We're almost done with the additional calculations.

5. SMILES-based genetic algorithms for chemical space exploration

The calculations are basically done (here, here, and here) and I just started working on the paper now.

3. Testing algorithms for finding the global minimum of drug-like compounds

The coding is basically done and I started generating data for a paper, but then decided on working on paper 5. This paper is next.

I think that'll be it for 2019. I went on to the 2nd round for a research center application and had to write a big proposal, so I got behind on paper writing in the Spring. I also decided to spend more time on making excuses :).

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Useful introductory books and blogposts on neural networks

2019-06-17T12:56:00.003+02:00

Here's a list of books and blogposts on neural networks and related aspects that I have found particularly useful. In general, I like very simple examples - preferably with python code - to introduce me to a topic.

Books

Make Your Own Neural Network

This book is an excellent place to start. The book explains the basics of NNs and guides you through writing your own 3-layer NN code from scratch and applying it to the MNIST set. The book even introduces you to Python, so this is something virtually anyone can do. My only (minor) complaint is that the code uses classes, which can be quite difficult for beginners to grasp and it not really needed here.

Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More
This book offers brief and to-the-point descriptions of some of the major classes of NNs, such CNN and RNN in the first chapters and then walks you though many interesting applications using the DeepChem library. This book gets you started using NNs very quickly and is an excellent supplement to the more basic or more theoretical approaches in this list.

Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning
This is a more formal treatment of deep learning but I still found it (mostly) very readable and there are several useful pseudo-code examples with Python equivalents. The topics are discussed in roughly chronological order, so you also get a good feel for how the NN field developed including major milestones.

Blogposts

Build a Recurrent Neural Network from Scratch in Python – An Essential Read for Data Scientists

This is basically the equivalent of Make Your Own NN but for a RNN applied to a toy problem.

Introduction to Convolutions using Python and Playing with convolutions in Python

Both posts offer some very simple Python examples of what convolution actually means for images.

How to do Deep Learning on Graphs with Graph Convolutional Networks

A very simple Python introduction to graph convolution, which works quite bit differently from image convolution.

This work is licensed under a Creative Commons Attribution 4.0

Comparison of SMILES-, DeepSMILES-, SELFIES-, and graph-based genetic algorithms Part 2

2019-06-14T13:28:00.000+02:00

This post is a follow up to this post. There are two changes:

In that post I generated the data for the string based methods using my graph-based GA (GB-GA) code interfaced with new, string-based, crossover and mutation code. However, this involves going back and forth between graph and string-based representations which could potentially change the atom order. To make sure that doesn't happen I have now written a stand alone string-based GA code, where strings only are converted to graphs when computing the score and graphs are never converted back to strings.

I also had a another look at Brown et al.'s GA code and noticed that they remove duplicates from the population for each generation, which my code didn't. So implemented that as well for both the graph- and string-based methods. In the table below I list the best results, where the original implementation that does not remove duplicates are indicated by a "*".

For GB the removal of duplicates only improves results for celecoxib, where it is now rediscovered 8 times instead of 4. Tiotixene is not rediscovered and troglitazone is only found once with GB-GA, when duplicates are removed.

The new string-based implementation improves results for SMILES and DeepSMILES, with the exception of SMILES for troglitazone, which is discovered once using the old implementation. For SELFIES the new implementation is a little bit worse, but I would say the difference is within the statistical uncertainty.

GB still tends to outperform string based methods, though they all perform much better than I had expected. Amazingly, DeepSMILES and SELFIES do not appear to offer a clear advantage over SMILES with the exception of troglitazone, where DeepSMILES performs significantly better.

Here are the high scoring molecules found with string based methods. Some of the molecules have radical centers (red boxes) due to misplaced chiral centers.

x

This work is licensed under a Creative Commons Attribution 4.0

Comparison of SMILES-, DeepSMILES-, SELFIES-, and graph-based genetic algorithms

2019-06-09T11:47:00.000+02:00

This post is a follow up to this post. There are three main changes: 1) I have included Emilie's code in my code, 2) I have extended the implementation to SELFIES, and 3) the initial pool of molecules is now constructed exactly as described by Brown et al. (i.e. we use the 100 highest scoring molecules from ChEMBL, but remove molecules with scores higher than 0.323).

As before, I run the 10 GA searches, each for 1000 generations, and record the overall highest score found and the average high score. If the score is 1.00 I also record the number of times I found it, in parentheses. I also record the CPU time on 8 cores (note that I stop the search once the score is 1.00, so the time is not necessarily for 10 x 1000 generations).

Here are the high scoring molecules found with string based methods

Bottom line, DeepSMILES and SELFIES perform about the same, and both tend to outperform SMILES for rediscovery using GA.

This work is licensed under a Creative Commons Attribution 4.0

Comparison of SMILES-, DeepSMILES- and graph-based genetic algorithms

2019-06-05T14:24:00.001+02:00

Emilie is wrapping up her bachelor project and writing the report and here are some preliminary results (which are likely to change a bit).

I recently developed a graph-based genetic algorithm that seems to work pretty well. The crossover and mutation code is about 250 lines with a lot of hyperparameters that mainly specify the probabilities of performing different crossovers and mutations.

The question is whether all this was really necessary or could I have gotten away with about 25 lines of code that perform crossover and mutation operations on SMILES strings? For crossover you simply cut two strings at random places and recombine the fragments, e.g. OCC|C and CC|N (where "|" indicates the cut) yield OCCN and CCC and for mutation you simply change one character to another, e.g. CCC becomes C=C.

The potential problem with using SMILES is that one can imagine many scenarios where this wouldn't work, e.g. OC(|C)C and C1C|O1 would yield OC(O1 and C1CC)C, which are not valid SMILES string. But can you still find molecules with the desired properties using this approach? If so, do the molecules look very different than the ones you find with the graph-based approach? Which approach is more efficient?

And what about the DeepSMILES representation developed by O'Boyle and Dalke? Here, OC(|C)C and C1C|O1 are written as OC|C)C and CC|O3, which would yield OCO3 and CCC)C - both valid DeepSMILES strings.

Finding molecules with specific penalised logP values
We start by looking at what Brown et al. call a "trivial optimisation objective": finding molecules with a particular modified logP values. We use the same Gaussian modifier approach with a standard deviation of 2 logP units and select the initial population from the first 1000 molecules in the ZINC data set (after removing molecules with logP values within 2 units of the target). The mating pool size and mutation rates are 20 and 10%, respectively. The table shows the average number of generations (based on 10 runs) needed to find a molecule with a logP values within 0.01 of the target.

It is clear that a SMILES-based GA has no problems meeting the objective, but that using DeepSMILES is more effective both in terms of number of required generations and CPU time. The latter, because the percentage of valid strings generated by crossover and mutation (the succes rate) is considerably higher for DeepSMILES as expected. For this target there appears to be no real advantage in using graph-based GA.

Rediscovering a molecule
The next target is considerably harder: generating molecules with a Tanimoto similarity of 1.0 with a target molecule (naphthalene, celecoxib, or tiotixene). A Tanimoto similarity of 1.0 means that each atom has the same bonding pattern out to a certain radius (here 4 bonds), i.e. that the molecules are very, very similar. The population size is 100 and the mating pool size is 200. The initial populations is 100 molecules with the highest Tanimoto scores to the target (but with Tanimoto scores less than 0.323, following Brown et al.) found among the first 50,000 molecules in the ZINC data set.

Here it's clear that the graph-based approach offer an advantage over string-based methods, while DeepSMILES only offers an advantage over SMILES in terms of effciency. The Tanimoto score goes from 0 to 1, so the molecules found for the tiotexene search using graph-based GA look significantly more like tiotexene than those found with the string based methods. (When I run celecoxib search I succeed 5/10 times, and we are still trying to find the cause of this difference.)

Finding molecules that absorb at a particular wavelength
Inspired by the study of Tsuda and co-workers, we search for molecules that absorb at 400 and 800 nm. We use Grimme's semiempirical sTDA-xTB method to estimate the absorption wavelength and oscillator strength based on a low energy MMFF-optimised structure. We use a Gaussian(400/800,50) scoring function for the wavelength and a LinearThresholded(0.3) scoring function for the oscillator strength (0.3 is the oscillator strength computed for indigo dye). The population and mating pool sizes are 20 and 40, respectively and the mutation rate is 15%. We select the initial population from the first 1000 molecules in the ZINC data set (after removing molecules with absorption wavelengths within 300 nm of the target). The GA search is stopped if the top-scoring molecule has an absorption wave length within 5 nm of the target value.

It turns out that the find molecules that absorb relatively strongly at 400 and 800 nm is a relatively easy optimisation problem.

Conclusion
If there are very few ways to meet the target then graph-based GA performs better than string-based GA methods, but otherwise not. DeepSMILES-based GA is computationally more efficient than SMILES-based GA in many cases. It would be interesting to test the newly introduced SELFIES representation.

This work is licensed under a Creative Commons Attribution 4.0

Reviews of Graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space

2019-02-21T11:08:00.000+01:00

Here are the reviews of my latest paper which just appeared in Chemical Sciences. I submitted the paper December 1, 2018 and got these reviews on January 13, 2019. I resubmitted January 20, and got the final decision on February 8. As usual with Chem. Sci. a very efficient and positive experience. Kudos to Geoff Hutchison for signing the review (and being cool with me sharing it here) and kudos to Chem. Sci. for passing it on to me.

REVIEWER REPORT(S):
Referee: 1

Recommendation: Revisions required

Comments:
The main question of this paper is whether GA-based algorithms perform better than deep-learning. This paper includes interesting comparisons and free-to-use software, which is potentially a good resource for the AI-chemistry community. Yet I have following reservations about this paper.

1) Essentially GA-based method is faster than ML, creating more molecules. The logP computation is extremely fast, allowing GA to create many molecules in a fixed period of time. When simulation takes a longer time (like DFT), it may be beneficial to use more time to design. It is necessary to give a comparison in terms of the number of simulations needed to obtain good molecules. In that case, ML would be better, because it might be creating "high-quality" molecules using more time. Please provide comparisons with this respect.

2) It seems like the author tuned GA parameters such that the molecules are "realistic looking". It would be beneficial to readers if you can elaborate on this aspect. What exactly did you mean by "realistic looking"? Can you quantify somehow?

3) It seems to me that GA crossover parameters are inspired by chemical reactions. Can you claim that GA-based molecules are more synthesizable than deep-learning-based ones?

Referee: 2

Recommendation: Accept

Comments:
The manuscript “Graph-based genetic algorithm and generative model Monte Carlo tree search for the exploration of chemical space” by Jan Jensen is an excellent addition to recent work on using computational methods to generate new molecular compounds for target properties.

I will admit up-front that I am a proponent of GA strategies, so the conclusions were not surprising. I think the work should be published but would like to make some minor suggestions that I think will strongly improve the work.

- On page 2, the number of non-H atoms is described coming “from a distribution with a mean 39.15 and a standard deviation of 3.50” - this is a number without a unit. Based on the article, I think it should read “a mean of 39.15 atoms, with standard deviation of 3.50 atoms”
- The last paragraph on page 3 is perhaps a bit technical for the Chemical Science audience, discussing “leaf parallelization” and “leaf nodes.” I think the whole paragraph needs to be written for a general audience (i.e., not someone implementing a MCTS code) or moved to the supporting information. The code is, after all, open source and available.
- The penalized logP score could be described better. From the text, the penalty for “unrealistically large rings” was not described.
- The J(m) scores in Table 2 could perhaps be a bit expanded. For example, my assumption is that the SA scores and/or ring penalities may be higher in some methods than others. I think it would be useful to add columns for the raw logP, SA, and penalty scores - if not in the text then in the supporting information.
- The results and discussion could benefit from a figure indicating the rate of improvement with generations for the GA methods and/or the GB-GM-MCTS methods. We have, for example, shown that GA methods show high rate of improvement in early generations, but finding beneficial mutations slows over time. This would likely explain why the lower mutation rate shows better performance in this work - and moreover suggest an “early stop” (e.g., are all 50 generations needed for this problem?)
- Similarly, I find it strange that the author didn’t try longer runs or attempt to find an optimal mutation rate for the GA, particularly if the CPU time is so short.
- The caption for Table 3 includes a typo - I believe “BG-GM” should read “GB-GM”
- The conclusions suggest that the GB-GA approach “can traverse a relatively large distance in chemical space” - the author should really use similarity scores (e.g., a Tanimoto coefficient using ECFP fingerprints or similar) to quantify this - again, the discussion could be expanded.

Overall, I think it’s a great addition to the discussion on optimization of molecular structures for properties.

-Geoff Hutchison, University of Pittsburgh

Screening for large energy storage capacity of meta-stable dihydroazulenes Part 3

2019-01-30T14:34:00.000+01:00

This is a follow up to this post, which was a follow up to this post. Briefly, we (that is to say Mads) have computed $\Delta E_{rxn}$ and $\Delta E^\ddagger$ for about 32,500 molecules using xTB and PM3 respectively. We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies?

In my last post we saw that the xTB and PM3 energies can be trusted well enough and this post we'll see that the conformational search also can be trusted.

Conformational search
The xTB storage energy calculation is based on the lowest energy DHA and VHF structures found by optimisations of $5+5n_{rot}$ geometries generated using RDKit. The plot above shows a comparison of this approach to one where we generated all conformers by systematically rotating each rotateable bond by $\pm 120^\circ$ for a subset of 100 molecules. It is clear that the $5+5n_{rot}$-approach works really well for most molecules (including the 20 with high storage energy) and, if anything, overestimates the storage energy (i.e. at worst we will have som false positives).

From 35,588 to 41 to 6 to ?? candidates

The fact that we can trust the storage energies reasonably well means that we can proceed with the 41 molecules I identified in the previous post. As a first step we optimised the geometries at the DFT level and as expected most of the molecules have barriers that are too low. But 6 of them still look promising, so the next step is to perform a systematic conformer search using xTB (just to be safe) and then re-optimise all structures with energies close to the xTB minimum with DFT. Stay tuned ... with fingers crossed.

This work is licensed under a Creative Commons Attribution 4.0

Screening for large energy storage capacity of meta-stable dihydroazulene Part 2

2019-01-21T12:48:00.001+01:00

This is a follow up to this post. Briefly, we have computed $\Delta E_{rxn}$ and $\Delta E^\ddagger$ for about 32,500 molecules using xTB and PM3 respectively. We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies?

To try to answer the latter question we (i.e. Mads) randomly chose 20, 60, and 20 molecules with high, medium, and low xTB-$\Delta E_{rxn}$ values and recomputed $\Delta E_{rxn}$ at the M06-2X/6-31G(d) level of theory using the lowest xTB-energy DHA and VHF structures as starting geometries. The results are shown below

The vertical and and horizontal lines denote $\Delta E_{rxn}$ and $\Delta E^\ddagger$, respectively, for an experimentally characterised "reference" compound that we want to improve upon, i.e. we are looking for compounds that have larger $\Delta E_{rxn}$ and $\Delta E^\ddagger$ values. $\Delta E_{rxn}$ should ideally be at least 5 kcal/mol higher than the reference (but in general as large as possible) and $\Delta E^\ddagger$ should be at least 2 kcal/mol higher than the reference, but anything higher than that is not necessarily better. The half life depends very sensitively on the barrier, and is hard to compute accurately, so we have to be careful about excluding molecules with high storage capacity in cases where the barrier is close to the reference.

Using these criteria I would select 5 points for further study indicated by the red points (let's call them Square, Star, Triangle, Dot, and Diamond). Square is clearly the most promising one with significantly higher $\Delta E_{rxn}$ and $\Delta E^\ddagger$ compared to the reference. The remaining 4 points are chosen because they either have reasonably high $\Delta E_{rxn}$ and $\Delta E^\ddagger$-values that are only a few kcal/mol below the reference (Triangle and Star) or reasonably high $\Delta E^\ddagger$ and $\Delta E_{rxn}$ that are somewhat (ca 3 kcal/mol) higher than the reference. I'd call the last 4 points pseudo-positives.

The corresponding plot for SQM storage energies and barrier heights is shown above. The good news is that Square is still the clear winner and I would also have picked Star and Triangle for further investigation. However, there are many more points that I also would picked (false pseudo-positives) and Diamond and Dot would not have been picked (false pseudo-negatives).

The false positives are due to PM3 overestimating the barrier, so let's use B3LYP/6-31+G(d)//xTB barriers instead. Square is still the winner, Star would still be picked (but not Triangle), and the false positives are gone. However, the barriers for Dot and Diamond are now so low that they will not be picked (false pseudo-negatives).

Summary and outlook
SQM finds the only true positive (Square). PM3 tends to over estimate the barrier relative to the reference, which leads to false pseudo-positives. xTB tends to underestimate the storage capacity which leads to a few false pseudo-negatives. DFT/PM3 barriers removes the false pseudo-positives but also leads to some false pseudo-negatives.

The above plot shows the molecules with an xTB storage energy that is at least 4 kcal/mol higher than the reference (4 rather than 5 kcal/mol since xTB tends to underestimate the storage energy). The solid line shows the reference barrier and the dotted line lies 2 kcal/mol above that (since PM3 tends to underestimate the barrier). There are 41 points above that line, including good old Square, that we should investigate further and it's probably also good to look at a few more points with very high storage energy and reasonably high barrier.

But all this assumes that we can trust the conformational search so this has to be tested next.

This work is licensed under a Creative Commons Attribution 4.0

Open access chemistry publishing options in 2019

2019-01-16T11:02:00.000+01:00

Here is an updated list of affordable impact neutral and other select OA publishing options for chemistry

Impact neutral journals
$0 (in 2019) PeerJ chemistry journals. Open peer review. (Disclaimer I am an editor on PeerJ Physical Chemistry)

$425 (normally $850) Results in Chemistry. Closed peer review

$750 ACS Omega (+ ACS membership $166/year). Closed peer review. WARNING: not real OA. You still sign away your copyright to the ACS.

$1000 F1000Research. Open peer review. Bio-related

$1095 PeerJ - Life and Environment. Open peer review. Bio-related. PeerJ also has a membership model, which may be cheaper than the APC.

$1260. Royal Society Open Science. Open peer review.

Free or reasonably priced journals that judge perceived impact

$0 Chemical Science Closed peer review

$0 Beilstein Journal of Organic Chemistry. Closed peer review.

$0 Beilstein Journal of Nanotechnology. Closed peer review.

$0 ACS Central Science. Closed peer review. ($500-1000 for CC-BY, WARNING: not real OA. You still sign away your copyright to the ACS as far as I know)

Screening for large energy storage capacity of meta-stable dihydroazulenes

2019-01-10T14:38:00.003+01:00

Here's a summary of where we are at with Mads project

The Challenge

Dihydroazulenes (DHAs) are promising candidates for storing solar energy as chemical energy, which can be released as thermal energy when needed. The ideal DHA derivative has a large $\Delta H_{rxn}$ and a $\Delta G^{\ddagger}$ that is large enough to give a half life of days to months but low enough so that the energy release can be reduced. Of course any modification should not affect light adsorption adversely. This presents an interesting optimisation challenge!

We asked Mogens Brøndsted Nielsen to come up with a list of substituents and he suggested 40 different substituents and 7 positions, which results in 164 billion different molecules (we chose to interpret the right hand figure more generally). We decided to start by looking at all single and double substitutions, which amounts to 35,588 different molecules. The first question is what level of theory will allow us to screen this many molecules.

The initial screen

At a minimum we need to compute $\Delta E_{rxn}$ which involves at least a rudimentary conformational search for both reactants and products. We used an approach similar to this study, ($5+5n_{rot}$ RDKit generated start geometries) which results in over 1 million SQM geometry optimisations, but used GFN-xTB instead of PM3 because the former is about 10 times faster.

To find the barriers, we did a 12-point scan along the breaking bond in DHA (out to 3.5Å) starting from the lowest energy DHA conformer. The highest energy structure was then used as a starting point for a TS search using Gaussian and "calcall". We used ORCA for the scan and Gaussian for the TS search, and used PM3 because it is implemented in both programs. We also optimised the lowest VHF structure with PM3 to compute the barrier. We verify the TS by checking whether the imaginary normal mode lies along the reacting bond. This worked in 32,623 cases. The whole thing took roughly 5 days using roughly 250 cores.

Note that this approach finds a TS to cis-VHF, which we assume is in thermal equilibrium with the lower energy trans-VHF form. For both barriers and reaction energies we use the electronic energy differences rather than free energies of activation and reaction enthalpies.

The next step

We can afford to do a reasonably careful (DFT/TZV) study on at most 50 molecules, so the next question is how to identify the top 50 candidates. In other words to what extent can we trust the conformational search and the xTB and PM3 energies? I plan to cover this in a future blog post.

A more efficient initial screen
We now have data with which to test more efficient ways of performing the initial screen:

1. (a) We could perform the conformational search using MMFF and only xTB-optimise the lowest energy DHA and VHF structures.
(b) We could only perform the TS search for molecules with large $ \Delta E_{rxn}$ values.
(c) We could perform the bond-scan with xTB rather than ORCA.
(d) We could test whether the bond-scan barrier can be used instead of the TS-based barrier.

2. We could test the use ML-based energy functions such as ANI-1 instead of SQM.

3. We could test whether ML can be trained to predict $\Delta E_{rxn}$ and/or $\Delta E^{\ddagger}$ based on the DHA structure.

We'd be very happy to collaborate on this.

Beyond double substitution

No matter how efficient we make the initial screen, screening all 164 billion molecules is simply not feasible. Instead we'll need to use search algorithms such as genetic algorithms or random forest.

Other ideas/comments/questions on this or anything else related to this blogpost are very welcome.

This work is licensed under a Creative Commons Attribution 4.0

Planned papers for 2019

2019-01-04T10:55:00.001+01:00

A year ago I thought I'd probably publish three papers in 2018:

Accepted
1. Fast and accurate prediction of the regioselectivity of electrophilic aromatic substitution reactions

Probable
2. Random Versus Systematic Errors in Reaction Enthalpies Computed using Semi-empirical and Minimal Basis Set Methods
3. Improving Solvation Energy Predictions using the SMD Solvation Method and Semi-empirical Electronic Structure Methods

Unexpected

4. The Bicyclo[2.2.2]octane Motif: A Class of Saturated Group 14 Quantum Interference Based Single-Molecule Insulators

5. Empirical corrections and pair interaction energies in the fragment molecular orbital method

The end result was five papers. In addition I also published a proposal.

Here's the plan for 2019

Submitted

1. Graph-based Genetic Algorithm and Generative Model/Monte Carlo Tree Search for the Exploration of Chemical Space

Probable

2. Screening for energy storage capacity of meta-stable vinylheptafulvenes

3. Testing algorithms for finding the global minimum of drug-like compounds

4. Towards a barrier height benchmark set for biologically relevant systems - part 2

5. SMILES-based genetic algorithms for chemical space exploration

Maybe

6. Further screening of bicyclo[2.2.2]octane-based molecular insulators

7. Screening for electronic properties using a graph-based genetic algorithm

8. Further screening for energy storage capacity of meta-stable vinylheptafulvenes

I haven't started a draft on any of papers 2-5 so I'm not exactly brimming with confidence that all 4 will make it into print in 2019. However, we have 80-90% of all the data needed to write each paper, and I've blogged a bit about paper 3.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Conformational search for the global minimum

2018-11-25T12:20:00.000+01:00

We've been working on conformational search for a while and are nearing the point were we have enough to write it up. This post is to get my head around the central point of the study.

Motivation
I'm interested in conformational search because I want to compute accurate reaction energies. Therefore I need to find the global energy minimum of both reactants and products (or something very close to them as explained below).

I need to make to make 2 choices: what conformational search method and what level of theory to use.

Establishing a benchmark set
The only way to make reasonably sure you find the global minimum is to do a systematic search with a relatively fine dihedral angle resolution. This can be painful, even with MMFF, but needs only to be done once. So what if it runs for 24 hours?

A dihedral scan doesn't sample ring conformations directly, so you may have to repeat the systematic search for several ring conformations.

The global energy minimum is obviously fully energy minimised so when I say "search" I mean that you are generating starting geometries that are then fully energy minimised. Also, if you are interested in reaction enthalpies, the global minimum is the structure with the lowest enthalpy and similarly for free energies.

Test your new conformational search method
Here's the only test that matters: run your new conformational search method N times and record how many of the N runs find the global minimum, i.e. the success rate. Do that for M different molecules and calculate an average success rate.

Let's say the average success rate is 90%. That tells me that if I use your method once there is 10% chance I that I won't find the global minimum, but if I use the method twice on the same molecule that chance drops to (0.1 x 0.1) 1% (assuming your method is stochastic).

The larger N and M are, the lower the uncertainty in the average success rates: 5 is no good, 10 is borderline, >15-20 is acceptable.

The success rate will depend on the number of rotatable bonds so it's important that the sizes of your M molecules are representative of the molecules you want to study.

Obviously, if your new method finds a lower energy value then there you need to go back and look the way you found your benchmark set.

Depending on the accuracy you can live with, you can loosen the success criteria to having found a structure within, say, 0.5 kcal/mol of the global minimum.

Benchmarking energy functions for conformational search
Most benchmark studies of this kind compare conformational energies and report RMSD or MAE values. The problem with this approach is that large errors for high energy conformers can lead to large RMSD values which are misleading for the purposes of finding global minima.

The only test of an energy function that really matters is whether the "true" global minimum structure is among the predicted structures and, if so, how close in energy it is to the "predicted" global minimum. Here the "true" global minimum is the global minimum predicted by your reference energy function that you trust.

Let's say you want to compare a semiempirical method (SQM) to a DFT method you trust and you can afford to do 100 geometry optimisations with DFT per molecule. The SQM is cheap enough that you can perform either a systematic search or a search using a conformational search method you have found reliable. Now you take the 100 (unique) lowest energy structures and use them as starting points for 100 DFT optimisations. The lowest energy structure found with DFT is your best guess for the "true" global minimum and the question is what is the minimum number of DFT optimisations you need to perform to find the "true" global minimum for each molecule? The lower the number, the better the SQM method for that molecule.

Let's say you do this for 5 molecules and the answer for SQM1 is 3, 1, 4, 10, and 3 while for SQM2 it is 4, 6, 4, 7, and 4. The I would argue that SQM2 is better because you need to perform 7 DFT optimisations to be 100% sure to find the "true" global minimum for all 5 molecules, compared to 10 DFT optimisations for SQM1.

Another metric would be to say that in practice I can only afford, say, 5 DFT optimisations and compute the energy relative to the "true" global minimum, e.g. 0.0, 0.0. 0.0, 0.5, and 0.0 kcal/mol for SQM1 and 0.0, 0.2, 0.0, 0.8, and 0.0 kcal/mol for SQM2. In this case you could argue that SQM1 is better since the maximum error is smaller. The best metric depends on your computational resources and what kind of error you can live with.

A need for a global minimum benchmark sets
Most, if not all, conformer benchmark sets that are currently available are made starting from semirandomly chosen starting geometries, with no guarantee that the true global minimum is among the structures. What is really needed is a diverse set of molecular structures and total energies, obtained using trustworthy methods, that one is reasonably sure correspond to global minima.

As I mentioned above, the only "sure" way is to perform a systematic search but for large molecules this may be practically impossible for energy functions that you trust.

One option is to perform the systematic search using a cheaper method and then re-optimise the P lowest energy structures with the more expensive method. The danger here is that the global minimum on the expensive energy surface is not a minimum on the cheap energy surface or, more precisely, that none of the P starting geometries leads to the global minimum on the expensive surface. One way to test this is to start the stochastic search, which hopefully is so efficient that you can afford to use a trustworthy energy function, from the global minimum candidate you found.

Additionally, it is useful to use two different stochastic conformational search algorithms, such as Monte Carlo and genetic algorithm. If both method locate the same global minimum, then there is a good chance it truly is a global minimum, since it is very unlikely to find the global minimum by random chance.

This work is licensed under a Creative Commons Attribution 4.0

Why I support Plan S

2018-10-04T10:44:00.000+02:00

1. The world spends $10 billion annually on scientific publishing.

2. Most scientific papers are only accessible with subscription, which means they are only accessible to academia and large companies in countries with a large per capita GDP. The papers are not accessible to the tax payers who paid for the research and the university subscriptions, nor to small and medium-sized companies.

3. The price of scientific publishing has been increasing at an unsustainable rate

source

4. The increased price is not due to increased costs on the side of the publishers, but rather to their aggressive negotiation tactics, leading to profit margins unheard of in any other business.

source

5. Pushback during price-negotiations with publishers in the EU has resulted in the publishers denying access to their journals in several countries, as a negotiation tactic. This includes access to past issues. We can't do anything about it because they own the papers. They own the papers because scientists signed away their copyright.

I believe this is an untenable situation. Who's going to do anything about it?

6. Publishers are obviously not going to do anything about it on their own accord. Any company would fight to preserve these profit margins, and they do.

7. Scientific societies are not going to do anything about it. Scientific societies derive the bulk of their income from subscriptions and are every bit as ruthless in their negotiations on subscription price as commercial publishers. For all intends-and-purposes they are publishers first, societies second.

8. By-and-large (one notable exception is PLoS) scientists haven’t done anything about it either. In my experience, scientists are first on foremost focussed on career advancement and competition for research funds and don't think about the (rising) cost of publishing.

9. Now it seems some EU funders are finally trying to do something about it with Plan S. Plan S is designed to bring about change in the current system.

This is why I fully support Plan S

If Twitter is any indication most scientists are not happy with Plan S. From what I can tell, the worries center mostly on not being able to publish in "good" journals and can be classified into two main categories:

None of the "good" journals will change
Unless something changes within the next 2-3 years researchers would not be able to publish work funded by some EU-based funding agencies in most, but not all, "good" journals.

So your colleagues who are not funded by some EU-based funding agencies publish in "good" journals and get all the recognition. As a result you may not get promoted or get new grants.

It's an unlikely scenario but if this does happen remember that your colleagues, who decide on your promotion or that you are competing against for funds, are in exactly the same boat.

Another worry is that people won't want to collaborate with you because you can't publish together in "good" journals. In my experience, this is not how scientific collaborations work, but if you do happen to meet such a potential collaborator my advice would be to avoid them at all cost.

All of the "good" journals will change
If all the "good" journals change to Gold OA in response to Plan S, then people without funding who can't pay the APC won't be able to publish in "good" journals.

This is also an unlikely scenario, but if it does happen society would spend considerably less on subscription fees that could be used to pay APCs. Notice that Plan S calls for a APC-cap, meaning that Plan S-friendly journal should be affordable. Remember that the current APCs are designed to maintain a very large profit margin for the publishers, so there is plenty of "fat" to trim. Finally, APCs are tied to volume. If the number of submissions increase the cost of each individual paper decreases.

The most likely scenario in my opinion
Plan S will be (sadly) softened a little. Some "good" journals will change to comply with the final version of Plan S, some won't, and some new journals will spring up. Some researchers won't be able to publish some of their work in some of their favorit "good" journals and will have to find a new favorit for some of their papers. Your colleagues are in the same boat, so it won't have much effect on either career advancements nor funding success rate. The world of scientific publishing may become a little bit less ridiculous but not thanks to us scientists.

I view Plan S as a signal to the publishers for the next round of negotiations. We pay the bills and this is how we would like it. It's not an unreasonable position at all. Something does not need to change. Publisher will fight this tooth and nail and their main argument will be that scientists say it will be bad for science. Sadly, many scientists are saying just that. However, the main worry of the publishers is their profit margin and the main worry of the scientists is, I believe, their careers. The only thing they have in common is a fear of change.

This work is licensed under a Creative Commons Attribution 4.0

Reviews of Solvation Energy Predictions Using The SMD Solvation Method and Semiempirical Electronic Structure Methods

2018-09-11T15:42:00.000+02:00

Really late posting this. The paper is already out at JCP. Here are the reviews for the record.

Reviewer #1 Evaluations:
Recommendation: Revision
New Potential Energy Surface: No

Reviewer #1 (Comments to the Author):

In this contribution, the authors report a set of systematic analyses of semi-empirical (NDDO and DFTB) methods combined with continuum solvation models (COSMO and SMD) for the description of solvation free energies of well-documented benchmark cases (the MNSOL dataset). They found that the performance of NDDO and DFTB continuum solvation models can be substantially improved when the atomic radii are optimized, and that the results are most sensitive to the radii of HCNO. Another interesting observation is that the optimized radii have a considerable degree of transferability to other solvents.

Since an efficient computation of solvation free energies and related quantities (e.g., pKa values) is valuable in many chemical and biological applications, the results of this study are of considerable interest to the computational chemistry community.

In the current form, the ms can benefit from further discussion of several points:

1. The authors chose to optimize the atomic radii based entirely on element type (e.g., HCNOS). In the literature, many solvation models either further consider atom types (e.g., UAKS) or atomic charge distribution (based on either atomic point charges or charge density); in many cases, a higher degree of accuracy appears to be obtained. It would be useful to further clarify the principle behind the current optimization and the expected level of accuracy; for example, to what degree should we expect the same set of radii to work well for both neutral and ionic (especially anionic) species?
2. Although it is well known - it is useful to explicitly point out that the experimental values for neutral and charged species have different magnitudes of errors.
3. It would be informative to further dissect/discuss the physical origins for the errors of NDDO/DFTB continuum solvation models. For example, are the larger errors (as compared to, for example, HF based calculations) due primarily to the less reliable description of the solute charge density (e.g., multipole moments) or solute polarizability? Discussion along this line might be relevant to the transferability of the optimized model to non-aqueous solvents.

4. Cases with very large errors deserve further analysis/discussion - for example, some neutral solutes apparently have very large errors at HF, NDDO and DFTB levels - as much as 20 or even 30 kcal/mol! What are these molecules? Are the same set of molecules problematic for all methods? What is the physical origin for these large errors?

Reviewer #2 Evaluations:
Recommendation: Revision
New Potential Energy Surface: No

Reviewer #2 (Comments to the Author):

In this paper, the authors make the case for efficient solvation models in
fast electronic structure methods (currently heavily utilized for high-throughput
screening approaches). They extend an implementation of PM6 in the Gamess
programm to account for d orbitals. The SMD and COSMO continuum models in combination with
various semi-empirical NDDO and also DFT tight-binding approaches is considered.
Their analysis clearly highlights deficiencies of the semi-empiricial approaches
compared to HF/DFT. The authors then proceed to propose a remedy (changing the
radii for H, C, O, N, and S). Although this change was driven by data on aqueous
solvation energies, the authors find that other polar solvents (DMSO, CH3CN, CH3OH)
are also improved, which is a sign of transferability of this simple fix.
The prediction of pKa data, as an important application field, concludes the
results section. The paper is clearly written, however, it raises questions that
should be addressed in a revision:
1) Table 2 shows very (too) small Coulomb radii for H and on page 6 this is commented on.
The authors note that for radii smaller than 0.6 A the proton moved into the solvent.
However, no further analysis if provided. I assume that this is due to an increased outlying
charge and this outlying charge shoud be quantified. Apparently, some error compensation
is in operation. This also relates to the statement 'error for the ions is considerably larger
than for neutral molecules' on page 5. Error compensation also raises a concern about
transferability that the authors must address.
2) The authors should also review their list of references (I assume that the first author's
surname in Ref. 1 is misspelled, the abbreviations of journals are not in JCP style, Ref. 34
appears to lack a journal title, Ref. 16 lacks author names ...).
3) Moreover, figures 1-3 lack a label for the y-axis, figure 4 lacks units
on the y-axis.
4) Few typos need also be removed (see, e.g., "mainly only" -> "mainly on"
on page 2).

This work is licensed under a Creative Commons Attribution 4.0