Sunday, January 15, 2017

Making your computational protocol available to the non-expert

I recently read this paper by Jonathan Goodman and co-workers which I learned about through this highlight by Steven Bachrach.  The DP4 method is a protocol for computing chemical shifts of organic molecules using DFT and comparing the chemical shifts to experimental values.  This paper automates the method, switches to free software packages (NWCHEM instead of Gaussian and TINKER instead of Macromodel), and tests the applicability for drug like molecules.  The python and Java code is made available on Github under the MIT license.

I like everything about this paper and what follows is not a criticism of this paper.

The method is clearly aimed at organic chemists who use NMR to figure out what they made or isolated. Let's say they want to try DP4 to see how well it works on some molecule they are currently working on.

What's needed to get started
1. Access to multicore Linux computer.  The method requires quite many B3LYP/6-31G(d,p) NMR calculations and given the typical size of organic molecules it will probably not be practically possible to even test this method on a desktop computer.  Even if it is, the instructions for PyDP4 assumes you are using Linux to you'd have to somehow deal with that if you, like many, have a Windows machine.

2. Installation. You have to install NWCHEM, Tinker, OpenBabel and configure PyDP4.

3. Coordinates. PyDP4 requires an sdf file as input.  You have to figure out what that is and how to make one.

4. Familiarity with Linux.  All this assumes that you are familiar with Linux. How many synthetic organic chemists are?

If you'll be using DP4 a lot, all of this may be worth doing but perhaps not just to try it?  If you don't have access to a Linux cluster, buying one for the occasional NMR calculation may be hard to justify. If one is convinced/determined enough, the most likely solution would probably be to find and pay an undergrad to do all this using an older computer you were gonna throw out anyway.  Or maybe your department has a shared cluster and a sysadmin who could handle the installation.

Alternative 1: Web server
One alternative is to make DP4 available as a web server, where the user can upload the sdf file and other data.  If one includes a GUI all 4 problems are solved ... for the user.  The problem for the developer is that this could eat up a lot of your own computational resources. One could probably do something smart to only use idle cycles, but the best case scenario (lots of users) also becomes the worst case scenario.  Perhaps there's a way to crowdsource this?

Alternative 2: VM Virtual box
Another alternative is to make DP4 available as a virtual machine (VM).


This mostly solves the installation issue. The main problem here is that the user needs still needs to find a reasonably powerful computer to run this on. The other problem is that the developer needs to test the VM-installation on various operating systems and keep up to date as new ones appear. Perhaps there's a way to crowdsource all this?

Alternative 3: Amazon Web Services or Google Compute Engine
Another alternative is to make DP4 available as a VM image for AWS or GCE.  This mostly solves the CPU and installation issue. The user creates an AWS or GCE account and imports the VM image and then pays Amazon and Google for computer time using a credit card. For reasonably sized molecules the cost would probably be less than $10/molecule as far as I can tell.

I don't have any direct experience with AWS or GCE so I don't know how slick the interface can be made. All examples I have seen have involved ssh to the AWS/GCE console, so some Linux knowledge is required.

Alternative 4: AWS/GCE-based Web server
Another alternative is to combine 2 and 3. The problem here is how to bill the individual user for their CPU-usage. There is probably ways to to this but it's starting to sound like a lot of work to set up and manage. Perhaps by adding a surcharge one could pay someone to handle this on a part-time basis.  Perhaps existing companies would be interesting in offering such a service?

Licensing issues
As far as I can tell the licenses of NWCHEM, TINKER, and OpenBabel allow for all 4 alternatives.  

The bigger issue
A key step in making a computational chemistry-based methods such as DP4 usable to the non-expert is clearly automation and careful testing.  Another is using free software (I have access to Gaussian but I am not going to buy Macromodel just to try out DP4!). Kudos to Goodman and co-workers for doing this. But if we want to target the non-experts, I think we should try to go a bit further. One could even imagine something like this in the impact/dissemination section of a proposal:
The computational methodology is based on free software packages and the code needed for automatisation and analysis, that is written as part of the proposed work, will be made available on Github under an open source license.  Furthermore, Company X will make the approach available on the AWS cloud computing platform, which will allow the non-expert to use the approach without installation or investment in in-house computational resources and greatly increase usage. Company X handles the set-up, billing for on-demand CPU-time, usage-statistics, and provides a rudimentary GUI for the approach for a one-time fee of $2000, which is included in the budget.
Anyway, just some thoughts.  Have I missed other ways of getting a relatively CPU-intensive computational chemistry method in the hands of non-experts?


This work is licensed under a Creative Commons Attribution 4.0

No comments: