During a Twitter discussion Noel O'Boyle introduced me to Graph Edit Distance (GDE) as a useful measure of molecular similarity. The advantages over other approaches such as Tanimoto similarity is discussed in these slides by Roger Sayle.
It turns out Networkx can compute this, so it's relatively easy to interface with RDKit and the implementation is shown below.
Unfortunately, the time required for computing GDE increases exponentially with molecule size, so this implementation is not really of practical use.
Sayle's slides discusses one solution to this, but it's far from trivial to implement. If you know of other open source implementations, please let me know.
Update: GitHub page
This work is licensed under a Creative Commons Attribution 4.0
It turns out Networkx can compute this, so it's relatively easy to interface with RDKit and the implementation is shown below.
Unfortunately, the time required for computing GDE increases exponentially with molecule size, so this implementation is not really of practical use.
Sayle's slides discusses one solution to this, but it's far from trivial to implement. If you know of other open source implementations, please let me know.
Update: GitHub page
This work is licensed under a Creative Commons Attribution 4.0
3 comments:
Really nice, however, can you (or perhaps Noel and Roger) explain how the distance from benzene to cyclohexane is 3? I might have expected 6, or 12, but 3 is a mystery - my ignorance I suspect.
Here's my snippet of code:
def calc_GED(s1,n, s2, m):
mol1 = Chem.MolFromSmiles(s1)
mol2 = Chem.MolFromSmiles(s2)
G1 = get_graph(mol1)
G2 = get_graph(mol2)
GDE = nx.graph_edit_distance(G1, G2, edge_match=lambda a, b: a['weight'] == b['weight'])
print(n, m, GDE)
mol_dict = {
'benzene':'c1ccccc1',
'pyridine':'n1ccccc1',
'pyridazine':'n1ncccc1',
'cyclohexane':'C1CCCCC1'
}
for n,s1 in mol_dict.items():
for m,s2 in mol_dict.items():
if n != m:
calc_GED(s1,n, s2, m)
The implementation makes two arbitrary choices:
1. The structures are "kekulized" meaning that benzene is C1=CC=CC=C1 rather, meaning that 3 double bonds need to be changed to single bonds. I do this so that the distance between hexatriene (C=CC=CC=C) and benzene is 1. But whether that's a good idea is arguable.
2. The Hs are implicit, so when a double bond is changed to a single bond, then the number of Hs on the Cs adjust, and don't contribute to the edit distance.
I should also note that there is a bug in the implementation. The way I compare atoms (nodes) is by defining a "bond to itself". This means that additions of atoms is double counted (both as an edge and a node". The the code predicts the GED between C and CC to be 3, when in fact it is 2.
Hi Jan,
I appreciate your candour - to help move things forward here's a unittest file, these are what I think GEDs should be. My view is a unit of GED change should 1 per addition/removal of each atom and 1 per addition or deletion of a bond.
Ed
import unittest
from GED import *
class MyTestCase(unittest.TestCase):
def run_tests(self, d):
for n, p in d.items():
with self.subTest(p=p):
GED = calc_GED(p[0], p[1])
self.assertEqual(GED,p[2], msg='GED not correct for ' + n)
def test_aromatic_aromatic(self):
mol_dict = {
'benzene->pyridine': ('c1ccccc1', 'n1ccccc1',1),
'pyridine->pyridazine': ('n1ccccc1', 'n1ncccc1',1),
'benzene->pyridazine': ('c1ccccc1', 'n1ncccc1', 2)
}
self.run_tests(mol_dict)
def test_aromatic_substituents(self):
mol_dict = {
'benzene->toluene': ('c1ccccc1', 'c1ccccc1C',2),
'benzene->chlorobenzene': ('c1ccccc1', 'c1ccccc1Cl', 2),
'benzene->anisole':('c1ccccc1', 'c1ccccc1OC', 3)
}
self.run_tests(mol_dict)
def test_aliphatic_aromatic(self):
mol_dict = {
'benzene->cyclohexane': ('c1ccccc1', 'C1CCCCC1',6)
}
self.run_tests(mol_dict)
def test_aliphatic_ring_expansion(self):
mol_dict = {
'cyclopropane->cyclobutane': ('C1CC1', 'C1CCC1',4)
}
self.run_tests(mol_dict)
def test_aliphatic_aliphatic(self):
mol_dict = {
'ethane->propane': ('CC', 'CCC',2)
}
self.run_tests(mol_dict)
if __name__ == '__main__':
unittest.main()
Post a Comment