GSoC Report: From RDKit to the Universe and back 🚀

With the end of the summer comes the end of my awesome Google Summer of Code adventure with the MDAnalysis team. It’s also the occasion for me to report on the code I’ve implemented, the things I’ve learned and the challenges I’ve faced.

Summary of the project

The goal of my project, From RDKit to the Universe and back, was to provide interoperability between MDAnalysis and RDKit. i.e. to be able to:

With this in mind, the project was easily cut down in 2 main deliverables (the RDKit Reader/Parser and the Converter) and a few smaller ones for each “wrapper” functionality.

The motivation behind the interoperability project was to be able to benefit from all the features that are available in RDKit and MDAnalysis with as little hassle as possible.
For example, before this project if you wanted to compute molecular descriptors for a ligand in an MD trajectory, you would have to write a separate PDB file for each frame of the trajectory and then read each file through RDKit. With the work that I’ve done, you can now convert your MD trajectory to an RDKit molecule and compute descriptors from there, or you can directly use the descriptor wrapper on an MDAnalysis AtomGroup (see below).
It was also the occasion for me to increase my visibility in the community by working on an open-source software development project, and to learn how to write better code. After my PhD I’d like to develop software for computational chemistry and this Google-sponsored event will hopefully help me in that regard.

Contributions

Merged PRs

In progress

Left to do

Demo

Full circle

Here are all the possible conversions between RDKit and MDAnalysis:

import MDAnalysis as mda
from rdkit import Chem

# new feature
u1 = mda.Universe.from_smiles("CCO")
# new feature
mol1 = u1.atoms.convert_to("RDKIT")
# new feature
u2 = mda.Universe(mol1)
# before this project
u2.atoms.write("mol.pdb")
mol2 = Chem.MolFromPDBFile("mol.pdb")

Atom selections

There are two new selections available in MDAnalysis: aromatic for aromatic atoms, and smarts for the selection of atoms based on SMARTS queries. Let’s try them on this molecule: Depiction of a molecule through RDKit

>>> u = mda.Universe.from_smiles("Nc1cc(C[[email protected]]([O-])C=O)c[nH]1")
>>> u
<Universe with 20 atoms>
>>> u.select_atoms("aromatic")
<AtomGroup with 5 atoms>
# same as above
>>> u.select_atoms("smarts a")
<AtomGroup with 5 atoms>
# 4 aromatic carbon atoms
>>> u.select_atoms("smarts c").indices
array([1, 2, 3, 9])
# carbon atoms in a ring (not necessarily aromatic)
>>> u.select_atoms("smarts [#6;R]").indices
array([1, 2, 3, 9])
# 1 aromatic nitrogen
>>> u.select_atoms("smarts n").indices
array([10])
# not hydrogen and not in a ring but connected to a ring 
>>> u.select_atoms("smarts [$([!R][R])] and not type H").indices
array([0, 4])

Descriptors calculation

Soon, you will be able to compute descriptors directly from an AtomGroup, by either passing the name of the descriptor in RDKit, or by passing your own function that takes an RDKit molecule as argument. Here’s an example of what the current version looks like:

>>> from MDAnalysis.analysis.RDKit import RDKitDescriptors
>>> u = mda.Universe.from_smiles("CCO", numConfs=3)
>>> def num_atoms(mol):
...    return mol.GetNumAtoms()
>>> desc = RDKitDescriptors(u.atoms, "MolWt", "RadiusOfGyration",
...                         num_atoms).run()
>>> desc.results
array([[46.06900000000002, 1.161278342193013, 9],
       [46.06900000000002, 1.175492972121405, 9],
       [46.06900000000002, 1.173230936577319, 9]],
      dtype=object)

Fingerprint calculation

You will also be able to obtain fingerprints:

>>> from MDAnalysis.analysis.RDKit import get_fingerprint
>>> fp = get_fingerprint(u.atoms, "AtomPair", hashed=True, nBits=1024)
>>> fp.GetNonzeroElements()
{36: 1,
 106: 1,
 297: 1,
 569: 3,
 619: 1,
 624: 8,
 634: 2,
 699: 1,
 745: 5,
 819: 4,
 938: 6,
 945: 3}

Drawing with RDKit

Finally, you will be able to display small AtomGroups as images in notebooks:

>>> from MDAnalysis.visualization.RDKit import RDKitDrawer
>>> from nglview.datafiles import PDB, XTC
>>> u = mda.Universe(PDB, XTC)
>>> elements = mda.topology.guessers.guess_types(u.atoms.names)
>>> u.add_TopologyAttr('elements', elements)
>>> u.atoms
<AtomGroup with 5547 atoms>
>>> ag = u.select_atoms("resname LRT")
>>> ag

Depiction of a molecule through RDKit

You can also export any AtomGroup to a PNG or SVG, and even to a GIF for trajectories.

Conclusion

This project taught me a lot of things on software development and the Google Summer of Code experience has been incredible and valuable to me. All of this wouldn’t have been possible without the help of many people, including my amazing mentors (@IAlibay, @fiona-naughton and @richardjgowers), but also the rest of the MDAnalysis team as a lot of them got involved and gave me great feedback, so thank you to all of them!

@cbouy