Blog

A shiny, new and faster topology system

With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development. In fall 2015, we (@richardjgowers and @dotsdl) set to work on redesigning the topology system from scratch. This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (Universe, AtomGroup, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems. We hoped to resolve all of these issues with our new design.

The starting point of this work was (the now infamous) issue 363, which floated the idea of holding all atom, residue, and segment attributes in arrays instead of lists of Atom, Residue, and Segment objects. This approach turned the way topology data such as atom names, resids, masses, etc. are stored in a Universe on its head, going from an array of structs (list of Atom objects with individual attributes) to a struct of arrays (an array for each attribute, one entry per Atom).

Now, over a year later, the finishing touches on this work are being prepared for release. This post is meant to serve as a brief view to what has changed internally, what has changed externally, and what benefits this gives us looking forward to the future.

Invisible changes to make working with MDAnalysis faster

Most of the changes are (or should be) invisible to the user. But they made some of the most fundamental operations in MDAnalysis quite a bit faster. Although this section is mostly of interest to developers, it is useful for all users to know the operations that MDAnalysis can now do much faster than before (and why). In the new system, each atom is a member of exactly one residue, and each residue is a member of exactly one segment. The new Topology object keeps an array giving the residue membership of each atom, and likewise an array giving segment membership of each residue. Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to fancy-index the Atoms->Residues array, and then using the result of this to fancy-index the Resnames array. For example, if the Topology has 5 atoms and 3 residues, with membership (Atoms->Residues) and Resnames arrays as below:

       Atoms->Residues           Resnames
 index ---------------     index --------
     0 0                       0 GLU
     1 2                       1 LYS
     2 1                       2 ALA
     3 1
     4 2

calling AtomGroup.resnames for an AtomGroup with atoms [2, 0, 1, 2] will yield (pseudocode):

"Atoms->Residues"[[2, 0, 1, 2]] --> [1, 0, 2, 1]
"Resnames"[[1, 0, 2, 1]]        --> ['LYS', 'GLU', 'ALA', 'LYS']

This scheme only works if each atom is a member of one and only one residue, and likewise if residues are members of one and only one segment. Furthermore, AtomGroups, ResidueGroups, and SegmentGroups are very thin, storing only the indices of their members as a numpy array. This gives a number of advantages:

  1. Performance. We get up to an 8x speedup over the old scheme when accessing attributes. Setting attributes can give up to a 40x speedup.
  2. Memory. We don’t store, for example, a resname for each atom, but instead store attributes at the level they make sense for.
  3. Consistency. Since attributes are stored in one place, we avoid cases where the topology is in an inconsistent state, e.g. two atoms in the same residue give a different resname.
  4. No staleness. Because e.g. ResidueGroups are only an array of indices, not a list of Residue objects generated upon creation of the group, changes of resiude-level properties by another ResidueGroup are always reflected consistently by every other one. Data is not duplicated anywhere in this scheme, and is all contained in the Topology object.
  5. Serialization. Topologies become serializable and changes to topologies can be easily saved and communicated around. This is an important step towards implementing parallel algorithms in MDAnalysis.

For further performance comparisons, check out this notebook.

External changes that may affect how you use MDAnalysis

Previously, every object except Atom subclassed from AtomGroup. This meant that calling .positions of would give you the positions of the Atoms contained within that group.

Previous class structure:

Atom

AtomGroup  -> Residue
           -> ResidueGroup -> Segment
                           -> SegmentGroup

New class structure:

Group    -> AtomGroup
         -> ResidueGroup
         -> SegmentGroup

Atom
Residue
Segment

Now each object only contains information pertaining to that particular object. A Residue object only yields information about the residue; to get to the atoms, use Residue.atoms. Similarly, to get the atoms from a Segment or a SegmentGroup use Segment.atoms or SegmentGroup.atoms. As before, you can get all residues associated with a group with Group.residues (which returns a ResidueGroup) and all segments with Group.segments (a SegmentGroup). Bottom line: you should now always be explicit about what you want.

Why this was changed

Previously everything inheriting from AtomGroup made it unclear at what level of topology a given method or attribute was working on. For example, does ResidueGroup.charges give the charge of the residues or the atoms? Also, it was unclear what size a given output would be (see issue 411).

How to work with the new system

To access atom-level information from anything that isn’t an AtomGroup, use the .atoms level accessor. For example, changing all .positions calls on anything that isn’t an AtomGroup to .atoms.positions.

Going forward: what does this mean for MDAnalysis as a project?

A major benefit of the new topology system is that information about the topology of a Universe is now completely encapsulated in the Topology object. This not only makes development and maintenance easier, but also opens the door to some exciting new possibilities as simulation systems grow larger. A single Topology object can now be cleanly shared by multiple Universe instances, each with their own trajectory reader(s). This could make common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more efficient for large systems. The Topology object can also be serialized more easily. This should enable parallelization on workers without shared memory (using libraries such as distributed) out-of-the-box.

Making these things work is an ongoing effort, but the MDAnalysis coredevs are working to take advantage of all these possibilities. We look forward to the benefits this brings not only to the project, but also to all our users going forward. We hope you like what we’ve done here.

@dotsdl and @richardjgowers

MDAnalysis

Original draft of the new topology system by @richardjgowers and @dotsdl, November 2015 at ASU.

Google Summer of Code 2017

NumFOCUS Foundation Google Summer of Code 2016

MDAnalysis has been accepted as a sub-org of the NumFOCUS foundation, for Google Summer of Code 2017. If you are interested in working with us this summer as a student read the advice and links below and write to us on the mailing list.

We are looking forward to all applications from interested students (undergraduates and graduates).

The application window deadline is April 3, 2017 at 12:00 (MST). As part of the application process you must familiarize yourself with Google Summer of Code 2017. Apply as soon as possible.

Project Ideas

We have listed several possible projects for you to work on on our wiki. Each project is rated with a difficulty and lists the possible mentors for it.

Alternatively, if you have another idea about a project please write to us on the developer list and we can discuss it there.

Information for Students

You must meet our own requirements if you want to be a student with MDAnalysis this year (read all the docs behind these links!). You must also meet the eligibility criteria.

As a start to get familiar with MDAnalysis and open source development you should follow these steps:

Complete the Tutorial

We have a tutorial explaining the basics of MDAnalysis. You should go through the tutorial at least once to understand how MDAnalysis is used.

Introduce yourself to us

Introduce yourself on the mailing list. Tell us what you plan to work on during the summer or what you have already done with MDAnalysis

Close an issue of MDAnalysis

You must have at least one commit in the development branch of MDAnalysis in order to be eligible, i.e.. you must demonstrate that you have been seriously engaged with the MDAnalysis project.

We have a list of easy bugs to work on in our issue tracker on GitHub. We also appreciate if you write more tests or update/improve our documentation. To start developing for MDAnalysis have a look at our guide for developers and write us on the mailing list if you have more questions about setting up a development environment.

@kain88-de

MDAnalysis is a NumFOCUS affiliated project

NumFOCUS Foundation

We are glad to announce that, since February 2017, MDAnalysis is officially a NumFOCUS affiliated project. With this affiliation, MDAnalysis establishes itself as part of the wider scientific python ecosystem, and we hope it will open up new opportunities in the future.

NumFOCUS is a 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific computing, reproducible research, and education in data science.

@MDAnalysis/coredevs