Background

We want to perform Path Similarity Analysis (PSA) [Seyler2015] on a relatively small data set of 400 molecular dynamics trajectories. PSA will group similar trajectories with each other and enables a quantitative comparison. 200 trajectories were produced with the DIMS MD method [Perilla2011] [Beckstein2009], which is based on molecular dynamics, and 200 were produced with FRODA [Farrell2010] , which is based on a geometric rigidity decomposition. We want to understand if these methods produce similar or different transition pathways, and, perhaps how they differ.

The PSA distance metric

PSA requires the calculation of the distances between all paths \(P_i\), the distance matrix

\[D_{ij} = \delta(P_i, P_j)\]

The MDAnalysis.analysis.psa module in MDAnalysis contains implementations for various distance functions \(\delta(P, Q)\). Here we are going to use the discrete Fréchet distance as implemented in MDAnalysis.analysis.psa.discrete_frechet(). Another good choice is the Hausdorff distance (MDAnalysis.analysis.psa.hausdorff()), and it is straightforward to alter the example code in this tutorial (namely, mdanalysis_psa_partial.py) to explore the use of a different distance function.

Computational cost

Because the path metric \(\delta(P, Q)\) is symmetric, the distance metric is also symmetric. Furthermore, \(\delta(P, P) = 0\). Therefore, we need to perform in total

\[M = \frac{N(N-1)}{2}\]

calculations. For \(N=400\), this means 79,800 individual trajectory comparisons.

Approach

Individual path comparisons are relatively compute intensive (they scale with \(\mathcal{O}(n_i n_j)\), the product of the number of time frames in each trajectory. But they are all independent and hence a simple map-reduce strategy can be easily implemented whereby individual jobs compute sub-matrices of the distance matrix (map) and the results are then combined into the complete solution (reduce).

Block distance matrix

We choose a block size \(w\) and decompose the distance matrix \(\mathsf{D} = [D_{ij}],\ 0 \leq i, j < N\) into block matrices \(\mathsf{D}^{\alpha}\) with \(0 \leq \alpha < N/w\) and

\[D^{\alpha}_{u(i),v(j)} = D_{i,j} \quad \text{with}\ u = i - \alpha w,\ v = j - \alpha w\]

(with the appropriate adjustments for blocks at the edges that contain less than \(w\) entries).

MDAnalysis

In MDAnalysis each trajectory is loaded into a Universe, using a topology file (which contains static information about the atoms, bonds etc) and a trajectory file (which contains the changing coordinates for each time step):

import MDAnalysis as mda
universe = mda.Universe(topology, trajectory)

Because we need to access all trajectory data for our analysis, we can increase performance by first loading the whole trajectory into memory 1 with the transfer_to_memory() method 2:

universe.transfer_to_memory()

We will restrict the calculation of the path distances to a subset of atoms, namely the “C-alpha” atoms that are distributed along the protein backbone and report on the larger conformational changes. The C-alpha atoms can be selected as a AtomGroup with the atom selection language:

ca = u.select_atoms("name CA")

In order to calculate a pairwise distance between two trajectories we extract the coordinates of all CA atoms for all time steps into a numpy array:

P = u.trajectory.timeseries(asel=ca, format="fac")

With the coordinates for a second trajectory Q we can then calculate the discrete Fréchet distance using an recursive dynamic programming algorithm [EiterManilla1994], implemented in MDAnalysis.analysis.psa.discrete_frechet():

from MDAnalysis.analysis import psa
dF = psa.discrete_frechet(P, Q)

dF contains the discrete Fréchet distance \(\delta_F(P, Q)\), a real non-negative number. Because we base the Fréchet distance on the root mean square distance (RMSD) between the CA coordinates for two frames in \(P\) and \(Q\) as its point-wise metric (see, for instance, [Seyler2015] for more details), \(\delta_F(P, Q)\) has the interpretation of the RMSD between the two frames in the two trajectories that best characterize the difference between the two trajectories (they form a Fréchet pair).

Note

For macromolecular systems, we typically remove all translational and rotational degrees of freedoms for all trajectories by superimposing all trajectory frames on a single reference structure [Seyler2015]. The superposition can be carried out in a pre-processing step using, for instance, MDAnalysis.analysis.align.AlignTraj or as part of PSA with MDAnalysis.analysis.psa.PSAnalysis. The trajectories for this tutorial were already superimposed appropriately (on the “CORE” domain of AdK, as described in more detail in [Seyler2015].)

Calculating a full Fréchet distance matrix \(D_{ij} = \delta(P_i, P_j)\) just requires more book-keeping in order to perform the above steps for the Cartesian product of all trajectories \(P_i \times P_j\).

Radical.pilot

We use radical.pilot to generate one compute unit for each block matrix computation. The pilot job distributes the individual compute units, which includes staging of input trajectories and retrieval of the output file (the block matrix), as well as running the MDAnalysis script that performs the calculation of the block matrix on a compute node.

Footnotes

1

Instead of using transfer_to_memory() one could also simply set the in_memory=True keyword argument of Universe as shown for in-memory representation of arbitrary trajectories. However, here we keep the two steps separate for conceptual clarity.

2

Loading trajectory data into memory makes use of the new MemoryReader functionality in MDAnalysis.coordinates.memory; this functionality has been available since the 0.16.0 release.