ENCORE ensemble similarity

We are happy to announce that the ENCORE ensemble similarity library has been integrated in the next version of MDAnalysis as MDAnalysis.analysis.encore.

ENCORE implements a variety of techniques for calculating similarity measures between structural ensembles in the form of trajectories, as described in:

Tiberti M, Papaleo E, Bengtsen T, Boomsma W, Lindorff-Larsen K (2015), ENCORE: Software for Quantitative Ensemble Comparison. PLoS Comput Biol 11(10): e1004415. doi:10.1371/journal.pcbi.1004415 .

The similarity measures are based on the same fundamental principle, i.e. estimating the probability density of conformational states of proteins from the available ensemble data and comparing such densities using measures of distance between probability distributions, such as the Jensen-Shannon divergence. ENCORE implements three similarity measures: HES (Harmonic ensemble similarity), CES (Clustering ensemble similarity) and DRES ( Dimensionality reduction ensemble similarity). In HES the structures of the ensembles are seen as samples from a multivariate normal distribution, whose parameters are estimated based on the available data. CES partitions the conformational space of all the ensembles in clusters and uses the relative occurrence of the ensembles in the clusters to estimate the probability density. DRES uses a kernel-density estimate from the ensembles, which is run on a dimensionally-reduced version of the conformational space.

The ENCORE package implements the similarity measures themselves together with a number of other algorithms and features, also available standalone. ENCORE implements:

Details on implementation, use-cases and expected performance can be found in 10.1371/journal.pcbi.1004415.

The HES method is the fastest and least general of the three, as its performance depends on how well the probability of distribution underlying the ensembles can be modeled as a simple multivariate normal, which is not necessarily guaranteed for simulation trajectories. CES and DRES don’t rely on this assumption, however they both require the calculation of a full RMSD matrix for all the ensembles to be compared as well as clustering or dimensionality reduction, respectively, on the conformational space, and have thus higher requirements in terms of computation time and memory.

Using the similarity measures is simply a matter of loading the trajectories or experimental ensembles that one would like to compare as MDAnalysis.Universe objects:

>>> from MDAnalysis import Universe
>>> import MDAnalysis.analysis.encore as encore
>>> from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
>>> u1 = Universe(PSF, DCD)
>>> u2 = Universe(PSF, DCD2)

and running the similarity measures on them, as for instance using the Harmonic Ensemble Similarity measure (encore.hes()):

>>> hes_similarities, details = encore.hes([u1, u2])
>>> print hes_similarities
[[        0.         38279683.9587939]
 [ 38279683.9587939         0.       ]]

Similarities are written in a square symmetric matrix having the same dimensions and ordering as the input list, with each element being the similarity value for a pair of the input ensembles. Other available measures are CES (encore.ces()) and DRES (encore.dres()). The details variable contains extra information about the calculation that has been performed: with HES, it contains the parameters of the estimated probability distributions; with CES, it contains the output of clustering; with DRES, it contains the embedded space.

The clustering and dimensionality reduction functionality is also directly available through the cluster and reduce_dimensionality functions.

For instance, to cluster the conformations from the two universes defined above, we can write:

>>> cluster_collection = encore.cluster([u1,u2])
>>> print cluster_collection
0 (size:5,centroid:1): array([ 0,  1,  2,  3, 98])
1 (size:5,centroid:6): array([4, 5, 6, 7, 8])
2 (size:7,centroid:12): array([ 9, 10, 11, 12, 13, 14, 15])

Here each cluster element is a conformation belonging to an ensemble; the cluster_collection object keeps track, for each element, both of the standard cluster membership information and of the ensemble it belongs to, making it possible to evaluate how the different trajectories are represented in each cluster.

By default ENCORE uses our implementation of the Affinity Propagation algorithm, but that can be changed as desired by the user to the others available in scikit-learn, which are automatically loaded into ENCORE if available.

For instance:

>>> cluster_collection =

in the same way, it is possible use dimensionality reduction algorithm other than the default Stochastic proximity embedding:

>>> coordinates, details =

Similar options in encore.ces() and encore.dres() make it easy to change the algorithm that will be used by the methods on the fly.

For further details, see the documentation of the individual functions within ENCORE:

@mtiberti & @kain88-de