4.2.6.1.1. Ensemble Similarity Calculations — MDAnalysis.analysis.encore.similarity
¶
Author:  Matteo Tiberti, Wouter Boomsma, Tone Bengtsen 

New in version 0.16.0.
The module contains implementations of similarity measures between protein ensembles described in [LindorffLarsen2009]. The implementation and examples are described in [Tiberti2015].
The module includes facilities for handling ensembles and trajectories through
the Universe
class, performing clustering or dimensionality reduction
of the ensemble space, estimating multivariate probability distributions from
the input data, and more. ENCORE can be used to compare experimental and
simulationderived ensembles, as well as estimate the convergence of
trajectories from timedependent simulations.
ENCORE includes three different methods for calculations of similarity measures between ensembles implemented in individual functions:
 Harmonic Ensemble Similarity :
hes()
 Clustering Ensemble Similarity :
ces()
 Dimensional Reduction Ensemble Similarity :
dres()
as well as two methods to evaluate the convergence of trajectories:
 Clustering based convergence evaluation :
ces_convergence()
 Dimensionalityreduction based convergence evaluation :
dres_convergence()
When using this module in published work please cite [Tiberti2015].
References
[LindorffLarsen2009]  Similarity Measures for Protein Ensembles. LindorffLarsen, K. FerkinghoffBorg, J. PLoS ONE 2008, 4, e4203. 
[Tiberti2015]  (1, 2, 3, 4, 5, 6, 7, 8, 9) ENCORE: Software for Quantitative Ensemble Comparison. Matteo Tiberti, Elena Papaleo, Tone Bengtsen, Wouter Boomsma, Kresten LindorffLarsen. PLoS Comput Biol. 2015, 11, e1004415. 
Examples
The examples show how to use ENCORE to calculate a similarity measurement of two simple ensembles. The ensembles are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples first execute:
>>> from MDAnalysis import Universe
>>> import MDAnalysis.analysis.encore as encore
>>> from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
To calculate the Harmonic Ensemble Similarity (hes()
)
two ensemble objects are first created and then used for calculation:
>>> ens1 = Universe(PSF, DCD)
>>> ens2 = Universe(PSF, DCD2)
>>> print encore.hes([ens1, ens2])
(array([[ 0. , 38279683.95892926],
[ 38279683.95892926, 0. ]]), None)
Here None is returned in the array as the default details parameter is False. HES can assume any nonnegative value, i.e. no upper bound exists and the measurement can therefore be used as an absolute scale.
The calculation of the Clustering Ensemble Similarity (ces()
)
is computationally more expensive. It is based on clustering algorithms that in
turn require a similarity matrix between the frames the ensembles are made
of. The similarity matrix is derived from a distance matrix (By default a RMSD
matrix; a full RMSD matrix between each pairs of elements needs to be computed).
The RMSD matrix is automatically calculated.
>>> ens1 = Universe(PSF, DCD)
>>> ens2 = Universe(PSF, DCD2)
>>> CES, details = encore.ces([ens1, ens2])
>>> print CES
[[ 0. 0.68070702]
[ 0.68070702 0. ]]
However, we may want to reuse the RMSD matrix in other calculations e.g. running CES with different parameters or running DRES. In this case we first compute the RMSD matrix alone:
>>> rmsd_matrix = encore.get_distance_matrix(
encore.utils.merge_universes([ens1, ens2]),
save_matrix="rmsd.npz")
In the above example the RMSD matrix was also saved in rmsd.npz on disk, and so can be loaded and reused at later times, instead of being recomputed:
>>> rmsd_matrix = encore.get_distance_matrix(
encore.utils.merge_universes([ens1, ens2]),
load_matrix="rmsd.npz")
For instance, the rmsd_matrix object can be reused as input for the
Dimensional Reduction Ensemble Similarity (dres()
) method.
DRES is based on the estimation of the probability density in
a dimensionallyreduced conformational space of the ensembles, obtained from
the original space using either the Stochastic Proximity Embedding algorithm or
the Principal Component Analysis.
As the algorithms require the distance matrix calculated on the original space,
we can reuse the previouslycalculated RMSD matrix.
In the following example the dimensions are reduced to 3 using the
saved RMSD matrix and the default SPE dimensional reduction method. :
>>> DRES,details = encore.dres([ens1, ens2],
distance_matrix = rmsd_matrix)
>>> print DRES
[[ 0. , 0.67453198]
[ 0.67453198, 0. ]]
In addition to the quantitative similarity estimate, the dimensional reduction
can easily be visualized, see the Example
section in
MDAnalysis.analysis.encore.dimensionality_reduction.reduce_dimensionality
.
Due to the stochastic nature of SPE, two identical ensembles will not
necessarily result in an exactly 0 estimate of the similarity, but will be very
close. For the same reason, calculating the similarity with the dres()
twice will not result in necessarily identical values but rather two very close
values.
It should be noted that both in ces()
and dres()
the similarity is
evaluated using the JensenShannon divergence resulting in an upper bound of
ln(2), which indicates no similarity between the ensembles and a lower bound
of 0.0 signifying two identical ensembles. In contrast, the hes()
function uses
a symmetrized version of the KullbackLeibler divergence, which is unbounded.
4.2.6.1.1.1. Functions for ensemble comparisons¶

MDAnalysis.analysis.encore.similarity.
hes
(ensembles, selection='name CA', cov_estimator='shrinkage', weights='mass', align=False, details=False, estimate_error=False, bootstrapping_samples=100, calc_diagonal=False)[source]¶ Calculates the Harmonic Ensemble Similarity (HES) between ensembles using the symmetrized version of KullbackLeibler divergence as described in [Tiberti2015].
Parameters:  ensembles (list) – List of Universe objects for similarity measurements.
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 cov_estimator (str, optional) – Covariance matrix estimator method, either shrinkage, shrinkage, or Maximum Likelyhood, ml. Default is shrinkage.
 weights (str/array_like, optional) – specify optional weights. If
mass
then chose masses of ensemble atoms  align (bool, optional) – Whether to align the ensembles before calculating their similarity. Note: this changes the ensembles inplace, and will thus leave your ensembles in an altered state. (default is False)
 details (bool, optional) – Save the mean and covariance matrix for each ensemble in a numpy array (default is False).
 estimate_error (bool, optional) – Whether to perform error estimation (default is False).
 bootstrapping_samples (int, optional) – Number of times the similarity matrix will be bootstrapped (default is 100), only if estimate_error is True.
 calc_diagonal (bool, optional) – Whether to calculate the diagonal of the similarity scores (i.e. the similarities of every ensemble against itself). If this is False (default), 0.0 will be used instead.
Returns: Harmonic similarity measurements between each pair of ensembles.
Return type: numpy.array (bidimensional)
Notes
The method assumes that each ensemble is derived from a multivariate normal distribution. The mean and covariance matrix are, thus, estimatated from the distribution of each ensemble and used for comparision by the symmetrized version of KullbackLeibler divergence defined as:
\[D_{KL}(P(x)  Q(x)) = \int_{\infty}^{\infty}P(x_i) ln(P(x_i)/Q(x_i)) = \langle{}ln(P(x))\rangle{}_P  \langle{}ln(Q(x))\rangle{}_P\]where the \(\langle{}.\rangle{}_P\) denotes an expectation calculated under the distribution P.
For each ensemble, the mean conformation is estimated as the average over the ensemble, and the covariance matrix is calculated by default using a shrinkage estimation method (or by a maximumlikelihood method, optionally).
Note that the symmetrized version of the KullbackLeibler divergence has no upper bound (unlike the JensenShannon divergence used by for instance CES and DRES).
When using this similarity measure, consider whether you want to align the ensembles first (see example below).
Example
To calculate the Harmonic Ensemble similarity, two ensembles are created as Universe objects from a topology file and two trajectories. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files:
>>> ens1 = Universe(PSF, DCD) >>> ens2 = Universe(PSF, DCD2) >>> HES, details = encore.hes([ens1, ens2]) >>> print HES [[ 0. 38279683.95892926] [ 38279683.95892926 0. ]]
You can use the align=True option to align the ensembles first. This will align everything to the current timestep in the first ensemble. Note that this changes the ens1 and ens2 objects:
>>> print encore.hes([ens1, ens2], align=True)[0] [[ 0. 6880.34140106] [ 6880.34140106 0. ]]
Alternatively, for greater flexibility in how the alignment should be done you can call use an AlignTraj object manually:
>>> from MDAnalysis.analysis import align >>> align.AlignTraj(ens1, ens1, select="name CA", in_memory=True).run() >>> align.AlignTraj(ens2, ens1, select="name CA", in_memory=True).run() >>> print encore.hes([ens1, ens2])[0] [[ 0. 7032.19607004] [ 7032.19607004 0. ]]

MDAnalysis.analysis.encore.similarity.
ces
(ensembles, selection='name CA', clustering_method=<MDAnalysis.analysis.encore.clustering.ClusteringMethod.AffinityPropagationNative object>, distance_matrix=None, estimate_error=False, bootstrapping_samples=10, ncores=1, calc_diagonal=False, allow_collapsed_result=True)[source]¶ Calculates the Clustering Ensemble Similarity (CES) between ensembles using the JensenShannon divergence as described in [Tiberti2015].
Parameters:  ensembles (list) – List of ensemble objects for similarity measurements
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 clustering_method – A single or a list of instances of the
MDAnalysis.analysis.encore.clustering.ClusteringMethod
classes from the clustering module. Different parameters for the same clustering method can be explored by adding different instances of the same clustering class. Clustering methods options are the Affinity Propagation (default), the DBSCAN and the KMeans. The latter two methods need the sklearn python module installed.  distance_matrix (encore.utils.TriangularMatrix) – Distance matrix clustering methods. If this parameter is not supplied the matrix will be calculated on the fly.
 estimate_error (bool, optional) – Whether to perform error estimation (default is False). Only bootstrapping mode is supported.
 bootstrapping_samples (int, optional) – number of samples to be used for estimating error.
 ncores (int, optional) – Maximum number of cores to be used (default is 1).
 calc_diagonal (bool, optional) – Whether to calculate the diagonal of the similarity scores (i.e. the similarities of every ensemble against itself). If this is False (default), 0.0 will be used instead.
 allow_collapsed_result (bool, optional) – Whether a return value of a list of one value should be collapsed into just the value.
Returns: ces, details – ces contains the similarity values, arranged in a numpy.array. If only one clustering_method is provided the output will be a 2dimensional square symmetrical numpy.array. The order of the matrix elements depends on the order of the input ensembles: for instance, if
ensemble = [ens1, ens2, ens3]
the matrix elements [0,2] and [2,0] will both contain the similarity value between ensembles ens1 and ens3. Elaborating on the previous example, if n ensembles are given and m clustering_methods are provided the output will be a list of m arrays ordered by the input sequence of methods, each with a n*x*n symmetrical similarity matrix.
details contains information on the clustering: the individual size of each cluster, the centroids and the frames associated with each cluster.
Return type: numpy.array, numpy.array
Notes
In the JensenShannon divergence the upper bound of ln(2) signifies no similarity between the two ensembles, the lower bound, 0.0, signifies identical ensembles.
To calculate the CES, the affinity propagation method (or others, if specified) is used to partition the whole space of conformations. The population of each ensemble in each cluster is then taken as a probability density function. Different probability density functions from each ensemble are finally compared using the JensenShannon divergence measure.
Examples
To calculate the Clustering Ensemble similarity, two ensembles are created as Universe object using a topology file and two trajectories. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files. Here the simplest case of just two instances of
Universe
is illustrated:>>> ens1 = Universe(PSF, DCD) >>> ens2 = Universe(PSF, DCD2) >>> CES,details = encore.ces([ens1,ens2]) >>> print CES [[ 0. 0.68070702] [ 0.68070702 0. ]]
To use a different clustering method, set the parameter clustering_method (Note that the sklearn module must be installed). Likewise, different parameters for the same clustering method can be explored by adding different instances of the same clustering class:
>>> CES, details = encore.ces([ens1,ens2], clustering_method = [encore.DBSCAN(eps=0.45), encore.DBSCAN(eps=0.50)]) >>> print "eps=0.45: ", CES[0] eps=0.45: [[ 0. 0.20447236] [ 0.20447236 0. ]] >>> print "eps=0.5: ", CES[1] eps=0.5: [[ 0. 0.25331629] [ 0.25331629 0. ]]"

MDAnalysis.analysis.encore.similarity.
dres
(ensembles, selection='name CA', dimensionality_reduction_method=<MDAnalysis.analysis.encore.dimensionality_reduction.DimensionalityReductionMethod.StochasticProximityEmbeddingNative object>, distance_matrix=None, nsamples=1000, estimate_error=False, bootstrapping_samples=100, ncores=1, calc_diagonal=False, allow_collapsed_result=True)[source]¶ Calculates the Dimensional Reduction Ensemble Similarity (DRES) between ensembles using the JensenShannon divergence as described in [Tiberti2015].
Parameters:  ensembles (list) – List of ensemble objects for similarity measurements
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 dimensionality_reduction_method – A single or a list of instances of the DimensionalityReductionMethod classes from the dimensionality_reduction module. Different parameters for the same method can be explored by adding different instances of the same dimensionality reduction class. Provided methods are the Stochastic Proximity Embedding (default) and the Principal Component Analysis.
 distance_matrix (encore.utils.TriangularMatrix) – conformational distance matrix, It will be calculated on the fly from the ensemble data if it is not provided.
 nsamples (int, optional) – Number of samples to be drawn from the ensembles (default is 1000). This is used to resample the density estimates and calculate the JensenShannon divergence between ensembles.
 estimate_error (bool, optional) – Whether to perform error estimation (default is False)
 bootstrapping_samples (int, optional) – number of samples to be used for estimating error.
 ncores (int, optional) – Maximum number of cores to be used (default is 1).
 calc_diagonal (bool, optional) – Whether to calculate the diagonal of the similarity scores (i.e. the simlarities of every ensemble against itself). If this is False (default), 0.0 will be used instead.
 allow_collapsed_result (bool, optional) – Whether a return value of a list of one value should be collapsed into just the value.
Returns: dres, details – dres contains the similarity values, arranged in numpy.array. If one number of dimensions is provided as an integer, the output will be a 2dimensional square symmetrical numpy.array. The order of the matrix elements depends on the order of the input ensemble: for instance, if
ensemble = [ens1, ens2, ens3]
then the matrix elements [0,2] and [2,0] will both contain the similarity value between ensembles ens1 and ens3. Elaborating on the previous example, if n ensembles are given and m methods are provided the output will be a list of m arrays ordered by the input sequence of methods, each with a n*x*n symmetrical similarity matrix.
details provide an array of the reduced_coordinates.
Return type: numpy.array, numpy.array
Notes
To calculate the similarity, the method first projects the ensembles into lower dimensions by using the Stochastic Proximity Embedding (or others) algorithm. A gaussian kernelbased density estimation method is then used to estimate the probability density for each ensemble which is then used to compute the JensenShannon divergence between each pair of ensembles.
In the JensenShannon divergence the upper bound of ln(2) signifies no similarity between the two ensembles, the lower bound, 0.0, signifies identical ensembles. However, due to the stochastic nature of the dimensional reduction in
dres()
, two identical ensembles will not necessarily result in an exact 0.0 estimate of the similarity but will be very close. For the same reason, calculating the similarity with thedres()
twice will not result in two identical numbers; small differences have to be expected.Examples
To calculate the Dimensional Reduction Ensemble similarity, two ensembles are created as Universe objects from a topology file and two trajectories. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files. Here the simplest case of comparing just two instances of
Universe
is illustrated:>>> ens1 = Universe(PSF,DCD) >>> ens2 = Universe(PSF,DCD2) >>> DRES, details = encore.dres([ens1,ens2]) >>> print DRES [[ 0. 0.67996043] [ 0.67996043 0. ]]
In addition to the quantitative similarity estimate, the dimensional reduction can easily be visualized, see the
Example
section inMDAnalysis.analysis.encore.dimensionality_reduction.reduce_dimensionality`
To use a different dimensional reduction methods, simply set the parameter dimensionality_reduction_method. Likewise, different parameters for the same clustering method can be explored by adding different instances of the same method class:
>>> DRES, details = encore.dres([ens1,ens2], dimensionality_reduction_method = encore.PrincipalComponentAnalysis(dimension=2)) >>> print DRES [[ 0. 0.69314718] [ 0.69314718 0. ]]
4.2.6.1.1.2. Function reference¶

MDAnalysis.analysis.encore.similarity.
ces
(ensembles, selection='name CA', clustering_method=<MDAnalysis.analysis.encore.clustering.ClusteringMethod.AffinityPropagationNative object>, distance_matrix=None, estimate_error=False, bootstrapping_samples=10, ncores=1, calc_diagonal=False, allow_collapsed_result=True)[source] Calculates the Clustering Ensemble Similarity (CES) between ensembles using the JensenShannon divergence as described in [Tiberti2015].
Parameters:  ensembles (list) – List of ensemble objects for similarity measurements
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 clustering_method – A single or a list of instances of the
MDAnalysis.analysis.encore.clustering.ClusteringMethod
classes from the clustering module. Different parameters for the same clustering method can be explored by adding different instances of the same clustering class. Clustering methods options are the Affinity Propagation (default), the DBSCAN and the KMeans. The latter two methods need the sklearn python module installed.  distance_matrix (encore.utils.TriangularMatrix) – Distance matrix clustering methods. If this parameter is not supplied the matrix will be calculated on the fly.
 estimate_error (bool, optional) – Whether to perform error estimation (default is False). Only bootstrapping mode is supported.
 bootstrapping_samples (int, optional) – number of samples to be used for estimating error.
 ncores (int, optional) – Maximum number of cores to be used (default is 1).
 calc_diagonal (bool, optional) – Whether to calculate the diagonal of the similarity scores (i.e. the similarities of every ensemble against itself). If this is False (default), 0.0 will be used instead.
 allow_collapsed_result (bool, optional) – Whether a return value of a list of one value should be collapsed into just the value.
Returns: ces, details – ces contains the similarity values, arranged in a numpy.array. If only one clustering_method is provided the output will be a 2dimensional square symmetrical numpy.array. The order of the matrix elements depends on the order of the input ensembles: for instance, if
ensemble = [ens1, ens2, ens3]
the matrix elements [0,2] and [2,0] will both contain the similarity value between ensembles ens1 and ens3. Elaborating on the previous example, if n ensembles are given and m clustering_methods are provided the output will be a list of m arrays ordered by the input sequence of methods, each with a n*x*n symmetrical similarity matrix.
details contains information on the clustering: the individual size of each cluster, the centroids and the frames associated with each cluster.
Return type: numpy.array, numpy.array
Notes
In the JensenShannon divergence the upper bound of ln(2) signifies no similarity between the two ensembles, the lower bound, 0.0, signifies identical ensembles.
To calculate the CES, the affinity propagation method (or others, if specified) is used to partition the whole space of conformations. The population of each ensemble in each cluster is then taken as a probability density function. Different probability density functions from each ensemble are finally compared using the JensenShannon divergence measure.
Examples
To calculate the Clustering Ensemble similarity, two ensembles are created as Universe object using a topology file and two trajectories. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files. Here the simplest case of just two instances of
Universe
is illustrated:>>> ens1 = Universe(PSF, DCD) >>> ens2 = Universe(PSF, DCD2) >>> CES,details = encore.ces([ens1,ens2]) >>> print CES [[ 0. 0.68070702] [ 0.68070702 0. ]]
To use a different clustering method, set the parameter clustering_method (Note that the sklearn module must be installed). Likewise, different parameters for the same clustering method can be explored by adding different instances of the same clustering class:
>>> CES, details = encore.ces([ens1,ens2], clustering_method = [encore.DBSCAN(eps=0.45), encore.DBSCAN(eps=0.50)]) >>> print "eps=0.45: ", CES[0] eps=0.45: [[ 0. 0.20447236] [ 0.20447236 0. ]] >>> print "eps=0.5: ", CES[1] eps=0.5: [[ 0. 0.25331629] [ 0.25331629 0. ]]"

MDAnalysis.analysis.encore.similarity.
ces_convergence
(original_ensemble, window_size, selection='name CA', clustering_method=<MDAnalysis.analysis.encore.clustering.ClusteringMethod.AffinityPropagationNative object>, ncores=1)[source]¶ Use the CES to evaluate the convergence of the ensemble/trajectory. CES will be calculated between the whole trajectory contained in an ensemble and windows of such trajectory of increasing sizes, so that the similarity values should gradually drop to zero. The rate at which the value reach zero will be indicative of how much the trajectory keeps on resampling the same regions of the conformational space, and therefore of convergence.
Parameters:  original_ensemble (
Universe
object) – ensemble containing the trajectory whose convergence has to estimated  window_size (int) – Size of window to be used, in number of frames
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 clustering_method (MDAnalysis.analysis.encore.clustering.ClusteringMethod) – A single or a list of instances of the ClusteringMethod classes from the clustering module. Different parameters for the same clustering method can be explored by adding different instances of the same clustering class.
 ncores (int, optional) – Maximum number of cores to be used (default is 1).
Returns: out – array of shape (number_of_frames / window_size, preference_values).
Return type: np.array
Example
To calculate the convergence of a trajectory using the clustering ensemble similarity method a Universe object is created from a topology file and the trajectory. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files. Here the simplest case of evaluating the convergence is illustrated by splitting the trajectory into a window_size of 10 frames :
>>> ens1 = Universe(PSF,DCD) >>> ces_conv = encore.ces_convergence(ens1, 10) >>> print ces_conv [[ 0.48194205] [ 0.40284672] [ 0.31699026] [ 0.25220447] [ 0.19829817] [ 0.14642725] [ 0.09911411] [ 0.05667391] [ 0. ]]
 original_ensemble (

MDAnalysis.analysis.encore.similarity.
clustering_ensemble_similarity
(cc, ens1, ens1_id, ens2, ens2_id, selection='name CA')[source]¶  Clustering ensemble similarity: calculate the probability densities from
 the clusters and calculate discrete JensenShannon divergence.
Parameters:  cc (encore.clustering.ClustersCollection) – Collection from cluster calculated by a clustering algorithm (e.g. Affinity propagation)
 ens1 (
Universe
) – First ensemble to be used in comparison  ens1_id (int) – First ensemble id as detailed in the ClustersCollection metadata
 ens2 (
Universe
) – Second ensemble to be used in comparison  ens2_id (int) – Second ensemble id as detailed in the ClustersCollection metadata
 selection (str) – Atom selection string in the MDAnalysis format. Default is “name CA”.
Returns: djs – JensenShannon divergence between the two ensembles, as calculated by the clustering ensemble similarity method
Return type:

MDAnalysis.analysis.encore.similarity.
cumulative_clustering_ensemble_similarity
(cc, ens1_id, ens2_id, ens1_id_min=1, ens2_id_min=1)[source]¶ Calculate clustering ensemble similarity between joined ensembles. This means that, after clustering has been performed, some ensembles are merged and the dJS is calculated between the probability distributions of the two clusters groups. In particular, the two ensemble groups are defined by their ensembles id: one of the two joined ensembles will comprise all the ensembles with id [ens1_id_min, ens1_id], and the other ensembles will comprise all the ensembles with id [ens2_id_min, ens2_id].
Parameters: Returns: djs – JensenShannon divergence between the two ensembles, as calculated by the clustering ensemble similarity method
Return type:

MDAnalysis.analysis.encore.similarity.
cumulative_gen_kde_pdfs
(embedded_space, ensemble_assignment, nensembles, nsamples, ens_id_min=1, ens_id_max=None)[source]¶ Generate Kernel Density Estimates (KDE) from embedded spaces and elaborate the coordinates for later use. However, consider more than one ensemble as the space on which the KDE will be generated. In particular, will use ensembles with ID [ens_id_min, ens_id_max].
Parameters:  embedded_space (numpy.array) – Array containing the coordinates of the embedded space
 ensemble_assignment (numpy.array) – array containing one int per ensemble conformation. These allow to distinguish, in the complete embedded space, which conformations belong to each ensemble. For instance if ensemble_assignment is [1,1,1,1,2,2], it means that the first four conformations belong to ensemble 1 and the last two to ensemble 2
 nensembles (int) – Number of ensembles
 nsamples (int) – Samples to be drawn from the ensembles. Will be required in a later stage in order to calculate dJS.
 ens_id_min (int) – Minimum ID of the ensemble to be considered; see description
 ens_id_max (int) – Maximum ID of the ensemble to be considered; see description. If None, it will be set to the maximum possible value given the number of ensembles.
Returns:  kdes (scipy.stats.gaussian_kde) – KDEs calculated from ensembles
 resamples (list of numpy.array) – For each KDE, draw samples according to the probability distribution of the kde mixture model
 embedded_ensembles (list of numpy.array) – List of numpy.array containing, each one, the elements of the embedded space belonging to a certain ensemble

MDAnalysis.analysis.encore.similarity.
dimred_ensemble_similarity
(kde1, resamples1, kde2, resamples2, ln_P1_exp_P1=None, ln_P2_exp_P2=None, ln_P1P2_exp_P1=None, ln_P1P2_exp_P2=None)[source]¶ Calculate the JensenShannon divergence according the the Dimensionality reduction method. In this case, we have continuous probability densities, this we need to integrate over the measurable space. The aim is to first calculate the KullbackLiebler divergence, which is defined as:
\[D_{KL}(P(x)  Q(x)) = \int_{\infty}^{\infty}P(x_i) ln(P(x_i)/Q(x_i)) = \langle{}ln(P(x))\rangle{}_P  \langle{}ln(Q(x))\rangle{}_P\]where the \(\langle{}.\rangle{}_P\) denotes an expectation calculated under the distribution P. We can, thus, just estimate the expectation values of the components to get an estimate of dKL. Since the JensenShannon distance is actually more complex, we need to estimate four expectation values:
\[ \begin{align}\begin{aligned}\langle{}log(P(x))\rangle{}_P\\\langle{}log(Q(x))\rangle{}_Q\\\langle{}log(0.5*(P(x)+Q(x)))\rangle{}_P\\\langle{}log(0.5*(P(x)+Q(x)))\rangle{}_Q\end{aligned}\end{align} \]Parameters:  kde1 (scipy.stats.gaussian_kde) – Kernel density estimation for ensemble 1
 resamples1 (numpy.array) – Samples drawn according do kde1. Will be used as samples to calculate the expected values according to ‘P’ as detailed before.
 kde2 (scipy.stats.gaussian_kde) – Kernel density estimation for ensemble 2
 resamples2 (numpy.array) – Samples drawn according do kde2. Will be used as sample to calculate the expected values according to ‘Q’ as detailed before.
 ln_P1_exp_P1 (float or None) – Use this value for \(\langle{}log(P(x))\rangle{}_P\); if None, calculate it instead
 ln_P2_exp_P2 (float or None) – Use this value for \(\langle{}log(Q(x))\rangle{}_Q\); if None, calculate it instead
 ln_P1P2_exp_P1 (float or None) – Use this value for \(\langle{}log(0.5*(P(x)+Q(x)))\rangle{}_P\); if None, calculate it instead
 ln_P1P2_exp_P2 (float or None) – Use this value for \(\langle{}log(0.5*(P(x)+Q(x)))\rangle{}_Q\); if None, calculate it instead
Returns: djs – JensenShannon divergence calculated according to the dimensionality reduction method
Return type:

MDAnalysis.analysis.encore.similarity.
discrete_jensen_shannon_divergence
(pA, pB)[source]¶ JensenShannon divergence between discrete probability distributions.
Parameters:  pA (iterable of floats) – First discrete probability density function
 pB (iterable of floats) – Second discrete probability density function
Returns: djs – Discrete JensenShannon divergence
Return type:

MDAnalysis.analysis.encore.similarity.
discrete_kullback_leibler_divergence
(pA, pB)[source]¶ KullbackLeibler divergence between discrete probability distribution. Notice that since this measure is not symmetric :: \(d_{KL}(p_A,p_B) != d_{KL}(p_B,p_A)\)
Parameters:  pA (iterable of floats) – First discrete probability density function
 pB (iterable of floats) – Second discrete probability density function
Returns: dkl – Discrete KullbackLiebler divergence
Return type:

MDAnalysis.analysis.encore.similarity.
dres
(ensembles, selection='name CA', dimensionality_reduction_method=<MDAnalysis.analysis.encore.dimensionality_reduction.DimensionalityReductionMethod.StochasticProximityEmbeddingNative object>, distance_matrix=None, nsamples=1000, estimate_error=False, bootstrapping_samples=100, ncores=1, calc_diagonal=False, allow_collapsed_result=True)[source] Calculates the Dimensional Reduction Ensemble Similarity (DRES) between ensembles using the JensenShannon divergence as described in [Tiberti2015].
Parameters:  ensembles (list) – List of ensemble objects for similarity measurements
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 dimensionality_reduction_method – A single or a list of instances of the DimensionalityReductionMethod classes from the dimensionality_reduction module. Different parameters for the same method can be explored by adding different instances of the same dimensionality reduction class. Provided methods are the Stochastic Proximity Embedding (default) and the Principal Component Analysis.
 distance_matrix (encore.utils.TriangularMatrix) – conformational distance matrix, It will be calculated on the fly from the ensemble data if it is not provided.
 nsamples (int, optional) – Number of samples to be drawn from the ensembles (default is 1000). This is used to resample the density estimates and calculate the JensenShannon divergence between ensembles.
 estimate_error (bool, optional) – Whether to perform error estimation (default is False)
 bootstrapping_samples (int, optional) – number of samples to be used for estimating error.
 ncores (int, optional) – Maximum number of cores to be used (default is 1).
 calc_diagonal (bool, optional) – Whether to calculate the diagonal of the similarity scores (i.e. the simlarities of every ensemble against itself). If this is False (default), 0.0 will be used instead.
 allow_collapsed_result (bool, optional) – Whether a return value of a list of one value should be collapsed into just the value.
Returns: dres, details – dres contains the similarity values, arranged in numpy.array. If one number of dimensions is provided as an integer, the output will be a 2dimensional square symmetrical numpy.array. The order of the matrix elements depends on the order of the input ensemble: for instance, if
ensemble = [ens1, ens2, ens3]
then the matrix elements [0,2] and [2,0] will both contain the similarity value between ensembles ens1 and ens3. Elaborating on the previous example, if n ensembles are given and m methods are provided the output will be a list of m arrays ordered by the input sequence of methods, each with a n*x*n symmetrical similarity matrix.
details provide an array of the reduced_coordinates.
Return type: numpy.array, numpy.array
Notes
To calculate the similarity, the method first projects the ensembles into lower dimensions by using the Stochastic Proximity Embedding (or others) algorithm. A gaussian kernelbased density estimation method is then used to estimate the probability density for each ensemble which is then used to compute the JensenShannon divergence between each pair of ensembles.
In the JensenShannon divergence the upper bound of ln(2) signifies no similarity between the two ensembles, the lower bound, 0.0, signifies identical ensembles. However, due to the stochastic nature of the dimensional reduction in
dres()
, two identical ensembles will not necessarily result in an exact 0.0 estimate of the similarity but will be very close. For the same reason, calculating the similarity with thedres()
twice will not result in two identical numbers; small differences have to be expected.Examples
To calculate the Dimensional Reduction Ensemble similarity, two ensembles are created as Universe objects from a topology file and two trajectories. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files. Here the simplest case of comparing just two instances of
Universe
is illustrated:>>> ens1 = Universe(PSF,DCD) >>> ens2 = Universe(PSF,DCD2) >>> DRES, details = encore.dres([ens1,ens2]) >>> print DRES [[ 0. 0.67996043] [ 0.67996043 0. ]]
In addition to the quantitative similarity estimate, the dimensional reduction can easily be visualized, see the
Example
section inMDAnalysis.analysis.encore.dimensionality_reduction.reduce_dimensionality`
To use a different dimensional reduction methods, simply set the parameter dimensionality_reduction_method. Likewise, different parameters for the same clustering method can be explored by adding different instances of the same method class:
>>> DRES, details = encore.dres([ens1,ens2], dimensionality_reduction_method = encore.PrincipalComponentAnalysis(dimension=2)) >>> print DRES [[ 0. 0.69314718] [ 0.69314718 0. ]]

MDAnalysis.analysis.encore.similarity.
dres_convergence
(original_ensemble, window_size, selection='name CA', dimensionality_reduction_method=<MDAnalysis.analysis.encore.dimensionality_reduction.DimensionalityReductionMethod.StochasticProximityEmbeddingNative object>, nsamples=1000, ncores=1)[source]¶ Use the DRES to evaluate the convergence of the ensemble/trajectory. DRES will be calculated between the whole trajectory contained in an ensemble and windows of such trajectory of increasing sizes, so that the similarity values should gradually drop to zero. The rate at which the value reach zero will be indicative of how much the trajectory keeps on resampling the same ares of the conformational space, and therefore of convergence.
Parameters:  original_ensemble (
Universe
object) – ensemble containing the trajectory whose convergence has to estimated  window_size (int) – Size of window to be used, in number of frames
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 dimensionality_reduction_method – A single or a list of instances of the DimensionalityReductionMethod classes from the dimensionality_reduction module. Different parameters for the same method can be explored by adding different instances of the same dimensionality reduction class.
 nsamples (int, optional) – Number of samples to be drawn from the ensembles (default is 1000). This is akin to the nsamples parameter of dres().
 ncores (int, optional) – Maximum number of cores to be used (default is 1).
Returns: out – array of shape (number_of_frames / window_size, preference_values).
Return type: np.array
Example
To calculate the convergence of a trajectory using the DRES method, a Universe object is created from a topology file and the trajectory. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files. Here the simplest case of evaluating the convergence is illustrated by splitting the trajectory into a window_size of 10 frames :
>>> ens1 = Universe(PSF,DCD) >>> dres_conv = encore.dres_convergence(ens1, 10) >>> print dres_conv [[ 0.5295528 ] [ 0.40716539] [ 0.31158669] [ 0.25314041] [ 0.20447271] [ 0.13212364] [ 0.06979114] [ 0.05214759] [ 0. ]]
Here, the rate at which the values reach zero will be indicative of how much the trajectory keeps on resampling the same ares of the conformational space, and therefore of convergence.
 original_ensemble (

MDAnalysis.analysis.encore.similarity.
gen_kde_pdfs
(embedded_space, ensemble_assignment, nensembles, nsamples)[source]¶ Generate Kernel Density Estimates (KDE) from embedded spaces and elaborate the coordinates for later use.
Parameters:  embedded_space (numpy.array) – Array containing the coordinates of the embedded space
 ensemble_assignment (numpy.array) – Array containing one int per ensemble conformation. These allow to distinguish, in the complete embedded space, which conformations belong to each ensemble. For instance if ensemble_assignment is [1,1,1,1,2,2], it means that the first four conformations belong to ensemble 1 and the last two to ensemble 2
 nensembles (int) – Number of ensembles
 nsamples (int) – samples to be drawn from the ensembles. Will be required in a later stage in order to calculate dJS.
Returns:  kdes (scipy.stats.gaussian_kde) – KDEs calculated from ensembles
 resamples (list of numpy.array) – For each KDE, draw samples according to the probability distribution of the KDE mixture model
 embedded_ensembles (list of numpy.array) – List of numpy.array containing, each one, the elements of the embedded space belonging to a certain ensemble

MDAnalysis.analysis.encore.similarity.
harmonic_ensemble_similarity
(sigma1, sigma2, x1, x2)[source]¶ Calculate the harmonic ensemble similarity measure as defined in [Tiberti2015].
Parameters:  sigma1 (numpy.array) – Covariance matrix for the first ensemble.
 sigma2 (numpy.array) – Covariance matrix for the second ensemble.
 x1 (numpy.array) – Mean for the estimated normal multivariate distribution of the first ensemble.
 x2 (numpy.array) – Mean for the estimated normal multivariate distribution of the second ensemble.
Returns: dhes – harmonic similarity measure
Return type:

MDAnalysis.analysis.encore.similarity.
hes
(ensembles, selection='name CA', cov_estimator='shrinkage', weights='mass', align=False, details=False, estimate_error=False, bootstrapping_samples=100, calc_diagonal=False)[source] Calculates the Harmonic Ensemble Similarity (HES) between ensembles using the symmetrized version of KullbackLeibler divergence as described in [Tiberti2015].
Parameters:  ensembles (list) – List of Universe objects for similarity measurements.
 selection (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”
 cov_estimator (str, optional) – Covariance matrix estimator method, either shrinkage, shrinkage, or Maximum Likelyhood, ml. Default is shrinkage.
 weights (str/array_like, optional) – specify optional weights. If
mass
then chose masses of ensemble atoms  align (bool, optional) – Whether to align the ensembles before calculating their similarity. Note: this changes the ensembles inplace, and will thus leave your ensembles in an altered state. (default is False)
 details (bool, optional) – Save the mean and covariance matrix for each ensemble in a numpy array (default is False).
 estimate_error (bool, optional) – Whether to perform error estimation (default is False).
 bootstrapping_samples (int, optional) – Number of times the similarity matrix will be bootstrapped (default is 100), only if estimate_error is True.
 calc_diagonal (bool, optional) – Whether to calculate the diagonal of the similarity scores (i.e. the similarities of every ensemble against itself). If this is False (default), 0.0 will be used instead.
Returns: Harmonic similarity measurements between each pair of ensembles.
Return type: numpy.array (bidimensional)
Notes
The method assumes that each ensemble is derived from a multivariate normal distribution. The mean and covariance matrix are, thus, estimatated from the distribution of each ensemble and used for comparision by the symmetrized version of KullbackLeibler divergence defined as:
\[D_{KL}(P(x)  Q(x)) = \int_{\infty}^{\infty}P(x_i) ln(P(x_i)/Q(x_i)) = \langle{}ln(P(x))\rangle{}_P  \langle{}ln(Q(x))\rangle{}_P\]where the \(\langle{}.\rangle{}_P\) denotes an expectation calculated under the distribution P.
For each ensemble, the mean conformation is estimated as the average over the ensemble, and the covariance matrix is calculated by default using a shrinkage estimation method (or by a maximumlikelihood method, optionally).
Note that the symmetrized version of the KullbackLeibler divergence has no upper bound (unlike the JensenShannon divergence used by for instance CES and DRES).
When using this similarity measure, consider whether you want to align the ensembles first (see example below).
Example
To calculate the Harmonic Ensemble similarity, two ensembles are created as Universe objects from a topology file and two trajectories. The topology and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. To run the examples see the module Examples for how to import the files:
>>> ens1 = Universe(PSF, DCD) >>> ens2 = Universe(PSF, DCD2) >>> HES, details = encore.hes([ens1, ens2]) >>> print HES [[ 0. 38279683.95892926] [ 38279683.95892926 0. ]]
You can use the align=True option to align the ensembles first. This will align everything to the current timestep in the first ensemble. Note that this changes the ens1 and ens2 objects:
>>> print encore.hes([ens1, ens2], align=True)[0] [[ 0. 6880.34140106] [ 6880.34140106 0. ]]
Alternatively, for greater flexibility in how the alignment should be done you can call use an AlignTraj object manually:
>>> from MDAnalysis.analysis import align >>> align.AlignTraj(ens1, ens1, select="name CA", in_memory=True).run() >>> align.AlignTraj(ens2, ens1, select="name CA", in_memory=True).run() >>> print encore.hes([ens1, ens2])[0] [[ 0. 7032.19607004] [ 7032.19607004 0. ]]

MDAnalysis.analysis.encore.similarity.
prepare_ensembles_for_convergence_increasing_window
(ensemble, window_size, selection='name CA')[source]¶ Generate ensembles to be fed to ces_convergence or dres_convergence from a single ensemble. Basically, the different slices the algorithm needs are generated here.
Parameters: Returns: The original ensemble is divided into different ensembles, each being a window_sizelong slice of the original ensemble. The last ensemble will be bigger if the length of the input ensemble is not exactly divisible by window_size.
Return type: tmp_ensembles

MDAnalysis.analysis.encore.similarity.
write_output
(matrix, base_fname=None, header='', suffix='', extension='dat')[source]¶ Write output matrix with a nice format, to stdout and optionally a file.
Parameters:  matrix (encore.utils.TriangularMatrix) – Matrix containing the values to be printed
 base_fname (str) – Basic filename for output. If None, no files will be written, and the matrix will be just printed on standard output
 header (str) – Text to be written just before the matrix
 suffix (str) – String to be concatenated to basename, in order to get the final file name
 extension (str) – Extension for the output file