.. -*- mode: rst; coding: utf-8 -*-
==========
Tutorial
==========
We will first test that MDAnalysis can perform PSA, using the small
test set. We will then look in detail into the radical.pilot script
that implements the map-step. Finally, we conclude with the
reduce-step and analysis of the data.
Preliminary test
================
Provide topology and trajectory files to the PSA script
:program:`mdanalysis_psa_partial.py` as two lists in a JSON file (see
:ref:`mdanalysis_psa_partial` for more details). Initially we just
want to check that the script can process the data
.. code-block:: bash
(mdaenv) $ $RPDIR/mdanalysis_psa_partial.py --inputfile testcase.json -n 5
This means
- use the test case
- compare the first 5 trajectories against the remaining 5
trajectories
The ``-n`` (split) argument is important because we are going to use
it to decompose the full distance matrix into sub-matrices. (If you
just want to do all-vs-all comparisons, use the
:program:`mdanalysis_psa.py` script.)
You should see output like
.. code-block:: none
Loading paths from JSON file testcase.json
Processing 10 trajectories.
Splitting trajectories in two blocks of length 5 and 5
Calculating D (shape 5 x 5) with 25 entries
----------[ TIMING ]--------------------
load Universes 0.448 s
universe.transfer_to_memory() 0.178 s
PSA distance matrix 1.483 s
saving output 0.001 s
----------------------------------------
total time 2.111 s
----------------------------------------
This indicates that all MDAnalysis parts are working.
Similarly:
.. code-block:: bash
(mdaenv) $ $RPDIR/mdanalysis_psa_partial.py --inputfile testcase.json -n 7
should give something like
.. code-block:: none
Loading paths from JSON file testcase.json
Processing 10 trajectories.
Splitting trajectories in two blocks of length 7 and 3
Calculating D (shape 7 x 3) with 21 entries
----------[ TIMING ]--------------------
load Universes 0.422 s
universe.transfer_to_memory() 0.132 s
PSA distance matrix 1.436 s
saving output 0.001 s
----------------------------------------
total time 1.991 s
----------------------------------------
Supercomputer environment
=========================
We used `TACC Stampede `_ to
run the calculations on 32 cores but any cluster or even a multi core
workstation might be sufficient.
- Make sure all env vars are set (especially MongoDB,
:envvar:`RADICAL_PILOT_DBURL`) and password-less ssh works.
- On stampede: Set environment variable :envvar:`RADICAL_PILOT_PROJECT` to your
XSEDE allocation:
.. code-block:: bash
export RADICAL_PILOT_PROJECT=TG-xxxxxx
- Activate the *mdaenv* environment.
- You should have the JSON files in the ``WORK`` directory.
Copy the two scripts to the WORK directory (at the moment, this is a
limitation of the scripts to keep them simple)L
.. code-block:: bash
cd WORK
cp $RPDIR/{rp_psa.py,mdanalysis_psa_partial.py} .
Map-step: Radical.pilot script
==============================
RP script
---------
The pilot script is :program:`rp/rp_psa.py` (see :ref:`rp_psa` for
details). In the following, important parts of the script are
explained.
First, a session is created and a Pilot Manager is initialized:
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 65-70
:linenos:
:lineno-start: 65
Second, a ComputePilot is described and submitted to the Pilot Manager:
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 73-82
:linenos:
:lineno-start: 73
First, a Unit Manager is created for that session and the Pilot is added:
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 85-88
:linenos:
:lineno-start: 85
The script reads the topology and trajectory files from a JSON file:
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 91-92
:linenos:
:lineno-start: 91
The MDAnalysis script is staged
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 99-112
:linenos:
:lineno-start: 99
In a loop, a CU is set up for each block matrix --- this is the
**map** step. In particular, the trajectories are partitioned
according to the ``BLOCK_SIZE`` (the parameter :math:`w`) and all
necessary information is written to a JSON file that will be used as
the input for the :program:`mdanalysis_psa_partial.py` script:
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 115-189
:emphasize-lines: 29-58,70-72
:linenos:
:lineno-start: 115
For the :ref:`reduce step `, all information
about the block matrix (filename and indices in the distance matrix)
are written to a JSON file ("manifest"):
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 193-194
:linenos:
:lineno-start: 193
Finally, the CUs are submitted to execute on the compute resources and
the script waits until they are all complete:
.. literalinclude:: /code/rp/rp_psa.py
:language: python
:lines: 197-202
:linenos:
:lineno-start: 197
Launch pilot jobs
-----------------
Launch the pilot job from the login node of stampede (or the cluster
where you set up your radical.pilot environment):
.. code-block:: bash
python rp_psa.py trajectories.json 20 16 SPIDAL.001
The :program:`rp_psa.py` radical.pilot script takes as input (see
:ref:`rp_psa` for details):
- the JSON file with the trajectories (trajectories.json)
- number of trajectories per block (20)
- number of cores to request (16)
- session name (arbitrary string, "SPIDAL.001"); note that you
need to provide a new session name every time you-rerun the command,
e.g. "SPIDAL.002", "SPIDAL.003", ...
.. _section-reduce-step:
Reduce-step
===========
Once the pilot jobs has completed and all block matrices have been
computed we combine all data (**reduce**) into the distance matrix
:math:`D_{ij}` and analyze it:
.. code-block:: bash
psa_reduce.py -t trajectories.json -m manifest.json \
-p psa_plot.pdf -o distance_matrix.npy
Combine block matrices
----------------------
The ``manifest.json`` file contains all information that is needed to
re-assemble the distance matrix: for each output file it also stores
the indices of the sub-matrix in the distance matrix and so the full
distance matrix can be built with
.. literalinclude:: /code/rp/psa_reduce.py
:language: python
:pyobject: combine
:emphasize-lines: 4,11,12
:linenos:
The matrix is written to a numpy file ``distance_matrix.npy`` (and can
be loaded with :func:`numpy.load`).
Analysis
--------
The distance matrix can be clustered to reveal patterns in the
trajectories. Here we use hierarchical clustering with `Ward's linkage
criterion`_ as implemented in :mod:`scipy.cluster.hierarchy`
[Seyler2015]_. The clustering and plotting code was taken from the
:meth:`~MDAnalysis.analysis.psa.PSAnalysis.cluster` and
:meth:`~MDAnalysis.analysis.psa.PSAnalysis.plot` methods of
:class:`~MDAnalysis.analysis.psa.PSAnalysis`. The plotting code is
fairly complicated but the clustering is straightforward:
.. literalinclude:: /code/rp/psa_reduce.py
:language: python
:pyobject: cluster
:linenos:
In the resulting plot we indicate FRODA trajectories with a heavy
double-bar and DIMS trajectories with a thin single bar.
.. _figure-psa:
.. figure:: /figures/psa_distance_matrix.png
:alt: Clustered PSA distance matrix
**PSA of AdK transitions**. Transitions of AdK from the closed to
the open conformation were generated with the DIMS MD and the FRODA
method. *Path similarity analysis* (PSA) was performed by
hierarchically clustering the matrix of Fréchet distances between
all trajectories, using `Ward's linkage criterion`_.
+---------+-----------------------+
| Symbol | Method |
+=========+=======================+
| `||` | FRODA |
+---------+-----------------------+
| `|` | DIMS |
+---------+-----------------------+
Figure :ref:`PSA of AdK transitions ` shows clearly that DIMS and FRODA transitions
are different from each other. Each method produces transitions that
are characteristic of the method itself. In a comparison of many different
transition path sampling methods it was also found that this pattern
generally holds, which indicates that there is likely no "best" method
yet [Seyler2015]_.
.. _`Ward's linkage criterion`:
https://en.wikipedia.org/wiki/Ward's_method