Usage

All datasets in MDAnalysisData are accessible via fetch_* functions in the MDAnalysisData.datasets module. Datasets are organized in submodules by the type of simulations that they represent. The currently included datasets are:

adk_equilibrium

AdK equilibrium trajectory without water.

adk_transitions

Ensembles of AdK transitions.

PEG_1chain

Molecular dynamics trajectory of a single PEG chain in TIP3P water.

ifabp_water

MD simulation of I-FABP with water.

nhaa_equilibrium

NhaA equilibrium trajectory without water.

yiip_equilibrium

YiiP equilibrium trajectory with water.

vesicles

Large vesicles library (coarse grained).

CG_fiber

Coarse-grained molecular dynamics of an amphiphilic fiber.

Accessing a dataset

As an example, we can access the AdK equilibrium dataset with the fetch_adk_equilibrium() function:

>>> from MDAnalysisData import datasets
>>> adk = datasets.fetch_adk_equilibrium()

This will download the dataset from figshare (doi: 10.6084/m9.figshare.5108170.v1) and unpack it into a cache directory. This means that only the first time executing fetch_adk_equilibrium() will be slow; at later times, the cached files will be used. The resulting Bunch object can be introspected for what this dataset includes. In particular, it features a DESCR attribute with a human-readable description of the dataset:

>>> print(adk.DESCR)

AdK equilibrium trajectory dataset
==================================

MD trajectory of apo adenylate kinase with CHARMM27 force field and
simulated with explicit water and ions in NPT at 300 K and 1
bar. Saved every 240 ps for a total of 1.004 µs. Produced on PSC
Anton. The trajectory only contains the protein and all solvent
stripped. Superimposed on the CORE domain of AdK by RMSD fitting.

The topology is contained in the PSF file (CHARMM format). The
trajectory is contained in the DCD file (CHARMM/NAMD format).

Notes
-----

Data set characteristics:

 :size: 161 MB
 :number of frames:  4187
 :number of particles: 3341
 :creator: Sean Seyler
 :URL:  `10.6084/m9.figshare.5108170.v1 <https://doi.org/10.6084/m9.figshare.5108170.v1>`_
 :license: `CC-BY 4.0 <https://creativecommons.org/licenses/by/4.0/legalcode>`_
 :reference: [Seyler2017]_


.. [Seyler2017]  Seyler, Sean; Beckstein, Oliver (2017): Molecular dynamics
           trajectory for benchmarking
           MDAnalysis. figshare. Fileset. doi:
           `10.6084/m9.figshare.5108170.v1
           <https://doi.org/10.6084/m9.figshare.5108170.v1>`_

The topology and trajectory files can be accessed:

>>> print(adk.topology)
>>> print(adk.trajectory)

and one can immediately load it into an MDAnalysis.Universe:

>>> import MDAnalysis as mda
>>> u = mda.Universe(adk.topology, adk.trajectory)

Managing data

When data is downloaded from a remote location, it is copied to a local data directory and cached. Subsequently, the cached copy is used. By default, data are locally stored in the data directory ~/MDAnalysis_data (i.e., under the user’s home directory).

The location of the data directiory can be changed by setting the environment variable MDANALYSIS_DATA, for instance

export MDANALYSIS_DATA=/tmp/MDAnalysis_data

All fetch_* functions also have a keyword argument data_home that can be used to set an alternative data directory.

The location of the data directory can be obtained with MDAnalysisData.base.get_data_home().

If a dataset or the whole data directory is removed then the data are downloaded again when they are needed. If data are downloaded as archives (zip or tar files) then both the archive and the unpacked data are stored; removing the archive will trigger a re-download because only the archive itself is checked with the checksum.

Only datasets that are needed are downloaded. However, the full data directory can take up more than 2 GB of space. One may manually delete subdirectories (e.g. data sets that are currently not needed) and the whole data directory can we wiped (removed) with the function MDAnalysisData.base.clear_data_home().