Blog

GSoC 2024 - Streaming H5MD trajectories from the cloud

03 Aug 2024

What is Zarrtraj?

Zarrtraj is an MDAKit for storing and analyzing trajectories in MDAnalysis from the cloud, representing a major milestone towards the MDAnalysis 3.0 goal of cloud streaming as well as a proof-of-concept for a new paradigm in the field of molecular dynamics as a whole. It can stream H5MD trajectories from cloud storage providers including AWS S3, Google Cloud buckets, and Azure Blob storage and data lakes. Using Zarrtraj, anyone can reproduce your analyses or train their machine learning (ML) model on trajectory data without ever having to download massive files to their disk.

This is possible thanks to Zarr, fsspec, and kerchunk packages that have created a foundation for storing and interacting with large datasets in a uniform way across different storage backends. Interestingly, these projects were intially developed by geoscientists in the Pangeo project to make use of cloud computing credits from an NSF partnership with cloud providers, but have since undergone wider adoption by the broader Python community. This project also represents one of the first forays of the molecular dynamics field into the excellent Pangeo ecosystem of tools.

Zarr is especially well-suited to the task of reading cloud-stored files due to its integration with dask for parallelized reading. This parallelization offsets the increased IO time in cloud-streaming, speeding up common analysis algorithms up to ~4x compared to sequential analysis. See the zarrtraj benchmarks for more.

In this project, we also decided to experiment with storing trajectories directly in Zarr-backed files using the same specification that H5MD uses, so Zarrtraj can read both .h5md and H5MD-formatted .zarrmd files. See this explanation of the modified format to learn more.

While this GSoC project started with the goal of building a new, Zarr-backed trajectory format, we pivoted to making the existing H5MD format streamable after getting feedback from the community that supporting widely adopted formats in the MD ecosystem makes code more sustainable and simplifies tool adoption.

The next section is a walkthrough of Zarrtraj’s features and usage (also available here).

How can I use it?

Zarrtraj is currently available via PyPI and Conda Forge.

Pip installation

pip install zarrtraj

Conda installation

conda install -c conda-forge zarrtraj

For more information on installation, see the installation guide

This walkthrough will guide you through the process of reading and writing H5MD-formatted trajectories from cloud storage using AWS S3 as an example. To learn more about reading and writing trajectories from different cloud storage providers, including Google Cloud and Azure, see the API documentation.

Reading H5MD trajectories from cloud storage

Uploading your H5MD file

First, upload your H5MD trajectories to an AWS S3 bucket. This requires that an S3 Bucket is setup and configured for write access using the credentials stored in “sample_profile”. If you’ve never configured an S3 Bucket before, see this guide. You can setup a profile to easily manage AWS credentials using this VSCode extension. Here is a sample profile (stored in ~/.aws/credentials) where the key is an access key associated with a user that has read and write permissions for the bucket.

[sample_profile]
aws_access_key_id = <key>

MDAnalysis can write a trajectory from any of its supported formats into H5MD. We recommend using the chunks kwarg with the MDAnalysis H5MDWriter with a value that yields ~8-16MB chunks of data for best S3 performance. Once written locally, you can upload the trajectory to S3 programmatically:

import os
from botocore.exceptions import ClientError
import boto3
import logging

os.environ["AWS_PROFILE"] = "sample_profile"
# This is the AWS region where the bucket is located
os.environ["AWS_REGION"] = "us-west-1"

def upload_h5md_file(bucket_name, file_name):
    s3_client = boto3.client("s3")
    obj_name = os.path.basename(file_name)

    response = s3_client.upload_file(
        file_name, bucket_name, obj_name
    )


if __name__ == "__main__":
    # Using test H5MD file from the Zarrtraj repo
    upload_h5md_file("sample-bucket-name", "zarrtraj/data/COORDINATES_SYNTHETIC_H5MD.h5md")

You can also upload the H5MD file directly using the AWS web interface by navigating to S3, the bucket name, and pressing “upload”.

Reading your H5MD file

After the file is uploaded, you can use the same credentials to stream the file into MDAnalysis:

import zarrtraj
import MDAnalysis as mda
# This sample topology requires installing MDAnalysisTests
from MDAnalysisTests.datafiles import COORDINATES_TOPOLOGY
import os

os.environ["AWS_PROFILE"] = "sample_profile"
os.environ["AWS_REGION"] = "us-west-1"

u = mda.Universe(COORDINATES_TOPOLOGY, "s3://sample-bucket-name/COORDINATES_SYNTHETIC_H5MD.h5md")
for ts in u.trajectory:
    pass

You can follow this same process for reading .zarrmd files with the added advantage that Zarrtarj can write .zarrmd files directly into an S3 bucket.

Writing trajectories from MDAnalysis into a zarrmd file in an S3 Bucket

Using the same credentials with read/write access, you can write a trajectory into your bucket.

You can change the stored precision of floating point values in the file with the optional precision kwarg and pass in any numcodecs.Codec compressor with the optional compressor kwarg. See numcodecs for more on the available compressors.

Chunking is automatically determined for all datasets to be optimized for cloud storage and is not configurable by the user. Initial benchmarks show this chunking strategy is effective for disk storage as well.

import zarrtraj
import MDAnalysis as mda
from MDAnalysisTests.datafiles import PSF, DCD
import numcodecs
import os

os.environ["AWS_PROFILE"] = "sample_profile"
os.environ["AWS_REGION"] = "us-west-1"

u = mda.Universe(PSF, DCD)
with mda.Writer("s3://sample-bucket-name/test.zarrmd", 
                n_atoms=u.trajectory.n_atoms, 
                precision=3,
                compressor=numcodecs.Blosc(cname="zstd", clevel=9)) as W:
                for ts in u.trajectory:
                    W.write(u.atoms)

If you have additional questions, please don’t hesitate to open a discussion on the zarrtarj github. The MDAnalysis discord is also a great resource for asking questions and getting involved in MDAnalysis; instructions for joining the MDAnalysis discord server can be found on the MDAnalysis website.

What’s next for Zarrtraj?

Zarrtraj is currently in a fully operational state and is ready for use! However, I’m excited about creating some new features in the future that will make Zarrtraj more flexible and faster.

Lazytimeseries

In MDAnalysis, many trajectory readers expose a timeseries method for getting access to coordinate data for a subselection of atoms across a trajectory. This provides a viable way to sidestep the Timestep (eagerly-loaded frame-based) paradigm that MDAnalysis uses for handling trajectory data. Zarrtraj could implement a “lazytimeseries” that returns a lazy dask array of a selection of atoms’ positions across the trajectory. Early benchmarks show that analysis based on such a lazy array can outperform Timestep-based analysis.

Asynchronous Reading

The performance impact of network IO could be reduced by creating a multi-threaded ZARRH5MDReader that isn’t blocked by analysis code executing. The reader could eagerly load the cache with the next frames the analysis code will need to reduce the impact of network IO on exeution time.

Acknowledgements

A big thanks to Google for supporting the Google Summer of Code program and to the GSoC team for enabling my project.

Thank you to Dr. Hugo MacDermott-Opeskin (@hmacdope) and Dr. Yuxuan Zhuang (@yuxuanzhuang) for their mentorship and feedback throughout this project and to Dr. Jenna Swarthout Goddard (@jennaswa) for supporting the GSoC program at MDAnalysis.

I also want to thank Dr. Oliver Beckstein (@orbeckst) and Edis Jakupovic (@edisj for lending their expertise in H5MD and all things MDAnalysis.

Finally, another thanks to Martin Durant (@martindurant), author of Kerchunk, who was incredibly helpful in refining and merging a new feature in his codebase necessary for this project to work.

Citations

Alistair Miles, jakirkham, M Bussonnier, Josh Moore, Dimitri Papadopoulos Orfanos, Davis Bennett, David Stansby, Joe Hamman, James Bourbeau, Andrew Fulton, Gregory Lee, Ryan Abernathey, Norman Rzepka, Zain Patel, Mads R. B. Kristensen, Sanket Verma, Saransh Chopra, Matthew Rocklin, AWA BRANDON AWA, … shikharsg. (2024). zarr-developers/zarr-python: v3.0.0-alpha (v3.0.0-alpha). Zenodo. https://doi.org/10.5281/zenodo.11592827

de Buyl, P., Colberg, P. H., & Höfling, F. (2014). H5MD: A structured, efficient, and portable file format for molecular data. In Computer Physics Communications (Vol. 185, Issue 6, pp. 1546–1553). Elsevier BV. https://doi.org/10.1016/j.cpc.2014.01.018

Gowers, R., Linke, M., Barnoud, J., Reddy, T., Melo, M., Seyler, S., Domański, J., Dotson, D., Buchoux, S., Kenney, I., & Beckstein, O. (2016). MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. In Proceedings of the Python in Science Conference. Python in Science Conference. SciPy. https://doi.org/10.25080/majora-629e541a-00e

Jakupovic, E., & Beckstein, O. (2021). MPI-parallel Molecular Dynamics Trajectory Analysis with the H5MD Format in the MDAnalysis Python Package. In Proceedings of the Python in Science Conference. Python in Science Conference. SciPy. https://doi.org/10.25080/majora-1b6fd038-005

Michaud‐Agrawal, N., Denning, E. J., Woolf, T. B., & Beckstein, O. (2011). MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. In Journal of Computational Chemistry (Vol. 32, Issue 10, pp. 2319–2327). Wiley. https://doi.org/10.1002/jcc.21787

Extras: Upstream code merged during this GSoC project

MDAnalysis

One feature I needed for testing Zarrtraj was full use of writer kwargs when aligning a trajectory and writing it immediately rather than storing it in memory. However, this feature hadn’t yet been implemented, and luckily, it was a small change, so I worked with the core developers to merge this PR with the new feature and tests:

https://github.com/MDAnalysis/mdanalysis/pull/4565

MDAKits

Since Zarrtraj is based on the MDAnalysis MDAKit Cookiecutter (which is a fantastic tool for getting started in making and distributing Python packages), I was able to find and fix a few small bugs along the way in my GSoC journey including:

Kerchunk

Kerchunk is central to Zarrtraj’s ability to read hdf5 files using Zarr. However, H5MD files (stored in hdf5) have linked datasets as per the H5MD standard, but Kerchunk did not translate these from hdf5 to Zarr previously. I was able to work alongside a Kerchunk core developer to add the ability to translate hdf5 datasets into Zarr along with comprehensive tests of this new feature.

https://github.com/fsspec/kerchunk/pull/463

Extras: Lessons Learned

Here are a bunch of random things I learned while doing this project!

Any time you need to run code that will take several hours to execute due to file size or some other factor, create a minimal, quickly-executing example of the code to work out bugs before running the full thing. You will save yourself so, so much frustration.
It is worth investing time into getting a debugging environment properly configured. If you’re hunting down a specific bug, it is worth the 20 minutes it will take to create a barebones example of the bug instead of trying to hunt it down “in-situ”. A lot of the time, just creating the example will make you realize what was wrong. GH Actions runners sometimes behave differently than your development machine. This action for SSHing into a runner is FANTASTIC!
Maintain a “tmp” directory in your locally cloned repos, gitignore it, and use it for testing random ideas you have or working through bugs. Take the time to give each file in it a descriptive name! Having these random scripts and ideas all in one place will pay off massively later on.
Take risks with ideas if you suspect they might result in cleaner and faster code, even if you’re not 100% sure! Experimenting is worth it.
Don’t be afraid to read source code! Sometimes the fastest way to solving a problem is seeing how someone else solved it, and sometimes the fastest way to learning why someone else’s code isn’t doing what you expected is to read the code rather than the docs.

GSoC 2024 - 2D visualization for small molecules

02 Aug 2024

Contributor: Valerij Talagayev (@talagayev)

Mentors: Cédric Bouysset (@cbouy), Richard Gowers (@richardjgowers) and Yuxuan Zhuang (@yuxuanzhuang)

Organization: MDAnalysis

Release: GSoC 2024: 2D visualization for small molecules Release

During the Google Summer of Code 2024 program I was working on the 2D Visualization of small molecules, which is an important project, that would allow people to easily visualize the molecule that they have in their file through the selection of the molecule as an AtomGroup in MDAnalysis.

Goals of the project

The goal of the project was to create a code, that would allow the user to select an AtomGroup with the help of MDAnalysis, which would be used with an RDKit Converter to convert the molecule into an rdkit mol object, which would be used for the visualization.

My Proposal consisted of the following steps:

Select the AtomGroup via an input from the user via u = mda.Universe("input.pdb") to create an MDAnalysis Universe and ag = u.select_atoms("resname UNK") to select the AtomGroup with the molecule that needs to be displayed
Convert the selected AtomGroup into an rdkit mol object via ag.convert_to("RDKit")
Apply Chem.RemoveHs and AllChem.Compute2DCoords to obtain a 2D visualization of the molecule
Display the molecule via IpyWidgets with different checkboxes, that by selection would highlight specific parts of the molecule

After discussion with the mentors and suggestions regarding how the code should be structured, it was decided that I would proceed with this suggestion and would apply the following steps with later on adding the additional features in the IpyWidgets visualization, which would give the user important information in regards to the characteristics of the molecule that is being visualized.

What I did

The outcome of this GSoC program is a package called mdonatello which was created as follow:

Main Visualization Code

The main part of the code consisted of setting up the widget for the visualization, which would allow the display of the molecule in a jupyter notebook. This is performed in the MoleculeVisualizer class that is responsible for the visualization. It initializes several interactive widgets from IpyWidgets such as a dropdown menu to select the molecule that the user wants to display and the checkboxes that control the aspect of the image displayed, as explained in the next section.

This main code was the first part of the code that I designed and was in the beginning only able to display the molecule and had only one interactive checkbox, which displayed the atom indices of the molecules, but as more and more features were added the visualization grew bigger and had more features that it needed to display.

Feature Addition

The feature addition was the next step that I was working on and was mainly the biggest part of the project since there was always an additional interesting feature that could be added that was proposed either by myself or one of the mentors.

The first feature addition that I was working on were the physiochemical properties as well as the number of hydrogen bond donors and acceptors of the molecule.

Next, I added checkboxes that would highlight the related features on the image. For example, I implemented pharmacophore checkboxes, which by selection would highlight the selected pharmacophores, thus by selecting a hydrophobic feature, the atoms of the molecule that are hydrophobic via the RDKit pharmacophore recognition were highlighted in the assigned colors.

Here is a short summary of features that I added for the visualization, which would show a value for the corresponding feature or adjust the figure of the molecule to display the values:

Phyisochemical Properties
Atom Indices
Partial Charges
Hydrogen Bond Donors&Acceptors

And here is a short summary of features, that will highlight a specific atoms and bonds corresponding to their features if they are selected:

Rotatable Bonds
Partial Charge Heatmap
Functional Groups
Stereocenters
Murcko Scaffold
Pharmacophores

There is a separate checkbox for each pharmacophoric feature, thus you can select the specific feature that needs to be highlighted.

For the functional groups it currently uses SMART Patterns to identify certain functional groups and will then generate the checkboxes of the functional groups that were recognized and by selecting them you are able to see the atoms of this functional group.

Code Restructure

This is also a section where the work that was done throughout the whole project. In the beginning, the code was mainly consisting of functions in one file and this was one of the points that I was glad that my mentors helped me with. The code restructure allowed me to go from a code that was first based upon only function to structure it with classes. In the end, the separate classes were moved into separate files. The benefit of such a structure is that by having each class/function have a single responsibility, the code is divided into small parts, which makes it easier to understand the code and maintain it.

Actions, Testing and Documentation

This would be the final part of what I did and here I would put all these three parts together. A testsuite based on the pytest framework was created for this project, in addition to the code being uploaded to Codecov to display the coverage of the code. Furthermore, a CI CD workflow was created, which installs the package and runs the tests on Linux, MacOS and Windows systems with the Python versions 3.10, 3.11 and 3.12 with different IpyWidgets versions, with the oldest supported version for this code being 7.6.4. Finally, a documentation based on the MDAKits Template was created and can be found here: https://mdonatello.readthedocs.io/

Current State

MDonatello is available and can be installed and used on your system (see the How to use it section). Some improvements are planned as explained below.

What’s left to do

The code is working, but there is always things that can be improved. The main parts of the code that would need to be improved would be the functional groups, since currently the SMART Patterns are not optimal and it leads to cases of functional groups overlapping, an example being the recognition of an hydroxyl group in an carboxylic acid group, thus such cases need to be separated. Additionally, the figure generation is not optimal, since currently it uses Draw.MolToImage, which during the discussion with the mentors was decided as an not so good option and it would need to be adjusted. Additionally, the Partial Charge Heatmap is currently mainly highlighting the atoms, but there is an option to display it as a heatmap, which would require code adjustment. Then the update_display is also not optimal currently and would need to be improved. There are also some small code structure details, that would need to be adjusted for the code to be more clean and structured and last, but not least the code would need to be moved into the MDAKits and also published in conda-forge and pypi to make it more accessible.

What code got merged (or not) upstream

There were multiple PRs created for this project, but they were merged to obtain a big and summarized PR, which is also the version that is used in the Release

How to use it

Currently MDonatello is a separate package that can be git cloned and installed with the following steps:

First clone the repository:

git clone https://github.com/talagayev/MDonatello

Create a virtual environment and activate it:

conda create --name mdonatello
conda activate mdonatello

Then go into the MDonatello folder:

cd MDonatello

Finally this package from source:

pip install -e .

To use the mdonatello package you need to run a jupyter notebook, thus run the command:

jupyter notebook

Now that you started a jupyter notebook create a notebook file and enter the following command to use mdonatello: Here you need to adjust the name of the PDB File to your PDB File and the resname of the molecule to your molecule

import MDAnalysis as mda
from mdonatello import MoleculeVisualizer

u = mda.Universe("input.pdb")
ag = u.select_atoms("resname UNK")
visualizer = MoleculeVisualizer(ag, show_atom_indices=False, width=-1, height=-1)

This would lead you to obtain the following display, where you could then select the molecule and checkboxes:

Example of MDonatello Display

Lessons Learned During GSoC

Here I would highlight some lessons that I learned during GSoC2024:

Have an Idea of the overall structure of the code
Discuss the ideas with the mentors to see their opinion
Don’t be afraid to ask questions
Try to work on different aspects of the code (in my case I was almost always only focusing on feature addition and needed to get reminded that there are other parts)
Try to learn as much from the mentors and apply it in your code
Getting a meeting with all mentors that have different time zones is sometimes tricky :)

Conclusion & The Future

It was a fun project and I liked working on it, but the work doesn’t stop here. As I mentioned some aspects of the code can be improved and things that can be done, so I will continue to work on the code after GSoC 2024, here are again the short highlights that I already mentioned that I would work on:

Code structure improvement
Figure Generation and SMART Patterns need to be improved
Heatmap needs to be created for partial charges instead of atom highlighting
Adding the package to MDAKits and conda-forge
New Features can be added

Acknowledgements

I would like to thank MDAnalysis for giving me the opportunity to work on this project. The application process was very nice and detailed, I liked it that during the application process. I was able to get to know the MDAnalysis code more and contribute to the code and get to know some people from the community during this time.

I would like to thank Oliver Beckstein (@orbeckst) and Jenna M Swarthout Goddard(@jennaswa) for the insights and help with the organization and structure during the application, I am glad that I was able to contribute to MDAnalysis with this project.

I would also like to thank my mentors Cédric Bouysset (@cbouy), Richard Gowers (@richardjgowers) and Yuxuan Zhuang (@yuxuanzhuang), who helped me a lot during the project, with their helpful and insightful mentoring. I am glad, that I was able to learn from you and it helped me to improve my coding skills. Also, shoutout to Hugo MacDermott-Opeskin (@hmacdope) for his help with the PR requests during the application process.

Finally, I would also like to thank Google for offering this program and supporting open-source software.

MDAnalysis Community Survey 2024

31 Jul 2024

We are excited to announce the launch of our MDAnalysis Community Survey 2024! Do you or your institution use, develop, contribute to, or sponsor MDAnalysis? Have you attended MDAnalysis workshops or user group meetings? Have you heard about MDAnalysis, but are still learning how to get started or looking for more ways to get involved? If any of these things sound like you, we want to hear from you as part of our valued community! Your feedback is crucial in helping us understand the diverse needs and experiences within our community and guiding our efforts to make MDAnalysis an even more inclusive and welcoming community.

Purpose of the Survey

This survey aims to gather your insights and feedback on various aspects of MDAnalysis, especially about your experiences as a community member and your engagement with the MDAnalysis project.

What will the data be used for?

The information from this survey will help us foster an inclusive and supportive community by:

Determining preferred MDAnalysis communication channels and developing a content strategy to keep the community informed.
Informing workshop and event planning efforts to reach diverse audiences based in various time zones across the globe.
Ensuring MDAnalysis resources are findable and accessible.

Confidentiality and Anonymity

We value your privacy so your responses will be kept confidential and anonymous and only shared with MDAnalysis survey administrators. Demographic information will only be reported in an aggregated format to generate high-level insights.

How to Participate

We invite all members of the MDAnalysis community to participate in the survey. We greatly appreciate your time answering a few questions (around 10 minutes) before August 31st, 2024. Please take a moment to share your thoughts and experiences with us by clicking the following link to access the survey:

Take the Survey

Thank you so much for being a part of the MDAnalysis community. We cannot wait to hear from you!😀

Acknowledgment

This survey was developed as part of an Outreachy project.

The following resources were consulted to aid in the development of this survey:

– @adetutudeborah (Outreachy intern working with MDAnalysis) @jennaswa @micaela-matta (Outreachy mentors)

Older Newer