04 Sep 2024
Goals of the Project
The goal of this project is to develop a comprehensive social media and communications strategy to grow the MDAnalysis user and contributor base. This involves conducting surveys with MDAnalysis users and developers to understand which communication channels they prefer to use to seek information about and engage with the MDAnalysis project. It also entails finding ways to boost participation and involvement from everyone in the MDAnalysis community and to also implement the setup of the MDAnalysis newsletter.
What I Did
At the start of the internship, I was able to explore MDAnalysis communication channels and document my observations, recommendations, and current analytics for each channel. I conducted keywords analysis, along with their frequencies, of the MDAnalysis mailing lists (now archived), Discord, and GitHub Discussions forum to gain insights into common questions and user needs within the MDAnalysis community. I delved into understanding how participants discover MDAnalysis workshops by gathering data from past MDAnalysis workshops. I also created a methods and definitions document to clarify the metrics used in my report and my approach to the tasks.
Moving forward, I drafted a survey plan, starting with a description of the MDAnalysis target audience and developing research questions to understand what we aim to learn from the survey. From the research questions, I developed the survey questions and answer options, and the Google form to collect data that will inform our communications strategy. Also, I created a survey outreach plan detailing the outreach and advertising channels we plan to use for distributing the survey, the survey participation plan, and the promotional content (blog post, email, and social media post).
We launched the survey form on August 1st and I sent follow-up reminders on August 15th and August 23rd respectively. While waiting for the survey responses, I developed a content strategy which included the proposed content plan and calendar for MDAnalysis. In the last week of the internship, I analyzed and wrote a summary of the survey results and how it can inform the MDAnalysis communication strategy.
The Current State
Currently, the initial survey results have been analyzed, and a summary has been prepared to inform the MDAnalysis communication strategy and a proposed content plan and calendar for MDAnalysis have been created.
What’s Left to Do
The next steps involve implementing the communications strategy based on the survey results and making adjustments to the strategy as needed. The survey results will guide the selection of communication channels and content types that will best engage the MDAnalysis community. Additionally, launching the MDAnalysis newsletter is another important part of the project. Beyond the internship, an ongoing effort will be needed to sustain the implementation of the communications strategy to grow MDAnalysis user and contributor’s base.
Challenges
One of the main challenges I encountered was drafting research and survey questions that would give the feedback needed to inform MDAnalysis communication strategy. It was important to ask the right questions to gather meaningful feedback from respondents. I was able to navigate this phase with the guidance and valuable input from my mentors and the MDAnalysis leadership.
Another challenge that I faced was ensuring active participation from the community in the survey. Beyond the initial outreach, my mentor and I sent follow-up reminders, and the survey was promoted during the MDAnalysis 2024 User Group Meeting. While the survey did receive a reasonable number of responses, broader participation from the community could have provided even more meaningful insights.
Lessons Learned
Here are some of the lessons I learned during the internship:
- Prioritize tasks/activities that directly contribute to the project goals. This will help you to stay focused and ensure the overall project goals are achieved.
- Before drafting survey questions, it’s important to first identify the key research questions.
- Learn to embrace and implement feedback from your mentors.
- Breakdown big goals into smaller, achievable weekly tasks. It helped me to stay on track and clearly identify the next steps.
- Communicate with your mentors. Don’t hesitate to ask questions and seek clarity when needed.
- Plan for multiple rounds of feedback and revisions when setting project timelines.
- Document your work and progress when working on a project.
- Be proactive.
Acknowledgements
I would like to extend my sincere gratitude to the Outreachy organizers for this incredible opportunity to contribute to the MDAnalysis project.
A special thanks to the MDAnalysis leadership team and my mentors, Jenna Swarthout Goddard (@jennaswa) and Micaela Matta (@micaela-matta), for their guidance and feedback throughout the internship. This experience has provided me with new skills I will carry forward in my future endeavors.
Finally, I want to express my gratitude to the entire MDAnalysis community for their participation in the survey and for helping make this project a success.
— @adetutudeborah
03 Aug 2024
What is Zarrtraj?
Zarrtraj is an MDAKit for storing and analyzing trajectories in MDAnalysis from the cloud, representing a major milestone towards the MDAnalysis 3.0 goal of cloud streaming as well as a proof-of-concept for a new paradigm in the field of molecular dynamics as a whole.
It can stream H5MD trajectories from cloud storage
providers including AWS S3, Google Cloud buckets, and Azure Blob storage and data lakes.
Using Zarrtraj, anyone can reproduce your analyses or train their machine learning (ML) model on trajectory data
without ever having to download massive files to their disk.
This is possible thanks to Zarr, fsspec,
and kerchunk packages that have created a foundation for storing and interacting with
large datasets in a uniform way across different storage backends. Interestingly, these projects were intially developed by geoscientists in the Pangeo project to make use of cloud computing credits from an NSF partnership with cloud providers, but have since undergone wider adoption by the broader Python community. This project also represents one of the first forays of the molecular dynamics field into the excellent Pangeo ecosystem of tools.
Zarr is especially well-suited to the task of reading cloud-stored files due to its integration with
dask for parallelized reading. This parallelization offsets
the increased IO time in cloud-streaming, speeding up common analysis algorithms up to ~4x compared to sequential analysis.
See the zarrtraj benchmarks for more.
In this project, we also decided to experiment with storing trajectories directly in Zarr-backed files
using the same specification that H5MD uses, so Zarrtraj can read both .h5md
and H5MD-formatted .zarrmd
files. See this explanation of the modified format
to learn more.
While this GSoC project started with the goal of building a new, Zarr-backed trajectory format, we pivoted
to making the existing H5MD format streamable after getting feedback
from the community that supporting widely adopted formats in the MD ecosystem makes code more sustainable and simplifies tool adoption.
The next section is a walkthrough of Zarrtraj’s features and usage (also available
here).
How can I use it?
Zarrtraj is currently available via PyPI and Conda Forge.
Pip installation
Conda installation
conda install -c conda-forge zarrtraj
For more information on installation, see the installation guide
This walkthrough will guide you through the process of reading and writing H5MD-formatted trajectories from cloud storage using
AWS S3 as an example. To learn more about reading and writing trajectories from different cloud storage providers,
including Google Cloud and Azure, see the API documentation.
Reading H5MD trajectories from cloud storage
Uploading your H5MD file
First, upload your H5MD trajectories to an AWS S3 bucket. This requires that an S3 Bucket is setup and configured for
write access using the credentials stored in “sample_profile”. If you’ve never configured an S3 Bucket before, see
this guide. You can setup a profile to easily manage AWS
credentials using this VSCode extension.
Here is a sample profile (stored in ~/.aws/credentials) where
the key is an access key associated with a user that has read and write permissions for the bucket.
[sample_profile]
aws_access_key_id = <key>
MDAnalysis can write a trajectory from
any of its supported formats into H5MD. We
recommend using the chunks
kwarg with the MDAnalysis H5MDWriter with a value that yields ~8-16MB chunks of data for best S3 performance.
Once written locally, you can upload the trajectory to S3 programmatically:
import os
from botocore.exceptions import ClientError
import boto3
import logging
os.environ["AWS_PROFILE"] = "sample_profile"
# This is the AWS region where the bucket is located
os.environ["AWS_REGION"] = "us-west-1"
def upload_h5md_file(bucket_name, file_name):
s3_client = boto3.client("s3")
obj_name = os.path.basename(file_name)
response = s3_client.upload_file(
file_name, bucket_name, obj_name
)
if __name__ == "__main__":
# Using test H5MD file from the Zarrtraj repo
upload_h5md_file("sample-bucket-name", "zarrtraj/data/COORDINATES_SYNTHETIC_H5MD.h5md")
You can also upload the H5MD file directly using the AWS web interface by navigating to S3, the bucket name, and pressing
“upload”.
Reading your H5MD file
After the file is uploaded, you can use the same credentials to stream the file into MDAnalysis:
import zarrtraj
import MDAnalysis as mda
# This sample topology requires installing MDAnalysisTests
from MDAnalysisTests.datafiles import COORDINATES_TOPOLOGY
import os
os.environ["AWS_PROFILE"] = "sample_profile"
os.environ["AWS_REGION"] = "us-west-1"
u = mda.Universe(COORDINATES_TOPOLOGY, "s3://sample-bucket-name/COORDINATES_SYNTHETIC_H5MD.h5md")
for ts in u.trajectory:
pass
You can follow this same process for reading .zarrmd
files with the added advantage
that Zarrtarj can write .zarrmd
files directly into an S3 bucket.
Writing trajectories from MDAnalysis into a zarrmd file in an S3 Bucket
Using the same credentials with read/write access, you can write a trajectory
into your bucket.
You can change the stored precision of floating point values in the file with the optional
precision
kwarg and pass in any numcodecs.Codec
compressor with the optional
compressor
kwarg. See numcodecs
for more on the available compressors.
Chunking is automatically determined for all datasets to be optimized for
cloud storage and is not configurable by the user.
Initial benchmarks show this chunking strategy is effective for disk storage as well.
import zarrtraj
import MDAnalysis as mda
from MDAnalysisTests.datafiles import PSF, DCD
import numcodecs
import os
os.environ["AWS_PROFILE"] = "sample_profile"
os.environ["AWS_REGION"] = "us-west-1"
u = mda.Universe(PSF, DCD)
with mda.Writer("s3://sample-bucket-name/test.zarrmd",
n_atoms=u.trajectory.n_atoms,
precision=3,
compressor=numcodecs.Blosc(cname="zstd", clevel=9)) as W:
for ts in u.trajectory:
W.write(u.atoms)
If you have additional questions, please don’t hesitate to open a discussion on the zarrtarj github.
The MDAnalysis discord is also a
great resource for asking questions and getting involved in MDAnalysis; instructions for joining the MDAnalysis discord server can be found on the MDAnalysis website.
What’s next for Zarrtraj?
Zarrtraj is currently in a fully operational state and is ready for use!
However, I’m excited about creating some new features in the future that will
make Zarrtraj more flexible and faster.
Lazytimeseries
In MDAnalysis, many trajectory readers expose a timeseries
method for getting access to
coordinate data for a subselection of atoms across a trajectory. This provides
a viable way to sidestep the Timestep
(eagerly-loaded frame-based) paradigm that
MDAnalysis uses for handling trajectory data. Zarrtraj could implement a
“lazytimeseries” that returns a lazy dask array of a selection of atoms’ positions
across the trajectory. Early benchmarks show that analysis based on such a lazy array
can outperform Timestep
-based analysis.
Asynchronous Reading
The performance impact of network IO could be reduced by creating a multi-threaded ZARRH5MDReader
that isn’t blocked
by analysis code executing. The reader could eagerly load the cache with the next frames
the analysis code will need to reduce the impact of network IO on exeution time.
Acknowledgements
A big thanks to Google for supporting the Google Summer of Code program and to the GSoC team for enabling my project.
Thank you to Dr. Hugo MacDermott-Opeskin (@hmacdope) and Dr. Yuxuan Zhuang (@yuxuanzhuang) for their mentorship and feedback
throughout this project and to Dr. Jenna Swarthout Goddard (@jennaswa) for supporting the GSoC program
at MDAnalysis.
I also want to thank Dr. Oliver Beckstein (@orbeckst) and Edis Jakupovic (@edisj for lending their expertise
in H5MD and all things MDAnalysis.
Finally, another thanks to Martin Durant (@martindurant), author of Kerchunk, who was incredibly helpful in refining and merging
a new feature in his codebase necessary for this project to work.
Citations
Alistair Miles, jakirkham, M Bussonnier, Josh Moore, Dimitri Papadopoulos Orfanos, Davis Bennett, David Stansby, Joe Hamman, James Bourbeau, Andrew Fulton, Gregory Lee, Ryan Abernathey, Norman Rzepka, Zain Patel, Mads R. B. Kristensen, Sanket Verma, Saransh Chopra, Matthew Rocklin, AWA BRANDON AWA, … shikharsg. (2024). zarr-developers/zarr-python: v3.0.0-alpha (v3.0.0-alpha). Zenodo. https://doi.org/10.5281/zenodo.11592827
de Buyl, P., Colberg, P. H., & Höfling, F. (2014). H5MD: A structured, efficient, and portable file format for molecular data. In Computer Physics Communications (Vol. 185, Issue 6, pp. 1546–1553). Elsevier BV. https://doi.org/10.1016/j.cpc.2014.01.018
Gowers, R., Linke, M., Barnoud, J., Reddy, T., Melo, M., Seyler, S., Domański, J., Dotson, D., Buchoux, S., Kenney, I., & Beckstein, O. (2016). MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. In Proceedings of the Python in Science Conference. Python in Science Conference. SciPy. https://doi.org/10.25080/majora-629e541a-00e
Jakupovic, E., & Beckstein, O. (2021). MPI-parallel Molecular Dynamics Trajectory Analysis with the H5MD Format in the MDAnalysis Python Package. In Proceedings of the Python in Science Conference. Python in Science Conference. SciPy. https://doi.org/10.25080/majora-1b6fd038-005
Michaud‐Agrawal, N., Denning, E. J., Woolf, T. B., & Beckstein, O. (2011). MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. In Journal of Computational Chemistry (Vol. 32, Issue 10, pp. 2319–2327). Wiley. https://doi.org/10.1002/jcc.21787
MDAnalysis
One feature I needed for testing Zarrtraj was full use of writer kwargs
when aligning a trajectory and writing it immediately rather than storing it in memory.
However, this feature hadn’t yet been implemented, and luckily, it was a small change,
so I worked with the core developers to merge this PR with the new feature and tests:
MDAKits
Since Zarrtraj is based on the MDAnalysis MDAKit Cookiecutter
(which is a fantastic tool for getting started in making and distributing Python packages), I was able
to find and fix a few small bugs along the way in my GSoC journey including:
Kerchunk
Kerchunk is central to Zarrtraj’s ability
to read hdf5 files using Zarr. However, H5MD files (stored in hdf5) have linked
datasets as per the H5MD standard, but Kerchunk did not translate these from
hdf5 to Zarr previously. I was able to work alongside a Kerchunk core developer
to add the ability to translate hdf5 datasets into Zarr along with comprehensive
tests of this new feature.
Here are a bunch of random things I learned while doing this project!
- Any time you need to run code that will take several hours to execute
due to file size or some other factor, create a minimal, quickly-executing
example of the code to work out bugs before running the full thing. You will
save yourself so, so much frustration.
- It is worth investing time into getting a debugging environment properly configured.
If you’re hunting down a specific bug, it is worth the 20 minutes it will take to create
a barebones example of the bug instead of trying to hunt it down “in-situ”. A lot of the time,
just creating the example will make you realize what was wrong. GH Actions runners sometimes
behave differently than your development machine. This action
for SSHing into a runner is FANTASTIC!
- Maintain a “tmp” directory in your locally cloned repos, gitignore it, and use it for testing random ideas
you have or working through bugs. Take the time to give each file in it a descriptive name!
Having these random scripts and ideas all in one place will pay off massively later on.
- Take risks with ideas if you suspect they might result in cleaner and faster code,
even if you’re not 100% sure! Experimenting is worth it.
- Don’t be afraid to read source code! Sometimes the fastest way to solving a problem
is seeing how someone else solved it, and sometimes the fastest way to learning
why someone else’s code isn’t doing what you expected is to read the code rather than
the docs.