# GSOC 2018: Improvements in distance search methods

We are pleased to announce another successful year of Google Summer of Code with the NumFOCUS organization, thanks to Richard Gowers and Jonathan Barnoud for mentoring the GSoC students. This year one of the projects was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms.

MDAnalysis v0.19.0 and higher now include the new functions MDAnalysis.lib.distances.capped_distance and MDAnalysis.lib.distances.self_capped_distance which offer a much faster way to calculate all pairwise distances up to a certain maximum distance. By only considering distances up to a certain maximum, we can use various algorithms to optimise the number of pairwise comparisons that are performed. Behind the scenes, these functions are using one of three different algorithms: bruteforce which is a naive pairwise distance calculation algorithm, pkdtree which is a wrapper method around Scipy’s KD tree search algorithm and nsgrid which is an implementation of cell-list algorithm. This last algorithm uses the new MDAnalysis.lib.nsgrid module which was implemented with the help of Sebastien Buchoux.

For more information on these algorithms the reader is encouraged to read @ayushsuhane’s blog, which includes a comparison of these approaches and their performance in different conditions.

## The GSoC project

One of the major bottleneck in various analysis routines in MDAnalysis (and typically in molecular dynamics studies) is the evaluation of pairwise distances among the particles.

The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree-based binary search algorithms to solve such problems. While these methods are suitable for a variety of analysis functions using pairwise distances in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods.

This question led to the inception of a Google Summer of Code project with NumFOCUS. Ayush Suhane completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds, and radial distribution function in the analysis module of MDAnalysis. More details on the commit history, PR’s and blog posts can be found in the final report submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found here.

## The new capped_distance() function

The major highlight of the project is the introduction of MDAnalysis.lib.distances.capped_distance which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following API:

def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True):
"""
An Algorithm to evaluate pairs between reference and configuration atoms
and corresponding distances
"""
return pairs, distances

Once the method is defined, register the function name in _determine_method in MDAnalysis.lib.distances as:

from MDAnalysis.lib import distances
distances_methods['newmethod'] = newmethod_capped

That’s it. The new method is ready to be tested across functions which use capped_distance. For any specific application, it can be called as capped_distance(ref, conf, max_dist, method='newmethod') from the function.

## Treatment of periodic boundary conditions

While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function augment_coordinates and undo_augment can be used with any algorithm to handle PBC. The main idea is to extend the box by generating duplicate particles in the vicinity of the box by augment_coordinates. These duplicates, as well as the original particles, can now be used with any algorithm to evaluate the nearest neighbors. After the operation, the duplicate particles can be reverted back to their original particle indices using undo_augment. These functions are available in MDAnalysis.lib._augment. We encourage the interested readers to try different algorithms using these functions. Hopefully, you can help us improve the performance further with your feedback. As a starting point, the skeleton to enable PBC would take the following form:

aug, mapping = augment_coordinates(coords, box, radius)
all_coords = no.concatenate([coords, aug])
"""
Perform operations for distance evaluations
with **all_coords** using the new algorithm
and obtain the result in indices
"""
indices = undo_augment(indices, mapping, len(coords))
return indices

Finally, this function can be tested with capped_distance to check the performance against already implemented algorithms in MDAnalysis.

## Performance improvements

As a final note, we managed to get a speed improvement of

• ~ 2-3 times in Radial Distribution Function computation,
• ~ 10 times in identification of bonds, and
• ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis.

The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world.

This was a flavor of what work was done during GSoC’18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks.

# NumFOCUS Sponsored Workshop and Hackathon

We are happy to announce that thanks to a NumFOCUS development grant we will be hosting a 2 day workshop and hackathon aimed at introducing researchers to MDAnalysis. The event will be free to attend and will be held at Northwestern University, Evanston IL on the 12th and 13th of November 2018. In addition, there are small travel grants available for people attending from other US institutions aimed at promoting diversity in STEM. Women and minorities are especially encouraged to apply!

The first day of the workshop will focus on covering the basics of the Python programming language and an introduction to the MDAnalysis library. The second day will cover how to apply MDAnalysis to your own research data. The workshop will conclude with a hackathon which will teach you how to start contributing to open source software.

## Registration

To attend the Workshop please register here. Space is limited and registration closes on October 15th.

## Program and Materials

The workshop materials are online and are available for attendants and anyone else interested.

# Google Summer of Code Students 2018

We are happy to anounce that MDAnalysis is hosting two GSoC students for NumFOCUS this year, Ayush Suhane (@ayushsuhane) on GitHub) and Davide Cruz (@davidercruz).

## Ayush Suhane: Improve Distance Search Methods in MDAnalysis

With the capability of multiple MD codes to easily handle milions of atoms, a major roadblock to analysis of this vast amount of data corresponding to positions of each atoms at every timestep is the time to evaluate pairwise distance between multiple atoms. Almost every operation requires the distance between the pair of atoms, fast calculation of pairwise distance is of utmost importance. Multiple basic analysis functions like Radial Distribution Function, Contact Matrices, depepend very heavily on fast distance evaluations. Apart from naive approach for pairwise calculations which scale as $$\mathcal{O}(N^2)$$, other forms of data structures like KDTree, Octree are sugested for faster calulations based on the requirements. Based on the MDAnalysis, two use cases are identified as highly used in majority of the analysis algorithms. The goal of the project is to identify the data structure based on the requirements of the use case and implement in the MDAnalysis library along with clear documentations and test cases.

Ayush is a graduate student in Materials Engineering at UBC, Canada. He is working with Molecular dynamics simulations for his Master’s thesis. He wishes to contribute to the open source community by simplifying the complex analysis and visualization involved in MD. During GSoC, he aims to use the opportunity to learn from the already established open source contributors and continue the tradition by becoming an active member of the community. In his free time, he also likes to read fiction novels and play computer games.

Ayush will describe his progress on his blog.

## Davide Cruz: Implementing on-the-fly coordinate transformations

Implement trajectory transformations on the MDAnalysis API, to be called on-the-fly by the user, eliminating the requirement for multiple intermediate steps of modifying and saving the trajectory, and giving users a more efficient and simple workflow for simulation data analysis.

Davide is currently on the last year of his PhD on Molecular Biosciences at ITQB-NOVA in Lisbon, Portugal. For his thesis he is using MDAnalysis to analyse the results of molecular dynamics simulationations and this GSoC project is an opportunity to contribute to the community. He expects to learn a lot about python and software development during this summer.

Davide will describe his progress on his blog.

## Other NumFOCUS students

NumFOCUS is hosting 45 students this year for several of their supported and affiliated projects. You can find out about the other students here.