What is it about?

Anyone who has used a supercomputer (HPC cluster) knows the frustration of trying to install the software they need for their research. You don't have administrator rights, the system libraries are often ancient, and getting everything to compile correctly is a nightmare known as "dependency hell." This makes truly reproducible science incredibly difficult. This paper argues that a powerful but underutilized tool called nix is the ideal solution. nix is a functional package manager. This means it treats software installation like a mathematical function: the exact same set of inputs (source code, dependencies) is guaranteed to produce the exact same compiled program every single time, on any machine. While nix is great, it wasn't designed to work with the job schedulers (like SLURM) that run supercomputers. Our paper provides a blueprint for integrating them, proposing a system where nix submits compilations as formal, managed jobs to the cluster's queue.

Featured Image

Why is it important?

This approach solves "dependency hell" for scientists, freeing them from the massive, time-consuming headache of managing software and allowing them to focus on their actual research. By ensuring that a computational environment can be perfectly replicated anywhere, it provides a rock-solid foundation for reproducible science. A computation run today can be perfectly re-run by another scientist years from now. Unlike tools like Docker which have security issues on shared systems, nix is secure. Our proposed integration with job schedulers also makes cluster administrators' lives easier by ensuring that resource-intensive compilations are properly managed. It's a practical roadmap for HPC centers to create a more modern, robust, and user-friendly software environment.

Perspectives

This work was a direct result of my experiences as a researcher trying to get my own complex codes to run, and as an HPC systems specialist helping others, along with my many FOSS projects. I had just finished setting up a lot of the initial software tooling for the Icelandic national HPC cluster, and I saw the same pattern everywhere: scientists spending a huge fraction of their time fighting with broken software installations. The existing solutions all had fundamental flaws in security or reproducibility. nix was the answer. Its functional, deterministic approach felt like it was designed to solve the exact problems of scientific computing; the only challenge was that it didn't speak the language of traditional HPC job schedulers. This paper was my blueprint for making them talk to each other. It was born from the very practical pain points I experienced, and it’s about taking an elegant idea from the functional programming world to make life on a supercomputer less chaotic and more scientifically rigorous for everyone.

Rohit Goswami
University of Iceland

Read the Original

This page is a summary of: Reproducible High Performance Computing without Redundancy with Nix, November 2022, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/pdgc56933.2022.10053342.
You can read the full text:

Read

Contributors

The following have contributed to this page