What is it about?

Biologists who study evolution (phylogenetics) need the power of supercomputers (HPC clusters) to analyze huge genetic datasets. But using these machines is often a frustrating experience—you can't easily install the software you need, and the workflow is complex and non-interactive. Many turn to simple point-and-click websites, but these tools are often "black boxes" that don't allow for serious, reproducible science. This paper provides a practical, step-by-step guide for how to do this work effectively and reproducibly on a standard HPC cluster. We assembled a toolkit of modern, user-friendly software that works without needing administrator privileges. We show how to use micromamba to manage software, how to run an interactive R session with modern tools to analyze data in real-time, and how to use Quarto for "literate programming"—weaving the code, results, and narrative into a single, transparent, and reproducible report.

Featured Image

Why is it important?

This work provides an accessible roadmap that empowers life scientists to take full advantage of high-performance computing without needing to become HPC experts. It lowers the barrier to entry for doing high-quality, large-scale computational biology. By championing a literate programming approach with self-contained software environments, this workflow makes complex bioinformatic analyses transparent and reproducible, which is a cornerstone of modern science. It bridges the gap between the often-intimidating world of command-line HPC and the interactive, user-friendly environment that data scientists are used to, showing that you don't have to sacrifice interactivity for performance.

Perspectives

This paper grew out of a collaboration with an applied biology project and was a very pragmatic exercise. My role was to bring my expertise in high-performance and interoperable software to a real-world scientific problem. I saw that the biggest barrier for these biologists wasn't the science itself, but the frustrating experience of using the necessary computational tools. While my other work explores more "purist" solutions like nix, for this project, the goal was to find the most practical, user-friendly toolkit that would get the job done right now. This work is another example of my core interest: building bridges. Whether I'm connecting Fortran to Python as a NumPy maintainer, or connecting a biologist to a supercomputer with a better workflow, the goal is the same—to make powerful computational tools more accessible and reliable. This paper was about demonstrating how to connect all the pieces—data acquisition, HPC execution, and literate analysis—into a seamless and reproducible whole.

Rohit Goswami
University of Iceland

Read the Original

This page is a summary of: High Throughput Reproducible Literate Phylogenetic Analysis, November 2022, Institute of Electrical & Electronics Engineers (IEEE),
DOI: 10.1109/pdgc56933.2022.10053210.
You can read the full text:

Read

Contributors

The following have contributed to this page