What is it about?

Phylogenetic analyses of large genomic datasets require HPC resources, but the workflow -- installing specialized software, managing job submissions, tracking parameters -- is complex and poorly documented in practice. Many biologists use ad hoc scripts that work on one system but cannot be reproduced elsewhere. We built a literate programming workflow for phylogenetic analysis on HPC clusters using Nix for environment management and org-mode for documentation. The entire analysis, from data preprocessing through tree estimation to visualization, lives in a single document that interleaves code, results, and narrative. The Nix environment ensures that the same software versions run on any cluster. The literate document ensures that every parameter choice is recorded in context. We demonstrated the workflow on a real dataset from an applied biology project.

Featured Image

Why is it important?

Computational biology increasingly depends on HPC, but the barrier to entry remains high for researchers without systems administration experience. Package installation alone can consume days of effort. Our workflow addresses both the usability and reproducibility problems simultaneously. Nix handles the environment; literate programming handles the documentation. The result is an analysis that another researcher can re-run on a different cluster by cloning a repository and running a single command. The approach generalizes beyond phylogenetics to any HPC workflow that involves multiple sequential analysis steps with many parameters.

Perspectives

This paper came from a collaboration with an applied biology group that needed phylogenetic analyses on the national HPC cluster. The biologists were capable scientists but had no interest in becoming Linux systems administrators. My contribution was the infrastructure: packaging the phylogenetic tools with Nix and embedding the workflow in a literate document. The point was to make the HPC part invisible so the biologists could focus on the biology. The combination of Nix and literate programming turned out to work better than either alone. Nix handles the "it works on my machine" problem; literate programming handles the "what parameters did I use?" problem.

Rohit Goswami
University of Iceland

Read the Original

This page is a summary of: High Throughput Reproducible Literate Phylogenetic Analysis, October 2024, Center for Open Science,
DOI: 10.31219/osf.io/7pzha.
You can read the full text:

Read

Contributors

The following have contributed to this page