What is it about?

Microorganisms are ubiquitous and essential since they inhabit several ecological niches contributing to the equilibrium of the host organism or the ecosystem. Several studies have shown that in the absence of this balance dysbiosis characteristic of some pathologies or ecological community changes are observed. Next-generation sequencing technologies allow to measure the genetic material (sequences) of all microorganisms in a sample. Then, bioinformatics methods on sequence data are used to reconstruct the abundance of each microorganism from sequences. Bioinformatics downstream analysis methods are used to mine information from abundance data. An example is the diversity analysis which measures the difference in terms of overall microbial community between different samples belonging to different experimental groups. Differential abundance (DA) analysis is the technique used to go beyond the simple determination of diversity and the identification of patterns between the abundance profiles of different experimental groups. DA aims to identify the specific microorganisms that underlie this diversity to understand which microorganisms potentially have a role in the observed dysbiosis. To date, several DA methods have been developed and widely-used to analyze microbial sequence data. However, there is no consensus about the best approach to use. Moreover, recently published studies demonstrate a lack of consistency of DA analysis methods casting doubts on the reliability of the findings of these methods. In this paper we shed light on DA methods performance developing: (i) a robust, reliable, fair and reproducible simulation framework that allow to obtain synthetic abundance data with known underlying differential abundance microorganisms (i.e., the so-called Ground Truth); (ii) a complete assessment framework with new methods and realistic experimental scenarios not yet investigated evaluated on a wide set of metrics. With these two ingredients we overcome the limitations of previous comparative studies performing an extensive benchmarking of the known DA methods.

Featured Image

Why is it important?

In Literature there is a need of: (i) a simulation framework to assess DA method performance on known Ground Truth; (ii) an assessment framework that evaluates all possible technical and biological covariates; (iii) instructions about which DA methods used for analysis but also which are the data characteristic that developers have to take into account to improve methods. In this work we respond to these needs by developing a reliable and reproducible simulation framework based on a new definition of differentially abundant microorganisms maintaining all the typical microbial sequence data characteristics. Moreover, the assessment framework includes new methods, but also scenarios and covariates evaluated in all possible combinations. Finally, important take-home messages are achieved: - Methods show good control of the type I error. - More samples are needed to reach high recall and good control of the False Discovery Rate. This is an aspect to be taken into consideration in the study design phase. - Sequencing depth does not impact the methods’ performance. Therefore, sequencing fewer reads and performing more biological replicates could be an effective strategy to increase the power and accuracy of results. - Microorganisms (biological) variability influences methods’ recall, but not methods’ precision. Results demonstrated that the number of samples to obtain adequate recall values depends on the variability of the dataset. Consequently, in the experimental design phase, a preliminary variability investigation on publicly available datasets of the interested niche or the datasets used in this work could drive the choice of the number of desirable samples. - Low-abundant microorganisms are not handled properly by most DA methods. This result suggests to the developer that the characteristic of intensity-variability in the data is a weak point of the current DA methods. On the other hand, the analyst who is interested in studying DA microorganisms at low abundance must be aware that even in the case of high sample sizes, they may not detect reliable biomarkers. - Normalisation (GMPR) does not affect the overall ranking of methods. Since all results in this work are obtained with the default normalization, the fact that an alternative normalization does not modify the performances is a signal for the user that can rely on the default normalization. Finally, in order to ensure the reproducibility of results, simulated data and assessment scripts are included in the metaBenchDA R package (see Resources section below). Moreover, a Docker container image containing the developed R package and the tested DA methods is available at the same link. This package could be a useful tool for developers that want to test their methods in a reliable simulation and assessment framework.


I hope this work can be a starting point to guide analysts in the choice of tools, but also a first step towards the development of a fair, robust and reproducible framework for the assessment of DA methods to continuously evaluate current and new approaches.

Marco Cappellato
University of Padova

Read the Original

This page is a summary of: Investigating differential abundance methods in microbiome data: A benchmark study, PLoS Computational Biology, September 2022, PLOS,
DOI: 10.1371/journal.pcbi.1010467.
You can read the full text:

Open access logo



The following have contributed to this page