What is it about?

Phylogenetic profiling is a way to predict which genes are involved in the same biological process by looking at how often certain protein families are kept or lost across different species. It's been mostly used for bacteria, but now it is beginning to be adopted in eukaryotes like plants and animals. However, it can be slow and difficult to use with larger genomes. We made a new method called HogProf which is faster and can be used with thousands of species without slowing down. We tested it on known gene interactions and found that it works better than other methods.

Featured Image

Why is it important?

Scaling up the number of genomes used in phylogenetic profiling is important because it allows for better predictions of which genes are involved in the same biological process and are interactors. With more genomes, we can see patterns of which protein families tend to be lost or retained across a wider range of species, giving us a more complete picture of how different organisms are related and function together. This is especially critical now as the amount of sequencing data is increasing exponentially, providing more and more genetic information to work with. By using more genomes, we can make better predictions about the interactions of different genes and how they relate to one another in biological processes.


This work was a challenge to piece together since I hadn't worked with probabilistic data structures before. Minhashing and LSH forest approaches are tricky to get your head around but once you know what these approaches can do, it becomes a great technique. Since then I've used them for a few other projects. I would recommend them to anyone dealing with big data where individual entries can be represented as sets. It was also interesting to dive into the study of the study of evolutionary dynamics at a network level rather than focusing in on one individual protein family as was the case with my previous work.

David Moi
university of lausanne

Read the Original

This page is a summary of: Scalable phylogenetic profiling using MinHash uncovers likely eukaryotic sexual reproduction genes, PLoS Computational Biology, July 2020, PLOS, DOI: 10.1371/journal.pcbi.1007553.
You can read the full text:

Open access logo



The following have contributed to this page