What is it about?

Cryptococcus neoformans is a fungus that can cause serious infections in people with weakened immune systems. It is responsible for many deaths every year, especially in sub-Saharan Africa. However, little is known about where this fungus lives in nature. To learn more, scientists used a Natural Language Processing model (a type of AI) to analyze research articles that mention C. neoformans. They discovered that this fungus is often found in soil associated with decomposing wood. This new approach to analyzing research data can help scientists find connections between disease-causing agents and their environment.

Featured Image

Why is it important?

Natural Language Processing (NLP) is important because it enables computers to analyze and extract insights from vast amounts of written or spoken language. This is essential for many applications in fields such as business, healthcare, education, and science. With NLP, computers can better understand the specific characteristics and behaviors of a pathogen in its natural environment, which is crucial for developing more targeted and effective interventions, such as vaccines or treatments. Furthermore, NLP is increasingly used to develop conversational agents, chatbots, virtual assistants, and other tools that can interact with people in natural language, providing personalized support, advice, or entertainment. In summary, NLP is a powerful tool for unlocking the value of human language, advancing scientific research, and bridging the gap between people and machines.


This work was the culmination of at least five years of work. One of the co-authors and I came up with the idea of searching through the NCBI SRA database, specifically for metabarcode datasets. What you need to know is that metabarcode datasets are DNA sequencing datasets specifically looking for a particular gene which is then searched against an existing database. For a lot of people there are really two types of metabarcode datasets, microbiome datasets using a single or a few of a handful of genes, or Environmental DNA (eDNA) which look for a particular gene of larger organisms (like fungi). So we were sitting out on the back patio of one of the research buildings and we basically said: "hey, there are all these SRA datasets full of these metabarcode data, why don't we use that to look for a fungi of interest" so I setup a search to do so and low and behold, we didn't find the fungi we were interested in. But one of the fungi that I used to test weather the search would work was C. neoformans, which honestly I knew was studied by really knew next to nothing about. Later, I was working with the library and was introduced to NLP which the Digital Scholarship group was working on, and I made the connection, this dataset of samples we had searched through could be used in conjunction with the search to maybe make some results, so we went about setting up an analysis. What I didn't know was that there would be a bunch of manual curation work in getting the journal articles. I basically Tom-Sawyer'd a bunch of people into curating the journal articles for me. Lots of pizza was given out. See, we had a tenuous connection between a metabarcode dataset and it's attached journal article, and we needed the journal articles for the NLP part of the analysis, so for each dataset we had we had to either find the direct link to a journal article, or use a identification number, or just use the authors and context to find the associated articles. It was arduous... But once we had that, we needed a way to confirm that what we were looking at was actually valid, and I didn't really have an idea for that, until one of my undergraduate researchers took an AI/ML course and came up with the idea of using a random forest for validating our results. The result of all of this was a team effort, with lots of players to create a unique analysis and article that really pushed the capabilities of what NLP could do, which had the benefit of creating some really interesting results that warrant further study.

David Molik
United States Department of Agriculture - Agricultural Research Service

Read the Original

This page is a summary of: Combining natural language processing and metabarcoding to reveal pathogen-environment associations, PLoS Neglected Tropical Diseases, April 2021, PLOS, DOI: 10.1371/journal.pntd.0008755.
You can read the full text:




The following have contributed to this page