What is it about?
“It is often said that ‘knowledge is power’. To this end, Evotec (and many other pharma companies) are constructing ‘knowledge graphs’ that contain millions of data points linking biological entities to one another, such as if a drug is used to treat a disease or if a small molecule binds to a particular protein. One way to leverage this information prospectively is though the prediction of novel links. For example, we can potentially repurpose well known drugs by linking them to diseases that have no current treatments. But how is data in the drug discovery domain collected? Our paper investigates the utility of a computational technique called “Natural Language Processing” (NLP) to automatically extract facts from millions of sentences in the scientific literature, with the goal of including this data in knowledge graphs. We compared results from an NLP pipeline to a ‘ground truth’ dataset called ‘Nexus’. Nexus is Evotec’s proprietary database containing over 2.7 million associations between small molecules and proteins. We looked at four categories of relationship between a small molecule and a protein (“agonists”, “antagonists”, “inhibitors”, “binders”) and how often the NLP pipeline got these relationships correct. Interestingly, we found that Nexus had around 17 times more information in it than the NLP-derived dataset, which perhaps reflects the huge human effort that has gone into curating small molecule-protein interactions over the years (see databases such as ChEMBL, DrugBank, PubChem and others that Nexus incorporates). It appears that human curation is still able to extract information that this NLP pipeline cannot, either because the task is difficult for machines or because paywalls prohibit machine reading of certain articles. In conclusion, there are many biomedical domains, such as protein interactions in biological pathways or how certain genes are risk factors for diseases, where NLP could really make a difference. This is especially the case where structured data sources are not keeping up with the scientific literature. However, for small molecule–protein relationships (and associated quantitative datapoints), Evotec’s Nexus database is more comprehensive than data derived from the NLP pipelines we evaluated.
Featured Image
Read the Original
This page is a summary of: A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction, PLOS One, September 2023, PLOS,
DOI: 10.1371/journal.pone.0291142.
You can read the full text:
Contributors
The following have contributed to this page







