What is it about?

We used a manual approach to curate structure based data for a publicly available physicochemical property dataset. Using this experience we developed an automation procedure using KNIME to process multiple other datasets and then developed QSAR prediction models and examined the influence of data curation on the statistical performance of the models.

Featured Image

Why is it important?

Data quality is important. For the development of QSAR prediction models this paper shows the importance of data curation and how it influences the resulting statistical performance of the models and why it is worth the upfront investment in checking and validating the data. This work focused only on the chemical structures, NOT the actual property values, and even this made a measurable difference to the algorithmic performance.

Perspectives

I have been working on issues regarding data quality for years and this particular example clearly demonstrates the impact on QSAR models. The resulting models are available via the online website https://comptox.epa.gov and are exposed with all of the relevant statistics for global and local domain of applicability as well as nearest neighbors. The QSAR Model Report Format reports detail the development of the models and ALL training and test data are available also. This, I believe, is a major contribution to Open Science in our domain.

Dr Antony John Williams
United States Environmental Protection Agency

Read the Original

This page is a summary of: An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$, SAR and QSAR in Environmental Research, November 2016, Taylor & Francis,
DOI: 10.1080/1062936x.2016.1253611.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page