An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$

K. Mansouri; C. M. Grulke; A. M. Richard; R. S. Judson; A. J. Williams

doi:10.1080/1062936x.2016.1253611

What is it about?

We used a manual approach to curate structure based data for a publicly available physicochemical property dataset. Using this experience we developed an automation procedure using KNIME to process multiple other datasets and then developed QSAR prediction models and examined the influence of data curation on the statistical performance of the models.

Why is it important?

Data quality is important. For the development of QSAR prediction models this paper shows the importance of data curation and how it influences the resulting statistical performance of the models and why it is worth the upfront investment in checking and validating the data. This work focused only on the chemical structures, NOT the actual property values, and even this made a measurable difference to the algorithmic performance.

Perspectives

I have been working on issues regarding data quality for years and this particular example clearly demonstrates the impact on QSAR models. The resulting models are available via the online website https://comptox.epa.gov and are exposed with all of the relevant statistics for global and local domain of applicability as well as nearest neighbors. The QSAR Model Report Format reports detail the development of the models and ALL training and test data are available also. This, I believe, is a major contribution to Open Science in our domain.
Dr Antony John Williams
United States Environmental Protection Agency

This page is a summary of: An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$, SAR and QSAR in Environmental Research, November 2016, Taylor & Francis,
DOI: 10.1080/1062936x.2016.1253611.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page

Dr Antony John Williams
United States Environmental Protection Agency

Does automated curation and data standardization contribute to improved QSAR Models?

What is it about?

Why is it important?

Perspectives

Resources

The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

An examination of data quality on QSAR Modeling in regards to the environmental sciences

PHYSPROP Curated training and test sets etc.

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Does automated curation and data standardization contribute to improved QSAR Models?

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

An examination of data quality on QSAR Modeling in regards to the environmental sciences

PHYSPROP Curated training and test sets etc.

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management