An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$

K. Mansouri, C. M. Grulke, A. M. Richard, R. S. Judson, A. J. Williams
  • SAR and QSAR in Environmental Research, November 2016, Taylor & Francis
  • DOI: 10.1080/1062936x.2016.1253611

Does automated curation and data standardization contribute to improved QSAR Models?

What is it about?

We used a manual approach to curate structure based data for a publicly available physicochemical property dataset. Using this experience we developed an automation procedure using KNIME to process multiple other datasets and then developed QSAR prediction models and examined the influence of data curation on the statistical performance of the models.

Why is it important?

Data quality is important. For the development of QSAR prediction models this paper shows the importance of data curation and how it influences the resulting statistical performance of the models and why it is worth the upfront investment in checking and validating the data. This work focused only on the chemical structures, NOT the actual property values, and even this made a measurable difference to the algorithmic performance.


Dr Antony John Williams
United States Environmental Protection Agency

I have been working on issues regarding data quality for years and this particular example clearly demonstrates the impact on QSAR models. The resulting models are available via the online website and are exposed with all of the relevant statistics for global and local domain of applicability as well as nearest neighbors. The QSAR Model Report Format reports detail the development of the models and ALL training and test data are available also. This, I believe, is a major contribution to Open Science in our domain.

Read Publication

The following have contributed to this page: Dr Antony John Williams