The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS

Igor V. Tetko, Daniel M. Lowe, Antony J. Williams
  • Journal of Cheminformatics, January 2016, Springer Science + Business Media
  • DOI: 10.1186/s13321-016-0113-y

Extracting and Modeling a Large Melting Point Dataset (300k) from a Patent Collection

What is it about?

Text-mining was used for automated extraction of melting point data from published PATENTS. Almost 300,000 data points were collected and used to develop models to predict melting and pyrolysis (decomposition). The models are available for everyone to use!

Why is it important?

This paper indicates that it is now possible to text-mine property data directly out of a large corpus and, following automated curation/validation the data can then be used as the basis of building models. This work was focused on Melting Point data but could be extended to other properties such as logP, NMR data etc.


Dr Antony John Williams
United States Environmental Protection Agency

The manual extraction of data from literature, or in this case patents, is very time-consuming. The possibility of using text mining for the extraction of data has been of interest to me personally for years and this collaboration with Daniel Lowe from NextMove to apply their software for extraction, and with Igor Tetko to perform the modeling, proves the point I believe. MP is only one property but this approach could now be extended to other properties.

Read Publication

The following have contributed to this page: Dr Antony John Williams