Open Source Bayesian Models. 2. Mining a “Big Dataset” To Create and Validate Models with ChEMBL

  • Alex M. Clark, Sean Ekins
  • Journal of Chemical Information and Computer Sciences, June 2015, American Chemical Society (ACS)
  • DOI: 10.1021/acs.jcim.5b00144

Creating open source Bayesian models with a big dataset

What is it about?

We use open source fingerprints and a Bayesian algorithm to build thousands of computational models from data in a very big public dataset called ChEMBL. We demonstrate the cross validation of these models, make them openly accessible and demonstrate how they can be imported in to a mobile app and used for predictions.

Why is it important?

We are not aware of anyone using ChEMBL in this way with open source technologies and making the thousands of models accessible. In addition we describe a novel algorithm for detecting thresholds for active / inactive in continuous data. Finally we access the effect of folding on the fingerprints.


Alex Michael Clark
Molecular Materials Informatics

The paper follows up on the previous description of open source Bayesian models, adding some more detail about validation and calibration techniques. It describes a method for partitioning the ChEMBL database of bioactivity data into >2000 datasets, and an algorithm for automatically detecting a threshold for classifying as active/inactive, which is required for Bayesian algorithms. Each of the datasets was used for model building, in order to evaluate the technique. The results are made available, as well as a description of the method.

Read Publication

The following have contributed to this page: Dr Sean Ekins and Alex Michael Clark