Are Bigger Data Sets Better for Machine Learning? Fusing Single-Point and Dual-Event Dose Response Data forMycobacterium tuberculosis

  • Sean Ekins, Joel S. Freundlich, Robert C. Reynolds
  • Journal of Chemical Information and Computer Sciences, July 2014, American Chemical Society (ACS)
  • DOI: 10.1021/ci500264r

Bigger datasets for TB machine learning

What is it about?

After focusing on using dose response data for modeling we have added in the huge amounts of inactive single point data. The biggest models now have over 300,000 molecules in the training set. We show for TB there is little improvement by adding this data and speculate the smaller models may be adequate.

Why is it important?

Bigger models may not always be better at predicting external compounds. We evaluate this hypothesis with TB datasets we have collected. These models are a powerful resource for virtual screening.

Read Publication

The following have contributed to this page: Dr Sean Ekins