What is it about?

Modern artificial intelligence systems are often trained using enormous datasets, under the assumption that more data always leads to better performance. However, collecting, storing, and processing large volumes of data is expensive and energy‑intensive. This research asks a fundamental question: how much data is actually sufficient to train an accurate image‑classification model? The study investigates whether all training samples contribute equally to learning. Using a dimensionality‑reduction technique, the authors analyse how individual images are positioned relative to the centre of their class in a reduced‑dimension space. Images that lie far from the class centre tend to contain more distinctive information, while those closer to the centre are often more repetitive. By systematically excluding certain training samples before learning begins, the research shows that models can often achieve the same or even better accuracy while using less data. Across multiple well‑known image datasets, including handwritten digits and real‑world object images, the results demonstrate that removing a portion of less informative data does not significantly harm performance—and in some cases improves it. This work provides an intuitive, data‑driven way to identify which examples matter most during training, helping to reduce unnecessary computation without sacrificing accuracy.

Featured Image

Why is it important?

Modern artificial intelligence systems are often trained using enormous datasets, under the assumption that more data always leads to better performance. However, collecting, storing, and processing large volumes of data is expensive and energy‑intensive. This research asks a fundamental question: how much data is actually sufficient to train an accurate image‑classification model? The study investigates whether all training samples contribute equally to learning. Using a dimensionality‑reduction technique, the authors analyse how individual images are positioned relative to the centre of their class in a reduced‑dimension space. Images that lie far from the class centre tend to contain more distinctive information, while those closer to the centre are often more repetitive. By systematically excluding certain training samples before learning begins, the research shows that models can often achieve the same or even better accuracy while using less data. Across multiple well‑known image datasets, including handwritten digits and real‑world object images, the results demonstrate that removing a portion of less informative data does not significantly harm performance—and in some cases improves it. This work provides an intuitive, data‑driven way to identify which examples matter most during training, helping to reduce unnecessary computation without sacrificing accuracy.

Perspectives

Working on this paper reinforced the idea that progress in AI is not only about building larger models, but also about understanding data more deeply. I found it especially rewarding to see that careful analysis of data structure can lead to practical reductions in training cost without compromising performance. I hope this work encourages researchers and practitioners to rethink how they define “enough data,” and to consider efficiency and sustainability as first‑class goals in machine learning research.

Prof Tatiana Kalganova
Brunel University

Read the Original

This page is a summary of: Towards an Analytical Definition of Sufficient Data, SN Computer Science, January 2023, Springer Science + Business Media,
DOI: 10.1007/s42979-022-01549-4.
You can read the full text:

Read

Contributors

The following have contributed to this page