What is it about?
Modern buildings increasingly use artificial intelligence (AI) to detect whether spaces are occupied, allowing heating, ventilation, and air‑conditioning systems to run only when needed. This can significantly reduce energy use and carbon emissions. However, achieving high accuracy often relies on collecting and processing very large volumes of sensor data, much of which may not actually be necessary. This research investigates how to identify which parts of a time‑series dataset are truly useful for training AI models, and which data can safely be removed without reducing performance. Using real‑world building occupancy datasets collected from environmental sensors such as temperature, humidity, and air quality, the study explores several data‑reduction strategies designed specifically for imbalanced time‑series data—where some conditions (such as “occupied”) occur far more often than others. The paper introduces the concept of class density as a practical way to determine whether a dataset is suitable for data reduction before any AI model is trained. By selectively removing redundant data—particularly from the dominant class—the authors show that it is often possible to remove over half of the training data while maintaining classification accuracy. This leads to substantially shorter training times and lower energy consumption. The work also evaluates whether combining data from multiple buildings (dataset fusion) improves performance, finding that more data is not always better, especially when efficiency and sustainability are priorities.
Featured Image
Photo by Dmitrii E. on Unsplash
Why is it important?
This work is timely because the environmental cost of AI is becoming a major concern, particularly for applications deployed at scale or on low‑power devices. Rather than focusing only on developing more complex models, this study demonstrates the value of a data‑centric approach to AI design. The key contribution is showing that dataset properties—specifically class density—can predict when data reduction will be effective. This allows practitioners to make informed decisions before training begins, saving computation, energy, and time. The findings challenge the common assumption that larger datasets always lead to better models and provide practical guidance for developing green AI systems suitable for edge devices and smart‑building applications.
Perspectives
Working on this paper highlighted how much hidden inefficiency exists in many AI pipelines. Collecting more data is often treated as a default solution, yet this research shows that understanding the structure of a dataset can be just as important as choosing the right model. I hope this work encourages researchers and engineers to think more carefully about the quality and necessity of their data. Reducing unnecessary computation is not just a technical optimisation—it is a meaningful step towards more sustainable and responsible AI deployment.
Prof Tatiana Kalganova
Brunel University
Read the Original
This page is a summary of: Identifying Suitability for Data Reduction in Imbalanced Time-Series Datasets, AI, May 2025, MDPI AG,
DOI: 10.3390/ai6050098.
You can read the full text:
Contributors
The following have contributed to this page







