What is it about?

In our work, we establish how data curation principles can be applied to curate datasets in ML in practice. To do this, we translated data curation principles and concepts for the ML dataset development context and developed an evaluation framework (composed of a toolkit and rubric) for dataset documentation in ML. We presented first findings and a sample set of evaluations of ML datasets in our paper.

Featured Image

Why is it important?

There is no doubt that datasets are fundamental to machine learning work. For example, ML bias is often considered a result of the choices made about datasets used for training ML models. In many cases, datasets are reused across tasks that they weren’t originally created for. Appropriate data use is also hindered by the hidden, tacit, and undervalued nature of data work – the many activities involved in collecting, selecting, and combining datasets as well as documenting, evaluating, sharing, reusing, and repurposing datasets. To address these issues, the study of data practices has become prominent. In focusing on data work, ML has an opportunity to learn from fields like archives and libraries who have done this for a long time. Data curation is a mature field with origins in librarianship and archives whose scholarship and thinking on data issues go back many centuries. It has found new relevance today as ML is recognizing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models. ML researchers have argued for the adoption of principles from archival studies and digital curation into dataset development processes for machine learning research. In practice, it has been difficult to do that because these concepts do not easily apply. In this paper, we perform the translation process of data curation knowledge into clear terms for ML to illustrate their importance to dataset development and evaluation in order to help researchers more rigorously curate their own dataset and assess others’ datasets for their use, reuse, and reproducibility.

Read the Original

This page is a summary of: Machine learning data practices through a data curation lens: An evaluation framework, June 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3630106.3658955.
You can read the full text:

Read

Contributors

The following have contributed to this page