What is it about?

This paper extends the "Datasheets for Datasets" framework for speech data by emphasizing documentation of often-overlooked “linguistic subpopulations”: accents, dialects, varieties, languages, and speech arising from speech disorders or pathologies. Documentation questions to answer include, e.g, why does the dataset focus on a specific language/region? Is there consensus on how you define a certain accent? Are speech types self-reported? Empty datasheet templates (in .docx and .tex) and worked examples of common speech datasets (CORAAL, CommonVoice, LibriSpeech, VoxPopuli, WHAM) are available on Github: https://github.com/SonyResearch/project_ethics_augmented_datasheets_for_speech_datasets

Featured Image

Why is it important?

Our augmented datasheets provide a standardized way to check for the linguistic make-up of speech data. Datasheets are useful to dataset creators, because linguistic subpopulations should be considered before beginning data collection. Datasheets are also useful to dataset users: when combining multiple speech datasets to build models, is there appropriate coverage across speech types to train a robust model?

Read the Original

This page is a summary of: Augmented Datasheets for Speech Datasets and Ethical Decision-Making, June 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3593013.3594049.
You can read the full text:

Read

Contributors

The following have contributed to this page