What is it about?

We introduce MultiGEC, a dataset for Grammatical Error Correction (GEC) in 12 European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. The data consists mostly of learner essays written by second language speakers of these languages, but also includes texts written by schoolchildren, heritage speakers and members of the general population. All texts come with one or more corrected versions.

Featured Image

Why is it important?

MultiGEC was built for the MultiGEC-2025 shared task, which is part of a series of initiatives aimed at fostering an interest in low-resource languages in the Natural Language Processing community. By making the dataset available to a broader public, we aim for it to have a more long lasting impact and aid the development of GEC system ready for educational settings.

Perspectives

As a PhD student, it was the first time leading such a large coordinated effort: 12 languages, 17 subcorpora and 30 co-authors! If you're just getting started with NLP, I hope our dataset can help you learn as much about GEC as I have about collaboration. And if you want to contribute with data for a new language, please get in touch!

Arianna Masciolini
Goteborgs Universitet

Read the Original

This page is a summary of: Towards better language representation in Natural Language Processing, International Journal of Learner Corpus Research, April 2025, John Benjamins,
DOI: 10.1075/ijlcr.24033.mas.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page