What is it about?

In this paper, we present the first-ever large-scale publicly available Roman-Urdu parallel corpus, Roman-Urdu-Parl, with 6.37M sentence pairs. It is a huge corpus collected from diverse sources, annotated using crowd-sourcing techniques, and also assured for quality. It has a total of 92.76M Roman-Urdu words, 92.85M Urdu words, Roman-Urdu vocabulary of 42.9K words, and Urdu vocabulary of 43.8K words. Roman-Urdu-Parl has been built to ensure that it not only captures the morphological and linguistic features of the language but also the heterogeneity and variations arising due to demographic conditions.

Featured Image

Why is it important?

The availability of large-scale datasets and corpora have played a huge role in the advancement of natural language understanding and processing by computers. In fact, these datasets and corpora have catalyzed all recent progress in these fields but for languages with a dearth of large-scale corpora, hardly any significant progress has been made. It is also the case with the research problem of machine transliteration which is one of our target problems. Machine transliteration is the process of expressing a word's pronunciation in the source language to the alphabet of the target language. Addressing the research problem of transliteration serves as a major step in bringing the world together. Although a lot of work has been done for resource-rich languages like French, German, or English, very little attention has been given to Urdu, a low-resource language. One of the major reasons for this lack of progress and interest is the dearth of any publicly available parallel corpus. we address this problem of the availability of large-scale parallel corpus for Roman-Urdu and Urdu languages. To this end, we build a new large-scale parallel corpus named "Roman-Urdu-Parl" for solving problems like learning word representations, machine transliteration, conversational agent modeling, etc.

Read the Original

This page is a summary of: Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding, ACM Transactions on Asian and Low-Resource Language Information Processing, January 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3464424.
You can read the full text:

Read

Contributors

The following have contributed to this page