What is it about?

This article is about the development of a new language resource for Urdu language, which is a resource-poor language. The reported language resource is in the form of word embeddings, trained on a larger collection of Urdu text, as compared to the amount of text used by other Urdu language researchers. The reported word embeddings cover a vocabulary of Urdu, with a size almost doubled as compared to the state-of-the-art in the language.

Featured Image

Why is it important?

Urdu language is a resource-poor language. Word embeddings are the resource needed by researchers who are using machine learning, deep learning models or neural networks to accomplish various tasks under the umbrella of Urdu natural language processing (NLP). The word embeddings can be used to perform NLP tasks such as parsing, machine translation, sentiment analysis etc as well as the development of large language models (LLMs) and generative AI for Urdu.

Perspectives

The research work reports an important resource required for the natural language processing of the Urdu language using state-of-the-art machine learning and neural network models. The resource covers a vocabulary whose size is almost double of the accumulated sizes of word embeddings developed by other researchers of the language.

Fatima Tuz Zuhra
Quaid-i-Azam University

Read the Original

This page is a summary of: Towards Development of New Language Resource for Urdu: The Large Vocabulary Word Embeddings, ACM Transactions on Asian and Low-Resource Language Information Processing, August 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3748308.
You can read the full text:

Read

Contributors

The following have contributed to this page