Annotated Universal Dependencies Dataset for Literary and Educational Uzbek Texts

Sanatbek Matlatipov; Mersaid Aripov; Makhmud Bobokandov; Gayrat Matlatipov

doi:10.1016/j.dib.2026.112857

What is it about?

This publication introduces a carefully built, high-quality collection of Uzbek language sentences that have been manually analysed and labelled by linguistic experts. The dataset contains 681 sentences taken from Uzbek literary books and educational fairy tales. Each word in these sentences has been strictly tagged with its grammatical role, root form, and its syntactic relationship to other words in the sentence. Essentially, it acts as a detailed grammatical map, providing the foundational examples needed to teach computers how to accurately read, understand, and process modern Uzbek text.

Photo by BoliviaInteligente on Unsplash

Why is it important?

Uzbek is a crucial language that has historically lacked the advanced digital resources necessary for modern natural language processing. This work is highly important because it provides a fully manually annotated gold-standard dataset, directly filling a critical gap for the underrepresented Uzbek language. By strictly conforming to the Universal Dependencies v2 guidelines, this dataset allows researchers and developers worldwide to build, train, and test advanced tools like morphological analysers, parsers, and sentiment analysis models for Uzbek. Furthermore, its standardised structure enables immediate cross-lingual comparisons with other Turkic languages, facilitating the development of multilingual AI systems designed for low-resource languages.

Perspectives

From my perspective, creating the UzUDT dataset was an essential step in ensuring the Uzbek language has a rigorous, structured foundation in the global computational landscape. Watching our research team meticulously map the syntax of our native literature and educational texts to universal standards in the INCEPTION platform was a highly rewarding academic endeavour. I believe this open repository will serve as a vital stepping stone not only for training advanced neural NLP pipelines but also for applied pedagogical research, allowing students and developers to algorithmically engage with the complex morphology and sentiment structures of the Uzbek language. This project ultimately ensures that future AI tools will accurately respect and reflect our language's rich structural heritage.
Mr. Makhmud Bobokandov
National University of Uzbekistan

This page is a summary of: Annotated Universal Dependencies Dataset for Literary and Educational Uzbek Texts, Data in Brief, May 2026, Elsevier,
DOI: 10.1016/j.dib.2026.112857.
You can read the full text:

Read

Contributors

The following have contributed to this page

Mr. Makhmud Bobokandov
National University of Uzbekistan

A New Uzbek Language Dataset to Train a Natural Language Processing Model

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

A New Uzbek Language Dataset to Train a Natural Language Processing Model

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management