Fine-Grained Arabic Dialect Identification: Investigating Various Approaches Across Multiple Datasets

Ferihane Kboubi; Anja Habacha Chaibi

doi:10.1145/3758093

What is it about?

Understanding Arabic Dialect Identification Through Comparative Analysis and Model Interpretability ---------------------------------------- In this study, we focus on the fine-grained ADI task, investigating the effectiveness of various feature extraction techniques (BoW, Word embedding: AraVec, FastText, etc.) and classification algorithms through a comparative study across three datasets: MADAR, NADI and NADAR. We considered three pools of classifiers: Machine learning (MNB, SVM, LR, Stacking), Deep learning (BiLSTM, CNN, CNN-BiLSTM ) and pre-trained language model (AraBERTv1, MultiDialBert, QARIB, MARBERT, CAMeLBERT, AraBERTv0.2). We analyzed the variation in the classifiers performances based on various criteria: the training set size, the considered dialect, and the dataset type (whether it is parallel, balanced or not). We employed the Explainable AI (XAI) technique LIME to investigate the interpretability of the models.

Why is it important?

Arabic Dialect Identification (ADI) is a crucial task that can be integrated into a wide range of processes that handle dialectal Arabic text, such as sentiment analysis, hate speech detection, and machine translation. Given that most Arabic content on social platforms is written in dialects, ADI can significantly enhance the analysis of social media and user-generated content. It supports various applications, including dialect-aware machine translation, dialect mapping for understanding regional language use, and the development of chatbots or virtual assistants capable of understanding and responding in the user's dialect.

Perspectives

As future work, we plan to explore the use of hierarchical classification techniques and to model the problem as a multilabel classification task, since certain expressions or sentences can belong to more than one dialect. Additionally, it is important to acknowledge that MADAR’s parallel translations may not fully reflect organic dialectal variations. To better represent real-world usage, future research could incorporate naturally occurring parallel corpora—such as subtitled media —which better capture the spontaneous and diverse nature of dialectal Arabic.
Ferihane Kboubi

This page is a summary of: Fine-Grained Arabic Dialect Identification: Investigating Various Approaches Across Multiple Datasets, ACM Transactions on Asian and Low-Resource Language Information Processing, September 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3758093.
You can read the full text:

Read

Contributors

The following have contributed to this page

Ferihane Kboubi

Fine-Grained ADI: Investigating Various Approaches Across Multiple Datasets

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Fine-Grained ADI: Investigating Various Approaches Across Multiple Datasets

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management