What is it about?
Understanding Arabic Dialect Identification Through Comparative Analysis and Model Interpretability ---------------------------------------- In this study, we focus on the fine-grained ADI task, investigating the effectiveness of various feature extraction techniques (BoW, Word embedding: AraVec, FastText, etc.) and classification algorithms through a comparative study across three datasets: MADAR, NADI and NADAR. We considered three pools of classifiers: Machine learning (MNB, SVM, LR, Stacking), Deep learning (BiLSTM, CNN, CNN-BiLSTM ) and pre-trained language model (AraBERTv1, MultiDialBert, QARIB, MARBERT, CAMeLBERT, AraBERTv0.2). We analyzed the variation in the classifiers performances based on various criteria: the training set size, the considered dialect, and the dataset type (whether it is parallel, balanced or not). We employed the Explainable AI (XAI) technique LIME to investigate the interpretability of the models.
Featured Image
Why is it important?
Arabic Dialect Identification (ADI) is a crucial task that can be integrated into a wide range of processes that handle dialectal Arabic text, such as sentiment analysis, hate speech detection, and machine translation. Given that most Arabic content on social platforms is written in dialects, ADI can significantly enhance the analysis of social media and user-generated content. It supports various applications, including dialect-aware machine translation, dialect mapping for understanding regional language use, and the development of chatbots or virtual assistants capable of understanding and responding in the user's dialect.
Perspectives
As future work, we plan to explore the use of hierarchical classification techniques and to model the problem as a multilabel classification task, since certain expressions or sentences can belong to more than one dialect. Additionally, it is important to acknowledge that MADAR’s parallel translations may not fully reflect organic dialectal variations. To better represent real-world usage, future research could incorporate naturally occurring parallel corpora—such as subtitled media —which better capture the spontaneous and diverse nature of dialectal Arabic.
Ferihane Kboubi
Read the Original
This page is a summary of: Fine-Grained Arabic Dialect Identification: Investigating Various Approaches Across Multiple Datasets, ACM Transactions on Asian and Low-Resource Language Information Processing, September 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3758093.
You can read the full text:
Contributors
The following have contributed to this page







