Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation

Nabil Badri; Ferihane Kboubi; Anja Habacha Chaibi

doi:10.1145/3679049

What is it about?

Hateful content on social media is a global issue that affects individuals and communities. Most research on automatic detection of such content focuses on English due to resource availability, leaving low-resource languages like Arabic underexplored. This study addresses hate speech detection in Arabic mono/multi-dialect texts, emphasizing the unique challenges of Arabic dialects and the need for specialized models.

Photo by NordWood Themes on Unsplash

Why is it important?

The research hypothesizes that leveraging pre-trained language models (PLMs) designed for Arabic, along with data augmentation, can significantly improve detection performance. Key research questions include: Does text augmentation improve results compared to unaugmented datasets? Do Arabic-specific PLMs outperform models using fastText and AraVec embeddings? Does training on multilingual datasets yield better results than monolingual datasets? The methodology involved comparing Arabic PLMs (DziriBERT, AraBERT v2, Bert-base-arabic) using transfer learning and evaluating the impact of text augmentation. Results showed that augmented datasets improved performance metrics (accuracy, precision, recall, and F1-score) by 15–21%, highlighting the effectiveness of data augmentation in enhancing model generalization across Arabic dialects.

Perspectives

Future research could focus on extending hate speech detection to other low-resource languages, developing language-specific pre-trained models, and refining data augmentation techniques tailored to linguistic nuances like those in Arabic. Enhancing Arabic PLMs by fine-tuning them with diverse datasets and exploring multilingual and cross-language transfer learning could improve model performance and generalizability. Real-world applications, such as social media content moderation, would benefit from collaboration with policymakers to ensure practical integration. Ethical considerations, including bias and fairness, should guide the development of these models, which could also be expanded to detect other harmful content types like misinformation or harassment.
Ferihane Kboubi

This page is a summary of: Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation, ACM Transactions on Asian and Low-Resource Language Information Processing, August 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3679049.
You can read the full text:

Read

Contributors

The following have contributed to this page

Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management