What is it about?
The research investigates the linguistic variations of Portuguese on social media across three continents: America (Brazil), Europe (Portugal), and Africa (Mozambique). The objective is to understand whether pre-trained language models can identify the origin of texts and adapt to regional differences, including those less represented, such as the Portuguese spoken in Mozambique. Twelve versions of BERT models pre-trained on specific corpora for pt-BR, pt-PT, and multilingual contexts are applied to achieve this. The evaluation of the models is carried out in three complementary phases. In the first phase, a quantitative analysis is conducted using metrics such as F1-score and accuracy to measure the models' performance in the task of origin classification. Next, a surprisal analysis is applied to evaluate how the models react to region-specific terms. The concept of surprisal measures the degree of unexpectedness of a token for the model: the lower the probability of a term occurring in a given variant, the higher the surprisal value assigned. This step highlights the models' sensitivity to regional linguistic traits and reveals potential learning gaps. Finally, an interpretative analysis is performed, exploring the weights the models assign to the tokens during classification, highlighting the impact of each term on the model's final decision. This phase enables the identification of regional patterns, such as slang, cultural expressions, and typical orthographic variations of each country, which directly influence the neural network's behavior. The results show that the BERTweet.BR model, trained with Brazilian social media texts, performed best in most evaluations. However, difficulties were observed when dealing with specific tokens from Mozambique, highlighting the lack of African corpora in the models' pre-training. The research concludes that the lack of representativeness of African variants negatively impacts the generalization capacity for Mozambican Portuguese, emphasizing the need to expand corpora to encompass these linguistic variations better.
Featured Image

Photo by Amador Loureiro on Unsplash
Why is it important?
The investigation of linguistic variations of Portuguese in different regions is essential for advancing Natural Language Processing (NLP) technologies in multilingual and multicultural contexts. Currently, most Portuguese language models focus on predominant variants, such as Brazilian Portuguese (pt-BR) and European Portuguese (pt-PT), while other variants, especially those spoken in African countries, remain underrepresented. Including variants like Mozambican Portuguese (pt-MZ) is crucial to ensure equity and representation in NLP applications. Social media, for example, are environments of great linguistic diversity, where local terms, regional slang, and cultural expressions are frequently used. The absence of specific corpora for these regions leads to errors in meaning interpretation, biased results, and limitations in classification and sentiment analysis tasks. In practical scenarios, this can result in misinterpretations, digital exclusion, and the reinforcement of linguistic inequalities.
Read the Original
This page is a summary of: Language Flavors in the Lusophone World: A BERT-Based Social Media Study of Portuguese in Brazil, Portugal, and Mozambique, March 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3672608.3707931.
You can read the full text:
Contributors
The following have contributed to this page







