What is it about?

This research paper is a comprehensive review of recent models that teach computers to describe what they see in images using natural language, especially focusing on systems that work across different languages. It explains how modern “attention-based” transformer methods help machines focus on important parts of an image to create accurate and fluent captions. The paper also discusses popular datasets, standard ways to measure performance, common challenges, such as a lack of data in many languages, and future research directions to improve and extend these systems into real-world applications.

Featured Image

Why is it important?

The techniques discussed in this paper are important and timely because visual content is being created and shared at an unprecedented scale, often across different languages and cultures. As images increasingly appear in global platforms such as social media, education, healthcare, and accessibility tools, there is a growing need for systems that can automatically and accurately describe images in many languages. Attention-based transformer models represent a major step forward, as they produce more natural, meaningful descriptions while reducing language bias. Addressing these challenges now is essential to ensure that future AI systems are inclusive, reliable, and capable of supporting multilingual users worldwide.

Perspectives

This paper was proposed to provide a clear and structured overview of how attention-based transformer models are used for image captioning across different languages. By systematically reviewing existing methods, datasets, and evaluation practices, current strengths and limitations in multilingual image captioning were highlighted. The work is considered important because it addresses the growing demand for inclusive and reliable vision–language systems and helps guide future research toward more robust, fair, and widely applicable image description technologies.

Dr Omar S Al-Kadi
University of Jordan

Read the Original

This page is a summary of: Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation, Computer Science Review, November 2025, Elsevier,
DOI: 10.1016/j.cosrev.2025.100766.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page