What is it about?

Recent advances in technology have made videos a major part of our daily lives, especially on social media, where they are the main way people communicate. This makes understanding and analyzing videos very important to manage the huge amount of content being created. Video captioning, which automatically generates text descriptions for videos, is one of the best ways to achieve this understanding. Thanks to progress in Natural Language Generation (NLG) and Computer Vision, strong video captioning models now exist. However, choosing the right model for a specific task is still hard. Evaluation methods are limited, and current models have important shortcomings that prevent them from being widely used. This survey addresses these issues by providing a comprehensive overview of video captioning models, including a comparative study on two benchmark datasets using eight representative, high-quality models.

Featured Image

Why is it important?

Understanding videos automatically helps manage and make sense of the massive amount of content online. Video captioning makes videos searchable, accessible, and easier to analyze, which is crucial for social media platforms, education, media, and AI applications. This survey helps researchers and practitioners choose the right models and understand the strengths and weaknesses of current approaches.

Perspectives

The future of video captioning lies in making models more reliable, efficient, and versatile. Key directions include better handling of uncertainty in large language models, enabling long-video understanding through smarter frame selection and extended context, and exploring new architectures like hybrid Transformer–Mamba models for efficiency. Advances in prompt engineering and knowledge distillation can improve adaptability and make high-performing models more accessible. Finally, incorporating practical features such as person or landmark re-identification, context-aware captions, and scene-motion analysis could increase real-world usefulness and support applications across media, accessibility, and AI-driven video analysis.

Antoine Brimont
Télécom SudParis, Institut Polytechnique de Paris

Read the Original

This page is a summary of: A Survey on Video Captioning in the Era of Large Language Models, ACM Transactions on Multimedia Computing Communications and Applications, January 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3793908.
You can read the full text:

Read

Contributors

The following have contributed to this page