What is it about?
This work reviews over two decades of research on multimodal summarization — systems that generate summaries combining text with other media such as images, video, and audio. It explains how different models process and fuse information from multiple sources, including traditional deep learning architectures and the latest multimodal large language models (mLLMs). The review also compares evaluation methods and datasets, offering a comprehensive overview of the field.
Featured Image
Photo by Semyon Borisov on Unsplash
Why is it important?
As digital information increasingly combines text, images, and video, effective multimodal summarization is essential for making content more accessible and useful. While recent advances in large language models have transformed the field, there has been little systematic comparison of methods and evaluation strategies. This review fills that gap, highlighting strengths, limitations, and future directions. It will help researchers design better models and guide practitioners in building more reliable multimodal applications.
Perspectives
This review was an opportunity to step back and make sense of a rapidly evolving field. I found it fascinating to trace the shift from early graph-based and deep learning approaches to the rise of multimodal large language models. One of the most striking insights for me was how evaluation methods have not kept pace with modeling advances — something I believe is critical for future progress. My hope is that this work not only maps the existing landscape but also inspires others to develop more robust and human-centered approaches to multimodal summarization.
Abid Ali
Macquarie University
Read the Original
This page is a summary of: A Systematic Literature Review on Multimodal Text Summarization, ACM Computing Surveys, September 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3763245.
You can read the full text:
Contributors
The following have contributed to this page







