What is it about?

As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.

Featured Image

Why is it important?

Telling a story is a natural task for humans and a fundamental problem for machine intelligence, which includes the generation of various types of texts, such as novels, scripts, and news. Using automatic story generation systems, users can input story clues (e.g., storylines, related images) and get automatically generated stories, which can significantly improve the efficiency and quality of the composition.

Perspectives

This paper takes a very early attempt to generate stories coherent to both input textual content and visual comcepts. It enables users to obtain deeply customized stories by inputing specific multimodal guidance. I believe the commercial applications with similar functions will emerge in the following years.

Dizhan Xue
Institute of Automation, Chinese Academy of Sciences

Read the Original

This page is a summary of: MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3503161.3548022.
You can read the full text:

Read

Contributors

The following have contributed to this page