What is it about?

Recent captioning models are limited in their ability to describe concepts unseen in paired image-sentence pairs. This paper presents a framework of multi-task learning for describing novel words not present in existing image-captioning datasets. Our framework takes advantage of external sources-labeled images from image classification datasets, and semantic knowledge extracted from the annotated text. We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings. When in the inference step we change the BeamSearch step by considering both the caption model and language model enabling the model to generalize novel words outside of image-captioning datasets. We demonstrate that in our framework we add an annotated text data which can help our image captioning model to describe images with the right corresponding novel words. Extensive experiments are conducted on both AI Challenger and MSCOCO image captioning datasets of two different languages, demonstrating the ability of our framework to describe novel words such as scenes and objects.

Featured Image

Why is it important?

There are two main contributions in this paper. 1) We add an annotated text data which can help our image captioning model to describe images with the right corresponding novel words. 2) We change the BeamSearch of inference step which enables the model to describe images with novel words learned from the text corpus.

Perspectives

Our model still has some limitations. Our model couldn’t support the multi-label, and we can improve the strategy of adding two values of softmax in the inference step. In the future, in addition to making our model fit the multi- label, we could utilize reinforcement learning to help improve the logic of the output sentences.

He Zheng

Read the Original

This page is a summary of: Multi-task Learning for Captioning Images with Novel Words, IET Computer Vision, October 2018, the Institution of Engineering and Technology (the IET),
DOI: 10.1049/iet-cvi.2018.5005.
You can read the full text:

Read

Contributors

The following have contributed to this page