What is it about?

In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. To address these challenges, we propose a novel cross-modal method for long document classification, in which multiple granularity feature shifting networks are proposed to integrate the multi-scale text and visual features of long documents adaptively.

Featured Image

Why is it important?

A novel Cross-Modal Multiple Granularity Interactive Fusion Network (CM-MGIFN) is proposed for LDC by combining the text and image features at different levels of granularity. To the best of our knowledge, this is the first work integrating text and images at different granularity levels for LDC. A Multi-Modal Collaborative Pooling (MMCP) block is proposed to eliminate the redundant information of text, thus reducing the computational complexity. Extensive experiments on the public Food101 dataset and two newly created multi-modal long document datasets show that our method outperforms the single-modal text methods and defeats the state-of-the-art multi-modal baselines.

Perspectives

The manner of using visual features is simple in the model. In the future, we will exploit the relations of images and more fine features of images to improve the performance of our method for LDC.

Tengfei Liu
Beijing University of Technology

Read the Original

This page is a summary of: Cross-Modal Multiple Granularity Interactive Fusion Network for Long Document Classification, ACM Transactions on Knowledge Discovery from Data, November 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3631711.
You can read the full text:

Read

Contributors

The following have contributed to this page