What is it about?
In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. To address these challenges, we propose a novel cross-modal method for long document classification, in which multiple granularity feature shifting networks are proposed to integrate the multi-scale text and visual features of long documents adaptively.
Featured Image
Photo by Nathan Dumlao on Unsplash
Why is it important?
A novel Cross-Modal Multiple Granularity Interactive Fusion Network (CM-MGIFN) is proposed for LDC by combining the text and image features at different levels of granularity. To the best of our knowledge, this is the first work integrating text and images at different granularity levels for LDC. A Multi-Modal Collaborative Pooling (MMCP) block is proposed to eliminate the redundant information of text, thus reducing the computational complexity. Extensive experiments on the public Food101 dataset and two newly created multi-modal long document datasets show that our method outperforms the single-modal text methods and defeats the state-of-the-art multi-modal baselines.
Perspectives
Read the Original
This page is a summary of: Cross-Modal Multiple Granularity Interactive Fusion Network for Long Document Classification, ACM Transactions on Knowledge Discovery from Data, November 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3631711.
You can read the full text:
Contributors
The following have contributed to this page