What is it about?

Multi-modal action recognition is an essential task in human-centric machine learning. Humans perceive the world by processing and fusing information of multiple modalities such as vision and audio. We introduce a novel transformer-based multi-modal architecture that outperforms existing state-of-the-art methods while significantly reducing the computational cost.

Featured Image

Why is it important?

The key to our idea is a Token-Selector module that collates and condenses the most useful token combinations and only shares what is necessary for crossmodal modeling. We conduct extensive experiments on multiple multi-modal benchmark datasets and achieve state-of-the-art performance under similar experimental conditions while reducing 30 percent of computing consumption.

Read the Original

This page is a summary of: Cross-modal Token Selection for Video Understanding, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3552458.3556444.
You can read the full text:

Read

Contributors

The following have contributed to this page