What is it about?
Multi-modal action recognition is an essential task in human-centric machine learning. Humans perceive the world by processing and fusing information of multiple modalities such as vision and audio. We introduce a novel transformer-based multi-modal architecture that outperforms existing state-of-the-art methods while significantly reducing the computational cost.
Featured Image
Photo by Asher Legg on Unsplash
Why is it important?
The key to our idea is a Token-Selector module that collates and condenses the most useful token combinations and only shares what is necessary for crossmodal modeling. We conduct extensive experiments on multiple multi-modal benchmark datasets and achieve state-of-the-art performance under similar experimental conditions while reducing 30 percent of computing consumption.
Read the Original
This page is a summary of: Cross-modal Token Selection for Video Understanding, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3552458.3556444.
You can read the full text:
Contributors
The following have contributed to this page