What is it about?

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video. Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects. In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crucial features corresponding to sound-emitting targets. To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information. SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation subset. The code can be found at https://github.com/Cyyzpoi/SelM.

Featured Image

Why is it important?

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video. Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects. In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crucial features corresponding to sound-emitting targets. To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information. SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation subset. The code can be found at https://github.com/Cyyzpoi/SelM.

Read the Original

This page is a summary of: SelM: Selective Mechanism based Audio-Visual Segmentation, October 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3664647.3680926.
You can read the full text:

Read

Contributors

The following have contributed to this page