SelM: Selective Mechanism based Audio-Visual Segmentation

Jiaxu Li; Songsong Yu; Yifan Wang; Lijun Wang; Huchuan Lu

doi:10.1145/3664647.3680926

What is it about?

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video. Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects. In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crucial features corresponding to sound-emitting targets. To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information. SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation subset. The code can be found at https://github.com/Cyyzpoi/SelM.

Why is it important?

Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in videos according to associated audio cues, where both modalities are affected by noise to different extents, such as the blending of background noises in audio or the presence of distracted objects in video. Most existing methods focus on learning interactions between modalities at high semantic levels but is incapable of filtering low-level noise or achieving fine-grained representational interactions during the early feature extraction phase. Consequently, they struggle with illusion issues, where nonexistent audio cues are erroneously linked to visual objects. In this paper, we present SelM, a novel architecture that leverages selective mechanisms to counteract these illusions. SelM employs State Space model for noise reduction and robust feature selection. By imposing additional bidirectional constraints on audio and visual embeddings, it is able to precisely identify crucial features corresponding to sound-emitting targets. To fill the existing gap in early fusion within AVS, SelM introduces a dual alignment mechanism specifically engineered to facilitate intricate spatio-temporal interactions between audio and visual streams, achieving more fine-grained representations. Moreover, we develop a cross-level decoder for layered reasoning, significantly enhancing segmentation precision by exploring the complex relationships between audio and visual information. SelM achieves state-of-the-art performance in AVS tasks, especially in the challenging Audio-Visual Semantic Segmentation subset. The code can be found at https://github.com/Cyyzpoi/SelM.

This page is a summary of: SelM: Selective Mechanism based Audio-Visual Segmentation, October 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3664647.3680926.
You can read the full text:

Read

Contributors

The following have contributed to this page

Songsong Yu(于松松)
Dalian University of Technology

SelM: Selective Mechanism based Audio-Visual Segmentation

What is it about?

Why is it important?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

SelM: Selective Mechanism based Audio-Visual Segmentation

What is it about?

Featured Image

Why is it important?

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management