What is it about?

Previous video object segmentation approaches mainly focus on simplex solutions linking appearance and motion, limiting effective feature collaboration between these two cues. In this work, we study a novel and efficient full-duplex strategy network (FSNet) to address this issue by considering a better mutual restraint scheme linking motion and appearance, allowing the exploitation of cross-modal features from the fusion and decoding stage. Specifically, we introduce a relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding sub-spaces. To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings, we adopt a bidirectional purification module after the RCAM. Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios (e.g., motion blur and occlusion) and compares well to leading methods for video object segmentation and salient object detection.

Featured Image

Why is it important?

In this paper, we presented a simple yet efficient framework, termed full-duplex strategy network (FSNet), that fully leverages the mutual constraints on appearance and motion cues to address the video object segmentation problem. It consists of two core modules: a relational cross-attention module in the encoding stage and an efficient bidirectional purification module in the decoding stage. The former is used to abstract features from a dual modality, while the latter is utilized to re-calibrate inconsistent features step-by-step. We thoroughly validated the functional modules of our architecture through extensive experiments, leading to several interesting findings. Finally, FSNet acts as a unified solution that significantly advances SOTA models for U-VOS and V-SOD tasks.

Perspectives

We dive into the full-duplex strategy for addressing automated video object segmentation, which enforces bidirectional message-passing at both the encoding and decoding stages. Our experiments show that such a strategy could essentially enhance the interaction ability of inter-and intra-features extracted from spatio-temporal space. We hope this article could inspire more ideas for promoting the development of video content understanding tasks.

Deng-Ping Fan
ETH Zurich

Read the Original

This page is a summary of: Full-duplex strategy for video object segmentation, Computational Visual Media, October 2022, Tsinghua University Press,
DOI: 10.1007/s41095-021-0262-4.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page