SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

Peng Yin; Xiaosu Zhu; Jingkuan Song; Lianli Gao; Heng Tao Shen

doi:10.1145/3664647.3680872

What is it about?

Binarized Vision Transformers (BiViTs) aim to facilitate the efficient and lightweight utilization of Vision Transformers (ViTs) on devices with limited computational resources. Yet, the current approach to binarizing ViT leads to a substantial performance decrease compared to the full-precision model, posing obstacles to practical deployment. By empirical study, we reveal that spatial interaction (SI) is a critical factor that impacts performance due to lack of token-level correlation, but previous work ignores this factor. To this end, we design a ViT binarization approach dubbed SI-BiViT to incorporate spatial interaction in the binarization process. Specifically, an SI module is placed alongside the Multi-Layer Perceptron (MLP) module to formulate the dual-branch structure. This structure not only leverages knowledge from pre-trained ViTs by distilling over the original MLP, but also enhances spatial interaction via the introduced SI module. Correspondingly, we design a decoupled training strategy to train these two branches more effectively. Importantly, our SI-BiViT is orthogonal to existing Binarized ViTs approaches and can be directly plugged. Extensive experiments demonstrate the strong flexibility and effectiveness of SI-BiViT by plugging our method into four classic ViT backbones in supporting three downstream tasks, including classification, detection, and segmentation.

Photo by Markus Spiske on Unsplash

Why is it important?

Binarized Vision Transformers (BiViTs) aim to facilitate the efficient and lightweight utilization of Vision Transformers (ViTs) on devices with limited computational resources. Yet, the current approach to binarizing ViT leads to a substantial performance decrease compared to the full-precision model, posing obstacles to practical deployment.

This page is a summary of: SI-BiViT: Binarizing Vision Transformers with Spatial Interaction, October 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3664647.3680872.
You can read the full text:

Read

Contributors

The following have contributed to this page

Peng Yin
University of Electronic Science and Technology of China

SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

What is it about?

Why is it important?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

SI-BiViT: Binarizing Vision Transformers with Spatial Interaction

What is it about?

Featured Image

Why is it important?

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management