What is it about?

CMAL is a novel cross-modal associative learning framework with anchor points detection and cross-modal associative learning for vision-language pre-training. Experiments on four well-known downstream V+L tasks demonstrate the effectiveness of CMAL, showing that it achieves competitive performance on four representative downstream tasks with fewer pre-train corpus and lower computational cost.

Featured Image

Why is it important?

Encouraged by the associative thinking of the human brain, CMAL provides a new paradigm for multimodal learning in machines. And CMAL is a no-contrast learning method, experiments on four downstream V+L tasks demonstrate its effectiveness, showing that it achieves competitive performance.

Perspectives

I hope that this paper will provide a completely different, human-like, and potentially promising new technology for many of our colleagues in the field of modality, which is different from the current self-supervised methods such as mask learning and contrast learning, and has inherent advantages in cross-modal training.

Zhiyuan Ma
Huazhong University of Science and Technology

Read the Original

This page is a summary of: CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3503161.3548292.
You can read the full text:

Read

Contributors

The following have contributed to this page