What is it about?

We propose an effective unsupervised crossmodal hashing retrieval method, called Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval (VLKD). VLKD uses the vision-language pre-training (VLP) model to encode features on multi-modal data, and then constructs a similarity matrix to provide soft similarity supervision for the student model. It distils the knowledge of the VLP model to the student model to gain an understanding of multi-modal knowledge. In addition, we designed an end-to-end unsupervised hashing learning model that incorporates a graph convolutional auxiliary network. The auxiliary network aggregates information from similar data nodes based on the similarity matrix distilled by the teacher model to generate more consistent hash codes. Finally, the teacher network does not require additional training, it only needs to guide the student network to learn high-quality hash representation, and VLKD is quite efficient in training and retrieval.

Featured Image

Why is it important?

Sufficient experiments on three multimedia retrieval benchmark datasets show that the proposed method achieves better retrieval performance compared to existing unsupervised cross-modal hashing methods, demonstrating the effectiveness of the proposed method.

Perspectives

I hope this article makes a little contribution to the field of cross -model retrieval.

Lina Sun

Read the Original

This page is a summary of: Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval, June 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3591106.3592242.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page