What is it about?
We propose an effective unsupervised crossmodal hashing retrieval method, called Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval (VLKD). VLKD uses the vision-language pre-training (VLP) model to encode features on multi-modal data, and then constructs a similarity matrix to provide soft similarity supervision for the student model. It distils the knowledge of the VLP model to the student model to gain an understanding of multi-modal knowledge. In addition, we designed an end-to-end unsupervised hashing learning model that incorporates a graph convolutional auxiliary network. The auxiliary network aggregates information from similar data nodes based on the similarity matrix distilled by the teacher model to generate more consistent hash codes. Finally, the teacher network does not require additional training, it only needs to guide the student network to learn high-quality hash representation, and VLKD is quite efficient in training and retrieval.
Featured Image
Photo by Claudio Schwarz on Unsplash
Why is it important?
Sufficient experiments on three multimedia retrieval benchmark datasets show that the proposed method achieves better retrieval performance compared to existing unsupervised cross-modal hashing methods, demonstrating the effectiveness of the proposed method.
Perspectives
I hope this article makes a little contribution to the field of cross -model retrieval.
Lina Sun
Read the Original
This page is a summary of: Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval, June 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3591106.3592242.
You can read the full text:
Resources
Contributors
The following have contributed to this page