What is it about?
Imagine being able to search through thousands of photos just by describing something like “the brown dog lying on a cushion near the fireplace.” Most current AI systems can either find whole images that match a caption or locate objects within a single picture—but not both at once. Our work bridges that gap. We introduce a new task called Referring Expression Instance Retrieval (REIR) that allows computers to find and highlight a specific object described in natural language, across large photo collections. We also build the first large dataset for this task and design an AI model named CLARE that learns to connect words and objects more precisely. This research moves us closer to search engines that truly understand detailed visual descriptions the way people do.
Featured Image
Photo by Alina Grubnyak on Unsplash
Why is it important?
This study is the first to unify retrieval (finding the right image) and localization (pinpointing the right object) into a single, efficient framework. It comes at a time when multimodal systems like ChatGPT 4o and GPT-4V are rapidly advancing but still struggle with fine-grained visual understanding. Our newly created REIRCOCO dataset, generated with advanced vision-language models such as GPT-4o and DeepSeek R1, offers high-quality, instance-level text-image pairs that did not exist before. Combined with the CLARE model’s innovative “Contrastive Language-Instance Alignment” learning approach, this work provides a foundation for the next generation of intelligent visual search, surveillance analysis, and assistive applications. It enables AI to handle natural, human-like visual queries with unprecedented accuracy.
Perspectives
From my own perspective, this project represents a meaningful step toward making AI perceive the world in a way that aligns with how humans describe it. Developing REIR and CLARE was not only a technical challenge but also an exploration of how language and vision intertwine in real-world understanding. I believe this research will inspire future systems that interact with users more naturally—where describing what you see or imagine is enough to find it instantly. Personally, I find it rewarding that our work contributes to bridging human expression and machine perception in such an intuitive way.
Xiangzhao Hao
University of the Chinese Academy of Sciences
Read the Original
This page is a summary of: Referring Expression Instance Retrieval and A Strong End-to-End Baseline, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746027.3755457.
You can read the full text:
Contributors
The following have contributed to this page







