Referring Expression Instance Retrieval and A Strong End-to-End Baseline

Xiangzhao Hao; Kuan Zhu; Hongyu Guo; Haiyun Guo; Ning Jiang; Quan Lu; Ming Tang; Jinqiao Wang

doi:10.1145/3746027.3755457

What is it about?

Imagine being able to search through thousands of photos just by describing something like “the brown dog lying on a cushion near the fireplace.” Most current AI systems can either find whole images that match a caption or locate objects within a single picture—but not both at once. Our work bridges that gap. We introduce a new task called Referring Expression Instance Retrieval (REIR) that allows computers to find and highlight a specific object described in natural language, across large photo collections. We also build the first large dataset for this task and design an AI model named CLARE that learns to connect words and objects more precisely. This research moves us closer to search engines that truly understand detailed visual descriptions the way people do.

Photo by Alina Grubnyak on Unsplash

Why is it important?

This study is the first to unify retrieval (finding the right image) and localization (pinpointing the right object) into a single, efficient framework. It comes at a time when multimodal systems like ChatGPT 4o and GPT-4V are rapidly advancing but still struggle with fine-grained visual understanding. Our newly created REIRCOCO dataset, generated with advanced vision-language models such as GPT-4o and DeepSeek R1, offers high-quality, instance-level text-image pairs that did not exist before. Combined with the CLARE model’s innovative “Contrastive Language-Instance Alignment” learning approach, this work provides a foundation for the next generation of intelligent visual search, surveillance analysis, and assistive applications. It enables AI to handle natural, human-like visual queries with unprecedented accuracy.

Perspectives

From my own perspective, this project represents a meaningful step toward making AI perceive the world in a way that aligns with how humans describe it. Developing REIR and CLARE was not only a technical challenge but also an exploration of how language and vision intertwine in real-world understanding. I believe this research will inspire future systems that interact with users more naturally—where describing what you see or imagine is enough to find it instantly. Personally, I find it rewarding that our work contributes to bridging human expression and machine perception in such an intuitive way.
Xiangzhao Hao
University of the Chinese Academy of Sciences

This page is a summary of: Referring Expression Instance Retrieval and A Strong End-to-End Baseline, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746027.3755457.
You can read the full text:

Read

Contributors

The following have contributed to this page

Referring Expression Instance Retrieval

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Referring Expression Instance Retrieval

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management