What is it about?

This work studies how a robot can understand human instructions when those instructions are incomplete or ambiguous. In real-world settings, a robot often cannot see both the person and the target object at the same time. For example, a person might say “get the blue book” while pointing, but the robot must move around to figure out which book is meant. We develop a system that combines three types of information: what the person says (language), where they point (gesture), and what the robot sees (vision). Instead of treating these signals as certain, the robot keeps track of uncertainty—both about which object is the target and where it is located. As the robot moves and observes more of the environment, it updates its belief and becomes more confident about the correct object. We test this system in simulation and on a real robot, showing that combining language and gesture helps the robot find objects more reliably than using only one type of input.

Featured Image

Why is it important?

This work is important because it provides a principled way for robots to reason under uncertainty while using multiple forms of human communication. By explicitly modeling both what the user means and where the object might be, the system can handle more realistic scenarios where information is incomplete. This approach moves robots closer to interacting with humans in natural ways—using speech and gestures—without requiring perfectly structured commands or fully observable environments.

Perspectives

This work is motivated by combining the controllability of a modular framework with the capabilities of modern AI models. By doing so, it extends robot functionality to handle more open, unstructured, and real-world environments, moving toward robots that can operate in everyday life.

Ivy He
Brown University

Read the Original

This page is a summary of: LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments, March 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3757279.3785585.
You can read the full text:

Read

Contributors

The following have contributed to this page