QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li; Yuhui Chen; Mingcai Zhou; Haoran Li

doi:10.65109/ljrk3716

What is it about?

Robots often struggle with fine-grained manipulation tasks because they lack strong 3D spatial understanding. While recent Vision–Language–Action (VLA) models can follow language instructions and recognize objects, they may fail to accurately reason about object positions, distances, and gripper interactions during complex manipulation. In this work, we propose QDepth-VLA, a framework that improves robotic spatial reasoning by introducing an auxiliary depth prediction task. Instead of directly predicting noisy pixel-level depth maps, our method learns compact quantized depth representations that better capture important geometric structures. We further design a dedicated “Depth Expert” module so that the robot can learn spatial cues without disrupting the original vision-language understanding capabilities. Experiments on both simulation benchmarks and real-world robotic tasks show that QDepth-VLA significantly improves manipulation performance and spatial reasoning ability. Our approach demonstrates that incorporating efficient geometric supervision can make VLA models more reliable for future embodied AI and robotic applications.

This page is a summary of: QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models, International Foundation for Autonomous Agents and Multiagent Systems,
DOI: 10.65109/ljrk3716.
You can read the full text:

Read

Contributors

The following have contributed to this page

Li Yixuan
University of the Chinese Academy of Sciences

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

What is it about?

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

What is it about?

Featured Image

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management