3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

arXiv:2606.31329v1 Announce Type: cross Abstract: Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm uses 2D end-effector trajectories predicted by a Vision-Language Model (VLM) as explicit guidance for a downstream policy. However, state-of-the-art low-level policies operate in 3D metric space on point clouds, and feeding them 2D guidance that lacks depth forces each waypoint to be assigned the depth of whatever scene surface lies beneath it, producing geometrically
The proliferation of Vision-Language Models (VLMs) and advanced robotic policies has created an immediate need to bridge the gap between high-level planning and precise 3D execution.
This development significantly enhances the capabilities of robotic manipulation, moving towards more robust and generalizable autonomous systems critical for various industries.
Robot policies can now receive and act upon more geometrically accurate 3D guidance from high-level models, leading to improved task success and adaptability in complex environments.
- · Robotics companies
- · AI hardware manufacturers
- · Logistics and manufacturing sectors
- · Companies relying on less sophisticated robotic automation
More reliable and adaptable robot manipulation for industrial and service applications.
Accelerated deployment of advanced robotics in unstructured environments, impacting labor markets.
Enhanced human-robot collaboration as robots gain finer manipulation and understanding of physical space.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI