Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

arXiv:2606.29350v1 Announce Type: cross Abstract: Vision-language models and vision-language action models endow the robot with unprecedented capabilities. However, the input of video and high-resolution images yields a massive number of visual tokens, leading to extremely high inference latency and severely hindering the robot's real-time control. To break through this computational bottleneck, we propose ST-Merge, a plug-and-play, training-free framework that efficiently fuses redundant tokens directly during the visual encoding phase. By explicitly constructing 3D spatiotemporal coordinates
The rapid development of Vision-Language Models (VLMs) and Vision-Language Action Models (VLAs) is pushing the boundaries of robotic capabilities, while simultaneously highlighting the computational bottlenecks associated with processing high-resolution visual data in real-time.
Achieving low-latency processing in robotic VLMs/VLAs is a critical enabler for real-time control and interaction, which is necessary for the practical deployment of advanced robotic systems in dynamic environments.
This breakthrough provides a plug-and-play, training-free method to significantly reduce inference latency in robotic VLMs/VLAs, making real-time control more feasible and expanding the potential applications for embodied AI.
- · Robotics companies
- · AI hardware manufacturers
- · Logistics and manufacturing sectors
- · Research institutions in AI/robotics
- · Developers reliant on high-latency visual processing
- · Specialized hardware for brute-force visual processing
More responsive and capable robots become commercially viable in various applications.
Accelerated development and adoption of general-purpose humanoid robots and autonomous agents due to improved real-time cognitive abilities.
Increased demand for robust and efficient AI models that can integrate seamlessly with physical systems, potentially shifting investment towards embodied AI research.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI