SIGNALAI·Jun 30, 2026, 4:00 AMSignal75Short term

Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

arXiv:2606.29350v1 Announce Type: cross Abstract: Vision-language models and vision-language action models endow the robot with unprecedented capabilities. However, the input of video and high-resolution images yields a massive number of visual tokens, leading to extremely high inference latency and severely hindering the robot's real-time control. To break through this computational bottleneck, we propose ST-Merge, a plug-and-play, training-free framework that efficiently fuses redundant tokens directly during the visual encoding phase. By explicitly constructing 3D spatiotemporal coordinates

Why this matters

Why now

The rapid development of Vision-Language Models (VLMs) and Vision-Language Action Models (VLAs) is pushing the boundaries of robotic capabilities, while simultaneously highlighting the computational bottlenecks associated with processing high-resolution visual data in real-time.

Why it’s important

Achieving low-latency processing in robotic VLMs/VLAs is a critical enabler for real-time control and interaction, which is necessary for the practical deployment of advanced robotic systems in dynamic environments.

What changes

This breakthrough provides a plug-and-play, training-free method to significantly reduce inference latency in robotic VLMs/VLAs, making real-time control more feasible and expanding the potential applications for embodied AI.

Winners

· Robotics companies
· AI hardware manufacturers
· Logistics and manufacturing sectors
· Research institutions in AI/robotics

Losers

· Developers reliant on high-latency visual processing
· Specialized hardware for brute-force visual processing

Second-order effects

Direct

More responsive and capable robots become commercially viable in various applications.

Second

Accelerated development and adoption of general-purpose humanoid robots and autonomous agents due to improved real-time cognitive abilities.

Third

Increased demand for robust and efficient AI models that can integrate seamlessly with physical systems, potentially shifting investment towards embodied AI research.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.AI

#cs.CV #cs.AI

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.