Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

arXiv:2508.20072v4 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside
The increased sophistication of vision-language models makes their application to robotic control a natural next step, while challenges in action decoding necessitate new architectural innovations like discrete diffusion.
This development represents a significant step towards more robust and generalizable robot action execution by improving the efficiency and effectiveness of how VLA models translate perception into physical actions.
Current fragmented VLA architectures may be replaced by more unified, scalable models that integrate action decoding directly into the vision-language backbone, leading to improved performance in robot tasks.
- · Robotics companies
- · AI research labs
- · Automation sector
- · Developers of less efficient VLA architectures
- · Companies reliant on fragile robot control systems
Improved performance and broader applicability of Vision-Language-Action models in robotics.
Accelerated development of more capable and autonomous robots for diverse applications from manufacturing to service industries.
Enhanced human-robot interaction and the potential for robots to perform a wider range of complex, unscripted tasks in dynamic environments.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG