When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

arXiv:2606.11906v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps
This research provides a timely, systematic evaluation of how multilingual instructions impact Vision-Language-Action models, highlighting a critical limitation as AI deployment expands globally.
A strategic reader should care because the robustness of VLA models to linguistic variation directly impacts their global deployability and the fairness and efficacy of their application beyond English-speaking contexts.
The understanding of AI model robustness extends beyond purely technical metrics to include critical linguistic and cultural sensitivities, indicating that current VLA models are not universally applicable without significant adaptation.
- · AI researchers focused on multilingual NLP
- · Companies developing localized AI solutions
- · Open-source initiatives for diverse language datasets
- · VLA model developers prioritizing English-only training
- · Companies deploying unadapted VLA models globally
- · Global consumers of AI services reliant on non-English instructions
Immediate performance degradation of VLA models when given non-English instructions.
Increased investment in multilingual AI research and development to address performance disparities.
The emergence of 'language-centric AI' as a distinct and critical subfield, potentially leading to new industry standards for linguistic robustness.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL