
arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in ano
The proliferation of advanced vision-language models necessitates more robust and ethnolinguistically comprehensive evaluation benchmarks, revealing previously masked deficiencies.
This highlights a critical blind spot in current VLM development and evaluation, showing that 'multilingual' claims often fail to account for linguistic diversity across scripts, impacting equitable AI access and performance.
The focus for VLM development will need to explicitly incorporate script consistency and multi-script language support to genuinely serve global language diversity, moving beyond a one-language-one-script assumption.
- · Developers of inclusive AI models
- · Researchers specializing in less-resourced languages
- · Multilingual user communities
- · VLM developers using narrow evaluation benchmarks
- · Users of multi-script languages relying on current 'multilingual' VLMs
- · Companies aiming for global AI adoption without addressing script gaps
This benchmark will likely lead to a re-evaluation and retraining of existing state-of-the-art vision-language models.
Increased investment in data collection and model architectures explicitly designed to handle multi-script languages will follow.
The concept of 'multilingual AI' will be redefined to include script-awareness, influencing future policy and funding for AI development.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL