Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

arXiv:2604.04532v2 Announce Type: replace Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44
The increasing sophistication and proliferation of AI agents demand more robust and culturally aware evaluation methods, highlighting the limitations of English-centric benchmarks.
This research reveals that AI agent performance is highly sensitive to the language used for evaluation, implying significant challenges and opportunities for global AI deployment and development.
The assumption that English-based evaluations are universally representative for AI agent performance is challenged, necessitating multilingual testing and localization strategies.
- · Developers of multilingual AI models
- · AI localization specialists
- · Organizations targeting diverse global markets
- · AI developers relying solely on English benchmarks
- · Companies with single-language AI deployment strategies
AI models will need to be evaluated and possibly tuned for performance across multiple languages to ensure robustness and fairness.
The development of AI agents will increasingly incorporate multilingual training and localization from the outset, changing standard development pipelines.
Global competition in AI may intensify as non-English speaking nations develop superior localized AI agents, potentially shifting technological leadership.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL