
arXiv:2605.24904v1 Announce Type: new Abstract: Machine-translated benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs), yet translation errors in these benchmarks remain underexplored, raising concerns about the reliability and comparability of multilingual evaluation. We address two practical gaps: (i) how well automatic MQM-style error spans from LLM judges and a span-aware QE baseline (xCOMET-XXL) match expert human span annotations on benchmark translations, and (ii) how strongly translation errors (as opposed to source-side issues in the Engl
The proliferation of multilingual LLMs necessitates robust evaluation methodologies, making the impact of translation errors a pressing concern for accurate assessment. This research directly addresses a known limitation in current evaluation practices as LLM capabilities expand globally.
Reliable multilingual evaluation is critical for the global adoption and equitable development of advanced AI, ensuring that models perform consistently across languages and cultural contexts. Inaccurate evaluations can lead to skewed development priorities and biased model deployments.
This research provides a framework for better understanding and mitigating translation errors in multilingual LLM benchmarks, potentially leading to more accurate comparisons and development of models. It shifts focus to the quality of evaluation data itself, not just the model output.
- · Multilingual LLM developers
- · AI researchers focusing on fairness
- · Translators and linguistic experts
- · Developers relying solely on untrustworthy benchmarks
- · Benchmarks with poor translation quality
Increased scrutiny and methodology improvements for multilingual LLM benchmarks.
More reliable cross-lingual performance comparisons, allowing for better identification of actual model capabilities versus benchmark artifacts.
Accelerated development of LLMs that are genuinely robust and equitable across diverse linguistic populations, fostering broader and safer global AI adoption.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL