From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

arXiv:2606.01136v1 Announce Type: new Abstract: Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each
The proliferation of advanced LLMs necessitates rigorous evaluation, especially as enterprises consider their use for complex, nuanced tasks beyond common languages.
This research provides a more sophisticated methodology for evaluating LLM performance in translation, moving beyond single-score metrics that can mask deep accuracy issues in culturally or historically rich contexts.
The understanding of LLM translation quality, particularly for classical or specialized languages, now moves towards multi-reference validation, indicating a need for more robust evaluation frameworks and potentially more specialized models for niche linguistic tasks.
- · LLM developers focusing on accuracy and reliable evaluation frameworks
- · Human translators (as their nuanced understanding is validated)
- · Researchers in humanities and classical studies
- · Over-reliant adopters of off-the-shelf LLMs for complex translation
- · Single-metric translation evaluation systems
Improved methodologies for auditing LLM performance in specialized linguistic tasks will emerge.
Demand for domain-specific fine-tuning or specialized LLMs will increase to address nuanced translation challenges.
This could lead to a re-evaluation of LLM capabilities in other highly nuanced or domain-specific applications, not just translation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL