Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

arXiv:2510.07061v2 Announce Type: replace Abstract: While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 29 automatic metrics with human judgments across six major
The rapid development and deployment of AI models for diverse global populations necessitate a reevaluation of evaluation metrics to ensure their efficacy and fairness across languages.
Accurate and reliable evaluation metrics are critical for guiding the development of robust AI systems for non-English, high-resource languages, impacting billions of users and a vast linguistic landscape.
This research provides a benchmark (ITEM) to systematically assess existing metrics, potentially leading to the adoption of more appropriate evaluation standards for Indian languages, thus influencing future MT and TS model development.
- · Indian language AI users
- · Developers of Indian language MT/TS models
- · Linguistic diversity advocates
- · AI evaluation metrics developed solely for English
- · Generative AI models with poor performance in Indian languages
Improved machine translation and summarization quality for Indian languages due to better evaluation metrics.
Increased investment and research into AI models specifically tailored for Indian languages, fostering local AI ecosystems.
Reduced digital divide for Indian language speakers and accelerated digital transformation within India through more relevant AI applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL