
arXiv:2509.26600v2 Announce Type: replace Abstract: As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluat
The rapid advancement and saturation of LLMs in existing benchmarks necessitate automated evaluation methods, leading to an increasing reliance on LLM-as-a-benchmark paradigms.
This finding reveals a fundamental flaw in automated LLM evaluation, highlighting a systemic bias that compromises the objectivity and reliability of benchmark results critical for AI development and deployment.
The criteria and perceived performance of LLMs will now be understood to be inherently biased by the models themselves, requiring new approaches to objective evaluation and potentially slowing the pace of unbiased progress.
- · Human evaluators
- · Independent AI audit firms
- · Researchers focused on bias mitigation in AI
- · LLM developers relying solely on self-benchmarking
- · Organizations making decisions based on biased LLM performance metrics
- · Automated benchmark creation methodologies
AI models will continue to exhibit self-favoring biases in automated evaluations, potentially leading to inflated performance claims and misdirected development.
There will be a renewed emphasis on human-in-the-loop evaluation and the development of new, more robust, and bias-resistant benchmarking methodologies.
The perceived progress and trustworthiness of LLMs could be undermined, impacting investment and adoption unless transparent and unbiased evaluation standards are established.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL