CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

arXiv:2606.03650v1 Announce Type: new Abstract: Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end: from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels, contamination-free because items are generated anew on each ru
The rapid proliferation of diverse language models and the increasing recognition of benchmark contamination have created an urgent need for more robust evaluation methods.
This framework offers a contamination-free, task-specific method for evaluating and ranking language models, which is crucial for their effective deployment in real-world applications.
The ability to generate custom, uncontaminated benchmarks without human labels changes how organizations will select and validate AI models, shifting away from reliance on potentially flawed public benchmarks.
- · Businesses adopting AI
- · AI model developers aiming for transparency
- · AI evaluation platforms
- · Providers of contaminated benchmarks
- · Language models optimized solely for public benchmarks
Enterprise adoption of language models will accelerate due to increased confidence in model selection.
There will be a reduced focus on 'winning' public benchmarks and a greater emphasis on true task performance.
This could lead to a more meritocratic ecosystem for language models, where real-world applicability trumps academic benchmark scores.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL