Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

arXiv:2606.26101v1 Announce Type: cross Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware, multi-zone benchmark for measuring the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination-risk metadata, and dual parsing with an official strict parser p
The proliferation of advanced large language models necessitates more robust and reliable evaluation methods to ensure their responsible development and deployment, especially concerning knowledge boundaries and data contamination.
This benchmark addresses critical challenges in accurately assessing LLM capabilities, moving beyond superficial metrics to understand true knowledge versus spurious memorization or generalization.
The introduction of a 'contamination-aware' and 'multi-zone' benchmark will allow for more precise evaluation of LLM knowledge, reducing the ambiguity around unsupported answering and data leakage.
- · LLM developers prioritizing model integrity
- · Researchers in AI safety and ethics
- · Enterprises deploying LLMs in critical applications
- · LLMs with superficial knowledge or high contamination
- · Benchmarking methodologies lacking contamination awareness
Increased focus on data purity and knowledge grounding in LLM training and fine-tuning practices.
Improved trust and reliability of large language models in diverse applications, particularly those requiring factual accuracy.
Potential for new regulatory standards or industry best practices for LLM evaluation driven by sophisticated benchmarks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI