C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

arXiv:2603.05167v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with kn
The proliferation of LLMs as evaluators necessitates robust benchmarking to ensure their reliability in assessing complex reasoning processes, a critical step for developing more capable AI systems.
Reliable evaluation of LLM reasoning, beyond superficial plausibility, is crucial for advancing AI's safety, robustness, and ultimately, its utility in high-stakes applications.
Our ability to trust LLM judgments and to iterate on their development is now better informed by a more nuanced understanding of 'faithfulness' in their reasoning processes.
- · AI developers
- · AI researchers
- · Companies adopting LLM evaluation
- · LLM developers without strong evaluation methods
Improved methods for evaluating AI reasoning will accelerate the development of more robust and reliable large language models.
Enhanced LLM judgment capabilities could lead to more sophisticated AI agents capable of autonomous decision-making in complex environments.
The development of highly faithful AI judges might blur the lines between human and AI evaluation, potentially reshaping knowledge creation and validation processes.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI