
arXiv:2503.13445v3 Announce Type: replace Abstract: When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified varian
The rapid advancement and deployment of LLMs necessitate a deeper understanding of their internal reasoning to ensure reliability and trustworthiness.
Understanding the faithfulness of LLM explanations is critical for their safe and effective integration into sensitive decision-making processes and for mitigating risks of AI hallucination.
New metrics will allow for more rigorous evaluation of LLM explainability, moving beyond plausible-sounding but unfaithful self-explanations.
- · AI safety researchers
- · Developers of robust LLM applications
- · Sectors requiring high interpretability (e.g., healthcare, finance)
- · LLM developers relying solely on superficial explainability
- · Applications with unverified LLM reasoning
- · Users who implicitly trust all LLM generated explanations
Improved methods for evaluating and potentially training more faithful LLM explanations will emerge.
Increased scrutiny and regulatory pressure on explanation fidelity for AI systems will likely follow.
This could lead to a new generation of 'verified' AI models where their internal reasoning is more transparent and auditable.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL