Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

arXiv:2605.27773v1 Announce Type: cross Abstract: When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45
The proliferation of advanced language models necessitates deeper understanding of their internal reasoning process, especially concerning knowledge conflict and explainability.
Understanding how AI models reconcile conflicting information is crucial for building reliable and trustworthy AI systems, particularly as they are integrated into critical decision-making processes.
This research suggests that current CoT methods, while appearing to explain decisions, may not faithfully reflect the actual mechanisms of knowledge reconciliation within the model, introducing a new layer of complexity to AI interpretability.
- · AI interpretability researchers
- · AI developers focused on explainable AI
- · Organizations deploying AI in high-stakes environments
- · Over-reliance on current CoT for truthful explanations
- · Developers neglecting intrinsic model faithfulness
- · Applications requiring high-fidelity mechanistic transparency
This research complicates the direct interpretation of Chain-of-Thought outputs as faithful representations of model reasoning.
It will likely spur development of more introspectively faithful interpretability methods and metrics to better understand AI decision-making under uncertainty.
This could lead to a re-evaluation of regulatory requirements for AI explainability, pushing towards methods that probe deeper than surface-level reasoning structures.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG