
arXiv:2509.23782v4 Announce Type: replace Abstract: While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitu
The rapid advancement and deployment of large language models have brought their 'erratic behavior' and 'unfaithfulness' to internal knowledge under increased scrutiny, driving research into trustworthiness.
Improving the alignment between LLMs' internal knowledge and their output behavior is critical for their reliability, safety, and broader adoption in high-stakes applications.
This research provides a methodology to quantify and mitigate the 'knowledge-prediction gap' in LLMs on multiple-choice questions, paving the way for more robust and trustworthy AI systems.
- · AI developers
- · Enterprises deploying LLMs
- · AI ethics researchers
- · Developers of unreliable LLMs
- · Applications reliant on unfaithful AI
- · Skeptics of AI reliability
LLMs become more reliable in fact-based question answering, reducing errors and increasing user confidence.
Enhanced trustworthiness accelerates the integration of LLMs into critical decision-making processes across various industries.
Increased reliability in AI could lead to new regulatory frameworks and safety standards for autonomous AI systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL