
arXiv:2606.12818v1 Announce Type: new Abstract: Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find
The increasing prevalence and complexity of large language models necessitate deeper understanding of their internal mechanisms, especially regarding biases and reasoning vulnerabilities like anchoring effects.
Understanding how irrelevant numerical information influences LLM judgments is critical for improving model reliability, fairness, and safety in deployment across various sensitive applications.
This research provides a methodology for localizing and potentially mitigating anchoring pathways within LLMs, moving beyond mere observation of these effects to targeted intervention.
- · AI researchers
- · Developers of robust LLMs
- · Industries relying on AI for critical decision-making
- · Models uncritically deployed without bias mitigation
- · Platforms exhibiting unchecked anchoring effects
Improved methods for auditing and debugging the internal reasoning processes of sophisticated AI models will emerge.
Development of new architectural designs or training regimes specifically aimed at reducing human-like cognitive biases in AI.
Enhanced trust in AI systems due to their demonstrably more robust and less susceptible decision-making processes, broadening their application scope.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL