
arXiv:2605.26772v1 Announce Type: new Abstract: Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the C
This research provides a more nuanced understanding of how large reasoning models (LRMs) process information and refuse commands, moving beyond simpler LLM models.
Controlling advanced AI models, especially regarding refusal and safety, is critical for their deployment and ensuring alignment with human intent.
The complexity of controlling AI behavior is now understood to be significantly influenced by internal thought processes (CoT), requiring more sophisticated steering mechanisms.
- · AI safety researchers
- · AI alignment companies
- · Developers of advanced reasoning models
- · Companies relying on simplistic steering methods
- · Researchers oversimplifying AI control
New methods for influencing large reasoning models' behavior will emerge, specifically targeting the chain-of-thought.
The development of more reliable and safer AI systems will accelerate, leading to broader adoption of complex AI applications.
Increased trust in AI's refusal capabilities could lead to more autonomous and critical deployments, but also to more sophisticated exploits if control is imperfect.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI