Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

arXiv:2605.28553v1 Announce Type: new Abstract: In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with pa
The rapid advancement and deployment of large language models necessitate more robust safety mechanisms, leading researchers to explore novel methods for controlling their behavior internally.
This research provides a fundamental mechanism for real-time detection and intervention in LLM refusal behavior, enhancing safety and alignment while potentially enabling more precise control over AI outputs.
The ability to predict and exploit refusal signals before decoding means LLM safety becomes an internal architectural problem rather than solely a post-hoc filter, fundamentally altering how AI safety is engineered.
- · AI safety researchers
- · LLM developers
- · Regulatory bodies
- · AI-reliant industries
- · Malicious actors exploiting LLMs
- · Systems relying on external content filters
Improved safety and alignment of large language models through real-time internal monitoring.
Development of more sophisticated, fine-grained control mechanisms for LLM behavior, extending beyond safety to other desired attributes.
The potential for LLMs to become introspective and self-correcting agents, significantly advancing autonomous AI capabilities with integrated safety protocols.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI