Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

arXiv:2605.02958v2 Announce Type: replace-cross Abstract: Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that oper
The increasing sophistication of large language models and concurrent 'jailbreaking' attacks necessitates more robust and proactive detection methods to ensure safety and alignment.
This development offers a novel, white-box approach to detecting refusal dynamics within AI models, moving beyond superficial outputs to internal activation patterns, thus making AI systems more auditable and controllable.
The ability to identify 'Refusal Trajectories' provides a deeper, more resilient mechanism for AI safety, potentially making it harder for adversarial attacks to bypass ethical safeguards.
- · AI Safety Researchers
- · AI Developers and Deployers
- · Cybersecurity Industry
- · Regulatory Bodies
- · Adversarial Attackers
- · Black-box AI Safety Methods
Improved detection of harmful AI outputs and reduced instances of successful 'jailbreaks' against large language models.
Increased trust in AI systems due to enhanced safety mechanisms, potentially accelerating their adoption in sensitive applications.
The methodology could be extended to detect other latent adversarial behaviors, evolving into a foundational tool for general AI alignment and control.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG