
arXiv:2606.22686v2 Announce Type: replace-cross Abstract: Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on
The rapid deployment and increasing reliance on LLMs necessitate a deeper understanding of their safety mechanisms, particularly as these systems become more integrated into critical applications.
Understanding the 'geometry of refusal' allows for more robust and transparent control over AI safety, impacting trust, regulation, and the deployment of advanced AI systems.
The mechanistic basis of LLM safety alignment shifts from being an opaque 'black box' to a potentially manipulable and interpretable 'linear feature,' opening new avenues for control and auditing.
- · AI Safety Researchers
- · Developers of Safety-Critical AI Systems
- · Regulatory Bodies
- · Malicious Actors circumvention AI safeguards
- · Developers of proprietary, opaque safety systems
- · Systems highly vulnerable to prompt injection attacks
Identifying a 'refusal direction' allows for more precise and potentially real-time steering of LLM behavior, making AI outputs more predictable.
This improved understanding could lead to the development of robust, auditable safety layers that are less susceptible to adversarial attacks, enhancing overall AI security.
The transparency gained might accelerate public and regulatory acceptance of more autonomous AI systems, given greater confidence in their controllable safety parameters.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG