An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits

arXiv:2605.24583v1 Announce Type: cross Abstract: We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix on safety-relevant inputs, rho_eps := rank_eps(M_Ds)/d, which formalizes the single-refusal-direction observation of Arditi et al. (2024) as a continuous quantity. The paper has three contributions. (1) Confound-controlled measurement: a four-variant decomposition (M_naive, M_template, M_aligned, M_DiD) separates cha
The proliferation of instruction-tuned LLMs and increased scrutiny over their safety and alignment mechanisms necessitates deeper technical understanding of their internal workings.
This research provides a more rigorous and quantitative method for understanding how alignment techniques alter LLM behavior, moving beyond anecdotal observations.
The ability to audit alignment-induced activation shifts more effectively offers improved diagnostics and calibration methods for large language models.
- · AI developers
- · AI safety researchers
- · Regulators
- · Developers relying on black-box safety
- · Inferior alignment techniques
Improved understanding of LLM alignment mechanisms.
More robust and tunable safety features in future LLMs, potentially leading to less 'refusal' for safe inquiries.
Standardized metrics and methodologies for evaluating LLM safety and bias across the industry, facilitating stronger governance frameworks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL