MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

arXiv:2606.01060v1 Announce Type: new Abstract: Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectivel
The accelerating deployment and integration of large language models necessitates deeper understanding of their internal workings, especially concerning alignment and potential failure modes.
Understanding the internal mechanisms of AI alignment is critical for building more robust, trustworthy, and controllable AI systems, reducing risks associated with unintended behaviors.
This research provides a methodology to probe the internal changes in LLMs post-alignment, moving beyond behavioral evaluation to structural analysis of AI models.
- · AI safety researchers
- · Developers of robust AI systems
- · Organizations deploying sensitive AI applications
- · Malicious actors exploiting AI vulnerabilities
- · Users relying on black-box AI behavior
Improved debugging and understanding of large language model failures like jailbreaks and prompt injection.
Development of new alignment techniques that directly target internal model structures rather than just output behavior.
Enhanced ability to detect and prevent adversarial attacks, leading to more secure and reliable AI systems integrated into critical infrastructure.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL