Constitutional Value Potentials: reading and steering internal priority margins in language models

arXiv:2606.15420v1 Announce Type: cross Abstract: A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model mentions but which one it is willing to sacrifice. We provide evidence that this arbitration can be read from activations in a structured margin readout. We introduce Constitutional Value Potentials (CVP). For each value we learn a scalar potential from the hidden state: an internal pressure to preserve that value, supervi
The increasing sophistication and autonomy of language models necessitate new methods for evaluating and aligning their internal values, especially as they integrate into critical applications.
The ability to read and steer the 'internal priority margins' of language models is crucial for ensuring their safe, ethical, and aligned deployment, particularly in sensitive domains.
This research introduces a novel method (Constitutional Value Potentials) to internally observe and potentially control a language model's value arbitration, moving beyond output-based assessments alone.
- · AI safety researchers
- · Developers of constitutional AI
- · Governments and regulators focusing on AI governance
- · Malicious actors attempting to exploit unaligned AI
- · Organisations relying solely on black-box AI evaluation
- · Theories that AI alignment can only be evaluated post-hoc from outputs
Researchers gain a precise internal tool to diagnose and address value conflicts within large language models.
Improved internal visibility into AI decision-making accelerates the development of more trustworthy and robust autonomous AI agents.
The integration of such tools could lead to enforceable standards for explainable and ethically aligned AI, influencing regulatory frameworks globally.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI