When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

arXiv:2508.21448v3 Announce Type: replace Abstract: Large language models (LLMs) sometimes refuse to follow benign instructions, such as declining to argue a political position or adopt a stated persona, and such refusals are commonly read as safety guardrails at work. We ask whether they can instead signal a **capability deficit**: a shortage of the internal representations a model needs to reason from the instructed perspective. To investigate, we introduce **ideological depth**, a property with two components: (i) a model's ability to follow political instructions without *failure* (steerab
This research emerges as the capabilities and limitations of large language models are under intense scrutiny, particularly regarding their biases and control mechanisms.
Understanding whether 'refusal' indicates safety guardrails or fundamental 'capability deficits' is crucial for developing robust, reliable, and ethically aligned AI systems.
The frame shifts from simply 'safety' to a more profound assessment of an AI's internal 'ideological depth' and its ability to represent diverse perspectives.
- · AI ethicists
- · Developers of transparent AI architectures
- · Platforms demanding fine-grained model control
- · Companies relying solely on superficial 'safety' metrics
- · Black-box AI development approaches
Further research will focus on diagnosing and mitigating 'capability deficits' in LLMs related to ideological steerability.
This could lead to new benchmarks and regulatory requirements that assess a model's 'ideological depth' rather than just its safety guardrails.
Future AI systems may be designed with explicit modules for 'ideological representation' to ensure they can reason from a wider array of human perspectives.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL