Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

arXiv:2602.10352v2 Announce Type: replace Abstract: Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (70% vs 50% generation scoring at 70B s
The increasing complexity of large language models necessitates improved interpretability methods to ensure reliability, especially as they move towards more autonomous applications.
Reliable self-interpretation for language models will enhance their explainability, making them more trustworthy and applicable in sensitive domains, and accelerate the development of robust AI systems.
The ability to train lightweight adapters for self-interpretation on frozen language models significantly lowers the barrier to achieving reliable internal state descriptions, allowing for broader application without costly retraining.
- · AI developers
- · AI deployment sectors (e.g., finance, healthcare)
- · AI interpretability researchers
- · Developers reliant on black-box AI
- · Solutions requiring full LM fine-tuning for interpretability
Self-interpretation methods become more practical and widely adopted across various language models.
Improved interpretability leads to faster debugging, enhanced safety, and more compliant AI applications.
The increased transparency of AI internal states accelerates the development of more complex and autonomous AI agents capable of self-correction.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL