Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

arXiv:2605.22719v1 Announce Type: new Abstract: We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is
The proliferation of complex AI models necessitates more granular understanding of internal processes to ensure reliability and safety.
Understanding how AI models fail at a feature level is crucial for building more robust, interpretable, and controllable AI systems.
This research provides a methodology and specific insights into correlating internal model activations with task failure, fostering a more scientific approach to AI debugging and alignment.
- · AI Safety Researchers
- · AI Developers
- · Model Explainability Tools
- · Black-box AI approaches
- · Uninterpretable AI systems
Improved debugging and error correction for large language models based on identified failure features.
Development of automated tools that can proactively identify and mitigate specific failure modes in AI.
Enhanced trust and broader adoption of AI systems due to increased transparency and reliability in critical applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG