Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

arXiv:2605.21849v1 Announce Type: new Abstract: Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric d
The increasing deployment of AI models in diverse and dynamic real-world environments necessitates robust interpretability methods that account for distribution shifts, a core challenge being actively addressed by researchers.
Faithful interpretability is crucial for trusted AI deployment, especially when models operate in conditions different from their training data, impacting safety, reliability, and regulatory compliance.
This research highlights a fundamental challenge to dictionary-based interpretability under distribution shift and proposes a geometry-adaptive solution, improving the reliability of AI explainers.
- · AI safety researchers
- · AI ethicists
- · Developers of robust AI systems
- · Sectors with high-stakes AI applications
- · AI systems with poor OOD generalization
- · Interpretability methods not robust to distribution shift
Increased understanding and development of more reliable AI interpretability tools capable of handling real-world data variability.
Greater trust and adoption of AI systems in critical applications where out-of-distribution performance and explainability are paramount.
Potential for new regulatory frameworks and industry standards to incorporate robustness to distribution shift as a requirement for explainable AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG