C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

arXiv:2606.30609v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) are widely used to interpret large language models by decomposing activations into sparse, human-understandable features, but scaling to large dictionaries exposes fundamental challenges. Systematic studies reveal pervasive feature splitting that fragments coherent concepts into non-atomic latents and widespread feature absorption that creates arbitrary exceptions in general features, severely compromising latent reliability. These issues stem from inconsistent latent assignment across samples: without cross-sample cons
The increasing scale and complexity of large language models necessitate more effective interpretability tools.
Improved interpretability of large language models is crucial for their reliability, safety, and continued integration into critical applications.
This research provides a method to enhance the reliability of sparse autoencoders, making the internal workings of large language models more transparent.
- · AI researchers
- · companies deploying LLMs
- · AI audit and safety organizations
- · developers of less interpretable AI methods
- · malicious actors seeking to exploit opaque AI systems
The adoption of C$^{2}$R could lead to more robust and less error-prone sparse autoencoders.
Enhanced interpretability may accelerate the development of explainable AI (XAI) and foster greater trust in LLMs.
Increased transparency could influence regulatory frameworks for AI, potentially leading to demands for verifiable interpretability.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG