
arXiv:2605.23035v1 Announce Type: cross Abstract: Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by bridging sparse autoencoders (SAEs) from mechanistic interpretability with neural encoding models, decomposing GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy ($\kappa \geq 0.74$) reveals that semantic features alone recover 94% of peak encoding performance ($r=0
This research provides a mechanistic explanation for the established alignment between LLMs and human brain activity, advancing our understanding of how language processing occurs in both artificial and biological systems.
Understanding the mechanisms behind LLM-brain alignment is crucial for developing more neurologically plausible AI and could lead to breakthroughs in both AI interpretability and neuroscience.
This research shifts the understanding of LLM interpretability from a purely computational problem to one with strong biological underpinnings, suggesting a convergence of AI and neuroscientific principles.
- · AI interpretability researchers
- · Computational neuroscientists
- · AI developers
- · Cognitive science
- · Black-box AI approaches
Improved understanding of how LLMs create and process meaning, specifically through sparse autoencoders.
Development of new AI architectures that are designed to be intrinsically more interpretable and brain-like.
Enhanced brain-computer interfaces or diagnostic tools based on shared 'semantic topography' between AI and human minds.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI