IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

arXiv:2606.29693v1 Announce Type: new Abstract: We ask a simple question about decoder-only transformers: \emph{between which two layers is the probability of a predicted token actually produced?} Existing layer-wise readout tools answer only approximately. The logit lens and its trained variant report a per-layer \emph{level} of probability but give no additive decomposition; their estimates are biased and non-monotone across depth. Direct Logit Attribution and related residual-stream methods are additive, but only in \emph{logit} space -- the softmax nonlinearity breaks additivity in probabi
The rapid advancement and adoption of large language models necessitate more precise tools for understanding their internal workings, especially for critical applications.
Improved interpretability tools like IG-Lens are crucial for debugging, auditing, and enhancing the trustworthiness and control of complex AI systems, particularly transformers.
Researchers can now more precisely identify which transformer layers are responsible for specific probabilistic outputs, moving beyond approximate or non-additive methods.
- · AI researchers
- · ML engineers
- · AI safety community
- · Developers of AI-driven products
Increased understanding of transformer decision-making processes.
Faster identification and mitigation of biases or emergent behaviors in LLMs.
Potentially enables new methods for fine-tuning or architecting transformers based on layer-specific contributions.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG