Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation

arXiv:2606.15621v1 Announce Type: cross Abstract: Per-token counterfactual credit estimation asks which token in a language-model rollout caused the final answer to be right or wrong: cut the transcript at a pivot, substitute an alternative token, replay continuations, and compare outcomes. Published methods re-feed the transcript prefix as a fresh prompt, assuming this reproduces the state the model passed through during generation. We measure what that assumption costs on a stock inference engine, with a three-pass design: continuations resumed from the verified decode-time KV state, an iden
This paper addresses a technical limitation in current language model attribution methods, indicating an ongoing refinement in AI interpretability research as models become more complex.
Improving the accuracy of token-level credit estimation is crucial for developing more reliable, explainable, and debuggable AI systems, which impacts their broader adoption and safety.
The research identifies a discrepancy in how model states are handled during counterfactual analysis, suggesting that current methods of re-feeding prompts inaccurately represent the true model state.
- · AI interpretability researchers
- · Developers of attribution tools
- · Sectors requiring high-assurance AI
- · Developers relying on current 're-feeding' methods without verification
More accurate methods for understanding internal language model dynamics will emerge.
This improved understanding could lead to more robust and less 'black-box' large language models.
Enhanced interpretability may accelerate regulatory acceptance and public trust in advanced AI applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL