All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

arXiv:2602.17234v2 Announce Type: replace-cross Abstract: Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contami
The increasing deployment of LLMs in critical applications, particularly with backtesting, creates an urgent need to robustly identify and mitigate data contamination.
This development allows for more reliable and interpretable evaluation of LLM performance, essential for their trustworthy integration into sensitive domains like finance and regulatory compliance.
The ability to quantify and attribute 'leakage' in LLM reasoning provides a new layer of auditing and validation, challenging assumptions about pre-training data integrity.
- · LLM developers
- · AI auditors
- · Financial institutions
- · Regulators
- · Over-reliant LLM applications
- · Unscrupulous data providers
Improved reliability and trust in LLM backtesting and historical data analysis.
Increased scrutiny and demand for transparent data provenance and training methodologies for LLMs.
Development of new industry standards and regulatory requirements for 'leakage' mitigation in AI, impacting deployment costs and timelines.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG