
arXiv:2605.30393v1 Announce Type: new Abstract: Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary probes on production models with a white-box controlled validation on an open causal LM. Top-tier frontier LLMs recall the Fama-French market excess return at 3-seed pooled Pearson r=0.97-0.99 while staying within 0.15 within-25bps on the five sibling factors; comparable fidelity appears on U.S. unemployment, CPI infla
The proliferation of advanced foundation models and increasing scrutiny on their evaluation methodologies are prompting deeper investigations into pretraining data leakage and its implications.
Sophisticated actors need to understand the true capabilities of frontier LLMs, as memorized recall rather than genuine reasoning can skew evaluations and inform flawed strategic decisions.
The understanding of how LLMs acquire and retain public numerical data is evolving, necessitating more robust evaluation frameworks to distinguish between memorization and generalized skill.
- · AI evaluation frameworks
- · Open-source causal LMs
- · Independent AI safety researchers
- · Blind trust in LLM benchmark scores
- · Companies relying on superficial model evaluations
- · LLM developers downplaying data leakage
The NumLeak framework will provide a standardized method for assessing data leakage in production AI models.
This improved understanding of memorization will lead to the development of more robust pretraining strategies and evaluation benchmarks for LLMs.
Greater transparency about LLM capabilities could influence investment in AI R&D, shifting focus towards models demonstrating true emergent intelligence rather than data recall.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG