Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

arXiv:2605.22672v2 Announce Type: replace Abstract: We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.
This research provides new evidence, building on growing industry experience with LLMs, that highlights a critical limitation in their current forecasting capabilities, especially in complex, non-linear scenarios.
It is crucial for strategic readers to understand that more capable LLMs can paradoxically perform worse on high-stakes forecasting tasks, impacting decisions in finance, epidemiology, and other critical sectors.
The understanding of LLM limitations in forecasting now includes a specific 'inverse scaling' phenomenon for superlinear growth and tail risk, demanding more nuanced model selection and application strategies.
- · Specialized statistical modeling firms
- · Human domain experts
- · Hybrid AI-human forecasting platforms
- · Auditors of AI models
- · LLM-only forecasting solutions
- · Organizations relying solely on general-purpose LLMs for critical forecasts
- · Generative AI evangelists
- · Model providers ignoring these specific failure modes
Increased scrutiny and demand for robust validation benchmarks for LLM forecasting in high-stakes domains.
Development of specialized LLMs or hybrid models designed to mitigate 'inverse scaling' issues, potentially integrating traditional econometric methods.
A broader re-evaluation of 'capability' metrics for LLMs, moving beyond general benchmarks to task-specific performance in critical applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI