SIGNALAI·May 25, 2026, 4:00 AMSignal75Medium term

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

arXiv:2605.22672v2 Announce Type: replace Abstract: We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.

Why this matters

Why now

This research provides new evidence, building on growing industry experience with LLMs, that highlights a critical limitation in their current forecasting capabilities, especially in complex, non-linear scenarios.

Why it’s important

It is crucial for strategic readers to understand that more capable LLMs can paradoxically perform worse on high-stakes forecasting tasks, impacting decisions in finance, epidemiology, and other critical sectors.

What changes

The understanding of LLM limitations in forecasting now includes a specific 'inverse scaling' phenomenon for superlinear growth and tail risk, demanding more nuanced model selection and application strategies.

Winners

· Specialized statistical modeling firms
· Human domain experts
· Hybrid AI-human forecasting platforms
· Auditors of AI models

Losers

· LLM-only forecasting solutions
· Organizations relying solely on general-purpose LLMs for critical forecasts
· Generative AI evangelists
· Model providers ignoring these specific failure modes

Second-order effects

Direct

Increased scrutiny and demand for robust validation benchmarks for LLM forecasting in high-stakes domains.

Second

Development of specialized LLMs or hybrid models designed to mitigate 'inverse scaling' issues, potentially integrating traditional econometric methods.

Third

A broader re-evaluation of 'capability' metrics for LLMs, moving beyond general benchmarks to task-specific performance in critical applications.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.AI

#cs.AI

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.