SIGNALAI·Jun 5, 2026, 4:00 AMSignal75Medium term

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

arXiv:2606.05170v1 Announce Type: new Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. F

Why this matters

Why now

The proliferation of open-weight LLMs necessitates a more nuanced evaluation beyond simple accuracy, as their deployment in critical applications increases the potential for severe errors.

Why it’s important

This research introduces a framework that differentiates the severity of errors, enabling better model selection for sensitive tasks and highlighting a critical blind spot in current LLM evaluation methods.

What changes

The focus of LLM evaluation shifts from scalar error rates to a more granular, heavy-tailed distribution of error severity, impacting model development and deployment strategies.

Winners

· LLM developers focusing on robust error mitigation
· Enterprises deploying LLMs in high-stakes environments
· Researchers specializing in model safety and interpretability

Losers

· LLM benchmarks relying solely on aggregate error counts
· Open-weight LLMs with high rates of severe, catastrophic errors
· Users unaware of the differential impact of various AI hallucination types

Second-order effects

Direct

New benchmarks like Errorquake-10k become standard for evaluating the safety and reliability of LLMs.

Second

Model developers prioritize reducing the tail risk of severe errors over marginal improvements in overall accuracy.

Third

Regulatory bodies integrate severity-weighted error analysis into AI safety guidelines and certification processes.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.LG

#cs.LG

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.