SIGNALAI·May 26, 2026, 4:00 AMSignal75Short term

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

arXiv:2605.25981v1 Announce Type: new Abstract: We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap surviv

Why this matters

Why now

This research provides empirical evidence of a fundamental difference in how LLM agents process semantic versus surface information, a crucial insight as agentic systems become more prevalent.

Why it’s important

Understanding the brittleness and biases of LLM agents, particularly their sensitivity to meaning-bearing perturbations, is critical for developing robust and reliable AI systems across various applications.

What changes

This research highlights a new angle of vulnerability or inherent characteristic in LLM agents, suggesting that their performance is more susceptible to changes in meaning than in presentation, even when severity is controlled.

Winners

· AI safety researchers
· Developers of robust LLM applications
· Companies specializing in AI testing and validation

Losers

· Developers of brittle or poorly tested LLM agents
· Users relying on unvalidated agentic systems

Second-order effects

Direct

The study directly reveals a significant inconsistency gap in LLM agents' responses to semantic versus surface noise.

Second

This observation will drive new research into how to make LLM agents more robust to semantic perturbations and less sensitive to minor input variations.

Third

Improved understanding and mitigation of these inconsistencies could accelerate the dependable deployment of AI agents in critical, high-stakes environments.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.CL

#cs.CL

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.