When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

arXiv:2605.25981v1 Announce Type: new Abstract: We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap surviv
This research provides empirical evidence of a fundamental difference in how LLM agents process semantic versus surface information, a crucial insight as agentic systems become more prevalent.
Understanding the brittleness and biases of LLM agents, particularly their sensitivity to meaning-bearing perturbations, is critical for developing robust and reliable AI systems across various applications.
This research highlights a new angle of vulnerability or inherent characteristic in LLM agents, suggesting that their performance is more susceptible to changes in meaning than in presentation, even when severity is controlled.
- · AI safety researchers
- · Developers of robust LLM applications
- · Companies specializing in AI testing and validation
- · Developers of brittle or poorly tested LLM agents
- · Users relying on unvalidated agentic systems
The study directly reveals a significant inconsistency gap in LLM agents' responses to semantic versus surface noise.
This observation will drive new research into how to make LLM agents more robust to semantic perturbations and less sensitive to minor input variations.
Improved understanding and mitigation of these inconsistencies could accelerate the dependable deployment of AI agents in critical, high-stakes environments.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL