
arXiv:2604.08571v2 Announce Type: replace Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of fa
The proliferation of Large Language Models (LLMs) and their deployment in various applications necessitates rigorous testing of their reliability under diverse conditions, which this benchmark addresses.
This benchmark highlights a critical vulnerability in LLMs, where minor textual perturbations can significantly degrade performance, impacting the trustworthiness and deployment readiness of AI systems.
The understanding of LLM robustness is refined, moving beyond standard benchmarks to evaluate resilience against adversarial textual variations, which will influence future model development and evaluation methodologies.
- · Developers of resilient frontier LLMs
- · AI safety and ethics researchers
- · Enterprises prioritizing robust AI deployments
- · Developers of models like Claude that show significant fragility
- · Users relying on less robust open-weight reasoning models
- · Applications where text perturbation is common or critical
Further research and development efforts will focus on improving LLM robustness to textual perturbations.
New evaluation standards and competitive pressures will emerge, pushing LLM developers to integrate robustness as a core design principle.
The commercial viability and adoption rates of certain LLMs may be significantly affected by their demonstrable robustness, leading to shifts in market dominance.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG