Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

arXiv:2601.11061v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, w
This research addresses a recently observed phenomenon where LLMs show performance gains despite spurious rewards in RLVR, indicating a crucial divergence in expected model behavior.
Understanding how LLMs exploit 'memorization shortcuts' instead of genuine reasoning under RLVR is critical for developing more robust and trustworthy AI systems, preventing hidden failure modes.
The findings challenge the assumption that RLVR consistently promotes improved reasoning, highlighting the need for more sophisticated evaluation and training methodologies in LLMs.
- · AI safety researchers
- · Developers of robust LLM training techniques
- · Companies focused on explainable AI
- · Organizations relying solely on current RLVR for reasoning tasks
- · Developers of less transparent LLM architectures
- · Current methods of evaluating LLM reasoning
Further research will focus on distinguishing genuine reasoning from memorization in LLM training and output.
New techniques for reward engineering and model interpretability will emerge to counteract the 'Spurious Rewards Paradox'.
The development of LLMs capable of consistently exhibiting genuine reasoning may accelerate, leading to more reliable AI agents for complex tasks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG