
arXiv:2507.08794v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasonin
The increasing reliance on LLMs for critical tasks like evaluation and reward generation in AI training creates an immediate need to understand and mitigate their vulnerabilities.
This research exposes a fundamental vulnerability in LLM-as-a-judge paradigms, highlighting that even reference-based systems can be manipulated, which compromises the integrity of AI development and safety.
The assumption that LLMs can serve as reliably objective judges, particularly in verifiable reward settings, is challenged, necessitating more robust evaluation and reward mechanisms.
- · AI safety researchers
- · Cybersecurity firms
- · Developers of robust LLM evaluation techniques
- · AI developers relying solely on LLM-as-a-judge for reward signals
- · Organizations deploying unchecked generative reward models
- · LLM-as-a-Judge paradigms without adversarial training
Security patches and adversarial training techniques will be prioritized for LLM-based evaluation systems.
There will be a push for human-in-the-loop validation or alternative, more robust methods for reward signal generation beyond sole reliance on LLMs.
This could slow down the adoption of fully autonomous AI agentic systems if core evaluation components are deemed hackable.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL