
arXiv:2512.17375v2 Announce Type: replace Abstract: LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from "No" to "Yes". These tokens are sampled from the judge's own next-token distribution at the response position, with no manual seed set and no gradient-based optimization. Our procedure, AdvJudge-Zero, reaches $>$90% ensemble false-positive rate on 22 of 24 (model, dataset) cells acr
The increasing reliance on LLM-as-a-Judge systems for critical tasks like RLHF makes their vulnerability to simple adversarial attacks a pressing concern that requires immediate attention.
This research reveals a fundamental susceptibility in LLM evaluation, threatening the reliability and trustworthiness of AI models trained or guided by such judges, with implications for safety and bias.
The perceived robustness of LLM-as-a-Judge systems is significantly undermined, necessitating a rapid development of more resilient evaluation mechanisms.
- · AI safety researchers
- · Adversarial AI developers
- · Cybersecurity firms specializing in AI
- · Developers of reward models for RLHF/RLVR
- · AI evaluation platforms
- · Systems relying on current LLM-as-a-Judge paradigms
LLM-as-a-Judge systems that supply reward signals are shown to be vulnerable to simple, non-gradient adversarial attacks.
This vulnerability could lead to the deployment of AI models trained on flawed reward signals, potentially exhibiting unintended or harmful behaviors.
Public distrust in AI systems could grow if adversarial manipulations of evaluation systems become widespread, leading to regulatory scrutiny and demands for explainability.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG