SIGNALAI·May 28, 2026, 4:00 AMSignal85Short term

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Source: arXiv cs.LG

Share
AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

arXiv:2512.17375v2 Announce Type: replace Abstract: LLM-as-a-Judge systems supply the reward signal in modern RLHF and RLVR pipelines, but their binary verdict reduces to a single linear readout F_gap on one hidden state. We show this readout is shallow enough that short, low-perplexity tokens flip the verdict from "No" to "Yes". These tokens are sampled from the judge's own next-token distribution at the response position, with no manual seed set and no gradient-based optimization. Our procedure, AdvJudge-Zero, reaches $>$90% ensemble false-positive rate on 22 of 24 (model, dataset) cells acr

Why this matters
Why now

The increasing reliance on LLM-as-a-Judge systems for critical tasks like RLHF makes their vulnerability to simple adversarial attacks a pressing concern that requires immediate attention.

Why it’s important

This research reveals a fundamental susceptibility in LLM evaluation, threatening the reliability and trustworthiness of AI models trained or guided by such judges, with implications for safety and bias.

What changes

The perceived robustness of LLM-as-a-Judge systems is significantly undermined, necessitating a rapid development of more resilient evaluation mechanisms.

Winners
  • · AI safety researchers
  • · Adversarial AI developers
  • · Cybersecurity firms specializing in AI
Losers
  • · Developers of reward models for RLHF/RLVR
  • · AI evaluation platforms
  • · Systems relying on current LLM-as-a-Judge paradigms
Second-order effects
Direct

LLM-as-a-Judge systems that supply reward signals are shown to be vulnerable to simple, non-gradient adversarial attacks.

Second

This vulnerability could lead to the deployment of AI models trained on flawed reward signals, potentially exhibiting unintended or harmful behaviors.

Third

Public distrust in AI systems could grow if adversarial manipulations of evaluation systems become widespread, leading to regulatory scrutiny and demands for explainability.

Editorial confidence: 90 / 100 · Structural impact: 75 / 100
Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.LG
Tracked by The Continuum Brief · live intelligence network
Share
The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.

By subscribing, you agree to receive updates from THE CONTINUUM BRIEF. You can unsubscribe at any time.