How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv:2606.25487v1 Announce Type: cross Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite ways. The dedicated classifier over-flags (precisi
The proliferation of LLMs and increasing focus on their responsible deployment necessitates robust evaluation methods for adversarial robustness, making judge reliability a critical current concern.
The reliability of automated judges directly impacts the safety and trustworthiness of AI systems, especially in critical applications sensitive to jailbreaks and prompt injection attacks.
The understanding that automated ASR scoring, a standard metric for AI safety, is fundamentally flawed, requiring a re-evaluation of current LLM safety benchmarks and development practices.
- · developers of more robust AI safety evaluation methods
- · companies prioritizing human-in-the-loop validation for AI safety
- · researchers focused on adversarial AI and secure ML
- · LLM developers relying solely on automated ASR judges
- · benchmarking platforms with unvalidated scoring mechanisms
- · users exposed to less secure AI systems
Increased scrutiny and re-evaluation of automated ASR judges and safety evaluation tools for LLMs.
Development of next-generation, human-validated, and more robust adversarial robustness benchmarks and scoring methodologies.
Shifts in LLM development priorities to incorporate judge reliability as a core design principle alongside attack surface reduction.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG