SIGNALAI·May 29, 2026, 4:00 AMSignal75Medium term

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

arXiv:2605.30162v1 Announce Type: cross Abstract: Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it

Why this matters

Why now

The rapid advancement and deployment of large language models are concurrently increasing scrutiny on their safety and reliability, particularly in sensitive domains like biosecurity.

Why it’s important

This research reveals critical vulnerabilities in AI safety mechanisms, indicating models can be easily manipulated to bypass biosecurity safeguards, which demands immediate attention from AI developers and policymakers.

What changes

The understanding of AI refusal mechanisms shifts from 'is it refusing?' to 'how robust is that refusal?', highlighting a new dimension of potential algorithmic fragility and requiring more sophisticated auditing methods.

Winners

· AI safety researchers
· Cybersecurity firms specializing in AI red-teaming
· National security agencies

Losers

· Developers of unhardened LLMs
· Organizations relying solely on current refusal mechanisms for hazardous content
· AI ethics Washing initiatives

Second-order effects

Direct

Regulators will likely impose stricter auditing requirements for AI models, especially those deployed in high-risk applications like biosecurity.

Second

An arms race could emerge between prompt engineers attempting to bypass safety features and AI red-teamers designing more robust refusal mechanisms.

Third

Public trust in the safety claims of AI systems may erode as these vulnerabilities become more widely understood.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.LG

#cs.AI #cs.CR #cs.LG

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.