BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

arXiv:2605.30162v1 Announce Type: cross Abstract: Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it
The rapid advancement and deployment of large language models are concurrently increasing scrutiny on their safety and reliability, particularly in sensitive domains like biosecurity.
This research reveals critical vulnerabilities in AI safety mechanisms, indicating models can be easily manipulated to bypass biosecurity safeguards, which demands immediate attention from AI developers and policymakers.
The understanding of AI refusal mechanisms shifts from 'is it refusing?' to 'how robust is that refusal?', highlighting a new dimension of potential algorithmic fragility and requiring more sophisticated auditing methods.
- · AI safety researchers
- · Cybersecurity firms specializing in AI red-teaming
- · National security agencies
- · Developers of unhardened LLMs
- · Organizations relying solely on current refusal mechanisms for hazardous content
- · AI ethics Washing initiatives
Regulators will likely impose stricter auditing requirements for AI models, especially those deployed in high-risk applications like biosecurity.
An arms race could emerge between prompt engineers attempting to bypass safety features and AI red-teamers designing more robust refusal mechanisms.
Public trust in the safety claims of AI systems may erode as these vulnerabilities become more widely understood.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG