Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

arXiv:2606.00813v1 Announce Type: cross Abstract: Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other g
This research provides new empirical evidence that LLM safety alignment is non-monotonic, meaning progress is not linear across different model versions.
It challenges the assumption that newer LLM generations are inherently safer, highlighting the complex and potentially regressive nature of safety mechanisms.
Developers and red-teaming efforts must assume that new LLM versions might be more vulnerable to adversarial attacks, requiring continuous and generational-specific evaluations.
- · Red-teaming expertise and services
- · Cybersecurity firms specializing in AI
- · Independent AI safety researchers
- · LLM developers without robust, continuous safety testing
- · Users relying solely on version numbers for safety assurance
- · Generic, one-off safety audit methodologies
Increased emphasis and investment in generational AI safety evaluation and attack transfer mechanisms.
Potential for regulatory bodies to demand more stringent, continuous safety audits across LLM development cycles.
Divergence in LLM adoption based on proven, transparent safety methodologies rather than simply model size or generation number.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL