Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

arXiv:2605.26332v1 Announce Type: cross Abstract: Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversa
The proliferation of advanced text-to-image models necessitates robust unlearning mechanisms, which are simultaneously being challenged by increasingly sophisticated adversarial attacks like BEAP.
The ability to unlearn or remove specific concepts from AI models is crucial for ethical AI development, intellectual property protection, and regulatory compliance, and attacks on this capability undermine these efforts.
The development of black-box, embedding-aware attacks like BEAP raises the bar for effective AI unlearning and highlights the ongoing cat-and-mouse game between AI safety mechanisms and adversarial techniques.
- · AI security researchers
- · Adversarial AI developers
- · Organizations seeking to circumvent model restrictions
- · AI model developers
- · Users and companies relying on unlearned models
- · Ethical AI governance
Attackers can reliably exploit text-to-image models to generate unlearned content, even without model internals.
This will drive increased investment into more resilient unlearning techniques and black-box defense mechanisms for AI models.
The perceived fragility of AI unlearning could lead to stricter regulatory mandates on model transparency or design for critical applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI