MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

arXiv:2606.02630v1 Announce Type: cross Abstract: Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate GPT-4.1-mini under fixed template, template-adaptive, and live adversarial attacks. Unsafe responses rise from 35% to nearly 80% by Turn 4 under live attack. Under the same adversary, GPT-4.1-mini and Claude Sonnet 4.5 are statistically indistinguishable at baseline but diverge to a 19x gap by Tur
The increasing deployment of AI in sensitive applications like healthcare necessitates robust safety evaluations, and this paper highlights critical vulnerabilities in existing models under more realistic, multi-turn adversarial interactions.
This research reveals a significant and exploitable weakness in current medical AI safety evaluations, demonstrating that state-of-the-art models can be easily manipulated to provide unsafe responses when users persist, adding urgency or authority.
The understanding of AI safety for medical chatbots shifts from single-turn resilience to a more complex multi-turn vulnerability, requiring new evaluation methodologies and defensive strategies for deployment.
- · AI safety researchers
- · Adversarial AI specialists
- · Responsible AI development firms
- · Medical AI developers relying on single-turn safety metrics
- · Patients interacting with insufficiently robust medical chatbots
Medical AI systems will require more sophisticated, context-aware defense mechanisms and evaluation frameworks.
Increased scrutiny and possibly new regulatory requirements for AI systems deployed in high-stakes fields like healthcare, focusing on multi-turn robustness.
The development of 'red-teaming' as a standard and continuous practice within medical AI development to proactively identify and mitigate complex adversarial attacks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI