THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

arXiv:2606.01738v1 Announce Type: new Abstract: Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to
The rapid advancement of large language models and their increasing deployment across various applications necessitates robust defense mechanisms against sophisticated multi-turn adversarial attacks.
Securing large language models from 'jailbreak' attacks is critical for maintaining their safety, trustworthiness, and preventing their misuse, impacting their widespread adoption and regulatory compliance.
The introduction of a training-free, multi-turn defense framework potentially offers a more efficient and less resource-intensive method to secure LLMs compared to costly retraining or single-turn analysis.
- · AI developers
- · LLM users
- · Cybersecurity firms
- · Malicious actors
- · Attack frameworks
Increased reliability and safety of large language models against conversational exploitation.
Accelerated deployment of LLMs in sensitive applications due to enhanced security postures.
A shift in attack strategies as adversaries adapt to more sophisticated, multi-turn defenses on LLMs.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL