Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

arXiv:2510.14207v3 Announce Type: replace Abstract: Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agen
The proliferation of LLMs in interactive web applications necessitates a deeper understanding of their vulnerabilities to sophisticated misuse, moving beyond single-turn prompt attacks to multi-turn interactions.
This research provides crucial benchmarks and methods to evaluate and mitigate the risk of LLMs being weaponized for online harassment through multi-turn agentic interactions, impacting trust and safety.
The focus for safeguarding LLMs shifts from isolated prompt-based attacks to more complex, simulated multi-agent interactions, requiring advanced defense mechanisms and ethical guardrails.
- · AI safety researchers
- · Social media platforms
- · Developers of robust LLM defense systems
- · Unsecured LLM applications
- · Users vulnerable to online harassment
- · Developers neglecting multi-turn security
New benchmarks and methodologies will emerge for testing LLM resilience against multi-turn malicious interactions.
Increased investment in 'red teaming' and adversarial AI research will become standard for LLM deployment.
The development of 'ethical AI agents' designed to detect and neutralize harassment in real-time will accelerate, potentially leading to new forms of proactive digital moderation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI