arXiv:2511.13725v4 Announce Type: replace-cross Abstract: Malicious AI causing harm to humans is not just a Hollywood fantasy. Indeed, as highly capable models such as Claude Mythos emerge and agent systems like OpenClaw rapidly spread, the question of how to stop an AI that acts maliciously -- whether by design or by accident -- has become urgent. To address this, we propose Killbench, a benchmark for evaluating the Killswitch: a mechanism that halts a malicious AI's in-progress behavior using only external signals. Targeting web agents -- the most widely deployed agent domain -- Killbench ev
Source: arXiv cs.AI — read the full report at the original publisher.
