
arXiv:2606.03601v1 Announce Type: cross Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide
As LLMs become more integrated into critical applications, the paradox of safety guardrails causing 'overrefusal' on benign queries is a growing concern, necessitating immediate solutions for reliable deployment.
This development addresses a critical limitation in current LLM safety mechanisms, enabling more robust and trustworthy AI applications, particularly where reliability and ethical considerations are paramount.
The ability to automatically test and repair overrefusal in black-box LLMs means developers can deploy safer and more effective AI without needing internal access to proprietary safety systems.
- · LLM Developers
- · AI Safety Researchers
- · Enterprises Adopting LLMs
- · Users of LLM-powered applications
- · LLM Systems with high overrefusal
- · Organizations relying on opaque safety mechanisms
Increased public and industry trust in AI safety and reliability.
Accelerated adoption of LLMs in highly regulated and sensitive sectors due to improved refusal handling.
Potential for new regulatory frameworks and industry standards centered around explainable overrefusal testing.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI