OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

arXiv:2601.13300v2 Announce Type: replace Abstract: Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evalua
The proliferation and increasing autonomy of Large Language Models necessitate robust evaluation methods beyond traditional benchmarks to understand and mitigate their vulnerabilities.
Understanding how LLMs can be manipulated through 'option injection' is critical for developing more secure and reliable AI systems, especially as they integrate into sensitive applications.
The introduction of OI-Bench provides a standardized and scalable method to assess LLM susceptibility to deceptive prompts, which was previously harder to quantify systematically.
- · AI safety researchers
- · LLM developers (improving robustness)
- · Organizations deploying LLMs
- · LLMs with poor directive interference resistance
- · Organisations relying on un-benchmarked LLMs
Researchers gain a new tool to benchmark and compare the robustness of different LLMs against a specific type of attack.
Heightened awareness leads to the development of new training methodologies or architectural changes to make LLMs more resilient to directive interference.
Improved LLM robustness contributes to greater trust and broader adoption in high-stakes environments, while poorly defended models face scrutiny.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL