
arXiv:2606.15762v1 Announce Type: cross Abstract: We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run. Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when Claude matched a Snyk Cod
The rapid deployment and increasing reliance on large language models for complex tasks, including security, necessitates a deeper understanding of their reliability before widespread adoption.
Strategic readers should care as the erratic performance of LLMs in critical areas like vulnerability detection creates significant operational risks and limits their immediate utility without further development.
The expectation of consistent, repeatable performance from agentic LLMs in security applications is tempered, highlighting a need for improved stability and robustness in their design and deployment.
- · AI safety researchers
- · Traditional cybersecurity firms
- · Developers focused on LLM reliability
- · Companies prematurely relying on LLMs for critical security tasks
- · LLM providers with less robust models
- · Organizations seeking rapid, unverified AI deployment
Security teams integrating LLMs will need to implement extensive validation and human oversight due to their inconsistent outputs.
This variability could slow the adoption of fully autonomous agentic AI in sensitive domains, pushing towards augmented intelligence rather than full automation.
It may drive the development of new LLM architectures specifically designed for deterministic or highly repeatable output, challenging current probabilistic models.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI