SIGNALAI·Jun 16, 2026, 4:00 AMSignal75Medium term

Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

arXiv:2606.15762v1 Announce Type: cross Abstract: We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run. Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when Claude matched a Snyk Cod

Why this matters

Why now

The rapid deployment and increasing reliance on large language models for complex tasks, including security, necessitates a deeper understanding of their reliability before widespread adoption.

Why it’s important

Strategic readers should care as the erratic performance of LLMs in critical areas like vulnerability detection creates significant operational risks and limits their immediate utility without further development.

What changes

The expectation of consistent, repeatable performance from agentic LLMs in security applications is tempered, highlighting a need for improved stability and robustness in their design and deployment.

Winners

· AI safety researchers
· Traditional cybersecurity firms
· Developers focused on LLM reliability

Losers

· Companies prematurely relying on LLMs for critical security tasks
· LLM providers with less robust models
· Organizations seeking rapid, unverified AI deployment

Second-order effects

Direct

Security teams integrating LLMs will need to implement extensive validation and human oversight due to their inconsistent outputs.

Second

This variability could slow the adoption of fully autonomous agentic AI in sensitive domains, pushing towards augmented intelligence rather than full automation.

Third

It may drive the development of new LLM architectures specifically designed for deterministic or highly repeatable output, challenging current probabilistic models.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.AI

#cs.CR #cs.AI #cs.SE

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.