Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

arXiv:2605.23243v1 Announce Type: cross Abstract: We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sober
The rapid advancement and attempted practical application of frontier large language models (LLMs) intersect with the critical and complex domain of cybersecurity, necessitating immediate evaluation of their capabilities and limitations.
This research provides a critical independent assessment of LLM efficacy in cybersecurity, directly impacting secure development practices, vulnerability management strategies, and the integration of AI tools within defensive and offensive cyber operations.
Current expectations for LLM performance in specialized cybersecurity tasks must be tempered, indicating a need for significant domain-specific foundational model development or intensive fine-tuning rather than relying on generalist 'frontier' models.
- · Cybersecurity consultancies
- · Specialized AI security startups
- · Organizations developing vertical foundation models
- · Companies relying solely on generalist LLMs for security tasks
- · Uncritical proponents of 'frontier' LLM cybersecurity readiness
The cybersecurity industry will likely invest more in domain-specific AI models and training data rather than leveraging generic LLMs off-the-shelf.
This could lead to a bifurcation in the AI industry, with specialized 'vertical' AI models gaining prominence for critical enterprise applications like cybersecurity, distinct from general-purpose LLMs.
Increased focus on robust AI safety and ethical guidelines specifically for cybersecurity applications could emerge, preventing misuse and ensuring model integrity.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI