Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

arXiv:2605.30189v1 Announce Type: cross Abstract: We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical
The proliferation of fine-tuned language models via LoRA adapters makes them a prime target for sophisticated attacks, and understanding these vulnerabilities is critical as AI adoption grows.
This research reveals a significant security vulnerability in a prevalent method for distributing fine-tuned LLMs, posing risks to AI integrity, trust, and national security.
The ease with which LoRA adapters can be backdoored at a token-feature level means that ensuring the trustworthiness of AI models requires new, more sophisticated detection mechanisms.
- · AI security researchers
- · Cybersecurity firms
- · Developers of robust AI model validation tools
- · Users of untrusted fine-tuned LLMs
- · Organizations relying on LoRA adapters without rigorous validation
- · Open-source AI communities without robust security protocols
Increased scrutiny and demand for secure fine-tuning and deployment practices for LLMs.
Development of industry standards and regulatory frameworks for AI model provenance and integrity.
Potential for nation-state actors to weaponize these backdoors for espionage or sabotage within critical AI infrastructure.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG