
arXiv:2606.01695v1 Announce Type: new Abstract: Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state diffe
The proliferation of fine-tuning techniques for large language models makes the detection of deliberate or accidental data poisoning a pressing and immediate concern for AI safety and security.
This breakthrough provides a critical tool for identifying hidden vulnerabilities and malicious implants in language models, directly impacting the integrity and trustworthiness of AI systems deployed across various sectors.
Previously undetectable latent harmful behaviors stemming from fine-tuning contamination can now be identified with zero-label methods, significantly improving model auditability and safety protocols.
- · AI Safety Researchers
- · AI Model Developers
- · Organizations deploying LLMs
- · Cybersecurity firms
- · Adversaries attempting AI model poisoning
- · Untrustworthy AI service providers
- · Developers with poor data hygiene
Increased trust and security in fine-tuned language models due to enhanced detectability of contamination.
Development of industry standards and regulatory requirements for contamination auditing in AI models.
A shift in adversarial tactics towards more sophisticated, yet potentially still detectable, model manipulation techniques.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG