Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

arXiv:2605.27958v1 Announce Type: cross Abstract: Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design incl
The proliferation of LLMs and increasing interest in their reliability and safety necessitates advanced deception detection methods, especially as their capabilities scale.
Improving the robustness of deception probes is crucial for building trustworthy and safe AI systems, which impacts safety, regulation, and the deployment of AI in critical applications.
Understanding the limitations and failure modes of current deception detection probes in LLMs allows for the development of more reliable and robust diagnostic tools.
- · AI Safety Researchers
- · LLM Developers
- · Organizations deploying AI
- · Overly simplistic AI diagnostic tools
- · Unreliable AI applications
More accurate and resilient methods for detecting deceptive behavior in LLMs become available.
Increased trust in AI systems due to better diagnostic capabilities could accelerate their adoption in sensitive areas.
New regulatory frameworks may emerge, incorporating robust AI safety and deception-detection benchmarks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG