
arXiv:2605.22826v1 Announce Type: cross Abstract: Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of LLMs within the social deduction game Secret Hitler. I introduce an open-source framework and novel metrics to measure performance: Role Identification Accuracy, Deception Retention Rate, and Game State Impact Rate. By benchmarking models against rule-based algorithms and human games, I identify a gap between conversat
The rapid advancement and deployment of LLMs necessitate a deeper understanding of their complex social and deceptive capabilities for robust AI safety frameworks.
Quantifying LLM deception and social reasoning is crucial for anticipating risks and developing safeguards against potential misuse in sensitive applications and interactions.
Our ability to systematically evaluate and benchmark LLM's 'theory of mind' and deceptive potential in controlled environments is enhanced, paving the way for more rigorous safety testing.
- · AI Safety Researchers
- · Evaluators of AI Ethics
- · AI Governance Bodies
- · Unregulated LLM Developers
- · Users trusting LLMs uncritically
This research provides a standardized method to measure LLM deception, improving safety evaluations.
Improved measurement leads to more effective red-teaming and the development of LLMs more resistant to manipulative behaviors.
Greater public trust in AI systems due to transparent assessment of their limitations and potential for deception.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI