SIGNALAI·May 27, 2026, 4:00 AMSignal75Medium term

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

arXiv:2605.27068v1 Announce Type: new Abstract: Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing

Why this matters

Why now

The rapid advancement and deployment of LLMs necessitate more robust methods for evaluating their interactive and multimodal capabilities beyond simple win rates, driven by a desire for more transparent and auditable AI systems.

Why it’s important

A strategic reader should care because improving the auditability and grounding of multimodal AI agents is crucial for their reliable integration into complex, real-world applications where reasoning, deception, and belief modeling are critical.

What changes

The introduction of QUACK provides a standardized open-source framework for evaluating multimodal social deduction agents, shifting from outcome-based scoring to more granular analysis of an agent's perception, actions, and communication grounding.

Winners

· AI researchers
· AI developers
· Auditing frameworks

Losers

· Black-box AI models
· Traditional game outcome-based evaluation

Second-order effects

Direct

AI agents can be more effectively debugged and their failure modes precisely identified.

Second

Increased transparency and trustworthiness could accelerate the adoption of complex AI agents in critical domains.

Third

New regulatory standards for AI transparency and auditability might emerge, incorporating principles from such evaluation frameworks.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.CL

#cs.CL #cs.AI #cs.MA

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.