Who Gets the Reward & Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents

arXiv:2511.10687v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent- and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation to agent credit to response-level signals. Unlike prior approaches that rely only on attribution (Shapley) or step-level labels (PRM), our method produces local, signed, and
The rapid development and deployment of multi-LLM agent systems necessitate new methods for performance evaluation and training to unlock their full potential and address current limitations in credit assignment.
This framework offers a principled approach to overcoming a core challenge in complex AI systems, enabling more effective training and deployment of autonomous agents capable of collaborative problem-solving.
Current heuristic-based training methods for multi-agent LLM systems are replaced by a more rigorous, attributable, and granular system for connecting overall performance to individual agent and action contributions.
- · AI agents developers
- · Enterprises leveraging multi-agent systems
- · Researchers in cooperative AI and game theory
- · Inefficient multi-agent LLM architectures
- · Heuristic credit assignment methods
More robust, efficient, and reliable multi-LLM agent systems will become feasible for complex tasks.
The proliferation of highly capable AI agents could accelerate automation across various industries, impacting white-collar workforces.
Improved multi-agent coordination could lead to autonomous systems tackling grand challenges currently beyond human or single-AI capabilities, potentially shifting economic and societal structures.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL