Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv:2606.24391v1 Announce Type: new Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with
The proliferation of advanced LLMs has necessitated increasingly robust and adversarial benchmarking methods to assess their true capabilities in complex, dynamic, and strategic environments.
This benchmark addresses the critical need for evaluating LLM reliability, reasoning, and diplomatic skills under stressors like fog of war and strict output schemas, which are crucial for real-world agentic applications.
The introduction of 'Age of LLM' provides a new standard for evaluating LLM agentic capabilities, moving beyond static question-answering to dynamic, interactive, and strategically challenging scenarios.
- · AI researchers
- · LLM developers
- · Agentic AI platforms
- · Defence tech
- · LLMs lacking robustness
- · Simple benchmarking methodologies
Researchers gain a more effective tool to identify strengths and weaknesses in LLMs for agentic applications.
The competitive benchmark will accelerate the development of more reliable, strategic, and diplomatically capable AI agents.
Improved LLM agent reliability could lead to their faster adoption in sensitive domains requiring strategic reasoning and robust output control.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI