StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

arXiv:2507.07445v3 Announce Type: replace Abstract: Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community.
The rapid advancement in large language models and agentic AI necessitates more sophisticated and holistic benchmarking to evaluate their capabilities in complex, real-world-like scenarios.
Evaluating agentic multimodal LLMs in integrated production and social simulations is crucial for understanding their true potential and limitations before deployment in critical applications.
The introduction of benchmarks like StarDojo shifts the focus from isolated skill evaluation to comprehensive assessment of open-ended, complex behaviors essential for autonomous agents in human-like environments.
- · AI research labs developing multimodal LLMs
- · Gaming platforms for simulation-based AI development
- · Developers of embodied AI and robotics
- · Benchmarks limited to narrow, single-task evaluations
- · AI models unable to handle multi-modal, open-ended tasks
StarDojo will become a key tool for driving progress in agentic AI, pushing models to integrate diverse skills.
AI agents robustly performing in StarDojo-like environments could accelerate their deployment in complex real-world social and work settings.
The insights gained from these simulations may inform the design of future AI architectures and ethical guidelines for highly autonomous systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI