Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

arXiv:2606.05920v1 Announce Type: cross Abstract: Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the
The proliferation of code generation models necessitates more robust and realistic benchmarking to drive real-world utility and adoption, moving beyond simplified one-shot scenarios.
This benchmark addresses a critical gap in evaluating AI agents' ability to handle complex, iterative, and underspecified programming tasks, which is essential for developing truly autonomous software development agents.
The focus for code-generating AI will shift towards multi-round interaction, user clarification, and iterative refinement, mirroring real-world development workflows rather than just single-prompt completion.
- · AI Agent developers
- · Web development platforms
- · Software engineering researchers
- · Companies adopting AI for software development
- · Current single-shot code generation benchmarks
- · Developers relying solely on one-shot AI coding tools
Improved performance of AI code agents in handling real-world, iterative development tasks.
Accelerated development of more sophisticated AI agents capable of understanding and refining user intent over multiple interactions.
Potential for significant disruption in white-collar software development workflows as AI agents take on more autonomous, iterative project roles.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL