Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

arXiv:2606.12864v1 Announce Type: cross Abstract: Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repai
The rapid advancement of LLMs in code generation necessitates new benchmarks that go beyond basic problem-solving to evaluate their educational and debugging utility.
This benchmark indicates a maturing understanding of AI's role in software development, shifting from pure generation to more nuanced tasks like error identification and repair, which is critical for future developer tooling and education.
The evaluation criteria for code-generating LLMs expand significantly to include debugging and 'hacking' human-written code, pushing models towards more robust and interactive capabilities.
- · AI model developers
- · Competitive programming platforms
- · Software development education
- · Software companies using AI for code review
- · LLMs lacking advanced reasoning capabilities
- · Developers resistant to AI-assisted debugging
LLMs will be trained and optimized to excel at code debugging and repair, not just generation.
The integration of AI into developer workflows will accelerate, potentially creating new 'AI-pair programmer' roles focused on debugging.
Future programming curricula could be significantly altered as AI takes on more of the diagnostic and repair burden, allowing humans to focus on higher-level design and innovation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI