Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment

arXiv:2606.14948v1 Announce Type: cross Abstract: LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. We propose an agentic judging pipeline using a strong LLM as a scalable proxy for expert architectural evaluation, comprising two judges: the Architecture Complexity Judge (ACJ), which estimates codebase-specific architectural understanding a task demands, and the Architecture Quality Judge (AQJ), which evaluates patch
The rapid advancement and widespread adoption of large language models in software development necessitate robust methods for evaluating complex architectural understanding, which traditional testing and manual labeling struggle to provide at scale.
This development addresses a critical bottleneck in leveraging AI for complex software engineering by enabling scalable and automated architectural quality judgment, accelerating AI's integration into high-level design tasks.
The ability to automatically assess and enhance architectural reasoning in code LLMs means that AI can now contribute more effectively to the design and quality assurance of complex software systems, moving beyond mere code generation.
- · Software Development Teams
- · Open-source AI development
- · AI-powered DevTools
- · Large Language Model Developers
- · Manual architectural review services
- · Companies relying solely on traditional software testing
Architectural quality of AI-generated code will improve faster, leading to more robust software.
Software development cycles will shorten significantly for complex systems, increasing the pace of innovation.
The role of human architects may shift from primary design to oversight and strategic guidance, as AI handles more low-level architectural decisions.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI