
arXiv:2602.07457v2 Announce Type: replace-cross Abstract: Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline
The proliferation of advanced AI models has intensified the need for effective, scalable training methods for complex tasks like repository-level code editing, moving beyond simple agentic scaffolding.
This development offers a path to significantly improve the practical capabilities of AI for software development, potentially reducing reliance on extensive human oversight for multi-file code modifications.
AI models can now be trained more effectively on real-world software development data, leading to more robust and autonomous code editing capabilities across large projects.
- · AI developers
- · Software engineering teams
- · Open-source projects
- · Cloud infrastructure providers
- · Manual code review processes
- · Low-skilled software maintenance roles
AI systems will become significantly better at understanding and modifying large, complex codebases.
This improvement could lead to a faster pace of software development and a reduction in the effort required for maintenance and refactoring.
The enhanced autonomous code editing capabilities might eventually enable self-improving software or substantially accelerate the development of new AI applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI