Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval

arXiv:2606.27401v1 Announce Type: cross Abstract: Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage recall in large-scale code-to-code search engines. Benchmarking across multiple programming languages and datasets reveals critical limits in the precision and scalability of these models on Terabyte-scale source-code collections. We present LLM-based code normalisation and query-rewriting schemes that yield significant ga
The proliferation of increasingly complex deep learning models and large language models (LLMs) is driving the need for more efficient and scalable code retrieval techniques in large-scale software development environments.
This research highlights critical limitations in current deep learning approaches for code search at scale, indicating a need for more robust methods to manage vast codebases, impacting software development efficiency and potentially security.
The findings suggest that current deep learning models struggle with precision and scalability in Terabyte-scale code collections, necessitating advancements in 'recall before rerank' strategies and LLM-based normalization/query rewriting.
- · Companies developing specialized code retrieval algorithms
- · Software developers leveraging advanced search tools
- · Firms focusing on LLM-based code understanding
- · Companies relying solely on basic deep learning for large-scale code search
- · Software projects with unsearchable or difficult-to-manage codebases
Improved semantic code search tools will allow developers to find and reuse code more effectively.
Enhanced code search capabilities could lead to faster software development cycles and reduced technical debt across industries.
The ability to efficiently search vast codebases could accelerate the development of autonomous software agents and more sophisticated AI-driven development tools.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG