When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

arXiv:2606.11350v1 Announce Type: new Abstract: Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dil
The proliferation of RAG systems and increasing scale of available data necessitate solutions to address performance degradation in large document collections.
This research highlights a critical limitation in current RAG architectures, impacting the reliability and scalability of AI systems that depend on vast knowledge bases.
The understanding that simple scaling of RAG with more documents can degrade performance, rather than improve it, will drive new retrieval strategies and architectural designs.
- · AI researchers in information retrieval
- · Developers of domain-specific RAG solutions
- · Enterprises with large, heterogeneous internal knowledge bases
- · Generic RAG platforms relying solely on dense vector search
- · Applications requiring high accuracy from very large, unstructured datasets
Further research and development will focus on advanced retrieval methods to enhance RAG performance for large-scale document collections.
This will lead to more robust and reliable AI agents and enterprise AI applications capable of handling complex RAG scenarios.
Improved RAG systems could accelerate the adoption of AI in sectors requiring precise information extraction from massive, diverse data, impacting various white-collar workflows.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL