The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

arXiv:2605.27220v1 Announce Type: new Abstract: In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for o
This research provides empirical evidence of the inefficiencies in current RAG pipeline query augmentation, emerging as the technology rapidly scales into production environments.
It highlights significant cost and latency issues in widely adopted RAG techniques, directly impacting the economic viability and user experience of AI-driven information systems.
The findings suggest that current default implementations of query augmentation in RAG systems are often counterproductive, prompting a re-evaluation of best practices for cost-effective and efficient retrieval.
- · AI developers focused on efficiency
- · Companies with proprietary RAG optimization techniques
- · Users of RAG systems receiving faster, cheaper results
- · Companies over-relying on generic LLM-based query augmentation
- · Providers of LLMs used inefficiently for query expansion
System architects will re-evaluate and optimize RAG pipeline components to mitigate unnecessary LLM inference costs and latency.
There will be a shift towards more context-aware or dynamically triggered query augmentation strategies, rather than universal application.
New research and products will emerge focusing on intelligent pre-retrieval routing and selective augmentation to improve RAG efficiency.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL