Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

arXiv:2605.29384v1 Announce Type: cross Abstract: We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse
This research builds on recent advancements in sparse retrieval and dense retrieval, providing a novel bridge between them that optimizes for efficiency and performance at a critical juncture in AI model development.
This discovery simplifies the development and deployment of efficient retrieval systems by showing that existing dense models implicitly learn sparse features, potentially reducing computational overhead and improving interpretability.
Retrieval system design can now leverage the complementary strengths of dense and sparse methods more directly, potentially leading to more resource-efficient and performant information retrieval. It might also democratize access to advanced retrieval capabilities.
- · Information Retrieval Developers
- · Search Engine Providers
- · AI Infrastructure Providers
- · Companies with Large Document Corpora
- · Companies solely reliant on very complex, resource-intensive hybrid retrieval se
More efficient and accurate information retrieval systems become widely accessible across various applications.
The cost of building and operating robust AI-powered search and knowledge management systems decreases.
New AI applications emerge that previously were too computationally expensive due to retrieval inefficiencies.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI