Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps

arXiv:2607.00004v1 Announce Type: cross Abstract: While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT-base baseline in learned sparse retrieval (LSR). We identify the root cause as the \textit{Vocabulary Gap}: modern tokenizers utilize raw, case-sensitive vocabularies designed for lossless reconstruction, which map single semantic units to redundant surface forms, wasting model capacity on morphological noise and hindering lexical matching. We formalize this intuition through a theoretical
The paper identifies a critical architectural limitation in advanced foundation models, specifically related to tokenizer design, that has been overlooked until now.
This finding uncovers a fundamental flaw in how advanced AI models process information for a key task, revealing a gap where older, simpler models currently perform better.
Current assumptions about the superior performance of advanced foundation models across all AI tasks are challenged, necessitating re-evaluation of model architectures for sparse retrieval.
- · AI researchers focusing on tokenization and lexical matching
- · Developers of specialized sparse retrieval systems
- · Companies with less sophisticated, but performant, older models
- · Developers blindly applying advanced foundation models to all retrieval tasks
- · Teams that have heavily invested in current 'ModernBERT' type architectures for
- · Users expecting uniform superiority from large foundation models
Increased research and development into more efficient and context-aware tokenization strategies for advanced models in retrieval tasks.
A potential shift in focus from purely scaling model parameters to optimizing pre-processing steps and tokenizer design for specific applications.
New benchmarks and evaluation metrics that specifically assess models' lexical understanding and ability to bridge vocabulary gaps in sparse retrieval scenarios.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG