Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

arXiv:2511.11041v2 Announce Type: replace-cross Abstract: We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + \mu$, where the mean $\mu$ is near-identical across all sentences. We study two training-free corrections -- subtracting $\mu$ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark
The proliferation of advanced sentence-embedding models necessitates continuous refinement to address inherent biases and improve their practical efficacy across diverse applications.
Improving the accuracy and reliability of text embeddings is critical for countless AI applications, from search and recommendation systems to natural language understanding and generative AI, enhancing model performance and reducing downstream errors.
New methods for correcting mean bias in text embeddings promise a training-free improvement in model performance, offering a direct path to more robust and accurate AI systems without additional computational cost for retraining.
- · AI developers
- · NLP researchers
- · AI-powered search engines
- · Generative AI applications
- · Inefficient embedding models
- · Organizations relying on uncorrected biased embeddings
Sentence-embedding models will become more reliable and performant for a wide range of tasks.
The cost and complexity of deploying high-quality NLP systems may decrease due to fewer retraining cycles and better off-the-shelf performance.
This conceptual breakthrough could inspire similar training-free corrections for other types of AI model biases, accelerating AI development broadly.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG