Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

arXiv:2605.26575v1 Announce Type: new Abstract: Multilingual embedding models are deployed under the assumption that cross-lingual retrieval is symmetric: if a query in language A retrieves its translation in language B, the reverse should also hold. In practice it does not. Using a parallel corpus of 6,518 idiomatic and proverbial expressions in English, Bangla, Hindi, and Arabic, embedded by five production-grade encoders (Gemini, Mistral, OpenAI-L, OpenAI-S, Qwen), we formalise this failure as a deficit in mutual nearest-neighbour reciprocity and test a single mechanistic claim: among the g
This research is emerging as multilingual AI models are increasingly deployed globally, highlighting a fundamental, previously overlooked issue in their practical application.
Understanding the asymmetries in cross-lingual retrieval is crucial for developing more robust, fair, and reliable multilingual AI, impacting everything from search engines to international communication tools.
The focus for improving multilingual embeddings shifts from solely addressing anisotropy to also resolving hubness, leading to more targeted research and development efforts.
- · AI researchers focusing on representational geometry
- · Developers of multilingual applications requiring high accuracy
- · Users of AI tools in diverse linguistic contexts
- · Platforms relying on naive cross-lingual retrieval symmetry
- · Current generation of multilingual embedding models with unaddressed hubness
Further research and development will prioritize solutions for hubness in multilingual embedding models.
Improved retrieval accuracy will enhance cross-lingual information access and reduce translational biases.
More reliable multilingual AI could foster greater cross-cultural understanding and efficiency in global operations.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL