
arXiv:2605.24310v1 Announce Type: new Abstract: Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and or
The rapid advancement and widespread adoption of large language models (LLMs) are enabling new research avenues into linguistic challenges that were previously intractable, such as the systematic identification of lexical gaps.
Improved understanding and automatic detection of lexical gaps are crucial for enhancing machine translation accuracy, building more robust multilingual AI systems, and improving cross-lingual communication.
The ability to data-drivenly identify lexical gaps shifts the paradigm from reliance on human judgment or static taxonomies to dynamic, embedding-based analysis, potentially accelerating multilingual AI development.
- · Machine Translation Developers
- · Multilingual LLM Researchers
- · AI-driven Localization Services
- · Global Businesses
- · Traditional Lexicographers (without AI integration)
- · Anyone relying on manual linguistic analysis
- · Translation systems with poor handling of cultural nuances
Machine translation quality significantly improves, especially for nuanced or culturally specific terms.
Enhanced cross-lingual communication fosters greater cultural exchange and reduces misunderstandings in international contexts.
The reduced barrier of language differences could accelerate the global diffusion of ideas and technologies, potentially impacting economic and geopolitical dynamics.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL