Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

arXiv:2606.02147v1 Announce Type: new Abstract: Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversati
The proliferation of AI models demands more sophisticated multilingual understanding, highlighting the immediate need for robust idiom datasets that reflect real-world language use.
Accurate multilingual idiom handling is critical for developing truly global and contextually aware AI agents, particularly for non-English and low-resource languages.
The introduction of MIDI shifts the focus from isolated idiom evaluations to discourse-embedded, native-speaker curated data across a wide range of language resource levels, enabling more realistic NLP development.
- · Multilingual NLP developers
- · AI agents specializing in language understanding
- · Users of AI in low-resource language contexts
- · Linguists and computational linguists
- · AI models reliant on literal translations
- · Monolingual NLP approaches
- · AI development focused solely on high-resource languages
Improved performance of AI models in understanding and generating idiomatic expressions across various languages.
Enhanced cross-cultural communication facilitated by AI, reducing misunderstandings when dealing with nuanced language.
Potential for new AI applications in fields like cultural diplomacy or global content creation, enabled by more human-like language proficiency.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL