Effective vocabulary expansion of multilingual language models for extremely low-resource languages

arXiv:2602.09388v2 Announce Type: replace Abstract: Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source langu
The proliferation of advanced AI models highlights the growing challenge of language inclusivity, particularly for low-resource languages, prompting active research into methods to expand their applicability.
This development allows for broader and more equitable access to advanced AI capabilities across diverse linguistic groups, reducing the digital divide and enabling new applications in previously underserved communities.
Multilingual pre-trained language models can now be more effectively adapted to extremely low-resource languages using targeted vocabulary expansion and screening, improving their performance and utility.
- · AI developers
- · linguistic minorities
- · developers in emerging markets
- · local content creators
- · monolingual AI models
- · societies with limited linguistic diversity
AI models will become accessible and performant for a wider array of languages, fostering local language content creation and digital inclusion.
This could accelerate the development of AI tools tailored to specific cultural and linguistic contexts, driving new forms of localized innovation.
Increased linguistic equity in AI could subtly shift geopolitical soft power, as more nations and linguistic groups contribute to and benefit from cutting-edge AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL