A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

arXiv:2606.06420v1 Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few
The proliferation of large language models is driving efforts to extend their capabilities to a wider range of human languages, including those that are endangered and resource-poor.
This work directly addresses the challenge of linguistic diversity in the age of AI, showing progress in applying advanced AI translation to languages traditionally neglected due to lack of data.
The ability to develop parallel corpora and evaluation protocols for extremely low-resource languages opens new avenues for preserving linguistic heritage and expanding AI's global reach.
- · Linguistic preservation efforts
- · Developers of multilingual LLMs
- · Speakers of endangered languages
- · Computational linguists
- · Language barriers
Increased accessibility of AI technologies for communities speaking low-resource languages.
Potential for AI to aid in the revitalization and documentation of endangered languages.
Reduced digital divide for linguistically diverse populations, fostering greater cultural exchange and economic participation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL