QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark comparing four tokenization strategies - BPE, Unigram LM, WordPiece, and a morphology-aware PRPE tokenizer - for Southern Quechua (quz), a low-resource agglutinative language spoken by 8-10 million people in South America. Using a 200k-sentence corpus and the SQUOIA finite-state morphological analyzer (Rios, 2016) as silver
The proliferation of AI models necessitates more effective tokenization strategies for diverse, low-resource languages, especially as global AI development expands beyond English-centric datasets.
Improved tokenization for agglutinative, low-resource languages like Quechua is critical for broadening AI's applicability and ensuring equitable development, reducing data dependency on dominant linguistic groups.
The proposed QuechuaTok benchmark introduces a morphology-aware metric for tokenizer evaluation, shifting focus from mere token frequency to linguistic correctness for agglutinative languages.
- · AI developers working with low-resource languages
- · Speakers of agglutinative languages like Quechua
- · NLP researchers
- · South American linguistic communities
- · AI models reliant on fertility rate for tokenization evaluation
- · Monolingual AI development approaches
More accurate NLP models for agglutinative low-resource languages will emerge, leading to better language preservation and digital inclusion.
This methodology could be adapted for other complex morphological languages, increasing the global reach and utility of AI systems.
Enhanced AI capabilities in these languages may foster local digital economies and reduce linguistic data dependence on major powers.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL