Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

arXiv:2605.22203v1 Announce Type: new Abstract: In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs
This study emerges as the demand for effective AI solutions in low-resource languages, especially for critical sectors like agriculture, becomes more pronounced, coinciding with advancements in multilingual embedding models and RAG frameworks.
Improving text embedding and retrieval for low-resource languages is crucial for expanding AI's utility beyond dominant languages, enabling better information access and decision-making in previously underserved regions and domains.
The focus on optimizing chunking strategies directly impacts the efficiency and quality of AI models in understanding and processing content in languages like Khmer, facilitating broader data utilization.
- · AI developers focused on multilingual models
- · Agricultural sectors in low-resource language regions
- · Local language content creators and data providers
- · Retrieval-Augmented Generation (RAG) framework providers
- · Monolingual AI solutions
- · Systems reliant on manual information extraction in such regions
Improved performance of AI applications, such as enhanced agricultural advisories or market intelligence, in Khmer-speaking areas.
Increased digital literacy and data-driven decision-making among farmers and agricultural policymakers in these regions.
Potential for economic growth and food security improvements in nations with low-resource languages, fueled by better access to AI-driven insights.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL