
arXiv:2606.29378v1 Announce Type: new Abstract: Sinhala is a morphologically rich abugida spoken by roughly 16 million people in Sri Lanka, and to date, there are no publicly available real-world datasets for page-level Sinhala OCR. All previous studies for assessing Sinhala OCR models have used artificially generated data. To bridge the gap, we introduce sinhala-ocr-lk-acts-1010, an annotated dataset of 1,010 page-level images and their transcriptions collected from Sri Lankan Legislative Acts published between 1981-1989 and 2000-2019, split into 707 training examples, 101 validation examples
The increasing maturity of AI and NLP models is driving efforts to extend their capabilities to a wider array of less-resourced languages, leading to the creation of datasets like this one.
Improving OCR for morphologically rich, lower-resource languages like Sinhala expands access to historical documents and digital information, potentially fostering local digital economies and knowledge preservation.
The availability of a real-world, page-level Sinhala OCR dataset removes a significant barrier for developing and evaluating more accurate OCR systems for the language, shifting from synthetic to authentic data-driven research.
- · Sri Lankan linguistic researchers
- · NLP developers focusing on South Asian languages
- · Digital archivists and libraries in Sri Lanka
- · Monolingual OCR solutions
- · Legacy manual data entry processes for Sinhala texts
Improved Sinhala OCR will digitize historical legislative and cultural documents, making them searchable and analyzable via AI.
This digitization could enable new applications in legal tech, historical research, and education within Sri Lanka, potentially reducing information asymmetry.
The success of this approach for Sinhala could serve as a blueprint for developing similar real-world datasets and OCR solutions for other low-resource languages globally, fostering digital inclusivity on a broader scale.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL