A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

arXiv:2605.02270v2 Announce Type: replace Abstract: This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,2
The increasing maturity of various AI architectures necessitates systematic comparative benchmarks to guide future development, particularly for less-resourced language pairs like Tajik-Persian.
This research provides a foundational dataset and comparative analysis for machine transliteration between two strategically important languages, supporting cross-cultural digital communication and information access.
The availability of a comprehensive benchmark and parallel corpus streamlines development efforts for Tajik-Persian language technologies, potentially bridging communication gaps.
- · AI researchers in natural language processing
- · Tajik and Persian speaking communities
- · Organizations operating across Central Asia and Iran
- · Developers of less efficient, unbenchmarked transliteration models
Improved machine transliteration accuracy between Tajik and Persian will facilitate better information exchange.
Enhanced digital communication tools could lead to stronger economic and cultural ties between regions using these languages.
The methodology and dataset could inspire similar efforts for other low-resource language pairs, fostering global linguistic inclusivity in AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL