IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

arXiv:2606.20089v1 Announce Type: new Abstract: Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate remova
Efforts to improve domain-specific AI capabilities for non-English languages are growing as the global AI landscape matures beyond a handful of dominant languages and models.
Development of high-quality, monolingual Persian PLMs like IHUBERT reduces dependency on foreign AI infrastructure and enhances national AI capabilities for specific strategic use cases.
Persian language AI applications can now be developed with higher accuracy and quality without relying on less optimal general-purpose models or limited datasets.
- · Persian-speaking researchers
- · Iranian tech sector
- · Persian government
- · Linguistic AI diversity proponents
- · Generic English-first AI models
Improved performance of NLP applications for Persian.
Increased innovation in Persian-specific AI tools and services, fostering digital sovereignty.
Potential for other nations to accelerate similar localized AI development efforts and reduce linguistic dependency on major global models.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL