SIGNALAI·May 27, 2026, 4:00 AMSignal75Medium term

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Source: arXiv cs.LG

Share
ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

arXiv:2510.10774v3 Announce Type: replace-cross Abstract: Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimiz

Why this matters
Why now

The release of ParsVoice addresses a critical gap in open-source AI resources for less-resourced languages, coinciding with a global push for more inclusive and diverse AI development.

Why it’s important

This development is crucial for nations and regions seeking to develop their own AI capabilities and reduce dependency on models trained exclusively on dominant languages, fostering digital sovereignty.

What changes

The availability of a large-scale Persian speech corpus will significantly enable the development of advanced multi-speaker Text-to-Speech (TTS) systems and other speech technologies for the Persian language, previously lagging behind major languages.

Winners
  • · Iranian tech companies
  • · Persian-speaking populations
  • · AI researchers in low-resource languages
  • · NLP/TTS developers
Losers
    Second-order effects
    Direct

    Improved AI applications and services for Persian speakers, including voice assistants and accessibility tools.

    Second

    Increased regional digital autonomy and reduced reliance on foreign AI infrastructure for Persian language processing.

    Third

    Potential for other nations with underrepresented languages to accelerate similar domestic AI data and model development efforts.

    Editorial confidence: 90 / 100 · Structural impact: 55 / 100
    Original report

    This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

    Read at arXiv cs.LG
    Tracked by The Continuum Brief · live intelligence network
    Share
    The Brief · Weekly Dispatch

    Stay ahead of the systems reshaping markets.

    By subscribing, you agree to receive updates from THE CONTINUUM BRIEF. You can unsubscribe at any time.