
arXiv:2606.03504v1 Announce Type: new Abstract: We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538 utterances, down from a measured zero-shot baseline of 182.18% for Whisper-small on Balti. The dataset, fine-tuned
The proliferation of AI models makes data availability for under-resourced languages a pressing need for equitable AI development, and specific research efforts are addressing this. This paper contributes significantly to closing a gap in linguistic AI resources.
This development demonstrates progress in bridging the AI language gap, potentially enabling greater digital inclusion and preserving linguistic diversity, while also expanding the addressable market for ASR technologies. It also highlights the data collection effort as a key enabler for local AI development.
The Balti language now has its first publicly available ASR resource, enabling more accurate speech-to-text conversion and opening pathways for further AI applications in a previously underserved language. Local communities can now better leverage AI.
- · Balti language speakers
- · Linguistic diversity advocates
- · OpenAI (Whisper)
- · Researchers in low-resource NLP
- · Disambiguation: None
The Balti community gains access to AI-powered tools that were previously unavailable, such as voice assistants or transcription services.
This effort could inspire similar data collection and model fine-tuning initiatives for other low-resource languages, fostering a more inclusive global AI ecosystem.
Improved AI capabilities for local languages may lead to new localized AI products and services, creating economic opportunities and reducing dependence on dominant linguistic AI platforms.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL