Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

arXiv:2606.07608v1 Announce Type: cross Abstract: We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the A
This research provides a current benchmark for fine-tuning large language models for specialized, low-resource languages, reflecting ongoing efforts to improve model performance and address linguistic diversity challenges.
A strategic reader should care because improving ASR for less common languages expands AI applicability, reduces dependency on major language models for specific regions, and highlights the challenges in data quality and alignment for deep learning.
This research quantifies the current state and challenges of adapting powerful ASR models like Whisper to specific linguistic nuances and data conditions, suggesting pathways for more efficient and accurate localization.
- · AI researchers
- · Swiss German language users
- · Language technology companies
- · Monolingual AI solutions
Improved voice interface accessibility and utility for speakers of low-resource languages.
Increased demand for curated, high-quality linguistic datasets for specialized AI training.
Potential for sovereign AI initiatives in smaller linguistic regions to develop tailored, performant models.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG