WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

arXiv:2509.10452v2 Announce Type: replace Abstract: Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech
The proliferation of powerful ASR models like Whisper highlights the persistent challenge of domain adaptation, especially in environments where speech data collection is difficult, driving innovation in text-only methods.
This development allows ASR models to be finetuned for specialized vocabularies and accents using only text, significantly lowering data collection barriers and democratizing access to high-performance voice AI.
The ability to perform deep, text-only domain adaptation for ASR reduces the dependency on vast, domain-specific audio datasets, making custom voice AI more accessible and flexible for niche applications.
- · ASR developers
- · Companies with specialized jargon
- · Industries with sensitive audio data
- · Providers of expensive domain-specific audio datasets
- · Companies reliant on large-scale speech data collection
ASR systems will become more versatile and adaptable to new contexts quickly.
This could accelerate the deployment of voice interfaces in highly specialized or sensitive domains where collecting speech data is impractical.
The reduced data barrier might lead to a proliferation of custom voice AI applications, potentially impacting the white-collar workflow through highly accurate voice-to-text transcriptions in specific fields.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL