
arXiv:2606.14391v1 Announce Type: cross Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a
The proliferation of more natural human-AI interaction methods, coupled with advanced ASR models, necessitates better handling of natural speech disfluencies for improved user experience and performance.
This research addresses a key limitation in current ASR, which often struggles with the nuances of natural human speech, impacting the reliability and accuracy of voice-driven applications.
ASR systems will evolve from merely transcribing ideal speech to accurately capturing and interpreting the full spectrum of human vocalizations, including 'ums' and 'uhs', leading to more robust and human-like AI interactions.
- · AI developers
- · Customer service platforms
- · Speech-to-text applications
- · Accessibility technology
- · Existing ASR models (without disfluency handling)
- · Companies reliant on perfect speech inputs
ASR models become significantly more accurate and natural in understanding human speech.
Improved ASR accelerates the development and adoption of AI agents that can seamlessly interact with humans through voice.
Enhanced voice interfaces deepen human reliance on AI for daily tasks, blurring the lines between human and machine communication in numerous sectors.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI