Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

arXiv:2607.01733v1 Announce Type: new Abstract: Speech-LLM integration has shown promising results by leveraging extensive textual pretraining, yet its specific benefits for automatic speech recognition (ASR) remain unclear. We observe that as supervised ASR training data increases, the contribution of LLM priors becomes less evident, and simple speech-text joint training under-utilizes textual knowledge. We therefore propose Joint Speech-Text Interleaved Pretraining (JSTIP), an ASR-oriented pretraining strategy that constructs word-level and segment-level interleaved speech-text sequences wit
The rapid development of large language models (LLMs) and their integration with speech technologies necessitates advanced techniques to maximize their combined efficacy in ASR.
Improving speech-LLM integration provides significant boosts to automatic speech recognition (ASR) accuracy, which is foundational for numerous AI applications and interfaces.
New pretraining strategies like JSTIP can unlock more effective utilization of textual knowledge for ASR, potentially leading to more robust and accurate speech recognition systems.
- · AI developers
- · ASR companies
- · Voice assistant providers
- · Speech-to-text service providers
- · Traditional ASR models lacking LLM integration
- · Companies with suboptimal speech-LLM integration strategies
ASR systems become more accurate and efficient, reducing errors in voice-controlled interfaces and transcriptions.
Enhanced ASR capabilities accelerate the development and adoption of advanced conversational AI and agentic systems.
Improved speech understanding could lower barriers to entry for global language users in digital economies and AI applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL