
arXiv:2606.01483v1 Announce Type: new Abstract: Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter
The increasing demand for more efficient and accurate AI models for real-world applications, especially in areas like conversational AI and ambient computing, necessitates advancements in ASR inference systems.
Improving ASR efficiency without sacrificing accuracy is critical for scaling AI applications, reducing computational costs, and enabling more seamless human-computer interaction across various industries.
The Murmur system proposes a method to overcome the traditional trade-off between latency and accuracy in long-form ASR, potentially enabling more practical and widespread deployment of real-time speech-to-text technologies.
- · AI software developers
- · Cloud computing providers
- · Enterprises adopting AI
- · Users of voice interfaces
- · ASR systems with high latency
- · Specialized transcription services reliant on manual review
More accurate and faster transcription services become widely available, improving accessibility and productivity.
The reduced computational cost and improved performance could accelerate the development of more complex voice AI agents and ambient computing devices.
Enhanced ASR could contribute to a paradigm shift in human-computer interaction, making voice the primary interface for many applications, potentially impacting hardware design and software ecosystems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG