WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv:2604.08558v2 Announce Type: replace-cross Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated
The proliferation of high-fidelity AR-TTS models necessitates research into more efficient architectures to overcome their inherent computational and memory limitations.
Efficient autoregressive text-to-speech models are crucial for scaling AI applications, reducing inference costs, and enabling broader access to advanced generative AI capabilities.
The proposed WAND framework offers a method to adapt existing AR-TTS models for constant computational and memory complexity, making them more practical for real-world deployment.
- · AI model developers
- · Cloud computing providers
- · Generative AI application developers
- · Edge AI device manufacturers
- · Companies reliant on inefficient large-scale AR-TTS models without adaptation
Reduced operational costs and increased accessibility for high-quality text-to-speech generation across various applications.
Accelerated development and deployment of voice-enabled AI agents and digital assistants due to more efficient underlying models.
Enhanced user experience and personalization in AI interactions, potentially spurring new forms of human-computer interaction.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI