How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

arXiv:2606.20532v1 Announce Type: new Abstract: Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text t
This research addresses the ongoing challenge of controllability and explainability in expressive text-to-speech systems, driven by the increasing sophistication and deployment of AI models.
Improved interpretability of how AI models generate speech based on natural language instructions will enhance diagnosis of failures, boost reliability, and enable more precise control over AI-generated audio outputs.
The adoption of cross-attention attribution for speech diffusion models provides a new method for understanding and debugging expressive TTS, potentially leading to more robust and controllable systems.
- · AI developers
- · Voice AI companies
- · Content creators using AI voices
- · Researchers in explainable AI
- · Companies with less sophisticated TTS models
- · AI systems prone to black-box failures
Explaining how style captions influence speech allows for targeted improvements in AI voice synthesis accuracy and realism.
Greater control over voice characteristics facilitates more nuanced and emotionally intelligent AI-human interaction, impacting virtual assistants and customer service.
The methodology could generalize to other generative AI domains, improving interpretability and controllability across a broader spectrum of synthetic media creation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI