Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

arXiv:2606.13989v1 Announce Type: cross Abstract: Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference
The continuous improvement in AI models for generative tasks, particularly in text-to-speech, benefits from refining inference mechanisms for discrete probabilistic frameworks.
This development improves control and stability of AI-generated speech, critical for high-quality synthetic media and more natural human-computer interaction.
The ability to stably generate high-quality speech with improved control via a revisable inference stack for discrete flow matching models reduces artifacts and increases the utility of synthetic voice.
- · AI researchers
- · Speech synthesis developers
- · Content creators using AI voices
- · AI platform providers
- · Legacy speech synthesis methods relying on explicit duration modeling
Improved fidelity and control in AI-generated speech.
Reduced barriers for sophisticated synthetic voice applications in entertainment, education, and accessibility.
Enhanced realism in virtual assistants and digital companions, potentially leading to deeper human-AI integration.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI