
arXiv:2605.28456v1 Announce Type: new Abstract: Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context
The continuous advancements in Large Language Models (LLMs) and diffusion models are leading to their application in diverse AI subfields, driving innovation in areas like visual speech recognition.
This development represents a significant methodological leap in Visual Speech Recognition (VSR), potentially improving accuracy and robustness in challenging environments beyond current left-to-right decoding limitations.
The paradigm for VSR shifts from purely autoregressive models to more flexible, iterative masked denoising approaches, offering better context utilization and more accurate transcription.
- · AI researchers in speech recognition
- · Developers of accessibility technologies
- · Industries requiring robust human-computer interaction
- · Security and surveillance sectors
- · Developers relying on outdated VSR methodologies
- · Systems with limited computational resources for advanced AI models
Improved accuracy in recognizing speech from visual cues, particularly in noisy or ambiguous conditions.
Enhanced accessibility tools for individuals with hearing impairments and more reliable voice control in challenging environments.
Broader adoption of visual speech recognition in everyday devices, potentially enabling new forms of silent communication and interaction.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI