
arXiv:2510.20441v2 Announce Type: replace-cross Abstract: Neural audio codecs have largely promoted the application of language models (LMs) for speech applications. However, the effectiveness of autoregressive LM-based models in unifying speech enhancement (SE) tasks remains underexplored. In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction, and speech separation. Conditioned on input speech features, it autoregressively generates target discrete tokens, facilitating compatibility betwee
The accelerating capabilities of large language models are being rapidly explored for multimodal applications, making their extension into complex audio processing tasks like speech enhancement a timely development.
This development suggests a potential unification and simplification of diverse speech processing tasks under a single, powerful AI architecture, which could significantly improve the efficiency and quality of audio applications.
Existing specialized speech enhancement models may be replaced by more generalized and efficient LM-based frameworks, leading to advancements in areas from voice assistants to audio forensics.
- · AI researchers
- · Speech technology developers
- · Companies offering audio-based services
- · Developers of highly specialized, non-LM speech enhancement models
Improved performance and broader accessibility of advanced speech enhancement across various applications.
Increased demand for computational resources capable of running complex LM-based audio processing models.
Enhanced human-computer interaction through more natural and robust voice interfaces, potentially accelerating the development of agentic AI systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI