UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

arXiv:2509.14653v2 Announce Type: replace Abstract: This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grain
The continuous drive for more efficient and accurate speech recognition models, especially non-autoregressive ones, leads to papers like this that address current limitations.
This development improves speech recognition accuracy and efficiency across multiple prominent languages, which is critical for global AI applications and expanding AI accessibility.
The UMA-Split model demonstrates a method to improve non-autoregressive speech recognition for both English and Mandarin, overcoming previous linguistic limitations of the UMA approach.
- · AI developers
- · Speech recognition companies
- · English and Mandarin speaking users
- · Multilingual AI services
More accurate and faster voice interfaces become available in widespread languages.
Improved speech recognition reduces barriers for AI integration in diverse linguistic markets.
Enhanced multilingual AI capabilities could accelerate the development of global AI agents and services.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL