Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

arXiv:2606.06740v1 Announce Type: cross Abstract: Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results s
The increasing complexity of AI models, particularly large language models and speech-to-speech systems, necessitates deeper understanding and optimization of underlying discrete speech representations to overcome current limitations.
Improving unit vocoders is crucial for advancing multilingual and multi-speaker AI systems, leading to more robust and less biased generative AI applications, particularly in diverse linguistic environments like India.
The systematic analysis of discrete speech representations provides actionable insights for developing more performant and less problematic speech generation AI, potentially enabling broader adoption and better user experiences for non-English speakers.
- · AI developers
- · Speech technology companies
- · Multilingual AI users
- · Indian language AI initiatives
- · Monolingual AI solutions
- · AI companies ignoring linguistic diversity
Improved multilingual speech generation capabilities for AI models.
Reduced speaker mixing and cross-lingual interference in advanced speech AI applications, enhancing realism and utility.
Accelerated development of localized and culturally relevant AI experiences, fostering greater global AI adoption beyond English-centric systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI