Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe
The rapid advancement and deployment of Audio LLMs are exposing their limitations in complex linguistic tasks like code-switching, necessitating immediate solutions for real-world application.
Improving code-switching ability in Audio LLMs is critical for expanding their utility and accuracy in multilingual societies, reducing friction in human-computer interaction across diverse linguistic contexts.
Audio LLMs will become more effective at understanding and transcribing mixed-language speech, moving beyond current failure modes like omission and hallucination, especially in common language pairs like English-Mandarin.
- · AI developers
- · Multilingual users
- · Speech recognition companies
- · Global enterprise
- · Monolingual speech recognition solutions
Audio LLMs will exhibit significantly improved accuracy in transcribing code-switched conversations, making them more reliable for transcription and understanding.
The enhanced multilingual capabilities of these models will accelerate their adoption in customer service, legal, medical, and educational settings in diverse linguistic markets.
This improvement could reduce language barriers in digital communication and services, potentially fostering greater cross-cultural collaboration and access to information.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL