
arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation st
The proliferation of multimodal LLMs and their application in diverse linguistic contexts necessitates robust solutions for language adherence, especially as these models become more integrated into critical systems.
Incorrect language identification in multimodal LLMs compromises transcription accuracy and reliability, directly impacting the quality and trust of AI-driven applications across various industries and user demographics.
The proposed soft prompting approach and adherence metric offer a concrete method to improve multilingual robustness in LLMs, allowing for better control and evaluation of their real-world performance.
- · Multilingual LLM developers
- · Users of voice AI interfaces
- · Global technology companies
- · AI researchers focused on robustness
- · Companies relying on subpar multilingual ASR
- · Applications with high-stakes language processing
- · Monolingual AI solutions
Improved accuracy in multilingual AI applications, particularly those involving speech-to-text processing.
Increased user trust and adoption of voice-enabled AI technologies in diverse linguistic environments, potentially expanding market reach.
Enhanced global communication and collaboration facilitated by more reliable AI translation and transcription services, reducing language barriers in business and research.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL