Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

arXiv:2605.27772v1 Announce Type: cross Abstract: Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs rev
The rapid advancement and deployment of Audio LLMs necessitate robust methods to evaluate and improve their understanding of nuanced human communication beyond just transcribed text.
Improving Audio LLMs' ability to understand paralinguistic cues is crucial for developing more natural, empathetic, and reliable AI systems, especially in areas requiring nuanced interpretation like customer service or mental health support.
This research introduces a standardized benchmark, VoxParadox, to systematically identify and mitigate deficiencies in Audio LLMs' paralinguistic understanding, guiding future model development.
- · AI developers
- · Speech technology companies
- · Users of AI assistants
- · Academic researchers
- · AI models with poor paralinguistic understanding
Audio LLMs will become more sophisticated in interpreting human emotion and intent from speech alone.
This improved understanding will lead to better human-AI interaction across various applications, making AI systems feel more 'human'.
The development of truly empathic AI could revolutionize fields like healthcare, education, and entertainment, blurring the lines between human and artificial communication.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG