Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

arXiv:2606.19951v1 Announce Type: cross Abstract: Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Resu
The proliferation of advanced AI models generating speech necessitates more robust and human-aligned evaluation metrics, making this research timely.
Improving the accuracy of AI speech quality assessment directly impacts the development and user acceptance of text-to-speech technologies and conversational AI.
Our understanding of the discrepancies between human and model perception of speech quality is refined, potentially leading to more sophisticated and human-centric AI evaluation metrics.
- · Text-to-speech developers
- · Conversational AI companies
- · Speech technology researchers
- · Developers of AI evaluation metrics
- · AI models with poor human-in-the-loop evaluation
- · Inaccurate speech quality assessment tools
More accurate and human-aligned metrics for evaluating AI-generated speech will be developed.
Improved text-to-speech and conversational AI systems that better meet human perceptual expectations will emerge.
Increased adoption and trust in AI systems that communicate through speech, leading to new applications in various sectors.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL