
arXiv:2606.15888v1 Announce Type: cross Abstract: Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collec
The increasing sophistication and proliferation of AI-driven voice and speech synthesis necessitate more nuanced quality assessment beyond natural language, including non-verbal cues.
Improving the quality assessment of non-verbal vocalizations will lead to more emotionally resonant and human-like AI systems, enhancing their utility in various applications.
The focus expands from basic naturalness and correct non-verbal type/position to the perceptual quality of non-verbal vocal events themselves, enabling more refined AI development.
- · AI-driven voice synthesis companies
- · Customer service automation
- · Virtual assistants
- · AI systems with poor emotional fidelity
- · Low-quality non-verbal TTS systems
Refined and more believable interactive AI experiences.
Increased consumer adoption and trust in AI systems that can convey nuanced emotion.
The blurring of lines between human and AI emotional expression in digital interactions.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI