
arXiv:2602.07106v2 Announce Type: replace-cross Abstract: Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decoup
The rapid advancement of large language models is pushing the boundaries of multimodal integration, making the development of unified human-computer interaction more pressing.
This development is crucial for enabling more natural and intuitive human-computer interaction by bridging the gap between discrete AI reasoning and continuous physical world dynamics.
OLLMs can now generate not only speech but also corresponding 3D facial animations, moving towards more holistic and expressive AI-driven communication.
- · AI-driven customer service platforms
- · Metaverse and virtual reality developers
- · Entertainment industries
- · Open-source AI communities
- · Companies reliant on static, text-only AI interactions
- · Proprietary animation software companies
More realistic and engaging virtual avatars and AI assistants become widely accessible.
The demand for computational resources capable of real-time 3D rendering and multimodal AI processing increases significantly.
The definition of 'human-like' AI interaction expands, blurring lines between digital and physical presence in communication.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL