Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

arXiv:2412.11449v2 Announce Type: replace-cross Abstract: We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account
The rapid advancement in transformer architectures and generative AI is pushing the boundaries of multimodal models, making hybrid approaches for audio increasingly viable.
This development could significantly advance the capabilities of generative audio models by addressing current limitations in context length and fidelity, opening new applications in speech and music synthesis.
The ability to work with both continuous audio representations and discrete tokens simultaneously in a single architecture improves efficiency and quality of generative audio, a crucial step for real-world adoption.
- · AI development platforms
- · Music industry (composers, producers)
- · Speech technology companies
- · Entertainment sector
- · Companies reliant on older generative audio models
- · Traditional audio production workflows (some aspects)
Improved high-fidelity generative audio leads to more realistic virtual assistants and synthetic media.
The reduced computational overhead could democratize advanced audio generation, fostering widespread creative applications.
This could lead to entirely new forms of interactive and customizable audio content, blurring lines between human and AI creation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL