DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

arXiv:2605.31432v1 Announce Type: cross Abstract: Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to gui
The proliferation of Large Language Models and the increasing demand for real-time, multilingual communication pushes the boundaries of simultaneous translation research.
This research explores a novel architecture for simultaneous speech translation, potentially enabling more efficient and versatile real-time communication across language barriers.
A new training-free method for long-form simultaneous translation using decoder-only attention models is introduced, potentially simplifying the development and deployment of such systems.
- · AI researchers in NLP and speech
- · Speech-to-text translation service providers
- · Global businesses requiring real-time communication
- · Traditional encoder-decoder architectures for SimulST
Improved performance and reduced training complexity for simultaneous speech translation systems.
Accelerated adoption of real-time translation in various applications, from conferences to personal devices.
Enhanced global communication and collaboration by lowering language barriers more effectively and economically.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI