
arXiv:2306.11252v2 Announce Type: replace Abstract: We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoke
The continuous advancements in AI, specifically in natural language processing and speech recognition, are driving the development of specialized datasets and translation models.
This development is crucial for improving speech translation accuracy, particularly for languages with complex linguistic features and where verbatim transcripts are not always available, bridging communication gaps.
The availability of HK-LegiCoST will enable researchers to train more robust speech translation models tailored for non-verbatim input, which is common in real-world scenarios.
- · AI researchers
- · Speech translation developers
- · Multilingual communication platforms
- · Hong Kong
- · Traditional verbatim-dependent speech translation models
Improved speech translation models for Cantonese and potentially other languages with similar non-verbatim transcript challenges will emerge.
Enhanced cross-linguistic communication in business, legal, and governmental contexts, especially in regions with high language diversity, could be facilitated.
The reduced friction in understanding diverse spoken languages might contribute to a more interconnected global information ecosystem, impacting cultural exchange and content consumption.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL