FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv:2605.27062v1 Announce Type: cross Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP user
The increasing focus on AI model performance and the recognition of data scarcity for under-represented languages are driving current efforts in corpus development.
This development addresses a critical data gap for European Portuguese, potentially enabling improved AI systems and fostering greater digital equity for smaller language groups.
The availability of a large-scale, speaker-annotated corpus for European Portuguese significantly improves the prospects for developing high-performing ASR and other speech-based AI applications for this language.
- · European Portuguese speakers
- · AI developers in Portugal
- · Language preservation efforts
- · Regional tech autonomy
- · Generic multilingual ASR models
Improved speech recognition accuracy for European Portuguese in various applications.
Increased adoption of voice-controlled interfaces and AI services by European Portuguese speakers due to better accessibility.
Potential for Portugal to develop a specialized AI industry focused on language technologies, reducing reliance on foreign models and strengthening national digital infrastructure.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG