ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

arXiv:2606.15984v1 Announce Type: new Abstract: Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic
The continuous advancements in AI and NLP necessitate specialized datasets to address specific linguistic and demographic challenges inherent in speech recognition, particularly for lesser-resourced languages and governmental proceedings.
This development contributes to sovereign AI capabilities by providing foundational data and methods for robust, demographically unbiased speech recognition for Romanian parliamentary proceedings, reducing reliance on external models and potentially improving governmental transparency and efficiency.
The creation of a specialized, double-annotated corpus and an adversarial training framework offers a more accurate and less biased ASR solution for Romanian, which could be replicated for other languages facing similar challenges.
- · Romanian government
- · NLP researchers
- · AI developers in Eastern Europe
- · Language technology companies
- · Generic ASR models
- · Companies without specialized linguistic expertise
Improved transcription accuracy and reduced demographic bias in Romanian parliamentary proceedings.
Enhanced public access to parliamentary data and potentially greater transparency in governance due to reliable automated transcription.
The methodology could serve as a blueprint for other nations to develop sovereign, demographically-aware AI language technologies for their specific governmental or institutional contexts.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL