A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

arXiv:2605.22978v1 Announce Type: new Abstract: Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen a
The increasing sophistication of LLMs and NLP techniques has made it possible to address challenges with historical and less common languages previously deemed intractable, aligning with efforts to broaden AI applicability.
This development allows for the digital preservation and analytical processing of historical textual archives in languages like Katharevousa Greek, unlocking new research avenues and potentially informing future language model development for diverse linguistic contexts.
The previously underserved Katharevousa Greek now has a reproducible NLP pipeline, enabling automated analysis of significant parliamentary and administrative documents and reducing dependency on manual, specialized linguistic expertise.
- · Historians and linguistic researchers
- · NLP developers in less-resourced languages
- · Cultural preservation initiatives
- · Greece's digital humanities sector
- · Manual historical text annotators
The availability of this pipeline will accelerate research into Greece's early post-junta period parliamentary records.
This successful methodology could be replicated for other historical or less-resourced languages, leading to a broader expansion of linguistically diverse AI tools.
Improved access to historical parliamentary data could inform comparative political science studies across different eras and language contexts, revealing new patterns in governance.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL