
arXiv:2606.15510v1 Announce Type: new Abstract: AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and
The continuous advancements in AI and NLP necessitate richer, more diverse linguistic datasets for model training and historical linguistic analysis.
This development provides a foundational linguistic resource for training AI models on diachronic Greek, potentially enabling new research avenues in historical linguistics and cross-lingual understanding.
An open and comprehensive diachronic Greek treebank now exists, allowing for detailed computational analysis of language evolution and improved multilingual NLP capabilities.
- · Linguistics researchers
- · NLP developers
- · Historians
- · Cultural institutions
- · Proprietary linguistic data providers
Researchers gain a new, openly licensed dataset for diachronic Greek language studies and NLP.
Improved AI models for ancient languages could emerge, potentially benefiting translation, archaeology, and digital humanities.
The methodology could inspire similar open-source, diachronic treebank projects for other under-resourced or historically significant languages.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL