Towards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars

arXiv:2606.25231v1 Announce Type: new Abstract: Dictionaries are rich sources of lexical information about words that is required for many applications of natural language processing and human language technology. However, publishers prepare printed dictionaries for human usage not for machine processing. This paper presented a method to structure partly a machine-readable version of the Arabic-English Al-Mawrid dictionary. The method converted the entries of Al-Mawrid from a stream of words and punctuation marks into hierarchical structures. The hierarchical structure expresses the components
The increasing sophistication of NLP and AI models necessitates highly structured and machine-readable linguistic data, driving research into automated dictionary parsing.
Improved machine-readable dictionaries for languages like Arabic are crucial for advancing AI localization, enhancing cross-cultural communication, and developing more inclusive global AI applications.
This method provides a more efficient way to convert extensive human-oriented lexical resources into formats usable by AI, reducing manual effort and accelerating multilingual AI development.
- · NLP researchers
- · Multilingual AI developers
- · Companies operating in Arabic-speaking markets
- · Digital lexicography
- · Manual data entry linguistic services
More accurate and nuanced AI applications become possible for Arabic and other under-resourced languages through better lexical data.
The automation of dictionary structuring could lead to a proliferation of specialized AI-ready dictionaries, accelerating language model development for diverse domains.
Enhanced linguistic AI for specific languages may reduce the 'language barrier' in global knowledge sharing and economic interactions, potentially leading to increased data availability in non-English contexts.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL