How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

arXiv:2606.27275v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly critical to digital library workflows, yet their ability to process historical language remains poorly understood. Historical difficulty is typically treated as a monolithic barrier, conflating orthographic variation, linguistic distance, and pretraining exposure. In this paper, we propose a diagnostic framework that decomposes this difficulty into four distinct dimensions: tokenization cost, predictive uncertainty (surprisal), semantic robustness, and context sensitivity. We evaluate this framework o
The increasing reliance on LLMs for digital library workflows necessitates a better understanding of their limitations when processing historical linguistic data.
This research provides a diagnostic framework to understand and mitigate challenges of LLMs with historical languages, crucial for preserving cultural heritage and expanding AI's applicability beyond modern texts.
The ability to systematically decompose LLM difficulties with historical language data allows for targeted improvements in model training and application development.
- · Digital archivists
- · Historians
- · Linguists
- · Organizations relying on historical document analysis
- · LLM developers ignoring historical language nuances
Improved accuracy and utility of LLMs for processing and understanding historical documents.
Enhanced accessibility and discoverability of historical archives and cultural heritage through advanced AI tools.
The development of specialized LLMs and fine-tuning techniques specifically optimized for diverse historical languages, potentially creating new sub-sectors in AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL