Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

arXiv:2605.27750v1 Announce Type: cross Abstract: Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations
The proliferation of Vision-Language Models (VLMs) and their application to Optical Character Recognition (OCR), especially in challenging domains like historical texts, has intensified scrutiny of their underlying mechanisms and failure modes.
This research highlights fundamental limitations in how VLMs process visual information versus relying on linguistic priors, impacting the reliability and trustworthiness of AI systems in critical data extraction tasks.
The understanding that VLMs can 'guess' rather than 'read' with visual grounding will lead to focused development on improving visual reasoning capacities and more robust evaluation methodologies for OCR applications.
- · Researchers focused on visual grounding in AI
- · Developers of traditional, robust OCR engines
- · Sectors requiring high-fidelity data extraction from image-based sources
- · Companies over-relying on un-audited VLM OCR for critical tasks
- · Models uncritically emphasizing fluency over factual visual accuracy
Increased emphasis within AI research on disentangling linguistic priors from true visual understanding in multimodal models.
Development of new VLM architectures and training regimes specifically designed to enhance visual grounding and reduce 'plausible hallucination'.
Impact on the trustworthiness and adoption rates of AI in fields like digital humanities, legal tech, and archival research where nuanced and accurate text recognition from images is paramount.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI