arXiv:2604.00003v2 Announce Type: replace-cross Abstract: Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, H

Source: arXiv cs.AI — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.