
arXiv:2603.25640v2 Announce Type: replace-cross Abstract: Accurate parsing of citations is necessary for machine-readable scholarly infrastructure. But, despite sustained interest in this problem, existing evaluation techniques are often not generalizable, based on synthetic data, or not publicly available. We introduce RenoBench, a public domain benchmark for citation parsing, sourced from PDFs released on four publishing ecosystems: SciELO, Redalyc, the Public Knowledge Project, and Open Research Europe. Starting from 161,000 annotated citations, we apply automated validation and feature-bas
The proliferation of AI models interacting with scholarly literature necessitates more robust and standardized methods for parsing complex citation data, especially as AI agents become more sophisticated.
Accurate, machine-readable citation parsing is fundamental for building reliable scholarly infrastructure and enhancing the capabilities of AI in research, impacting discoverability, attribution, and knowledge synthesis.
The introduction of a public, large-scale benchmark for citation parsing will standardize evaluation and foster more effective development of parsing technologies, potentially improving data quality across academic systems.
- · AI developers
- · Scholarly publishers
- · Academic researchers
- · Digital libraries
- · Systems relying on proprietary or low-quality citation parsing
- · Manual data entry operators
Improved accuracy in citation extraction leads to more reliable bibliographic data in research databases.
Enhanced data quality enables advanced AI tools for literature review, knowledge graph construction, and scientific discovery.
More efficient and accurate parsing could accelerate the pace of scientific breakthroughs by making research more interconnected and discoverable for AI systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL