Invisible to humans, visible to machines: a preregistered audit of Unicode fidelity across four biomedical bibliographic APIs

arXiv:2606.24897v1 Announce Type: cross Abstract: Biomedical text mining, scientometrics, and the construction of training corpora for biomedical large language models (LLMs) all assume that the abstract text returned by a bibliographic API faithfully reproduces the published abstract. This pre-registered audit (OSF osf.io/269b5) tests that assumption for four widely used public APIs (PubMed E-utilities, Crossref, OpenAlex, Semantic Scholar) against PubMed Central (PMC) JATS XML as a common ground truth. From a complete enumeration of the PMC Open Access subset for 2024 (about 700,000 records)
The proliferation of Large Language Models (LLMs) and their reliance on vast datasets makes the fidelity of their training data a critical and newly highlighted concern.
The accuracy of biomedical text mining and LLMs heavily depends on the foundational data, and this audit reveals potential systemic issues in data reproduction across major bibliographic APIs.
This audit introduces a critical lens on data provenance, suggesting that current assumptions about data fidelity in AI training and scientific research may be flawed, requiring re-evaluation of data sourcing and cleaning processes.
- · Data validation services
- · Researchers focused on data quality
- · Systems with robust data ingestion pipelines
- · Biomedical LLM developers using raw API data
- · Scientific researchers relying solely on API data
- · Bibliographic API providers with low data fidelity
Immediate re-evaluation and potential distrust of current bibliographic API data by AI developers and researchers will occur.
This will likely lead to demand for new standards in data fidelity and comprehensive data curation services for scientific texts.
Long-term, improved data quality could lead to more robust and reliable AI systems in biomedical research, accelerating discoveries based on cleaner information.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL