SIGNALAI·Jun 25, 2026, 4:00 AMSignal75Short term

Invisible to humans, visible to machines: a preregistered audit of Unicode fidelity across four biomedical bibliographic APIs

arXiv:2606.24897v1 Announce Type: cross Abstract: Biomedical text mining, scientometrics, and the construction of training corpora for biomedical large language models (LLMs) all assume that the abstract text returned by a bibliographic API faithfully reproduces the published abstract. This pre-registered audit (OSF osf.io/269b5) tests that assumption for four widely used public APIs (PubMed E-utilities, Crossref, OpenAlex, Semantic Scholar) against PubMed Central (PMC) JATS XML as a common ground truth. From a complete enumeration of the PMC Open Access subset for 2024 (about 700,000 records)

Why this matters

Why now

The proliferation of Large Language Models (LLMs) and their reliance on vast datasets makes the fidelity of their training data a critical and newly highlighted concern.

Why it’s important

The accuracy of biomedical text mining and LLMs heavily depends on the foundational data, and this audit reveals potential systemic issues in data reproduction across major bibliographic APIs.

What changes

This audit introduces a critical lens on data provenance, suggesting that current assumptions about data fidelity in AI training and scientific research may be flawed, requiring re-evaluation of data sourcing and cleaning processes.

Winners

· Data validation services
· Researchers focused on data quality
· Systems with robust data ingestion pipelines

Losers

· Biomedical LLM developers using raw API data
· Scientific researchers relying solely on API data
· Bibliographic API providers with low data fidelity

Second-order effects

Direct

Immediate re-evaluation and potential distrust of current bibliographic API data by AI developers and researchers will occur.

Second

This will likely lead to demand for new standards in data fidelity and comprehensive data curation services for scientific texts.

Third

Long-term, improved data quality could lead to more robust and reliable AI systems in biomedical research, accelerating discoveries based on cleaner information.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.CL

#cs.DL #cs.CL #cs.IR

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.