
arXiv:2603.22327v2 Announce Type: replace-cross Abstract: Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts
The increasing sophistication of large language models and the high demand for efficient scientific knowledge synthesis are converging, necessitating robust evaluation frameworks.
This development provides a concrete and high-stakes benchmark for evaluating AI's capability in complex, white-collar knowledge work, specifically within systematic literature reviews in epidemiology, which are critical for public health.
The introduction of AgentSLR provides a standardized, large-scale evaluation harness that allows for granular testing of LLMs in a critical scientific domain, potentially accelerating the adoption and refinement of AI for research.
- · AI research and development
- · Epidemiological researchers
- · Public health organizations
- · LLM developers
- · Tasks requiring manual exhaustive literature review
- · Unspecialized AI models
- · Traditional manual review industries
AI-driven systematic reviews become more reliable and widely adopted in medical and scientific fields.
Reduced time and cost for evidence synthesis across various scientific disciplines, accelerating drug discovery and public health interventions.
The development of highly specialized AI agents that can independently conduct and publish scientific reviews, leading to new forms of scientific knowledge generation and dissemination.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI