
arXiv:2606.17588v1 Announce Type: cross Abstract: Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened indepe
The proliferation of Large Language Models (LLMs) in various professional domains necessitates a deeper understanding of their reliability, especially where accuracy is critical, such as in systematic reviews.
This research provides crucial insights into the limitations and failure modes of LLMs in structured document analysis, informing the development of more robust AI and guiding effective human-AI collaboration.
The focus shifts from merely measuring LLM accuracy to qualitatively analyzing 'why' and 'how' they fail, leading to actionable recommendations for improving their utility and trustworthiness in critical applications.
- · AI researchers and developers
- · Systematic review methodologists
- · Organizations implementing LLM-assisted workflows
- · Academic publishing
- · Companies offering 'black box' LLM solutions
- · Researchers relying solely on unvetted LLM outputs
- · Low-quality AI model developers
Improved understanding of LLM limitations and potential failure points in specific tasks.
Development of more reliable and interpretable LLMs, enhancing trust and accelerating adoption in high-stakes environments.
Re-evaluation of regulatory frameworks and best practices for incorporating AI into sensitive research and decision-making processes.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI