SIGNALAI·Jun 17, 2026, 4:00 AMSignal75Short term

Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

arXiv:2606.17588v1 Announce Type: cross Abstract: Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened indepe

Why this matters

Why now

The proliferation of Large Language Models (LLMs) in various professional domains necessitates a deeper understanding of their reliability, especially where accuracy is critical, such as in systematic reviews.

Why it’s important

This research provides crucial insights into the limitations and failure modes of LLMs in structured document analysis, informing the development of more robust AI and guiding effective human-AI collaboration.

What changes

The focus shifts from merely measuring LLM accuracy to qualitatively analyzing 'why' and 'how' they fail, leading to actionable recommendations for improving their utility and trustworthiness in critical applications.

Winners

· AI researchers and developers
· Systematic review methodologists
· Organizations implementing LLM-assisted workflows
· Academic publishing

Losers

· Companies offering 'black box' LLM solutions
· Researchers relying solely on unvetted LLM outputs
· Low-quality AI model developers

Second-order effects

Direct

Improved understanding of LLM limitations and potential failure points in specific tasks.

Second

Development of more reliable and interpretable LLMs, enhancing trust and accelerating adoption in high-stakes environments.

Third

Re-evaluation of regulatory frameworks and best practices for incorporating AI into sensitive research and decision-making processes.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.AI

#cs.SE #cs.AI

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.