StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering

arXiv:2605.24733v1 Announce Type: new Abstract: We present \textbf{StepGap}, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textsc{Contradicted Claim} (CC), \textsc{Irrelevant Evidence} (IE), or \textsc{Missing Bridge} (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, $\kappa{=}0.704$), StepGap reaches sF1$=$72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emph{hurts} F1 when removed, whi
The increasing complexity of multi-hop question answering in large language models necessitates precise tools for identifying and correcting evidence gaps, a critical area for improving AI accuracy and reliability.
Improving the ability of AI systems to detect and rectify errors in reasoning and evidence chains directly enhances their trustworthiness and practical utility for complex analytical tasks.
The development of specific tools like StepGap offers a more granular approach to debugging and improving multi-hop reasoning in LLMs, moving beyond black-box assessments to structured error identification.
- · AI researchers
- · LLM developers
- · Industries relying on complex AI analysis
- · AI systems with poor evidence chain validation
AI systems will become more robust in handling complex, multi-step queries by identifying reasoning flaws and missing information.
Increased reliability of AI-generated insights could accelerate their adoption in high-stakes decision-making environments.
The development of explainable AI (XAI) tools may be bolstered by the capacity to pinpoint specific logical gaps or contradictions in AI reasoning.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL