Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: cross Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reasoning graphs, structured graph representations extracted from free-text LLM diagnostic traces using a domain-grounded ontology with 5 node types and 7 edge types. We apply this pipeline to 750 traces from five LLMs across 50 New England Journal of Medicine Clinicopathological Conference cases and three prompt condition
The proliferation of LLMs in specialized domains like medicine necessitates robust and transparent evaluation methods beyond simple accuracy scores to understand their true utility and limitations.
This research provides a structured approach to evaluate the diagnostic reasoning of LLMs, moving beyond superficial metrics to reveal whether their competence is grounded or merely pattern-based, which is critical for their safe and effective deployment in sensitive applications.
The ability to distinguish between stable, clinically-grounded reasoning and pattern matching in LLMs, introducing a more nuanced understanding of their diagnostic capabilities and a methodology for structured evaluation.
- · AI ethicists and safety researchers
- · Clinical AI developers focusing on transparency
- · Healthcare providers seeking trustworthy AI assistance
- · LLM developers who prioritize raw accuracy over explainability
- · Healthcare systems that rush to deploy opaque LLM solutions
The adoption of structured evaluation methods like clinical reasoning graphs becomes a standard for medical AI.
Increased pressure on LLM developers to integrate explainability and robust reasoning capabilities into their models for critical applications.
Certification and regulatory frameworks for clinical AI begin to mandate transparency in reasoning, significantly shaping market leaders.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI