Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

arXiv:2606.30987v1 Announce Type: new Abstract: Decision-makers routinely rely on expert judgments accompanied by written explanations, yet explanation quality is difficult to measure at scale. Forecasting tournaments offer a natural testing ground: probabilistic judgments are paired with natural-language rationales and scored against realized outcomes. We introduce Explanation Quality Markers (EQMs), a set of sixty theory-guided reasoning patterns scored by large language models (LLMs). In a pre-registered analysis of over 55,000 forecast-rationale pairs from a multiyear forecasting tournamen
The proliferation of AI-generated content and expert systems necessitates robust methods to evaluate the quality and reliability of 'explanations' or 'rationales' behind decisions.
This research provides a scalable, LLM-driven approach to measure the quality of natural-language explanations, which is critical for trustworthy AI adoption and evaluating expert judgment.
The ability to systematically score explanation quality using AI models could significantly enhance the development, auditing, and deployment of agentic systems and human-AI collaboration.
- · AI developers focused on explainability
- · Organizations relying on expert forecasting
- · Auditors of AI systems
- · LLM providers
- · Opaque black-box AI systems
- · Experts providing low-quality rationales
- · Traditional, manual explanation review processes
Refinement of AI agent reasoning capabilities through feedback loops on explanation quality.
Increased demand for explainable AI outputs across various industries, leading to new compliance standards.
Potential for 'explanation marketplaces' where valuable rationales are traded or licensed, fostering a new knowledge economy.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL