A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

arXiv:2410.03296v4 Announce Type: replace-cross Abstract: Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian
The proliferation of instruction-tuned large language models makes the assessment of their self-explanations a critical and timely research area for practical AI deployment.
This research provides crucial insights into the reliability and plausibility of AI-generated self-explanations, directly influencing trust and usability of advanced AI systems in critical applications.
The understanding of whether AI self-explanations align with human reasoning shifts, potentially leading to more deliberate design choices for AI interpretability features.
- · AI interpretability researchers
- · Developers of explainable AI (XAI) systems
- · Organizations deploying LLMs in sensitive domains
- · Developers relying solely on superficial self-explanations
Increased focus on empirical validation and benchmark creation for AI interpretability techniques.
Demand for AI models that can generate more human-aligned and plausible explanations, rather than just any explanation.
The development of new AI architectures specifically designed for intrinsic, verifiable explainability, moving beyond post-hoc methods.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI