Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP

arXiv:2505.11189v3 Announce Type: replace-cross Abstract: Large language models (LLMs) can amplify misinformation, undermining societal goals such as the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) often shaped by default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive", "math is complex") and can act as "bags of heuristics", we ask whether belief-driven heuristics behind misinformation-related behaviour can be recovered from black-box LLM behaviour as explicit rules. A key obs
The proliferation of advanced LLMs necessitates robust methods for identifying and mitigating harmful embedded behaviors, aligning with urgent calls for responsible AI development.
Understanding how to reveal and, by extension, control 'injected behaviors' linked to misinformation in LLMs is crucial for ensuring their reliability and preventing their misuse in shaping public discourse.
The ability to systematically extract and correct problematic rules within LLMs moves from theoretical concern to applied research, offering pathways for more transparent and safer AI systems.
- · AI ethics researchers
- · LLM developers
- · Regulatory bodies
- · Platforms combating misinformation
- · Malicious actors using LLMs
- · LLMs with unmitigated biases
- · Unregulated AI deployment
Improved XAI methods could lead to more robust and less biased LLMs.
Public trust in AI systems may increase as their decision-making processes become more auditable and controllable.
New standards for AI accountability and transparency could emerge globally, influencing compliance and development cycles.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG