Attribution Graphs and Causal Probing for Mechanistic Discovery and Bias Repair in Multimodal Generative Learning

arXiv:2510.12957v4 Announce Type: replace Abstract: We treat the internals of generative models as mechanistic objects rather than black boxes. We introduce \textbf{Attribution Graphs} (AGs), which extend GradCAM++ to circuit-level representations, and \textbf{Causal Probing}, a do-calculus intervention method for identifying causal latent structures, enabling detection and correction of spurious correlations, demographic biases, and misaligned decision circuits during training. We further propose the \textbf{Cognitive Alignment Score (CAS)}, quantifying agreement between model-internal repres
The increasing complexity and scale of multimodal generative models necessitate advanced techniques for interpretability and bias mitigation at the foundational level.
This research provides critical tools for understanding and controlling the internal mechanisms of AI, moving generative models beyond black boxes towards reliable and accountable systems.
The ability to mechanistically debug and align AI models during training improves their trustworthiness, safety, and societal integration by directly addressing biases and spurious correlations.
- · AI developers
- · AI ethicists
- · Regulatory bodies
- · Industries deploying AI
- · Developers of opaque AI systems
- · Companies reliant on 'black box' AI
Increased adoption of interpretable and bias-corrected multimodal generative AI in sensitive applications.
Reduced incidence of AI failures and unintended societal harms due to improved internal model alignment.
Accelerated development of more robust and auditable AI governance frameworks and standards.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG