Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

arXiv:2606.29171v1 Announce Type: new Abstract: While existing data attribution methods can identify which training examples build specific mechanistic circuits, they cannot explain how training data shapes the high-level behavioral decisions a model learns to make. To bridge this gap, we introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior. SMDA fits a closed-form Ridge regression over sparse autoencoder (SAE) features to model a target behavior, then analytically decomposes how each
The rapid advancement and deployment of AI models, particularly in critical applications, necessitates robust explainability and attribution methods to ensure trustworthiness and address regulatory concerns.
Sophisticated readers should care because understanding how training data influences AI behavior is crucial for debugging, auditing, and ensuring ethical AI development, directly impacting the adoption and reliability of powerful AI systems.
The ability to attribute high-level AI behaviors to specific training data points using symbolic mechanistic approaches offers a new dimension of interpretability beyond identifying component circuits.
- · AI developers
- · Auditors and regulators
- · Researchers in AI safety
- · Sectors deploying critical AI
- · Black-box AI systems
- · Developers ignoring explainability
- · Applications with high-stakes, opaque AI
- · Trust-poor AI applications
Improved understanding of model biases and decision-making processes, leading to more robust and ethical AI.
New standards and regulations for AI transparency and data attribution may emerge, impacting AI development cycles and costs.
Enhanced trust in AI could accelerate adoption in highly sensitive fields, potentially transforming industries that rely on critical decision support systems.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG