
arXiv:2606.26257v1 Announce Type: new Abstract: How much of my data was used to train a machine learning model? Dataset Usage Inference (DUI) aims to answer this by estimating what fraction of a dataset contributed to a model's training. However, existing DUI methods rely on assumptions that rarely hold in practice: they require training expensive shadow models to imitate the target model, and they assume access to both known training samples and an in-distribution held-out set confirmed to be absent from training. These conditions make current approaches impractical for modern large models an
The proliferation of advanced AI models necessitates improved methods for understanding data lineage and usage, especially in light of increasing regulatory and ethical concerns.
This research addresses a critical gap in AI model auditing and transparency, enabling better compliance, intellectual property protection, and trust in AI systems.
New methodologies for Dataset Usage Inference (DUI) will allow more practical and scalable assessment of how data contributes to large machine learning models, without the prohibitive costs of prior methods.
- · AI developers
- · Data owners
- · Regulatory bodies
- · Ethical AI advocates
- · Developers of less transparent models
- · Systems reliant on proprietary 'black box' data usage
Easier and more accurate assessment of data contributions to large AI models, improving transparency.
Increased accountability for AI model training data, potentially impacting intellectual property and privacy regulations.
The development of 'data footprint' audits becoming a standard requirement for AI model deployment and certification.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG