arXiv:2605.30348v1 Announce Type: cross Abstract: The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{{Data Mixture Surgery (DMS)}}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{{LLMSurgeon}}$, a strong framework that casts DMS as an inv
Source: arXiv cs.LG — read the full report at the original publisher.
