Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

arXiv:2605.20602v1 Announce Type: cross Abstract: Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions,
This research provides deeper insight into the ongoing evolution of language models, responding to current debates about their self-reinforcing tendencies and potential 'model collapse'.
Understanding how self-training restructures, rather than merely flattens, language is critical for steering future AI development, especially for complex agentic systems.
The perception of self-training's impact shifts from simple diversity loss to a complex process of linguistic restructuring, with implications for AI's ability to maintain nuanced communication.
- · AI researchers focusing on linguistic structure
- · Developers of advanced reasoning AI
- · Ethical AI developers
- · Developers relying on 'flattened' language assumptions
- · AI systems requiring deep syntactic understanding
Self-training mechanisms in AI will be re-evaluated to prevent unintended linguistic biases.
New AI architectures may emerge that explicitly aim to preserve or enhance deep syntactic structures during self-improvement.
The development of highly sophisticated AI agents could be constrained by an inability to reliably maintain complex linguistic expression in self-generated data.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG