
arXiv:2606.19468v1 Announce Type: new Abstract: The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-bas
The proliferation of web-scale LLMs necessitates a more granular understanding of their training data's characteristics to improve performance and address biases, leading researchers to examine narrative content.
Understanding the narrative composition of LLM training data is crucial for developing more sophisticated, human-like AI models and mitigating unintended consequences derived from unexamined data properties.
The ability to systematically characterize narrative content in massive datasets allows for more targeted data curation, potentially leading to significant advancements in LLM capabilities and interpretability.
- · AI Researchers
- · LLM Developers
- · Data Scientists
- · Ethical AI Advocates
- · Ad-hoc Data Curation Practices
- · LLMs with Unaddressed Narrative Biases
Improved understanding of how narrative structures influence LLM pretraining and output.
Development of new data curation techniques focused on specific narrative qualities, enabling more robust and less biased LLMs.
Enhanced AI systems capable of generating highly coherent and contextually appropriate narratives for complex tasks, ranging from content creation to strategic analysis.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL