
arXiv:2605.19908v2 Announce Type: replace Abstract: Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are similarly available at every layer in every model we probe, including an off-the-shelf control encoder, suggesting that the gap is not explained by their linear readability. Instead, causal intervention shows that the score
The proliferation of sophisticated language models necessitates deeper understanding of their internal workings, making mechanistic interpretability a timely research focus.
Understanding how authorship signals emerge and are processed in LLMs is crucial for developing more robust attribution, misinformation detection, and honest communication systems.
This research reveals that stylistic features are readily available in LLMs but their effective use depends heavily on the scoring mechanism, challenging assumptions about simple linear readability.
- · AI researchers
- · Forensic linguistics
- · Content authentication platforms
- · Misinformation creators
- · Plagiarism services
Improved authorship attribution models with clearer performance optimization paths.
Development of new interpretability tools specifically designed to analyze stylistic feature utilization in neural networks.
Enhanced ability to differentiate human-generated content from AI-generated content, impacting digital provenance and intellectual property.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL