HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

arXiv:2407.20595v5 Announce Type: replace-cross Abstract: Authorship attribution asks whether two pieces of text share a writer, but topical confound makes the task deceptively easy: two authors covering the same topic may look more alike than one author covering two topics. Scholarly prose offers a natural remedy, academic writers produce multiple papers on related but distinct topics while maintaining consistent stylistic habits. We introduce HALvest, a 17-billion-token multilingual corpus of open-access academic papers, and its English contrastive derivative HALvest-Contrastive, where same-
The proliferation of AI models makes robust authorship attribution increasingly critical, especially in academic and professional contexts, necessitating advanced analytical techniques and large, clean datasets.
Improved authorship attribution methods can enhance academic integrity, intellectual property protection, and inform future AI-driven text analysis and generation, impacting trust in digital content.
The introduction of a massive, multilingual academic corpus and a retrieval-like attribution model offers a more robust method for identifying authors, potentially reducing biases from topical confounding.
- · Academic institutions
- · Plagiarism detection services
- · Intellectual property rights holders
- · Natural Language Processing researchers
- · Plagiarists
- · Creators of generative AI focused on indistinguishable human-like text
- · Legacy authorship attribution tools
- · Publishers with weak review processes
More accurate authorship attribution tools become widely adopted in academic and publishing fields.
Increased difficulty for AI language models to generate text that is indistinguishable from human authors in terms of unique stylistic markers.
Enhanced trust and transparency in digital content creation and academic contributions, potentially leading to new forms of digital identity verification based on writing style.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL