OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

arXiv:2405.01930v2 Announce Type: replace Abstract: This paper introduces OARelatedWork: a dataset for related work generation from open-access sources. It is the first large-scale multi-document summarization dataset for related work generation, containing whole related work sections and full texts of cited papers. Its validation and test splits are constructed so that every cited paper is available in full text, enabling controlled evaluation of full-text related work generation. The dataset includes 94 450 papers and 5 824 689 unique referenced papers from multiple domains. With OARelatedWo
The proliferation of academic papers and the development of large language models are creating a demand for more sophisticated and automated methods of knowledge synthesis.
This dataset provides a crucial resource for training AI models to understand, synthesize, and generate academic related work sections, accelerating research and development cycles.
The availability of 'OARelatedWork' enables more advanced and rigorously evaluated AI systems for scientific literature review and knowledge discovery.
- · AI researchers
- · Open Access publishers
- · Academic institutions
- · Knowledge management platforms
- · Manual literature review processes
- · Academic plagiarism (potentially harder to conceal)
- · Small-scale, non-reproducible AI datasets
AI models trained on 'OARelatedWork' will become more adept at generating comprehensive and accurate related work sections for researchers.
The automation of related work generation could significantly reduce the time burden on researchers, accelerating scientific discovery and publication.
This could lead to a 'meta-AI' that not only analyzes research but also helps formulate new research questions by identifying gaps in the literature more efficiently.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL