SciDef: Datasets and Tools for Automated Definition Extraction from Scientific Literature with LLMs

arXiv:2602.05413v2 Announce Type: replace-cross Abstract: Scientific concepts are often defined inconsistently across papers, making it difficult to compare findings, reuse terminology, and build reliable downstream resources. We present SciDef, a resource suite for scientific definition extraction. The suite contains DefExtra, a benchmark of 268 human-validated author-stated definitions from 75 academic papers; DefSim, 60 human-labeled definition-pair similarity judgments; and an open LLM-based pipeline for PDF preprocessing, chunking, definition extraction, prompt optimization, and evaluatio
The proliferation of LLMs and the increasing complexity of scientific literature accelerate the need for automated knowledge extraction tools.
This development improves clarity and consistency in scientific communication, which is crucial for accelerating research and development in fast-moving fields like AI.
The ability to automatically extract and standardize definitions will reduce ambiguities, making scientific concepts more accessible and comparable across different studies.
- · AI researchers
- · Scientific publishers
- · Academia
- · AI tool developers
- · Researchers relying on manual literature review for definitions
Improved interoperability and reusability of scientific findings due to standardized terminology.
Faster innovation cycles in fields where precise definitions and concept understanding are critical.
Potential for new AI-driven discovery platforms that leverage formalized scientific knowledge graphs built from extracted definitions.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL