Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities

arXiv:2606.01109v1 Announce Type: cross Abstract: Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600
The increasing sophistication of AI models for text processing is driving demand for more granular and domain-specific datasets to improve performance beyond general applications.
Improved reference extraction in academic and legal fields could significantly enhance research efficiency and the development of AI tools for knowledge management in professions heavily reliant on complex citation structures.
The availability of a specialized, multilingual dataset like FOSSIL addresses a gap in training data, potentially leading to more accurate and robust AI tools for unstructured bibliographic data.
- · Legal tech sector
- · Humanities AI research
- · Academic researchers
- · Text analytics companies
- · Manual data entry services
More accurate and versatile AI tools for processing legal and humanities texts will emerge.
This could lead to new forms of scholarly analysis and knowledge discovery based on interconnected bibliographies.
The enhanced AI capabilities might reduce research costs and democratize access to sophisticated analytical tools in these fields.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL