Introducing corpora Hlava Cor and Hlava AD: Human Label Variation in Coreference and Discourse Relations

arXiv:2606.25383v1 Announce Type: new Abstract: As previous research on annotator disagreement in discourse phenomena has shown, understanding text coherence varies considerably from one individual to another. To explore this phenomenon, we created two corpora with multiple annotations of Czech texts, accompanied by annotators' explanations of their choices. The first corpus consists of 1,024 contexts annotated in parallel by three annotators. It captures differences in the identification of coreference across various text types and grammatical-semantic categories, including pronouns, full nou
This research is emerging as AI systems, particularly large language models, are becoming more sophisticated, making the nuances of human language understanding and potential biases in training data increasingly relevant.
Understanding human label variation in coreference and discourse relations is crucial for building more robust, context-aware, and less biased AI models, particularly in natural language processing (NLP).
This research contributes to the foundational understanding of linguistic annotation challenges, which can improve data quality for training advanced NLP models and foster better human-AI collaboration in content creation and analysis.
- · NLP researchers
- · AI ethics committees
- · Companies building advanced LLMs
- · Developers relying on simplistic NLP models
- · Applications vulnerable to contextual misunderstandings
Improved understanding of human disagreement in linguistic annotation.
Development of more sophisticated annotation guidelines and tools that account for human variation, leading to higher quality training datasets for AI.
More robust and less biased AI language models that can handle ambiguities and contextual intricacies more effectively, reducing misinterpretations in critical applications.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL