Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

arXiv:2512.09634v2 Announce Type: replace Abstract: This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores
The proliferation of Large Language Models (LLMs) and the increasing need for nuanced natural language understanding are driving the creation of specialized datasets, especially for less-resourced languages.
This dataset aids in developing more sophisticated and localized AI models capable of understanding subjective content, which is crucial for applications ranging from content moderation to public opinion analysis in specific linguistic contexts.
The availability of a continuously scaled subjectivity dataset for Estonian enhances the capabilities for building more accurate AI systems in a specific national language, potentially reducing dependency on models trained exclusively on dominant languages.
- · Estonian AI developers
- · NLP researchers
- · Content analysis platforms
- · Government digital services
- · Generic LLM vendors
- · Manual content review processes
Improved sentiment analysis and content moderation in Estonian-language digital spaces will emerge.
This could lead to national-level AI initiatives focusing on developing indigenous language models and applications, fostering digital sovereignty.
The success of such localized datasets might inspire other smaller nations to invest in similar language-specific AI infrastructure, decentralizing AI development.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL