YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

arXiv:2606.14780v1 Announce Type: cross Abstract: Clickbait content on video-sharing platforms poses a significant challenge to information reliability, yet progress in automated detection has been constrained by the lack of large-scale, high-quality multimodal datasets. We present YTClickbait21K, a human-annotated YouTube clickbait dataset comprising 21,238 videos collected from 40 channels across 29 countries, covering diverse content categories such as news, entertainment, education, and gaming. Each sample includes structured metadata (title, description, engagement statistics) along with
The proliferation of AI-generated and algorithmically-driven content necessitates better tools for identifying problematic content like clickbait, which this dataset aims to address.
The creation of a large-scale, human-annotated multimodal dataset for clickbait detection is a critical advancement for platform integrity and the fight against misinformation, particularly as AI-generated content becomes more sophisticated.
This dataset provides a robust new resource that can significantly improve the accuracy and generalizability of AI models designed to detect clickbait across various content types and languages on platforms like YouTube.
- · Platforms (e.g., YouTube)
- · Content moderation service providers
- · AI researchers in content integrity
- · Users seeking reliable information
- · Content creators using clickbait tactics
- · Misinformation spreaders
Improved automated detection of clickbait on video platforms.
Reduced prevalence of misleading titles and thumbnails, potentially improving user trust and experience.
Enhanced overall information quality within online video ecosystems, creating a higher bar for content engagement strategies.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG