WAON: A Large-Scale Japanese Image-Text Dataset for Cultural Adaptation in Contrastive Vision-Language Models

arXiv:2510.22276v3 Announce Type: replace-cross Abstract: Contrastive vision-language models have achieved remarkable progress through large-scale pretraining. Recent work has shown that removing English-only caption filters and pretraining on global data is effective for improving multicultural performance. We study whether such global pretraining is sufficient for culture-specific understanding, or whether further adaptation with natively sourced data can boost performance beyond what global pretraining alone achieves. To enable this investigation, we present WAON, the largest publicly avail
The proliferation of global large language and vision models highlights an urgent need for culturally specific datasets to improve model performance and reduce biases for non-English speaking populations.
This initiative addresses a critical gap in AI development by providing foundational data for improving multilingual and multicultural AI capabilities, moving beyond English-centric model training.
The availability of large-scale, natively sourced cultural datasets like WAON changes how Contrastive Vision-Language models will be trained, enabling better regional adaptation and understanding.
- · Japanese AI developers
- · Multilingual AI users
- · Data localization initiatives
- · Cultural content creators
- · English-only AI models
- · Tech companies ignoring cultural data
- · Monolingual AI research
Increased accuracy and relevance of AI systems for Japanese language and culture.
Accelerated development of localized AI models in other non-English speaking regions, following similar data collection strategies.
Enhanced sovereignty over AI development as nations cultivate their own datasets and reduce reliance on externally trained models.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL