Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

arXiv:2606.07032v1 Announce Type: cross Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent referen
The rapid advancement of AI models necessitates more robust and genuine zero-shot benchmarking to accurately assess capabilities and limitations as models become more general-purpose.
Improved benchmarks for Zero-Shot Composed Image Retrieval (ZS-CIR) are critical for developing AI systems that can interpret and generate content more reliably in real-world, unseen scenarios, reducing model biases.
The introduction of ZeroSight provides a more rigorous evaluation framework for ZS-CIR, challenging existing models trained on public datasets and pushing for truly novel, unbiased performance.
- · AI researchers
- · Model developers
- · Zero-shot learning
- · Computer vision
- · Overfit AI models
- · Biased datasets
More accurate and reliable evaluation of zero-shot image retrieval capabilities will emerge.
This drives the development of more generalized and less dataset-dependent AI models for understanding image compositions.
These advances could accelerate autonomous systems' ability to interpret novel visual instructions and improve human-AI interaction in complex environments.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI