
arXiv:2512.11982v2 Announce Type: replace-cross Abstract: Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that curre
The proliferation of very large datasets from new generation telescopes combined with advancements in Vision-Language Models (VLMs) and astronomical foundation models makes this approach feasible now.
This work demonstrates a scalable method for semantic search of vast astronomical image datasets, accelerating scientific discovery by overcoming manual labeling bottlenecks and transforming how such data is explored.
The ability to semantically search unlabeled galaxy images using AI-generated captions shifts astronomical data analysis from slow, manual processes to rapid, AI-driven exploration.
- · Astronomical research institutions
- · AI model developers
- · Astrophysicists
- · Space agencies
- · Traditional manual image labeling services
Scientific discovery in astronomy accelerates dramatically as researchers can quickly identify novel phenomena in massive datasets.
New astrophysical theories and understandings emerge faster due to the improved ability to explore and correlate astronomical data.
The methodology is adopted across other scientific domains dealing with large, unlabeled image datasets, becoming a standard for scientific data exploration.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG