NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

arXiv:2603.12824v2 Announce Type: replace-cross Abstract: Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths
The proliferation of large vision-language models (VLMs) and the increasing computational demands of their deployment are driving innovation in model distillation and efficiency.
This development addresses the critical computational bottleneck of VLM-based retrievers, potentially enabling broader, more cost-effective, and faster deployment of visual document search capabilities.
The ability to achieve high-quality visual document retrieval with a significantly smaller, text-only encoder for queries drastically reduces latency, GPU dependence, and inference costs.
- · AI infrastructure providers (optimized for smaller models)
- · Cloud computing users
- · Enterprises with large visual document repositories
- · Edge AI applications
- · Companies reliant on selling large, inefficient VLM inference
- · Legacy document retrieval systems
Reduced operational costs and improved real-time performance for visual document retrieval systems, making advanced search more accessible.
Accelerated adoption of visual document understanding in various industries, from legal to healthcare, due to lower resource requirements.
Further research into asymmetric model architecture for other multimodal AI tasks, leading to a new paradigm of efficiency-driven AI system design.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG