Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

arXiv:2605.11374v3 Announce Type: replace Abstract: Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Since modern embedding models are distilled from LLM backbones, a frozen encoder should benefit from extra inference compute without retraining. An agentic program-search loop explores 144 candidate programs over a frozen encoder API and produces twelve Pareto-optimal programs spanning cost ratios from $c=1.2$ to $14.7$ over the single-pass baseline. The search independently rediscovers Rocchio pseudo-relevance feedback,
The paper demonstrates how agentic program generation can significantly improve the performance of frozen embedding models, making them more competitive without retraining. This research reflects the ongoing push to optimize existing AI models and infrastructure for greater efficiency and capability.
This development could significantly reduce the compute and energy costs associated with developing high-performing AI systems by leveraging existing models more effectively, enabling smaller entities to compete. It also suggests new pathways for improving AI model performance beyond traditional retraining.
The perceived limitation that small embedding models cannot benefit from additional test-time compute is challenged, opening new avenues for efficiency and performance gains in AI, particularly for retrieval tasks. This could democratize access to advanced AI capabilities by lowering development and operational costs.
- · AI researchers and developers looking for cost-effective model improvements
- · Smaller AI companies and startups
- · Deep learning infrastructure providers focusing on optimization
- · Sectors heavily reliant on efficient information retrieval
- · Companies whose competitive advantage relies solely on massive retraining budget
- · Inefficient AI development pipelines
- · Those slow to adopt agentic optimization techniques
Existing frozen embedding models could see substantial performance improvements without the need for additional data or retraining.
This could lead to a proliferation of more efficient and capable AI applications, broadening the accessibility of advanced AI systems.
Increased efficiency in AI model utilization may reduce overall computational demands, potentially impacting the trajectory of AI hardware scaling and energy consumption.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG