
arXiv:2602.15257v3 Announce Type: replace-cross Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to br
The continuous development and scaling of Vision-Language Models (VLMs) necessitate systematic studies on training large context windows, as current models like Qwen3 VL and GLM 4.5/6V lack reproducible training recipes.
This research provides crucial insights into reproducibility and optimization for long-context VLMs, which are foundational for advancing multimodal AI capabilities and more sophisticated agentic systems.
The systematic study of training methods, including continued pretraining and preference optimization, for long-context VLMs introduces a more rigorous and effective approach to developing these powerful models.
- · AI researchers
- · developers of long-document VQA systems
- · enterprises using multimodal AI
- · VLM developers without strong research capabilities
- · models optimized solely for short contexts
Improved performance and broader application of Vision-Language Models in complex visual document understanding.
Accelerated development of AI agents capable of processing and reasoning over vast quantities of multimodal information.
Enhanced automation of tasks requiring deep visual and textual understanding, impacting sectors from legal to scientific research.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI