Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models

arXiv:2606.14507v1 Announce Type: new Abstract: Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop remo
The continuous research into fine-tuning large language and vision models is rapidly advancing, leading to immediate discoveries in model behavior and control.
This research details methods to fine-tune AI models for improved visual grounding while also understanding and controlling unintended side effects like output repetition, crucial for robust AI applications.
The ability to fine-tune vision-language models more effectively, enabling better visual tasks and providing deeper insights into model architecture behavior and control surfaces.
- · AI/ML Research Institutions
- · Developers of Vision-Language Models
- · AI Application Developers
- · Platforms with brittle AI integrations
- · Developers reliant on out-of-the-box model performance
Improved visual grounding in AI models will enhance applications requiring precise object recognition and interaction.
Better understanding of model 'interference surfaces' will lead to more reliable and controllable AI systems across various domains.
The development of more predictable and adaptable AI models could accelerate the deployment of autonomous AI agents in real-world scenarios.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI