arXiv:2605.28160v1 Announce Type: new Abstract: Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to syste
Source: arXiv cs.AI — read the full report at the original publisher.
