arXiv:2510.12784v2 Announce Type: replace-cross Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a model's strong visual understanding often fails to transfer to visual generation: it may correctly judge prompt-image alignment while failing to generate a faithful image from the same prompt. This raises a compelling question: Can a model improve itself by using its understanding module to reward its generation module? We introduce SRUM, a self-r

Source: arXiv cs.CL — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.