arXiv:2511.18775v2 Announce Type: replace-cross Abstract: Virtual Try-On (VTON) synthesizes realistic images of a person wearing a target garment, with broad applications in e-commerce and fashion. Diffusion-based dual-UNet methods achieve strong results but double the parameters by dedicating a separate network to garment conditioning. Spatial concatenation offers a simpler single-network alternative, yet both UNet- and DiT-based instantiations report that full fine-tuning is ineffective, and the community has settled for attention-only training. We ask: why does full fine-tuning fail, and ca
Source: arXiv cs.AI — read the full report at the original publisher.
