
arXiv:2604.09429v4 Announce Type: replace-cross Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single fram
The paper leverages advances in video diffusion models to tackle a long-standing computer vision problem, integrating camera pose estimation with video generation in a unified framework.
This research represents a significant step towards more robust and generalizable AI systems capable of understanding and generating consistent 3D scenes from limited 2D data, bridging computer vision and graphics.
Traditional separation between camera parameter recovery and scene rendering tasks begins to break down, enabling more coherent and robust 3D scene understanding and synthesis from video.
- · AI researchers (computer vision, graphics)
- · Video game developers
- · Metaverse platforms
- · 3D content creation industries
- · Fragmented 3D reconstruction pipelines
- · Manual camera tracking workflows
Improved 3D reconstruction and novel view synthesis from sparse or ambiguous video data.
More realistic and controllable AI-generated video content with consistent camera motion and scene structure.
Accelerated development of virtual worlds and augmented reality applications requiring dynamic 3D understanding from real-world footage.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG