SIGNALAI·Jun 4, 2026, 4:00 AMSignal75Medium term

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

arXiv:2506.06006v3 Announce Type: replace-cross Abstract: Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP

Why this matters

Why now

This paper highlights a current limitation in generalized AI capabilities, specifically in physical world modeling, at a time when research into embodied AI and agents is rapidly progressing.

Why it’s important

Understanding the intrinsic difficulties and potential bootstrapping methods for AI to grasp real-world physics is crucial for developing robust general AI agents and humanoid robotics.

What changes

The focus might shift from directly training forward dynamics in VLMs to leveraging inverse dynamics as an easier, intermediate step for building world models.

Winners

· AI researchers focused on inverse dynamics
· Developers of embodied AI systems
· Companies investing in reinforcement learning from human feedback or physical in

Losers

· Approaches solely reliant on direct forward dynamics prediction in VLMs
· Companies prematurely deploying VLMs for complex physical prediction tasks

Second-order effects

Direct

This research suggests a more effective pathway for VLMs to acquire world models by focusing on inverse dynamics first.

Second

Improved world models could accelerate the development of more capable and reliable AI agents able to interact with complex environments.

Third

More capable AI agents could lead to breakthroughs in areas requiring fine-grained physical manipulation and understanding, including advanced manufacturing and humanoid robotics applications.

Editorial confidence: 90 / 100 · Structural impact: 60 / 100

Original report

This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.

Read at arXiv cs.AI

#cs.CV #cs.AI #cs.CL

Tracked by The Continuum Brief · live intelligence network

The Brief · Weekly Dispatch

Stay ahead of the systems reshaping markets.