EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

arXiv:2605.23610v1 Announce Type: cross Abstract: Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible
The accelerating demand for highly consistent and efficient video generation, especially for multi-shot scenarios, drives the development of more sophisticated memory mechanisms in AI models.
This development improves the efficiency and consistency of multi-shot video generation, which is critical for future applications in media, simulation, and creative AI.
The shift from full-frame to entity-centric memory significantly reduces computational cost and improves consistency by isolating persistent entity information from transient scene context.
- · AI video generation platforms
- · Content creators
- · Gaming industry
- · Simulation developers
- · Methods relying on full-frame memory
- · Less efficient video generation techniques
More realistic and consistent AI-generated multi-shot videos become feasible.
The cost and computational requirements for producing high-quality AI video content decrease, democratizing access.
New forms of media and interactive experiences emerge that are highly personalized and contextually aware due to scalable, entity-consistent AI generation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI