MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using three domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance ga
The rapid advancement of vision-language models necessitates robust evaluation methods to identify and address their limitations, particularly in complex multimodal reasoning tasks.
This benchmark highlights a critical bottleneck in current AI capabilities—the inability to consistently track entities across modalities, which is fundamental for reliable deployment of advanced AI systems.
The explicit identification of performance gaps in multimodal entity tracking will drive research and development towards more robust and generalizable AI models capable of coherent world modeling.
- · AI research institutions
- · Developers of multimodal AI
- · Industries reliant on advanced automation
- · Models with poor multimodal integration
- · Developers neglecting entity tracking
- · Applications demanding high reliability in dynamic environments
The MET-Bench will become a standard for evaluating multimodal AI, pushing models to improve their contextual understanding.
Improved entity tracking will accelerate the development of more capable AI agents and robotic systems that operate effectively in the real world.
Enhanced world modeling capabilities in AI could lead to breakthroughs in areas currently limited by AI's contextual awareness, such as autonomous vehicles or complex scientific discovery.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL