A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models

Nuo Chen, Lulin Liu, Zihao Li, Ziyao Zeng, Zihao Zhu, Wenyan Cong, Junyuan "Jason" Hong, Yunhao Yang, Zhengzhong Tu, Yan Wang, Boris Ivanovic, Marco Pavone, Zhangyang Wang, Yang Zhou, Zhiwen Fan

May 2026

Abstract

Generative world models hold promise as scalable simulators for autonomous systems, particularly for rare safety-critical multi-agent interactions such as vehicle collisions. However, current evaluation paradigms index heavily on visual fidelity and semantic alignment, leaving a critical blind spot: they rarely quantify whether generated dynamics obey the physical laws required for reliable simulation. To bridge this gap, we introduce CrashTwin, a physics-grounded evaluation framework designed to stress-test the physical trustworthiness of world models. CrashTwin combines 25K synthetic sequences and 12K real-world crash sequences with a calibration-free reconstruction pipeline that recovers metric-scale physical attributes from uncalibrated videos. We evaluate spatio-temporal consistency, momentum and energy conservation, and world-dynamics integrity. Benchmarking representative world models reveals that high perceptual quality can mask severe physical violations during complex interactions. By quantitatively exposing these failure modes, CrashTwin provides a vital diagnostic tool for developing physically grounded world models capable of reliable real-world simulation.

Type

Preprint

Publication

CVPR 2026 Workshop on Foundations of Multi-agent Evaluation and Analysis (FMEA)