A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models

Abstract

Generative world models hold promise as scalable simulators for autonomous systems, particularly for rare safety-critical multi-agent interactions such as vehicle collisions. However, current evaluation paradigms index heavily on visual fidelity and semantic alignment, leaving a critical blind spot: they rarely quantify whether generated dynamics obey the physical laws required for reliable simulation. To bridge this gap, we introduce CrashTwin, a physics-grounded evaluation framework designed to stress-test the physical trustworthiness of world models. CrashTwin combines 25K synthetic sequences and 12K real-world crash sequences with a calibration-free reconstruction pipeline that recovers metric-scale physical attributes from uncalibrated videos. We evaluate spatio-temporal consistency, momentum and energy conservation, and world-dynamics integrity. Benchmarking representative world models reveals that high perceptual quality can mask severe physical violations during complex interactions. By quantitatively exposing these failure modes, CrashTwin provides a vital diagnostic tool for developing physically grounded world models capable of reliable real-world simulation.

Publication
CVPR 2026 Workshop on Foundations of Multi-agent Evaluation and Analysis (FMEA)
Junyuan "Jason" Hong
Junyuan "Jason" Hong
Incoming Assistant Professor

My research interest lies in the interaction of responsible AI and healthcare.