Can World Simulators Reason?

Gen-ViRe

A Generative Visual Reasoning Benchmark

Xinxin Liu¹*, Zhaopan Xu²*, Ming Li¹, Kai Wang², Yong Jae Lee³, Yuzhang Shang¹†

¹University of Central Florida ²National University of Singapore ³UW-Madison

*Equal contribution †Corresponding author

TL;DR

Recent video generation models are emerging as potential world simulators, yet existing evaluations focus primarily on visual fidelity rather than reasoning capability. In this work, we introduce Gen-ViRe, a comprehensive benchmark designed to assess Chain-of-Frames (CoF) reasoning—the ability to solve complex tasks through continuous frame-by-frame visual simulation. By evaluating 7 state-of-the-art commercial models across six cognitive dimensions, spanning 24 tasks, we find that Sora-2 currently achieves the best performance in generative visual reasoning. However, our analysis reveals a critical nuance: despite its overall leadership, even Sora-2 exhibits notable deficits in spatial-temporal reasoning, particularly in the Spatial Obstacle task. This suggests that abstract logical reasoning and physical reality simulation seem to function as distinct capabilities for video generation models. Gen-ViRe establishes the necessary baselines to guide the transition from video generators to genuine world simulators.

Visualize

Sora-2 demonstrates strong logic in Sudoku, yet fails to follow basic physical laws.

In the Sudoku task, Sora-2 exhibits an emergent, human-like thinking process. The model uses a question mark (?) as a placeholder for the unknown value in the third row. This suggests it can hold an internal state of the problem ("this cell is unsolved"). Following the placeholder, the model generates frames that simulate the "moving" of numbers (2) into their correct, logically-deduced positions.

Sora-2 masters abstract algorithms (e.g., Sudoku) yet fails at basic physical laws (e.g., object permanence). This demonstrates that proficiency in reasoning tasks does not guarantee genuine understanding of physical world dynamics in video generation models.

A spectrum of tool-use attempts across 7 SOTA models: Progressing from ignoring the tool to physically flawed execution.

Input

A person opens the box. (Fixed camera angle.)

Kling-v1

No interaction

Seedance-1.0-Lite

The tool is ignored

Seedance-1.0-Pro

The tool is ignored

Wan-2.5

The tool is ignored

Veo-3.1

The tool is deformed

Hailuo-2.3

The tool is deformed

Sora-2

Incorrect use of tool

Gen-ViRe Evaluation

Left: The main chart compares the overall performance of the 7 state-of-the-art models across the six core cognitive dimensions. Right: The six sub-charts provide a detailed performance breakdown for the individual subtasks within each dimension.

Leaderboard

Methods	#Videos	Avg.	Abstract	Algorithmic & Logical	Analogy	Perceptual	Planning	Spatio- Temporal
Kling-v1	360	0.198	0.071	0.057	0.117	0.140	0.443	0.359
Seedance-1.0-Lite	360	0.279	0.087	0.256	0.083	0.146	0.572	0.532
Seedance-1.0-Pro	360	0.301	0.154	0.164	0.083	0.171	0.609	0.621
Wan-2.5	360	0.490	0.412	0.411	0.500	0.378	0.702	0.536
Veo-3.1	360	0.486	0.440	0.451	0.367	0.386	0.722	0.550
Hailuo-2.3	360	0.493	0.494	0.355	0.383	0.425	0.778	0.524
👑 Sora-2	360	0.560	0.604	0.472	0.483	0.496	0.768	0.537