Can World Simulators Reason?

Gen-ViRe

A Generative Visual Reasoning Benchmark

Xinxin Liu1*, Zhaopan Xu2*, Ming Li1, Kai Wang2, Yong Jae Lee3, Yuzhang Shang1

1University of Central Florida    2National University of Singapore    3UW-Madison

*Equal contribution    †Corresponding author

TL;DR

Recent video generation models are emerging as potential world simulators, yet existing evaluations focus primarily on visual fidelity rather than reasoning capability. In this work, we introduce Gen-ViRe, a comprehensive benchmark designed to assess Chain-of-Frames (CoF) reasoning—the ability to solve complex tasks through continuous frame-by-frame visual simulation. By evaluating 7 state-of-the-art commercial models across six cognitive dimensions, spanning 24 tasks, we find that Sora-2 currently achieves the best performance in generative visual reasoning. However, our analysis reveals a critical nuance: despite its overall leadership, even Sora-2 exhibits notable deficits in spatial-temporal reasoning, particularly in the Spatial Obstacle task. This suggests that abstract logical reasoning and physical reality simulation seem to function as distinct capabilities for video generation models. Gen-ViRe establishes the necessary baselines to guide the transition from video generators to genuine world simulators.

Research Overview

Visualize

Sora-2 demonstrates strong logic in Sudoku, yet fails to follow basic physical laws.

In the Sudoku task, Sora-2 exhibits an emergent, human-like thinking process. The model uses a question mark (?) as a placeholder for the unknown value in the third row. This suggests it can hold an internal state of the problem ("this cell is unsolved"). Following the placeholder, the model generates frames that simulate the "moving" of numbers (2) into their correct, logically-deduced positions.

Sora-2 masters abstract algorithms (e.g., Sudoku) yet fails at basic physical laws (e.g., object permanence). This demonstrates that proficiency in reasoning tasks does not guarantee genuine understanding of physical world dynamics in video generation models.

A spectrum of tool-use attempts across 7 SOTA models: Progressing from ignoring the tool to physically flawed execution.

Input
A person opens the box. (Fixed camera angle.)
Input box
Kling-v1
No interaction
Seedance-1.0-Lite
The tool is ignored
Seedance-1.0-Pro
The tool is ignored
Wan-2.5
The tool is ignored
Veo-3.1
The tool is deformed
Hailuo-2.3
The tool is deformed
Sora-2
Incorrect use of tool

Gen-ViRe Evaluation

Evaluation Framework
Left: The main chart compares the overall performance of the 7 state-of-the-art models across the six core cognitive dimensions. Right: The six sub-charts provide a detailed performance breakdown for the individual subtasks within each dimension.

Leaderboard

Methods #Videos Avg. Abstract Algorithmic
& Logical
Analogy Perceptual Planning Spatio-
Temporal
Kling-v1 360 0.198 0.071 0.057 0.117 0.140 0.443 0.359
Seedance-1.0-Lite 360 0.279 0.087 0.256 0.083 0.146 0.572 0.532
Seedance-1.0-Pro 360 0.301 0.154 0.164 0.083 0.171 0.609 0.621
Wan-2.5 360 0.490 0.412 0.411 0.500 0.378 0.702 0.536
Veo-3.1 360 0.486 0.440 0.451 0.367 0.386 0.722 0.550
Hailuo-2.3 360 0.493 0.494 0.355 0.383 0.425 0.778 0.524
👑 Sora-2 360 0.560 0.604 0.472 0.483 0.496 0.768 0.537