A Generative Visual Reasoning Benchmark
Recent video generation models are emerging as potential world simulators, yet existing evaluations focus primarily on visual fidelity rather than reasoning capability. In this work, we introduce Gen-ViRe, a comprehensive benchmark designed to assess Chain-of-Frames (CoF) reasoning—the ability to solve complex tasks through continuous frame-by-frame visual simulation. By evaluating 7 state-of-the-art commercial models across six cognitive dimensions, spanning 24 tasks, we find that Sora-2 currently achieves the best performance in generative visual reasoning. However, our analysis reveals a critical nuance: despite its overall leadership, even Sora-2 exhibits notable deficits in spatial-temporal reasoning, particularly in the Spatial Obstacle task. This suggests that abstract logical reasoning and physical reality simulation seem to function as distinct capabilities for video generation models. Gen-ViRe establishes the necessary baselines to guide the transition from video generators to genuine world simulators.
In the Sudoku task, Sora-2 exhibits an emergent, human-like thinking process. The model uses a question mark (?) as a placeholder for the unknown value in the third row. This suggests it can hold an internal state of the problem ("this cell is unsolved"). Following the placeholder, the model generates frames that simulate the "moving" of numbers (2) into their correct, logically-deduced positions.
Sora-2 masters abstract algorithms (e.g., Sudoku) yet fails at basic physical laws (e.g., object permanence). This demonstrates that proficiency in reasoning tasks does not guarantee genuine understanding of physical world dynamics in video generation models.
| Methods | #Videos | Avg. | Abstract | Algorithmic & Logical |
Analogy | Perceptual | Planning | Spatio- Temporal |
|---|---|---|---|---|---|---|---|---|
| Kling-v1 | 360 | 0.198 | 0.071 | 0.057 | 0.117 | 0.140 | 0.443 | 0.359 |
| Seedance-1.0-Lite | 360 | 0.279 | 0.087 | 0.256 | 0.083 | 0.146 | 0.572 | 0.532 |
| Seedance-1.0-Pro | 360 | 0.301 | 0.154 | 0.164 | 0.083 | 0.171 | 0.609 | 0.621 |
| Wan-2.5 | 360 | 0.490 | 0.412 | 0.411 | 0.500 | 0.378 | 0.702 | 0.536 |
| Veo-3.1 | 360 | 0.486 | 0.440 | 0.451 | 0.367 | 0.386 | 0.722 | 0.550 |
| Hailuo-2.3 | 360 | 0.493 | 0.494 | 0.355 | 0.383 | 0.425 | 0.778 | 0.524 |
| 👑 Sora-2 | 360 | 0.560 | 0.604 | 0.472 | 0.483 | 0.496 | 0.768 | 0.537 |