A Very Big Video Reasoning Suite

We bet on a future that video reasoning is the next fundamental intelligence paradigm, after language reasoning, where spatiotemporal embodied world experiences could be more naturally captured.

locate_twelve_o_clock_arrows
GitHub
Knowledge training set
The image contains 2 clocks, each with only an hour hand. Exactly one clock has its hour hand pointing to 12 o'clock. First find the single clock pointing to 12 o'clock, then draw a red circle around it. Do not change anything else. Show the complete solution step by step.
First Frame
Last Frame
return_to_correct_bin
GitHub
Abstraction training set
Move each item into the bin that matches its color. Only move items, do not change anything else.
First Frame
Last Frame
grid_obtaining_award
GitHub
Spatiality training set
The scene shows a 10x10 grid with a green start point, a red end point, and 4 triangle reward items scattered across it. A circular agent starts at the green start point and can move to adjacent cells (up, down, left, right). The agent collects a reward by moving to its cell, and once collected, the reward disappears. Find the shortest path that collects all 4 triangle rewards before reaching the red end point.
First Frame
Last Frame
object_packing
GitHub
Transformation training set
The scene shows objects on the left side and a container on the right side. Place the objects into the container one by one in the color order: orange - brown. Each object must be placed individually in the exact order specified, and all objects must end up inside the container.
First Frame
Last Frame
draw_midpoint_perpendicular_line
GitHub
Perception out-of-domain testset
Draw a red perpendicular line through the middle point between two parallel lines. The line should extend from the upper parallel line to the lower parallel line.
First Frame
Last Frame

Inference Results

View All Results
Identify Chinese Character - Samples
00
01
02
03
04
Task Domains 1/5
Identify Chinese Character
Knowledge out-of-domain testset
Shape Outline Fill
Abstraction in-domain testset
Multiple Keys One Door
Spatiality out-of-domain testset
Symbol Delete
Transformation out-of-domain testset
Attention Shift (Different)
Perception in-domain testset
Prompt
Loading...
Ground Truth
First
First Frame
Final
Final Frame
Model Outputs
1/9
VBVR-Wan2.2
VBVR-Wan2.2
CogVideoX 1.5
Kling 2.6
LTX-2
Runway Gen-4
Sora 2
Veo 3
Wan 2.2 I2V
Hunyuan I2V

Leaderboard

Reference
Strong Baseline
Proprietary
Open-source
Human
Human
97.4%
#1
VBVR
VBVR-Wan2.2
68.5%
#2
Sora 2
Sora 2
54.6%
#3
Veo 3.1
Veo 3.1
48.0%
#4
Runway
Runway Gen-4 Turbo
40.3%
#5
Wan2.2
Wan2.2-I2V-A14B
37.1%
#6
Kling
Kling 2.6
36.9%
#7
LTX-2
LTX-2
31.3%
#8
CogVideoX
CogVideoX1.5-5B-I2V
27.3%
#9
HunyuanVideo
HunyuanVideo-I2V
27.3%
#9