A diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action.
- Action Simulation Fidelity
- Simulative Reasoning & Planning
- Smoothness Evaluation
- Generation Consistency Evaluation
This section evaluates the ability of video generation models to simulate actions faithfully based on given prompts.
We evaluate multiple state-of-the-art video generation models, categorized into local models and API-based models:
- Cosmos-Predict1-14B-Video2World - GitHub
- Cosmos-Predict2-14B-Video2World - GitHub
- WAN 2.1-I2V-14B - GitHub
- WAN 2.2-I2V-A14B - GitHub
- Gen-3
- KLING
- MiniMax-Hailuo
- PAN - Our proprietary model (requires custom endpoint, not yet publicly released)
- Sora 2
- Veo 3
For each model, create a dedicated conda environment and follow the installation instructions from their respective repositories:
- Cosmos-Predict1: Follow setup instructions at https://github.com/nvidia-cosmos/cosmos-predict1 (environment name:
cosmos-predict1) - Cosmos-Predict2: Follow setup instructions at https://github.com/nvidia-cosmos/cosmos-predict2 (environment name:
cosmos-predict2) - WAN 2.1: Follow setup instructions at https://github.com/Wan-Video/Wan2.1 (environment name:
wan2_1) - Wan 2.2: Follow setup instructions at https://github.com/Wan-Video/Wan2.2 (environment name:
wan2_2)
Download the corresponding checkpoints for each model and place them in the respective directories:
thirdparty/
├── cosmos-predict1/checkpoints/
├── cosmos-predict2/checkpoints/
├── wan2_1/checkpoints/
└── wan2_2/checkpoints/
Execute the generation scripts using SLURM for local models:
# Local Models
sbatch action_simulation_fidelity_scripts/cosmos1.sh
sbatch action_simulation_fidelity_scripts/cosmos2.sh
sbatch action_simulation_fidelity_scripts/wan2_1.sh
sbatch action_simulation_fidelity_scripts/wan2_2.shFor models that use API calls:
conda create -n video-api python=3.10 -y
conda activate video-api
pip install -r requirements_api.txtExecute the generation scripts for API-based models:
# API-based Models
bash action_simulation_fidelity_scripts/gen3.sh
bash action_simulation_fidelity_scripts/kling.sh
bash action_simulation_fidelity_scripts/minimax.sh
bash action_simulation_fidelity_scripts/pan.sh
bash action_simulation_fidelity_scripts/sora2.sh
bash action_simulation_fidelity_scripts/veo3.shAfter generating videos for each model, evaluate their action simulation fidelity using GPT-4o:
python action_simulation_fidelity_scripts/action_simulation_fidelity_eval.py \
--openai_api_key YOUR_OPENAI_API_KEY \
--base_path outputs/action_simulation_fidelity/MODEL_NAME \
--dataset_json datasets/action_simulation_fidelity_subset/samples_subset.json \
--save_name MODEL_NAMEExamples:
# Evaluate PAN
python action_simulation_fidelity_scripts/action_simulation_fidelity_eval.py \
--openai_api_key YOUR_KEY \
--base_path outputs/action_simulation_fidelity/pan \
--dataset_json datasets/action_simulation_fidelity_subset/samples_subset.json \
--save_name panResults will be saved in outputs/action_simulation_fidelity/MODEL_NAME/MODEL_NAME_results.json.
This section evaluates video generation models on their ability to perform simulative reasoning and planning for robotic tasks. Part of the code has been uploaded and the rest is currently being prepared.
Both Cosmos-Predict1 and Cosmos-Predict2 models need to be fine-tuned on specific datasets for the evaluation tasks:
| Task Type | Dataset | Models to Fine-tune | Purpose |
|---|---|---|---|
| Open-ended Simulation Planning | Agibot World Colosseo – “A large-scale manipulation platform for scalable and intelligent embodied systems” (Bu et al., 2025) | Cosmos-Predict1, Cosmos-Predict2 | Enables open-ended reasoning about robotic manipulation tasks |
| Structured Simulation Planning | Language Table – “Interactive language: Talking to robots in real time” (Lynch et al., 2023) | Cosmos-Predict1, Cosmos-Predict2 | Enables structured reasoning with specific action constraints |
Fine-tuning process:
-
Fine-tune each model on both datasets:
- Agibot for open-ended simulation planning
- Language Table for structured simulation planning
-
Follow the respective model repository instructions for fine-tuning:
-
Replace the original checkpoints with your fine-tuned versions in the
thirdparty/*/checkpoints/directories.
Run the evaluation scripts and check the generated results for both open-ended and structured simulation planning.
Open-ended Simulation Planning:
# Cosmos-Predict1
sbatch simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM-WM_reasoning_cosmos1.sh
# Cosmos-Predict2
sbatch simulative_reasoning_planning_scripts/open_ended_simulation_planning/VLM-WM_reasoning_cosmos2.shStructured Simulation Planning:
Tasks may have a maximum of 5 or 10 actions:
# Maximum 5 actions
sbatch simulative_reasoning_planning_scripts/structured_simulation_planning/VLM-WM_reasoning_cosmos1_max_action_5.sh
sbatch simulative_reasoning_planning_scripts/structured_simulation_planning/VLM-WM_reasoning_cosmos2_max_action_5.sh
# Maximum 10 actions
sbatch simulative_reasoning_planning_scripts/structured_simulation_planning/VLM-WM_reasoning_cosmos1_max_action_10.sh
sbatch simulative_reasoning_planning_scripts/structured_simulation_planning/VLM-WM_reasoning_cosmos2_max_action_10.shResult Checking:
After execution, check the results in the following paths:
-
Open-ended:
outputs/simulative_reasoning_planning/open_ended_simulation_planning/[task_name]/[model_name]/[task_name]_refined.json -
Structured:
outputs/simulative_reasoning_planning/structured_simulation_planning/[task_name]/[model_name]/[task_name]_refined.json
Analyze the action sequences to determine whether the models successfully completed the tasks.
This section evaluates the temporal smoothness of multi-round generated videos using optical flow. Consecutive frame pairs are processed with SEA-RAFT to compute velocity and acceleration magnitudes, which are combined into a smoothness score (vmag × exp(−λ × amag)).
datasets/smoothness_eval/samples.json contains 100 photorealistic outdoor scenes, each with a 10-round sequential prompt list. Reference images are not bundled — set IMAGE_ROOT in the generation scripts to point to your local copy of the WorldScore-Dataset.
wget https://huggingface.co/datasets/memcpy/SEA-RAFT/resolve/main/Tartan-C-T-TSKH-spring540x960-M.pth \
-O thirdparty/SEA-RAFT/checkpoints/Tartan-C-T-TSKH-spring540x960-M.pthScripts for all supported models are in smoothness_eval_scripts/. Example using PAN:
# Edit IMAGE_ROOT inside the script first, then:
bash smoothness_eval_scripts/pan.shGenerated videos are saved under outputs/smoothness_eval/pan/{instance_id}/rounds/.
python smoothness_eval_scripts/compute_smoothness_scores.py \
--videos_dir outputs/smoothness_eval/pan \
--output_dir outputs/smoothness_eval/pan_scores \
--raft_ckpt thirdparty/SEA-RAFT/checkpoints/Tartan-C-T-TSKH-spring540x960-M.pth \
--num_workers 4Per-instance results are written to outputs/smoothness_eval/pan_scores/{instance_id}/smoothness.json. An aggregate summary.json is written once all instances are scored. For multi-node SLURM evaluation, set MODEL_NAME in smoothness_eval_scripts/eval.sh and run sbatch smoothness_eval_scripts/eval.sh.
This section evaluates video generation models on 7 aspects of multi-round generation consistency using the WorldScore benchmark framework (MIT License).
| Aspect | Metric | Key dependency |
|---|---|---|
camera_control |
camera reprojection error | DROID-SLAM |
object_control |
object detection score | GroundingDINO + SAM2 |
content_alignment |
CLIP score | CLIP |
3d_consistency |
reprojection error | DROID-SLAM |
photometric_consistency |
optical flow AEPE | SEA-RAFT |
style_consistency |
Gram matrix distance | VGG |
subjective_quality |
CLIP-IQA+, MUSIQ | QAlign, MUSIQ |
Each score is a list of per-round values rather than a single scalar. The bold aspects require heavy thirdparty dependencies (see WorldScore's own setup guide). The remaining four aspects (content_alignment, photometric_consistency, style_consistency, subjective_quality) can be run on any GPU without those dependencies.
git submodule update --init thirdparty/WorldScore
pip install -e thirdparty/WorldScoreFollow WorldScore's setup instructions for the thirdparty dependencies (DROID-SLAM, GroundingDINO, SAM2) if you need all 7 aspects.
bash generation_consistency_eval_scripts/install_patches.shThis copies the modified evaluator into the WorldScore submodule.
Edit IMAGE_ROOT in the script to point to your local WorldScore-Dataset, then run:
bash generation_consistency_eval_scripts/pan.shGenerated videos are saved under outputs/generation_consistency_eval/pan/.
python generation_consistency_eval_scripts/prepare_worldscore_dirs.py \
--videos_root outputs/generation_consistency_eval/pan \
--dataset_json datasets/generation_consistency_eval/samples.json \
--output_root outputs/generation_consistency_eval/pan_evalpython generation_consistency_eval_scripts/run_evaluate_multiround.py \
--model_name pan \
--visual_movement static \
--runs_root outputs/generation_consistency_eval/pan_eval \
--num_jobs 24 \
--use_slurm True \
--slurm_partition main \
--slurm_qos wmResults are written to outputs/generation_consistency_eval/pan_eval/worldscore_output/worldscore_multiround.json. For SLURM-based end-to-end runs, set MODEL_NAME in generation_consistency_eval_scripts/eval.sh and run sbatch generation_consistency_eval_scripts/eval.sh.