Note: This repository contains the official implementation of RAE-NWM. The pre-trained RAE-NWM weights and complete model cards will be released here soon.
First, clone the repository and navigate into the project directory:
git clone https://github.com/zmkun20/raenwm.git
cd raenwmWe recommend using Anaconda to manage the environment:
conda create -y -n raenwm python=3.11.10
conda activate raenwm
# Install PyTorch
pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu128
# Install dependencies
conda install -y ffmpeg
pip install "numpy<2" omegaconf huggingface_hub decord einops evo transformers diffusers tqdm timm notebook dreamsim torcheval lpips ipywidgets accelerate==0.23.0 torchdiffeq==0.2.5 wandbTo download and preprocess data, please follow the steps from NoMaD.
Specifically, after downloading the datasets, run process_bags.py and process_recon.py to save each processed dataset to path/to/raenwm/data/<dataset_name>.
Your final data directory structure should look like this:
raenwm/data
└── <dataset_name>
├── <name_of_traj1>
│ ├── 0.jpg
│ ├── ...
│ ├── T_1.jpg
│ └── traj_data.pkl
├── <name_of_traj2>
│ ├── 0.jpg
│ ├── ...
│ └── traj_data.pkl
└── <name_of_trajN>
├── 0.jpg
├── ...
└── traj_data.pkl
We follow the official RAE instructions to download the DINOv2 decoder weights and normalization stats:
# Optional: Login is usually not required for public models.
# If you encounter download issues, use the command below to log in.
# hf auth login
# (Legacy command: huggingface-cli login)
mkdir -p models
# Download using the hf command
hf download nyu-visionx/RAE-collections \
--repo-type model \
--include "decoders/dinov2/wReg_base/ViTXL_n08/model.pt" \
--include "stats/dinov2/wReg_base/imagenet1k/stat.pt" \
--local-dir ./models
# If the `hf` command is not recognized in your environment, use the legacy command instead:
# huggingface-cli download nyu-visionx/RAE-collections \
# --repo-type model \
# --include "decoders/dinov2/wReg_base/ViTXL_n08/model.pt" \
# --include "stats/dinov2/wReg_base/imagenet1k/stat.pt" \
# --local-dir ./models⏳ [Coming Soon] The pre-trained weights for our model will be released and linked here.
This repository natively supports multi-GPU operations for both training and inference. To utilize multiple GPUs, simply replace python with torchrun and specify the number of processes. For example:
torchrun --nproc_per_node=8 train.py ...We support WandB for experiment tracking. Before starting the training process, it is recommended to log in to your WandB account:
wandb loginTo train the model from scratch, run:
python train.py \
--config config/raenwm.yaml \
--epochs 50 \
--global-seed 42 \
--log-every 100 \
--ckpt-every 5000 \
--eval-every 1000 \
--bfloat16 1 \
--torch-compile 1We provide a streamlined bash script to run the evaluation pipeline, which supports both time and rollout modes. The script allows you to selectively run ground-truth preparation (gt), inference (infer), evaluation (eval), or all steps at once (all).
# Usage: bash run_eval.sh [MODE] [STEP] [DATASET] [CKPT_PATH]
# Example: Generate Ground-Truth only
bash run_eval.sh time gt sacson path/to/checkpoints/checkpoint.pth.tar
# Example: Run Inference only
bash run_eval.sh time infer sacson path/to/checkpoints/checkpoint.pth.tar
# Example: Run all steps sequentially for rollout mode
bash run_eval.sh rollout all sacson path/to/checkpoints/checkpoint.pth.tarTo evaluate planning performance using CEM, simply run the provided planning script:
# Usage: bash run_plan.sh [DATASET] [curve | line] [CKPT_PATH]
bash run_plan.sh sacson curve path/to/checkpoints/checkpoint.pth.tarTo reproduce the representation analysis (probe experiment) mentioned in the paper, we provide the following scripts.
Train and Evaluate: Training the probe will automatically run the evaluation at the end of the process:
python train_probe.py \
--config config/probe.yaml \
--epochs 5 \
--global-seed 42 \
--log-every 100 \
--ckpt-every 1000 \
--eval-every 1000 \
--bfloat16 1 \
--torch-compile 0Our project is inspired by and built upon RAE, NWM, and NoMaD.
If you find our work helpful, please consider citing our paper:
@article{zhang2026rae,
title={RAE-NWM: Navigation World Model in Dense Visual Representation Space},
author={Zhang, Mingkun and Shen, Wangtian and Zhang, Fan and Qin, Haijian and Pei, Zihao and Meng, Ziyang},
journal={arXiv preprint arXiv:2603.09241},
year={2026}
}