Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation

📝Paper | 🌍Project Page | Pre-trained HunyuanVideo Checkpoint

Also refer to here for the latest version with the Wan 2.2 model.

Introduction

Here is the codebase for Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation.

Below you will find setup instructions and basic usage guidance for the code within the vidar folder.

Environment Setup

Our code has been tested with CUDA 12.4.
If you encounter errors, please also refer to known issues in HunyuanVideo-I2V.

1. Create a Conda Environment

conda create -n vidar python==3.11.9

2. Activate the Environment

conda activate vidar

3. Install PyTorch and CUDA Dependencies

For CUDA 12.4:

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

(Optional) Install the full CUDA toolkit:

conda install -c nvidia cuda-toolkit=12.4

4. Install Python Requirements

python -m pip install -r requirements.txt

5. Install Flash Attention v2 for Acceleration

Requires CUDA 11.8 or newer:

python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]

6. Install xDiT for Parallel Inference

We recommend using PyTorch 2.4.0 and flash-attn 2.6.3:

python -m pip install xfuser==0.4.0

Troubleshooting: Floating Point Exceptions

If you encounter floating point exceptions (core dump) on certain GPUs, try:

pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/

Ensure you have CUDA 12.4, CUBLAS >= 12.4.5.8, and CUDNN >= 9.00 installed.

Video Diffusion Model

Data Preparation

Prepare your metadata as follows:

{
    "video_path": "{VIDEO_PATH}",
    "raw_caption": {
        "long caption": "{PROMPT}"
    }
}

You also need to encode the videos in your dataset before training:

vm/hyvae_extract/start.sh

For more details, refer to Hunyuan VAE extract.

Training

Edit scripts/vm/train.sh to match your platform settings, then run:

scripts/vm/train.sh

Inference

To test your trained model:

scripts/vm/sample.sh

This generates a video based on the first frame and your instruction.

Masked Inverse Dynamic Model

Data Preparation

Training data (default folder): assets/train
Testing data (default folder): assets/test
Files are organized as task_name/episode_idx.mp4 (a multi-view video) and task_name/episode_idx_qpos.pt (a 2D tensor with corresponding actions).

Training

Edit scripts/idm/train.sh as needed, then run:

scripts/idm/train.sh

Inference

To evaluate your model:

scripts/idm/eval.sh

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
vidar		vidar
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation

📝Paper | 🌍Project Page | Pre-trained HunyuanVideo Checkpoint

Introduction

Environment Setup

1. Create a Conda Environment

2. Activate the Environment

3. Install PyTorch and CUDA Dependencies

4. Install Python Requirements

5. Install Flash Attention v2 for Acceleration

6. Install xDiT for Parallel Inference

Troubleshooting: Floating Point Exceptions

Video Diffusion Model

Data Preparation

Training

Inference

Masked Inverse Dynamic Model

Data Preparation

Training

Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation

📝Paper | 🌍Project Page | Pre-trained HunyuanVideo Checkpoint

Introduction

Environment Setup

1. Create a Conda Environment

2. Activate the Environment

3. Install PyTorch and CUDA Dependencies

4. Install Python Requirements

5. Install Flash Attention v2 for Acceleration

6. Install xDiT for Parallel Inference

Troubleshooting: Floating Point Exceptions

Video Diffusion Model

Data Preparation

Training

Inference

Masked Inverse Dynamic Model

Data Preparation

Training

Inference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages