Authors: Yue Su, Chubin Zhang, Sijin Chen, Liufan Tan, Yansong Tang, Jianan Wang, Xihui Liu†
We have introduced both DINOv3 (As Default) and DINOv2 as the 2D Encoder here. You can refer policy and v_model to look up and change the setup.
The v1 version of DSP : Dense Policy. You can use its Dense Head for action generation easily.
Refer policy to follow DSPv2 easily and smoothly.
Please following the installation guide to install the dspv2 conda environments and the dependencies. Also, remember to adjust the constant parameters in dataset/constants.py according to your own environment.
Our original datasets are collected and organized by hdf5. Each demo(trajectory) is formulated like:
├─ Group: /images_dict
├─ Group: /images_dict/head
└─ Dataset: depth (Shape: (174, 720, 1280), Dtype: uint16)
└─ Dataset: rgb (Shape: (174, 720, 1280, 3), Dtype: uint8)
└─ ...
├─ Group: /images_dict/left
└─ ...
├─ Group: /images_dict/right
└─ ...
├─ Group: /images_dict/torso
└─ ...
├─ Group: /joints_dict
└─ Dataset: joints_position_state (Shape: (174, 25), Dtype: float64)
└─ ...
├─ Group: /poses_dict
└─ Dataset: astribot_arm_left (Shape: (174, 7), Dtype: float64)
└─ Dataset: astribot_arm_right (Shape: (174, 7), Dtype: float64)
└─ ...
└─ Dataset: merge_pose (Shape: (174, 37), Dtype: float64)
├─ ...
where we use the multi-view images and head-cam depth, pose_dict/merge_pose is used for organize state and actions, which serves as a combination of
[chassis pose, torso pose, left arm pose, left gripper, right arm pose, right gripper, head pose]. They are all relative pose to the chassis. The chassis's movement is based on the world frame, saved as the first 3 dimension in joints_dict/joints_position_state. You can ignore other data in hdf5.
As for the point cloud projection, sampling in conventional methods, voxelization in DSPv2, are provided in dataset/preprocess_data.py.py. It also provides a function for calculating delta of chassis movement. Using dataset/preprocess_data.py.py to process data is essential for accelerating training.
We provide a dummy hdf5 data at here, you can refer its structure by utils/hdf5_view.py. You can also process it by dataset/preprocess_data.py. Note: It may cause some pointcloud without points since it's dummy data.
Before training, we recommend to calculate the 5%-95% min-max value of each task for normalization. For each task, just follow utils/minmax.py and save the value in dataset/pose.json, named as your_task_name. Add --task your_task_name in train.sh. Then embark your training:
conda activate dspv2
bash train.shconda activate dspv2
python eval.py@article{dspv2,
title={DSPv2: Improved Dense Policy for Effective and Generalizable Whole-body Mobile Manipulation},
author={Yue Su and Chubin Zhang and Sijin Chen and Liufan Tan and Yansong Tang and Jianan Wang and Xihui Liu},
journal={arXiv preprint arXiv:2509.16063},
year={2025}
}DSPv2 is licensed under CC BY-NC-SA 4.0
