This repository contains the official implementation of our paper:
Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies Jing Wang, Weiting Peng, Jing Tang, Zeyu Gong, Xihua Wang, Bo Tao, Li Cheng [NeurIPS 2025]
Existing imitation learning methods freeze perception during action sequence generation, ignoring how humans naturally refine perception through ongoing actions. DP-AG (Action-Guided Diffusion Policy) closes this gap by evolving observation features dynamically with action feedback.
- Latent observations are modeled via variational inference.
- An action-guided SDE evolves features, driven by the Vector–Jacobian Product (VJP) of diffusion noise predictions.
- A cycle-consistent contrastive loss aligns evolving and static latents, ensuring smooth perception–action interplay.
DP-AG significantly outperforms state-of-the-art methods on Robomimic, Franka Kitchen, Push-T, Dynamic Push-T, and real-world UR5 tasks, delivering higher success rates, faster convergence, and smoother actions.
Figure: DP-AG extends Diffusion Policy by evolving observation features through an action-guided SDE and aligning perception–action interplay with a cycle-consistent contrastive loss.
We follow the same setup as Diffusion Policy. To reproduce simulation benchmarks, install the conda environment on a Linux machine with an NVIDIA GPU.
sudo apt install -y libosmesa6-dev libgl1-mesa-glx libglfw3 patchelfWe recommend Mambaforge:
mamba env create -f conda_environment.yamlFor conda:
conda env create -f conda_environment.yamlNote:
conda_environment_macos.yamlis only for development on macOS and does not support full benchmarks.
Create the data directory under the repo root:
mkdir data && cd dataDownload training datasets:
wget https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip
unzip pusht.zip && rm -f pusht.zip && cd ..Grab experiment configs:
wget -O image_pusht_diffusion_policy_cnn.yaml \
https://diffusion-policy.cs.columbia.edu/data/experiments/image/pusht/diffusion_policy_cnn/config.yamlWe provide our Dynamic Push-T dataset in the data/ folder of this repository for direct use with our implementation.
For all other simulation benchmarks (e.g., Robomimic, Franka Kitchen, Push-T), please kindly refer to the official Diffusion Policy repository for detailed instructions on downloading and preparing datasets.
We provide two Jupyter notebooks that contain the core implementation of DP-AG and are designed to be easy to use and understand:
PushT-Vision-Image-Action-Guided.ipynb– Demonstrates our method on the Push-T benchmark.Dynamic-PushT-Environment.ipynb– Showcases our Dynamic Push-T environment with action–perception interplay.
👉 We strongly suggest starting from these notebooks, as they provide the clearest entry point for understanding and experimenting with DP-AG.
Activate the conda environment and log into wandb:
conda activate robodiff
wandb loginTrain with a single seed:
python train.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn.yaml \
training.seed=42 training.device=cuda:0Train with multiple seeds using Ray:
export CUDA_VISIBLE_DEVICES=0,1,2
ray start --head --num-gpus=3
python ray_train_multirun.py --config-dir=. --config-name=image_pusht_diffusion_policy_cnn.yaml \
--seeds=42,43,44 --monitor_key=test/mean_scoreDownload a checkpoint (example):
wget https://diffusion-policy.cs.columbia.edu/data/experiments/low_dim/pusht/diffusion_policy_cnn/train_0/checkpoints/epoch=0550-test_mean_score=0.969.ckpt -O data/checkpoint.ckptRun evaluation:
python eval.py --checkpoint data/checkpoint.ckpt --output_dir data/pusht_eval_output --device cuda:0Our framework has been validated on a UR5 robot with RealSense cameras and SpaceMouse teleoperation.
Please refer to demo_real_robot.py and eval_real_robot.py for data collection, training, and evaluation following the same structure as Diffusion Policy.
The codebase follows DP’s modular design:
- Tasks: dataset wrappers, environments, configs.
- Policies: inference + training.
- Workspaces: manage experiment lifecycle.
See train.py as the entry point.
If you use this work, please cite:
@inproceedings{wang2025dpag,
title={Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies},
author={Wang, Jing and Peng, Weiting and Tang, Jing and Gong, Zeyu and Wang, Xihua and Tao, Bo and Cheng, Li},
booktitle={Advances in Neural Information Processing Systems (NeurIPS 2025)},
year={2025}
}We build upon the foundational work of Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (Chi et al.).
Their open-source code, benchmarks, and datasets enabled our development of DP-AG.
We especially thank the authors for releasing simulation environments, vision and state-based notebooks, and experiment data.
