$A_{0}$: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

| |

Example results on HOI4D dataset

Example results on Maniskill

Example results on DROID dataset

Environment Setup

git clone https://github.com/A-embodied/A0.git
cd A0

conda create -n a0env python=3.10.0
conda activate a0env

# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
pip install torch==2.1.0 torchvision==0.16.0  --index-url https://download.pytorch.org/whl/cu121

# Install flash-attn
pip install flash-attn --no-build-isolation
# or install prebuilt flash-attn wheels for faster setup: https://github.com/mjun0812/flash-attention-prebuild-wheels


# Install other prerequisites
pip install -r requirements.txt

Download Off-the-shelf Vision & Text Encoders

Qwen2.5-7B:link🤗
siglip: link🤗

Link the encoders to the repo directory:

# Under the root directory of this repo
mkdir -p google
mkdir -p Qwen

# Link the downloaded encoders to this repo
ln -s /path/to/Qwen2.5-7B Qwen/Qwen2.5-7B
ln -s /path/to/siglip-so400m-patch14-384 google/siglip-so400m-patch14-384

Data

Download A0-Dataset from Huggingface🤗 and unzip the zip files. Your dataset directory should look like:

├── maniskill # maniskill_path
├── droid-cotrack # droid_cotrack_path
├── droid_molmo_sam2 # droid_molmo_sam2_path
├── hoi4d_metadata # hoi4d_metadata_path
├── hoi4d_frame # hoi4d_frame_selection_path
└── HOI4D_release # hoi4d_rgb_path

Then set the dataset paths in configs/base.yaml:

# ...

dataset:
  droid_cotrack_path: /path/to/droid_cotrack
  droid_molmo_sam2_path: /path/to/droid_molmo_sam2
  hoi4d_metadata_path: /path/to/hoi4d_metadata
  hoi4d_rgb_path: /path/to/HOI4D_release
  hoi4d_frame_selection_path: /path/to/hoi4d_frame
  maniskill_path: /path/to/maniskill

Decompose the videos of HOI4D_release dataset into images using ffmpeg via official Python script file decode.py:

python utils/decode.py

Train

First, set some variables in the train.sh.

Run ifconfig to find your network interface, then export NCCL_SOCKET_IFNAME=<iface>.

Run ibstat to identify your InfiniBand device, then export NCCL_IB_HCA=<device:port>.

Set OUTPUT_DIR and CUDA_VISIBLE_DEVICES.

Optionally, you can download the model pre-trained on 1 million pixmo-points dataset: 🤗A0-1B-pretrain.
And set --pretrained_model_name_or_path to load it as initial parameters.

source train.sh

Experimental Details

The default model configuration (hidden size: 2048, depth: 28) contains 1 billion parameters. By setting the the hidden_size to 1024 and the depth to 14 in configs/base.yaml, you can obtain a model with approximately 170 million parameters.
In our experiments, we used 2 GPU cards with a batch size of 100 and trained the model for 30,000 steps. The 170M model required 46 GB of memory per card. In comparison, the 1B model required 73 GB of memory per card.

Test

You can test using your own trained model or the pre-trained model (🤗A0-1B and A0-170M).
Set the variables PRETRAINED_MODEL_NAME_OR_PATH in test_dataset.sh

# test performance on Maniskill dataset
bash test_dataset.sh maniskill
# test performance on HOI4D Frame Seclection dataset
bash test_dataset.sh hoi4d_frame
# test performance on HOI4D dataset
bash test_dataset.sh hoi4d
# test performance on DROID dataset
bash test_dataset.sh droid

Inference

You can test using your own trained model or the pre-trained model (🤗A0-1B and A0-170M).

# set keyword arguments --pretrained_model_name_or_path, --instruction and --image_path
bash inference.sh

Citation

@article{xu2025a0,
      title={A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation}, 
      author={Rongtao Xu and Jian Zhang and Minghao Guo and Youpeng Wen and Haoting Yang and Min Lin and Jianzheng Huang and Zhe Li and Kaidong Zhang and Liqiong Wang and Yuxuan Kuang and Meng Cao and Feng Zheng and Xiaodan Liang},
      journal={arXiv preprint arXiv:2504.12636},
      year={2025},
}

Acknowledgement

RDT-1B
Track Anything

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
configs		configs
data		data
models		models
scripts		scripts
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.sh		inference.sh
main.py		main.py
requirements.txt		requirements.txt
test_dataset.sh		test_dataset.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

$A_{0}$: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

| |

Environment Setup

Download Off-the-shelf Vision & Text Encoders

Data

Train

Experimental Details

Test

Inference

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

A-embodied/A0

Folders and files

Latest commit

History

Repository files navigation

$A_{0}$: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

| |

Environment Setup

Download Off-the-shelf Vision & Text Encoders

Data

Train

Experimental Details

Test

Inference

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages