Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
We introduce LEO, an embodied multi-modal generalist agent capable of grounding, reasoning, chatting, planning, and acting in the 3D world. LEO is trained in a two-stage scheme: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning.
We meticulously collect extensive diverse data for training LEO. † indicates the task contains our generated data. See Task and Data for details. We show the data statistics as below:
| Dataset | Task | 2D required? | 3D assets | #data |
|---|---|---|---|---|
| LEO-align | object captioning | ✗ | Objaverse | 660k |
| object referring† | ✗ | ScanNet + 3RScan | 354k | |
| scene captioning† | ✗ | 3RScan | 20k | |
| LEO-instruct | 3D captioning | ✗ | ScanNet | 37k |
| 3D QA† | ✗ | ScanNet + 3RScan | 83k | |
| 3D dialogue† | ✗ | 3RScan | 11k | |
| task planning† | ✗ | 3RScan | 14k | |
| navigation | ✓ | MP3D | 60k | |
| manipulation | ✓ | CLIPort | 300k |
[2024.07] We release a few EAI data examples for demonstration purpose.
[2024.05] LEO is accepted by ICML 2024.
[2024.04] We release the scripts for inference and scaling law analysis, model weights, and training code of EAI tasks.
[2024.03] We release the code and data. The embodied AI (EAI) tasks (navigation and manipulation) need further organization and will be released soon.
[2024.01] We release a Huggingface interactive demo. Chat with LEO and enjoy yourself.
- Clone Github repo.
git clone [email protected]:embodied-generalist/embodied-generalist.git
cd embodied-generalist- Create
condaenvironment and install dependencies.
conda create -n leo python=3.9
conda activate leo
# install PyTorch, take our version for example
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
# install other dependencies with pip
pip install -r requirements.txt
# install peft separately to escape its install_requires
pip install peft==0.5.0 --no-deps- Install third party libraries (for point cloud backbones). Note that if the installation of
PointNextfails, you can either 1) comment the line of importingPointNextinmodel/pcd_backbone.pyor 2) download the compiled file and place it atmodel/pointnext/cpp/pointnet2_batch/, which may possibly help.
cd model
# default PointNet++
cd pointnetpp
python setup.py install
cd ..
# optional: PointNext (if you want to substitute the default PointNet++)
cd pointnext/cpp/pointnet2_batch
python setup.py build_ext --inplace
cd ../../../
cd ..
# sanity check
python -c 'from model.pointnetpp.pointnetpp import PointNetPP'
# for PointNext, run 'from model.pointnext.pointnext import PointNext'- Go through task and data, model weights, and you are ready to run.
Data preparation. The data includes two components: scan data and language annotations.
- Scan data. To simplify the preparation and save storage, we streamline the scan data (point clouds and instance segments), which is less than 10G yet already sufficient for experiments on LEO. You can download the compressed files from the links below and arrange the data according to the illustration of scan data structure.
- ScanNet: pcd_with_global_alignment, mask (Mask3D proposals).
- 3RScan: 3RScan-ours-align.
- Cap3D. Please refer to Cap3D data for preparing the point clouds, where we use pcs_pt. The corresponding annotation file (
Cap3D_automated_Objaverse_no3Dword.csv) is included in our released annotations.
# scan data structure
├── ${scannet_base}
├── scan_data
│ └── pcd_with_global_alignment
│ ├── ${scan_id}.pth
└── mask
├── ${scan_id}.mask.npz
├── ${rscan_base}
└── 3RScan-ours-align
├── ${scan_id}
├── pcds.pth
├── pcd-align.pth
└── inst_to_label.pth
├── ${cap3d_root}
├── Cap3D_pcs_pt
│ ├── ${obj_id}.pt
└── Cap3D_automated_Objaverse_no3Dword.csv # included in annotations
- Language annotations. The annotations are categorized into two parts according to the training stage. We provide a compressed file that wraps up all the annotations, which should be organized in the following structure:
# annotations structure
├── ${alignment_base}
├── obj_caption -> ${cap3d_root}
│ ├── Cap3D_pcs_pt
│ │ ├── ${obj_id}.pt
│ └── Cap3D_automated_Objaverse_no3Dword.csv
├── obj_scene_caption
│ ├── 3rscan_prompted.json
│ ├── 3rscan_scanscribe.json
│ ├── scannet_referit3d_nr3d_train.json
│ └── scannet_referit3d_sr3d+_train.json
└── scene_caption
├── 3rscan_scenecap_train.json
└── 3rscan_scenecap_val.json
├── ${instruction_base}
├── scan2cap
│ ├── scanrefer_train.json
│ ├── scanrefer_val.json
│ └── scanrefer_corpus.json
├── scanqa
│ ├── ScanQA_v1.0_train.json
│ └── ScanQA_v1.0_val.json
├── sqa3d
│ ├── v1_balanced_questions_train_scannetv2.json
│ ├── v1_balanced_questions_val_scannetv2.json
│ ├── v1_balanced_questions_test_scannetv2.json
│ ├── v1_balanced_sqa_annotations_train_scannetv2.json
│ ├── v1_balanced_sqa_annotations_val_scannetv2.json
│ ├── v1_balanced_sqa_annotations_test_scannetv2.json
│ └── axisAlignment.pth
├── 3rscanqa
│ ├── 3rscan_qa_train.json
│ └── 3rscan_qa_val.json
├── dialogue
│ ├── 3rscan_dialog_train.json
│ └── 3rscan_dialog_val.json
└── planning
├── 3rscan_plan_train.json
└── 3rscan_plan_val.json
Data configurations. After data preparation, check configs/data/default.yaml to update the paths, including scan_family_base, rscan_base, alignment_base
and instruction_base.
Dataloaders. The implementation of dataset per task lies in data/datasets.py, where LeoMix aggregates various datasets as the training dataset.
EAI. We release a small subset of EAI tasks with a few data examples for demonstration purpose. You can download here. It is recommended to put the extracted folders (mp3d_objnav and cliport) right inside the instruction_base path. Though the test in simulator is not incorporated yet, it is ready for the training and validation of EAI tasks.
Pretrained weights to load.
- LLM: Vicuna-7B. We use Vicuna v1.1 from FastChat, which you can refer to for the access of Vicuna-13B or more advanced versions. Remember to update
cfg_pathinconfigs/llm/*.yaml. - Point cloud backbone: PointNet++, PointBERT. We have not tried
PointNext, but everything is ready except the pretrained weights. Remember to updatepathinconfigs/vision3d/backbone/*.yaml.
Trained LEO weights. We release two checkpoints here:
align.pth: the checkpoint after the alignment stage, trained with LoRA.sft_noact.pth: the checkpoint after the instruction tuning stage, based onalign.pthand tuned without embodied acting tasks.
Training. The training pipeline is elaborated in trainer/leo_trainer.py. Make sure the config file configs/default.yaml is properly set up before running.
- General setup. We use
wandbas the default experiment logger. Remember to modifylogger.entityto your account and init thewandb. Modifyname,note, andbase_dirfor proper experiment output. - Model. The components of
LeoAgentcan be configured inconfigs/llm,configs/vision2dandconfigs/vision3d. - Task. You can configure the tasks by specifying a
yamlinconfigs/task. You can also run new tasks by creating similar configs. - GPU usage. We run the experiments on NVIDIA A100-80GB and A800-80GB. Modify
dataloaderarguments for your GPU if necessary.
We prepare some running scripts in scripts/, covering two-stage training and evaluation. The core is to run launch.py with proper arguments. There are three launch modes:
# python launch
python launch.py --mode python --config configs/default.yaml <HYDRA_CONFIG>
# accelerate launch
python launch.py --mode accelerate --config configs/default.yaml <HYDRA_CONFIG>
# SLURM submitit launch, default
python launch.py --mode submitit --config configs/default.yaml <HYDRA_CONFIG>
# for example, run alignment with submitit
python launch.py --mode submitit \
--config configs/default.yaml \
--name leo_tuning \ # job name
--qos lv0b \ # QoS
--time 48 \ # job execution duration (hour)
--num_nodes 1 \
--partition HGX \ # node type
--gpu_per_node 4 \
--mem_per_gpu 100 \ # memory per GPU
--port 2050 \
task=align \ # hydra: cfg.task, select task
note=align_lora \ # hydra: cfg.note, for exp_dirInference. We prepare an inference script scripts/inference.sh, where we run a different python script inference.py in python mode by default:
# single-GPU python-mode launch
python launch.py --mode python \
--run_file inference.py \
--config configs/default.yaml \
note=tuning_noact \
pretrained_ckpt_path=null \Modify probe arguments in configs/default.yaml to customize the inputs for inference. You can select a checkpoint by specifying either note or pretrained_ckpt_path. For the former, note should align with the corresponding note for the training exp_dir. For the latter, you shoud assign with a checkpoint folder wherein pytorch_model.bin exists.
Launch mode. For explanation of the launch arguments, use python launch.py --help. Refer to SLURM submitit or Accelerate for more information.
We manually modify some methods of accelerate.Accelerator in common/misc.py, including gather_for_metrics (fix gathering non-tensor objects), get_state_dict (for saving only learnable parameters when calling save_state), and prepare_scheduler (fix behavior with gradient accumulation).
@inproceedings{huang2024embodied,
title={An Embodied Generalist Agent in 3D World},
author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
year={2024}
}