Skip to content

tutujingyugang1/ChatVLA_public

Repository files navigation

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

  • ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
    arXiv

πŸ“° News

  • May 29, 2025: We are excited to announce the release of ChatVLA-2! The paper is available here and the project website can be accessed here.
  • Feb 20, 2025: We are excited to announce the release of ChatVLA! The paper is available here and the project website can be accessed here.

Contents

Install

  1. Clone this repository and navigate to ChatVLA folder
git clone https://github.com/tutujingyugang1/ChatVLA_public.git
  1. Install Packages
conda create -n chatvla python=3.10 -y
conda activate chatvla
pip install --upgrade pip  # 
pip install -r requirements.txt
cd policy_heads
pip install -e .

For training acceleration, please install flash_attention.

pip install flash-attn --no-build-isolation
  1. For evaluation on multimodal understanding task, you should install other packages in here.

Download Qwen2_VL Weights

We construct the VLM backbone by integrating Qwen2-VL-2B, a powerful and efficient model, into our framework. The Qwen2-VL 2B serves as the core of our architecture, providing robust capabilities for vision-language tasks. We use off-the-shelf Qwen2-VL model proposed in Qwen2-VL without any post training on VLM itself. You can download the official weights from this link:

Model Link
Qwen2-VL (~2B) huggingface

❗❗ After downloading the standard weights, you have to replace the official "config.json" with our "doc/config.json" designed for VLA.

Data Preparation

  1. Our data format is the same as DexVLA, you should transfer your data into h5py format. More example data and preparation detail can refer to it.

  2. Download llava_v1_5_mix665k dataset from here or use your own Vision-Language Data using LLaVA-format.

    [
        {
            "id": "000000033471",
            "image": "coco/train2017/000000033471.jpg",
            "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            }
            ]
        },
    ]
  1. Add entries in constants.py to specify the path of your data as follows.
    "example_tasks_stage_1": { # for Stage 1 w/o Vision-Language Data
        'dataset_dir': [
            ROBOT_DATA_DIR + '/your_task_1',
            ROBOT_DATA_DIR + '/your_task_2',
        ],
        'episode_len': 1000,
        'camera_names': ['left', 'right', 'wrist'],
    },
    "example_tasks_stage_2": { # for Stage 2 with Vision-Language Data
        'dataset_dir': [
            ROBOT_DATA_DIR + '/your_task_1',
            ROBOT_DATA_DIR + '/your_task_2',
        ],
        'vl_file': os.path.join(VL_IMAGE_DIR, "llava_v1_5_mix665k.json"), # replace to your own VL Data if needed
        'vl_image_dir': os.path.join(VL_IMAGE_DIR, "data"),
        'episode_len': 1000,
        'camera_names': ['left', 'right', 'wrist'],
    },
  1. Save original Qwen2_VL weights to init MoE.

    You can refer to save_mlp_weights_for_init_moe.py

🦾Training

We provided training scripts for both stages in train_stage_1.sh and train_stage_2.sh. For each script, you should change the following parameters:

  1. OUTPUT: the save directory for training.

    ❗ the keyword "qwen2" must be included in OUTPUT.

  2. TASK: the tasks used for training. This should be corresponded to your own task name in constants.py.

    ❗ Stage 2 should use a different task name compared with Stage 1 as it utilize vision-language data in training.

  3. MNOP: model name or path. You should change to path to the pretrained VLM weights.

Other hyperparameters like "batch_size", "save_steps" could be customized according to your computation resources.

Start training by following commands:

Stage 1: Training with Robot Data only

Key arguments:

--using_moe True \
--init_moe True \
--freeze_vl_expert True \

Stage 2: Co-training with VL Data

Key arguments:

--using_moe True \
--init_moe False \
--vl_ratio 0.33 \

vl_ratio contols the ratio of VL Data and Robot Data, you can change it as you like.

ChatVLA Weights

You can download weights of ChatVLA from this link:

Model Link
ChatVLA huggingface

Evaluation

❗❗ Make sure your trained checkpoint dir has two files: "preprocessor_config.json" and "chat_template.json". If not, please copy them from downloaded Qwen2_VL weights or this link.

Evaluation on Real Robot

You can refer to our evaluation script evaluate_robot.py to evaluate your ChatVLA.

Evaluation on Multi-modal Understanding Tasks

We leverage the excellent VLMEvalKit for evaluating ChatVLA. The toolkit has been integrated into our project with minor modifications to support ChatVLA's evaluation framework.

To evaluate on multi-modal understanding tasks, you should:

  1. Set a path "LMUData" to download datasets (default path is '~'). Your LMUData folder should looks like:
LMUData
β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ MMMU/
β”‚   └── MMStar/
β”œβ”€β”€ MMStar.tsv
└── MMMU_DEV_VAL.tsv
  1. Modify the config config_vla.json to decide the model path and the benchmarks you want to evaluate on.
  2. Run the evaluation script evaluate_vqa.sh to evaluate ChatVLA on multi-modal understanding tasks.

Note: To evalutae our ChatVLA on more benchmarks, you should modify the config following the original VLMEvalKit setting. You can refer to it for more details.

Acknowledgement

We build our project based on:

  • LLaVA: an amazing open-sourced project for vision language assistant
  • act-plus-plus: an amazing open-sourced project for robotics visuomotor learning
  • Miphi: an amazing open-sourced project for tiny vision language model
  • DexVLA: an amazing open-sourced Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning
  • VLMEvalKit: an amazing open-source evaluation toolkit of large vision-language models (LVLMs)

Citation

# ChatVLA
@article{zhou2025chatvla,
  title={Chatvla: Unified multimodal understanding and robot control with vision-language-action model},
  author={Zhou, Zhongyi and Zhu, Yichen and Zhu, Minjie and Wen, Junjie and Liu, Ning and Xu, Zhiyuan and Meng, Weibin and Cheng, Ran and Peng, Yaxin and Shen, Chaomin and others},
  journal={arXiv preprint arXiv:2502.14420},
  year={2025}
}

# ChatVLA-2
@article{zhou2025vision,
  title={Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge},
  author={Zhou, Zhongyi and Zhu, Yichen and Wen, Junjie and Shen, Chaomin and Xu, Yi},
  journal={arXiv preprint arXiv:2505.21906},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors