ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

📰 News

May 29, 2025: We are excited to announce the release of ChatVLA-2! The paper is available here and the project website can be accessed here.
Feb 20, 2025: We are excited to announce the release of ChatVLA! The paper is available here and the project website can be accessed here.

Install

Clone this repository and navigate to ChatVLA folder

git clone https://github.com/tutujingyugang1/ChatVLA_public.git

Install Packages

conda create -n chatvla python=3.10 -y
conda activate chatvla
pip install --upgrade pip  # 
pip install -r requirements.txt
cd policy_heads
pip install -e .

For training acceleration, please install flash_attention.

pip install flash-attn --no-build-isolation

For evaluation on multimodal understanding task, you should install other packages in here.

Download Qwen2_VL Weights

We construct the VLM backbone by integrating Qwen2-VL-2B, a powerful and efficient model, into our framework. The Qwen2-VL 2B serves as the core of our architecture, providing robust capabilities for vision-language tasks. We use off-the-shelf Qwen2-VL model proposed in Qwen2-VL without any post training on VLM itself. You can download the official weights from this link:

Model	Link
Qwen2-VL (~2B)	huggingface

❗❗ After downloading the standard weights, you have to replace the official "config.json" with our "doc/config.json" designed for VLA.

Data Preparation

Our data format is the same as DexVLA, you should transfer your data into h5py format. More example data and preparation detail can refer to it.
Download llava_v1_5_mix665k dataset from here or use your own Vision-Language Data using LLaVA-format.

    [
        {
            "id": "000000033471",
            "image": "coco/train2017/000000033471.jpg",
            "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            }
            ]
        },
    ]

Add entries in constants.py to specify the path of your data as follows.

    "example_tasks_stage_1": { # for Stage 1 w/o Vision-Language Data
        'dataset_dir': [
            ROBOT_DATA_DIR + '/your_task_1',
            ROBOT_DATA_DIR + '/your_task_2',
        ],
        'episode_len': 1000,
        'camera_names': ['left', 'right', 'wrist'],
    },
    "example_tasks_stage_2": { # for Stage 2 with Vision-Language Data
        'dataset_dir': [
            ROBOT_DATA_DIR + '/your_task_1',
            ROBOT_DATA_DIR + '/your_task_2',
        ],
        'vl_file': os.path.join(VL_IMAGE_DIR, "llava_v1_5_mix665k.json"), # replace to your own VL Data if needed
        'vl_image_dir': os.path.join(VL_IMAGE_DIR, "data"),
        'episode_len': 1000,
        'camera_names': ['left', 'right', 'wrist'],
    },

Save original Qwen2_VL weights to init MoE.

You can refer to save_mlp_weights_for_init_moe.py

🦾Training

We provided training scripts for both stages in train_stage_1.sh and train_stage_2.sh. For each script, you should change the following parameters:

OUTPUT: the save directory for training.

❗ the keyword "qwen2" must be included in OUTPUT.
TASK: the tasks used for training. This should be corresponded to your own task name in constants.py.

❗ Stage 2 should use a different task name compared with Stage 1 as it utilize vision-language data in training.
MNOP: model name or path. You should change to path to the pretrained VLM weights.

Other hyperparameters like "batch_size", "save_steps" could be customized according to your computation resources.

Start training by following commands:

Stage 1: Training with Robot Data only

Key arguments:

--using_moe True \
--init_moe True \
--freeze_vl_expert True \

Stage 2: Co-training with VL Data

Key arguments:

--using_moe True \
--init_moe False \
--vl_ratio 0.33 \

vl_ratio contols the ratio of VL Data and Robot Data, you can change it as you like.

ChatVLA Weights

You can download weights of ChatVLA from this link:

Model	Link
ChatVLA	huggingface

Evaluation

❗❗ Make sure your trained checkpoint dir has two files: "preprocessor_config.json" and "chat_template.json". If not, please copy them from downloaded Qwen2_VL weights or this link.

Evaluation on Real Robot

You can refer to our evaluation script evaluate_robot.py to evaluate your ChatVLA.

Evaluation on Multi-modal Understanding Tasks

We leverage the excellent VLMEvalKit for evaluating ChatVLA. The toolkit has been integrated into our project with minor modifications to support ChatVLA's evaluation framework.

To evaluate on multi-modal understanding tasks, you should:

Set a path "LMUData" to download datasets (default path is '~'). Your LMUData folder should looks like:

LMUData
├── images/
│   ├── MMMU/
│   └── MMStar/
├── MMStar.tsv
└── MMMU_DEV_VAL.tsv

Modify the config config_vla.json to decide the model path and the benchmarks you want to evaluate on.
Run the evaluation script evaluate_vqa.sh to evaluate ChatVLA on multi-modal understanding tasks.

Note: To evalutae our ChatVLA on more benchmarks, you should modify the config following the original VLMEvalKit setting. You can refer to it for more details.

Acknowledgement

We build our project based on:

LLaVA: an amazing open-sourced project for vision language assistant
act-plus-plus: an amazing open-sourced project for robotics visuomotor learning
Miphi: an amazing open-sourced project for tiny vision language model
DexVLA: an amazing open-sourced Vision-Language Model with Plug-In Diffusion Expert for Visuomotor Policy Learning
VLMEvalKit: an amazing open-source evaluation toolkit of large vision-language models (LVLMs)

Citation

# ChatVLA
@article{zhou2025chatvla,
  title={Chatvla: Unified multimodal understanding and robot control with vision-language-action model},
  author={Zhou, Zhongyi and Zhu, Yichen and Zhu, Minjie and Wen, Junjie and Liu, Ning and Xu, Zhiyuan and Meng, Weibin and Cheng, Ran and Peng, Yaxin and Shen, Chaomin and others},
  journal={arXiv preprint arXiv:2502.14420},
  year={2025}
}

# ChatVLA-2
@article{zhou2025vision,
  title={Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge},
  author={Zhou, Zhongyi and Zhu, Yichen and Wen, Junjie and Shen, Chaomin and Xu, Yi},
  journal={arXiv preprint arXiv:2505.21906},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
aloha_scripts		aloha_scripts
data_utils		data_utils
doc		doc
evaluate		evaluate
policy_heads		policy_heads
qwen2_vla		qwen2_vla
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_env.yaml		conda_env.yaml
main.py		main.py
save_mlp_weights_for_init_moe.py		save_mlp_weights_for_init_moe.py
setup.py		setup.py
torch_utils.py		torch_utils.py
train_vla.py		train_vla.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

📰 News

Contents

Install

Download Qwen2_VL Weights

Data Preparation

🦾Training

Stage 1: Training with Robot Data only

Stage 2: Co-training with VL Data

ChatVLA Weights

Evaluation

Evaluation on Real Robot

Evaluation on Multi-modal Understanding Tasks

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

📰 News

Contents

Install

Download Qwen2_VL Weights

Data Preparation

🦾Training

Stage 1: Training with Robot Data only

Stage 2: Co-training with VL Data

ChatVLA Weights

Evaluation

Evaluation on Real Robot

Evaluation on Multi-modal Understanding Tasks

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages