MM-ACT: Learn from Multimodal Parallel Generation to Act

MM-ACT is a unified model that integrates text, image, and action into a shared token space, performing generation across all three modalities. It adopts re-mask parallel decoding strategy for text and image generation, and employs one-step parallel decoding strategy for action generation to improve efficiency.

This repository contains:

Training pipelines and deployment scripts for one-step parallel decoding and re-mask parallel decoding strategies across three modalities: action, text, and image.
Scripts for evaluation on LIBERO and Robotwin, as well as the data collection pipeline used for task planning annotation on Robotwin.

🛠️ Installation

1. Clone Repo and Environment Setup

git clone https://github.com/HHYHRHY/MM-ACT.git
cd MM-ACT

# Create environment
conda create -n mmact python=3.13
conda activate mmact

# Install requirements
pip install -r requirements.txt

2. Dataset Preparation

LIBERO

We utilize LIBERO datasets from Huggingface_LeRobot, and uses LeRobot datasets for loading robot data. Please download LIBERO-Object, LIBERO-Spatial,LIBERO-Goal and LIBERO-10. For LIBERO-10, we also provide our task planning datasets in LIBERO-10-task.
RoboTwin

For RoboTwin datasets, we utilize a dataset sampling pipeline that includes task planning generation. You can download our datasets or collect your own datasets with our pipeline in Robotwin_subtask. This branch includes updates to original RoboTwin data collection pipeline to support our subtask text annotations. The collection usage is identical to the main branch. Please report any bugs or questions of text annotations in MM-ACT's issue.

3. Model Weight Preparation

Download the base model weights from MMaDA: MMaDA-8B-Base and expand the original model's action codebook (we use 2048):

cp models/configuration_llada.py models/modeling_llada.py ${origin_model_path}
python model_utils/resize_model_vocab.py --model ${origin_model_path} --out ${output_model_path} --num_new ${action_codebook_size}

Besides, please download the image quantizer weight from showlab/magvitv2 and update your local weight path (e.g., ""/xxx/magvitv2"") in vq_model_name (training configs) and vq_model_path (experiments and deployment configs).

🚀 Training

We provide training pipelines for both LIBERO and RoboTwin. You can refer to the explanations of the configuration settings in configs/README.md. Single-node training can be launched using accelerate:

accelerate launch \
  --config_file accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
  --main_process_port 8888 \
  training/{your_training_script}.py \
  config=configs/{your_training_config}.yaml

Multi-node training can be referenced from the script in shell/training.sh and adapted to the launch commands of your cluster. For LIBERO, We provide three specific pipelines:

1.Text-Only Training: training/train_mmact_libero_mmu.py, which used in LIBERO-10 stage1 training.

2.Action-Only Training: training/train_mmact_libero_action.py wich used in all of LIBERO benchmark for action training.

3.Mixed (Text & Action) Training: training/train_mmact_libero_mix.py, which used in LIBERO-10 stage2 training.

For RoboTwin, we provide a unified mixed-modality pipeline in training/train_mmact_robotwin_mix.py. You can achieve arbitrary modality combinations training by adjusting parameters in configs.

For real robot, You can first convert your real-robot data into the LeRobot format. Then refer to the training/train_mmact_libero_action.py script to conduct real-robot data training.

⚡ Evaluation & Deployment

Our trained model weights can be found at: MM-ACT-weights.

For LIBERO and RoboTwin evalutation, please refer to experiments/README.md for detailed instructions.

For real-world deployment, please refer to the script provided at: deployment/mmact_deploy.py

🎥 Real-world Experiments (Video Demo)

MM-ACT_demo.mp4

A video demonstration of MM-ACT on real-world Franka experiments is also available on YouTube: YouTube Link.

Citation

If you find our work helpful, please cite us:

@article{liang2025mm,
  title={MM-ACT: Learn from Multimodal Parallel Generation to Act},
  author={Liang, Haotian and Chen, Xinyi and Wang, Bin and Chen, Mingkang and Liu, Yitian and Zhang, Yuhao and Chen, Zanxin and Yang, Tianshuo and Chen, Yilun and Pang, Jiangmiao and others},
  journal={arXiv preprint arXiv:2512.00975},
  year={2025}
}

Acknowledgments

This work is based on MMaDA, RoboTwin, LIBERO, LeRobot, OpenVLA. Thanks these great work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-ACT: Learn from Multimodal Parallel Generation to Act

🛠️ Installation

1. Clone Repo and Environment Setup

2. Dataset Preparation

3. Model Weight Preparation

🚀 Training

⚡ Evaluation & Deployment

🎥 Real-world Experiments (Video Demo)

Citation

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
accelerate_configs		accelerate_configs
assets		assets
configs		configs
deployment		deployment
experiments		experiments
model_utils		model_utils
models		models
shell		shell
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MM-ACT: Learn from Multimodal Parallel Generation to Act

🛠️ Installation

1. Clone Repo and Environment Setup

2. Dataset Preparation

3. Model Weight Preparation

🚀 Training

⚡ Evaluation & Deployment

🎥 Real-world Experiments (Video Demo)

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages