Hongpeng Wang*, Zeyu Zhang*†, Wenhao Li, Hao Tang‡
*Equal contribution. †Project lead. ‡Corresponding author.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
MoRL is a unified multimodal motion model designed to advance both human motion understanding and generation. Unlike prior approaches that treat user queries as a whole and lack explicit reasoning or planning, MoRL leverages a hierarchical post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Our task-specific reward design is dual-headed: for motion understanding, we introduce semantic alignment and a novel reasoning coherence reward to enforce logically consistent reasoning traces; for motion generation, we combine text–motion consistency with a physical plausibility reward to ensure biomechanical validity and perceptual realism. To further enhance inference, we propose Chain-of-Motion (CoM), a test-time reasoning strategy that enables step-by-step planning and reflection. CoM improves both the robustness of reasoning-based motion understanding and the quality of motion generation through iterative selection and correction. This principle also guides the construction of two large-scale synthetic Chain-of-Thinking (CoT) datasets: MoUnd-CoT-140K and MoGen-CoT-140K, which align motion sequences with reasoning traces and concise action descriptions. Extensive experiments on HumanML3D and KIT-ML demonstrate that MoRL achieves significant gains over state-of-the-art baselines in both logical reasoning and perceptual realism. Our code, data, and models are open-sourced to facilitate further research in unified motion-language modeling.
Overview of MoRL. MoRL unifies motion understanding and generation under a reinforcement learning paradigm. Motion and text inputs are tokenized into a shared representation space. A hierarchical post-training pipeline first applies SFT on large-scale synthetic CoT datasets to align motion sequences with reasoning traces and concise descriptions, then employs RLVR to refine outputs, enhancing semantic alignment, reasoning coherence, physical plausibility, and text–motion consistency. At inference, the Chain-of-Motion (CoM) decoding strategy enables step-by-step reasoning and reflection, improving both motion understanding and perceptually realistic motion generation.
Motion CoT data engine. Built on MotionHubV2, one branch (MoUnd-CoT-140K) uses motion sequences and captions with Gemini to construct reasoning chains for understanding, while the other (MoGen-CoT-140K) builds reasoning chains for generation.- Upload our paper to arXiv and build project pages.
- Upload the code.
- Release curated MoUnd-CoT / MoGen-CoT data. (see MoUnd-MoGen-CoT-140K)
- Release training checkpoints.
Install dependencies:
pip install -r requirements.txtThis repo provides helper scripts under prepare/:
prepare/download_extractor.shprepare/download_glove.sh
Run with bash:
bash prepare/download_extractor.sh
bash prepare/download_ckpt.shIf you are on native Windows PowerShell, you can manually download and unzip these assets following the URLs in the scripts.
You can download the pre-processed motion datasets from our HuggingFace page.
For custom data or full AMASS/kitml/HumanML3D, please follow the instructions in dataset/ and prepare/ folders.
python train_mllm.py \
--train-stage sft \
--training-task t2m \
--cot-train-jsonl path/to/cot_train.jsonl \
--use-reasoning \
--exp-name morl_sft_t2mpython train_mllm.py \
--train-stage sft \
--training-task m2t \
--cot-train-jsonl path/to/cot_train.jsonl \
--cot-task-filter m2t \
--use-reasoning \
--exp-name morl_sft_m2tpython train_mllm.py \
--train-stage rlvr \
--training-task t2m \
--cot-train-jsonl path/to/cot_train.jsonl \
--rl-reference-ckpt experiments/morl_sft_t2m/motionllm_t2m_best.pth \
--rl-epochs 3 \
--rl-group-size 8 \
--exp-name morl_rlvr_t2mFor m2t RLVR, set --training-task m2t and optionally --cot-task-filter m2t.
python eval_mllm.py \
--eval-task t2m \
--eval-ckpt experiments/morl_rlvr_t2m/motionllm_rlvr_epoch_2.pthpython eval_mllm.py \
--eval-task m2t \
--eval-ckpt experiments/morl_rlvr_m2t/motionllm_rlvr_epoch_2.pthpython eval_mllm.py \
--eval-task t2m \
--eval-ckpt experiments/morl_rlvr_t2m/motionllm_rlvr_epoch_2.pth \
--use-com \
--com-candidates 8 \
--com-refine-steps 2If you find this project useful, please consider citing:
@article{wang2026morl,
title={MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation},
author={Wang, Hongpeng and Zhang, Zeyu and Li, Wenhao and Tang, Hao},
journal={arXiv preprint arXiv:2602.14534},
year={2026}
}We thank the open-source communities behind Motion-Agent, MotionGPT, Qwen, and related motion-language benchmarks for their foundational contributions.









