ReactMotion: Generating Reactive Listener Motions
from Speaker Utterance

Cheng Luo^1* · Bizhu Wu^2,4,5* · Bing Li^1† · Jianfeng Ren⁴ · Ruibin Bai⁴ · Rong Qu⁵ · Linlin Shen^2,3† · Bernard Ghanem¹
¹King Abdullah University of Science and Technology ²School of Artificial Intelligence, Shenzhen University
³Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University
⁴School of Computer Science, University of Nottingham Ningbo China ⁵School of Computer Science, University of Nottingham, UK
^*Equal contribution ^†Corresponding author

Paper | Project Page | Video | Hugging Face

We introduce Reactive Listener Motion Generation from Speaker Utterance — a new task that generates naturalistic listener body motions appropriately responding to a speaker's utterance. Our unified framework ReactMotion jointly models text, audio, emotion, and motion with preference-based objectives, producing natural, diverse, and appropriate listener responses.

📢 Updates

[2026.03.17] 🎮 Inference Demo & Gradio UI released
[2026.03.16] 🎯 Full Training, Evaluation Code released

🚀 TLDR

Modeling nonverbal listener behavior is challenging due to the inherently non-deterministic nature of human reactions — the same speaker utterance can elicit many appropriate listener responses.

🔥 ReactMotion generates naturalistic listener body motions from speaker utterance (text + audio + emotion), trained with preference-based ranking on our ReactMotionNet dataset that captures the one-to-many nature of listener behavior.

We present:

ReactMotionNet — A large-scale dataset pairing speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness (gold/silver/negative)
ReactMotion — A unified generative framework built on T5 that jointly models text, audio, emotion, and motion, trained with preference-based ranking objectives
JudgeNetwork — A multi-modal contrastive scorer for best-of-K selection
Preference-oriented evaluation protocols tailored to assess reactive appropriateness

🏗️ Architecture

Component	Backbone	Input	Output
ReactMotion	T5-base	Text + Audio + Emotion	Motion token sequence
JudgeNetwork	T5 text enc + Mimi audio enc	Multi-modal conditions + motion	Ranking score

Conditioning modes — flexibly combine modalities:

Mode	Description
`t`	Text (transcription) only
`a`	Audio only
`t+e`	Text + Emotion
`a+e`	Audio + Emotion
`t+a`	Text + Audio
`t+a+e`	Text + Audio + Emotion (full)

🛠️ Installation

conda create -n reactmotion python=3.11 -y
conda activate reactmotion
pip install -e .

Or install dependencies directly:

pip install -r requirements.txt

We use wandb to log and visualize the training process:

wandb login

📥 Pretrained Models & Evaluators

Download the pretrained Motion VQ-VAE and evaluation models:

bash prepare/download_vqvae.sh
bash prepare/download_evaluators.sh

Once downloaded, your external/ directory should look like:

external/
├── pretrained_vqvae/
│   └── t2m.pth                           # Motion VQ-VAE checkpoint
└── t2m/
    ├── Comp_v6_KLD005/                   # T2M evaluation model
    ├── text_mot_match/                   # Text-motion matching model
    └── VQVAEV3_CB1024_CMT_H1024_NRES3/
        └── meta/
            ├── mean.npy                  # Per-dim mean for normalization
            └── std.npy                   # Per-dim std for normalization

📦 Data Preparation

ReeactMotionNet

Resource	Description	Link
ReeactMotionNet	speaker audio codes, raw audio, train.csv, val.csv, and test.csv	Google Drive

Dataset structure:

dataset
├── HumanML3D
├── audio_code
├── audio_raw

put the train.csv, val.csv, and test.csv to

reactmotion
├── reactmotion
    ├── data
        ├── train.csv
        ├── val.csv
        ├── test.csv
    ├── dataset
    ├── eval
    ├── models
    ....

HumanML3D Dataset

We use the HumanML3D 3D human motion-language dataset. Please follow the HumanML3D instructions to download and prepare the dataset, then place it under the dataset/ directory:

dataset/HumanML3D/
├── new_joint_vecs/      # Joint feature vectors (263-dim)
├── texts/               # Motion captions
├── Mean.npy             # Per-dim mean for normalization
├── Std.npy              # Per-dim std for normalization
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt

Motion VQ-VAE Codes

Pre-encode the HumanML3D motions with the Motion VQ-VAE and place the codes as .npy files:

dataset/HumanML3D/VQVAE/
├── 000000.npy
├── 000001.npy
├── M000000.npy
└── ...

Each .npy file contains a 1D integer array of VQ codebook indices (codebook size = 512).

Speaker Audio

We provide both pre-extracted Mimi code indices and raw wav files:

Resource	Description	Link
Audio Codes	Mimi encoder code indices (`.npz`)	Google Drive
Audio Raw	Raw speaker wav files (`.wav`)	Google Drive

Download and place them under your dataset directory:

{DATASET_DIR}/
├── audio_code/   # Mimi code indices (for audio_mode=code)
│   ├── 001193_1_reaction_fearful_4.npz
│   └── ...
└── audio_wav/    # Raw speaker wav (for audio_mode=wav)
    ├── 001193_1_reaction_fearful_4.wav
    └── ...

The audio codes are pre-extracted using Mimi (from the Moshi project). Mimi weights are automatically downloaded from HuggingFace on first use if you want to re-encode from raw wav.

CSV Splits

Prepare train.csv, val.csv, test.csv with the following columns:

Column	Description
`group_id`	Unique group identifier
`item_id`	Unique item identifier
`tier_label`	Sample quality tier: `gold` / `silver` / `neg`
`speaker_transcript`	Speaker transcription text
`speaker_emotion`	Speaker emotion label
`listener_motion_caption`	Text description of the listener motion
`motion_id`	Motion file ID (6-digit zero-padded, e.g. `000267`)
`speaker_audio_wav`	Audio file stem (maps to audio code/wav files)
`group_w` (optional)	Per-group weight for weighted training

🤗 Model Card

Our pretrained models are available on Hugging Face:

Model	Backbone	Description	Link
ReactMotion 1.0	T5-base	Generator (Text + Audio + Emotion → Motion)	awakening-ai/ReactMotion1.0
ReactMotion-Judge	T5 text enc + Mimi audio enc	Multi-modal judge network for best-of-K selection	awakening-ai/ReactMotion-Judge

Download via CLI:

# Install huggingface_hub if needed
pip install huggingface_hub

# Download the generator
huggingface-cli download awakening-ai/ReactMotion1.0 --local-dir models/ReactMotion1.0

# Download the judge network
huggingface-cli download awakening-ai/ReactMotion-Judge --local-dir models/ReactMotion-Judge

Or in Python:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="awakening-ai/ReactMotion1.0", local_dir="models/ReactMotion1.0")
snapshot_download(repo_id="awakening-ai/ReactMotion-Judge", local_dir="models/ReactMotion-Judge")

⚡ Quick Demo

Inference Demo

Download the pretrained model from Hugging Face and make sure you have run the prepare scripts to download the VQ-VAE checkpoint and normalization files. Then run:

python demo_inference.py \
  --gen_ckpt   models/ReactMotion1.0 \
  --vqvae_ckpt external/pretrained_vqvae/t2m.pth \
  --mean_path  external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy \
  --std_path   external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy \
  --text "It is really nice to meet you!" \
  --emotion "excited" \
  --cond_mode t+e \
  --num_gen 3 \
  --out_path output/demo_text_meet.mp4

The generated videos will be saved in output/, example shown below:

demo_text_meet_gen2.mp4

More demo examples

Audio + Text input:

python demo_inference.py \
  --gen_ckpt   models/ReactMotion1.0 \
  --vqvae_ckpt external/pretrained_vqvae/t2m.pth \
  --mean_path  external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy \
  --std_path   external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy \
  --text "Let's celebrate!" \
  --audio samples/speaker.wav \
  --cond_mode t+a+e --emotion "happy" \
  --out_path output/demo_fused.mp4

Audio-only input:

python demo_inference.py \
  --gen_ckpt   models/ReactMotion1.0 \
  --vqvae_ckpt external/pretrained_vqvae/t2m.pth \
  --mean_path  external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy \
  --std_path   external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy \
  --audio samples/speaker.wav \
  --cond_mode a+e --emotion "happy" \
  --out_path output/demo_audio.mp4

Batch demo with shell script:

bash scripts/demo.sh            # 3 text-only examples
bash scripts/demo.sh --audio    # text + audio example

🖥️ Gradio Web UI

Launch an interactive web demo:

python demo_gradio.py \
  --gen_ckpt   models/ReactMotion1.0 \
  --vqvae_ckpt external/pretrained_vqvae/t2m.pth \
  --mean_path  external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy \
  --std_path   external/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy \
  --port 7860 --share

Or via the shell script:

bash scripts/demo.sh --gradio

Features:

Text input, audio upload, emotion selection
Adjustable generation parameters (temperature, top-p, top-k)
Side-by-side comparison of multiple generated motions

🎯 Training

Train ReactMotion (Generator)

bash scripts/train_reactmotion.sh [cond_mode] [loss_type]
# Example:
bash scripts/train_reactmotion.sh t+a+e multi_ce_rank

Or run directly:

python -m reactmotion.train.train_reactmotion \
  --model_name google-t5/t5-base \
  --dataset_dir /path/to/dataset \
  --pairs_csv /path/to/data \
  --output_dir /path/to/output \
  --cond_mode t+a+e \
  --audio_mode code \
  --audio_code_dir /path/to/audio_code \
  --batch_size 8 \
  --learning_rate 5e-5 \
  --max_steps 100000

Key training hyperparameters

Parameter	Default
Batch size	8
Gradient accumulation	2
Learning rate	5e-5
Max steps	100,000
K gold samples	2
Modality dropout	0.30
Rank loss margin	0.5
Loss weights (w_rank / w_gn)	0.25 / 0.25

Train JudgeNetwork (Scorer)

bash scripts/train_judge.sh

Or run directly:

python -m reactmotion.train.train_judge \
  --dataset_dir /path/to/dataset \
  --pairs_csv /path/to/data \
  --audio_code_dir /path/to/audio_code \
  --save_dir /path/to/output \
  --batch_size 16 \
  --epochs 50

Key scorer hyperparameters

Parameter	Default
Batch size	16
Learning rate	5e-5
Weight decay	0.01
Epochs	50
Force single-modality ratio	0.10
Ordering margin (gold-silver / silver-neg)	0.20 / 0.20

📊 Evaluation

Generate Motions

bash scripts/eval_reactmotion.sh [cond_mode]
# Example:
bash scripts/eval_reactmotion.sh t+a+e

Or run directly:

python -m reactmotion.eval.eval_reactmotion \
  --gen_ckpt /path/to/checkpoint \
  --pairs_csv /path/to/test.csv \
  --dataset_dir /path/to/dataset \
  --cond_mode t+a+e \
  --num_gen 3 \
  --out_dir /path/to/output

Generate + Rank with JudgeNetwork

python -m reactmotion.eval.eval_reactmotion_with_judge \
  --gen_ckpt /path/to/generator/checkpoint \
  --judge_ckpt /path/to/judge/best.pt \
  --pairs_csv /path/to/test.csv \
  --dataset_dir /path/to/dataset \
  --cond_mode t+a+e \
  --num_gen 3 \
  --out_dir /path/to/output

Unified Evaluation (FID + Diversity + Win-Rate)

bash scripts/evaluate.sh [cond_mode] [pipeline]
# pipeline: all | winrate | fid
# Example:
bash scripts/evaluate.sh t+a+e all

FID & Diversity Metrics

python -m reactmotion.eval.eval_fid_diversity \
  --gen_root /path/to/generated \
  --dataset_dir /path/to/dataset \
  --pairs_csv /path/to/test.csv

📁 Project Structure

reactmotion/
├── reactmotion/                       # Main package
│   ├── models/
│   │   ├── vqvae.py                      # Motion VQ-VAE (HumanVQVAE)
│   │   ├── encdec.py                     # 1D Conv Encoder/Decoder
│   │   ├── quantize_cnn.py               # VQ codebook (EMA reset)
│   │   ├── resnet.py                     # Resnet1D blocks
│   │   └── judge_network.py              # JudgeNetwork scorer
│   ├── dataset/
│   │   ├── reactmotionnet_dataset.py     # ReactMotionNet dataset
│   │   ├── humanml3d_dataset.py          # HumanML3D T2M dataset
│   │   ├── joint_dataset.py              # Joint training dataset
│   │   ├── collator.py                   # Ranking-aware collator
│   │   ├── joint_collator.py             # Mixed-task collator
│   │   ├── prompt_builder.py             # Multi-modal prompt construction
│   │   ├── mimi_encoder.py               # Mimi streaming audio encoder
│   │   └── audio_aug.py                  # Audio augmentation
│   ├── train/
│   │   ├── train_reactmotion.py          # Train generator (seq2seq + ranking)
│   │   ├── train_judge.py                # Train scorer (contrastive)
│   │   └── trainer_reactmotion.py        # Custom Seq2SeqTrainer
│   ├── eval/
│   │   ├── eval_reactmotion.py           # Generate motions
│   │   ├── eval_reactmotion_with_judge.py # Generate + rank
│   │   ├── eval_judge.py                 # Evaluate scorer ranking
│   │   └── eval_fid_diversity.py         # FID & diversity metrics
│   ├── visualization/
│   │   └── plot_3d_global.py             # 3D skeleton visualization
│   ├── utils/                            # Rotation, skeleton, losses
│   └── baselines/                        # Baseline methods
├── demo_inference.py                  # CLI inference demo
├── demo_gradio.py                     # Gradio web UI demo
├── scripts/                           # Training & evaluation scripts
│   ├── train_reactmotion.sh
│   ├── train_judge.sh
│   ├── eval_reactmotion.sh
│   ├── evaluate.sh
│   └── demo.sh
├── requirements.txt
└── setup.py

💡 Acknowledgements

This project builds upon the following amazing open-source projects: HumanML3D, T2M-GPT, Moshi/Mimi, Hugging Face Transformers, wandb.

📄 Citation

@article{luo2026reactmotion,
  title={ReactMotion: Generating Reactive Listener Motions from Speaker Utterance},
  author={Luo, Cheng and Wu, Bizhu and Li, Bing and Ren, Jianfeng and Bai, Ruibin and Qu, Rong and Shen, Linlin and Ghanem, Bernard},
  journal={arXiv preprint arXiv:2603.15083},
  year={2026}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
prepare		prepare
reactmotion		reactmotion
scripts		scripts
README.md		README.md
demo_gradio.py		demo_gradio.py
demo_inference.py		demo_inference.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ReactMotion: Generating Reactive Listener Motionsfrom Speaker Utterance

Paper | Project Page | Video | Hugging Face

📢 Updates

🚀 TLDR

🏗️ Architecture

🛠️ Installation

📥 Pretrained Models & Evaluators

📦 Data Preparation

ReeactMotionNet

HumanML3D Dataset

Motion VQ-VAE Codes

Speaker Audio

CSV Splits

🤗 Model Card

⚡ Quick Demo

Inference Demo

🖥️ Gradio Web UI

🎯 Training

Train ReactMotion (Generator)

Train JudgeNetwork (Scorer)

📊 Evaluation

Generate Motions

Generate + Rank with JudgeNetwork

Unified Evaluation (FID + Diversity + Win-Rate)

FID & Diversity Metrics

📁 Project Structure

💡 Acknowledgements

📄 Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

ReactMotion: Generating Reactive Listener Motions
from Speaker Utterance

Packages