scrolling.mp4
High-quality listener reaction generation with ReactDiff
video-3.mp4
video-4.mp4
video-2.mp4
- 🎉 ReactDiff v1.0 Released! A diffusion-based model for generating realistic listener facial reactions from speaker audio and visual cues! (Dec/2024)
- 🚀 Multi-GPU Support - Now supports distributed training and evaluation across multiple GPUs
- 🎬 30-Second Video Generation - Generate full-length realistic listener reaction videos
- 🔧 Enhanced Configuration - Separate configs for training and evaluation with detailed documentation
- 🛠️ Installation
- 👨🏫 Getting Started
- ⚙️ Configuration
- 🚀 Training
- 📊 Evaluation
- 🎯 Key Features
- 📁 Project Structure
- 🐛 Troubleshooting
- 🖊️ Citation
- 🤝 Acknowledgements
- 🐍 Python 3.8+ - Required for modern Python features
- 🔥 PyTorch 1.9+ - Deep learning framework
- ⚡ CUDA 11.8+ - GPU acceleration support
- 💾 16GB+ RAM - Recommended for training
- 🎮 NVIDIA GPU - Required for CUDA acceleration
# Create a new conda environment with Python 3.9
conda create -n reactdiff python=3.9
# Activate the environment
conda activate reactdiff# Install PyTorch 2.0.1 with CUDA 11.8 support
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 \
--index-url https://download.pytorch.org/whl/cu118# Install PyTorch3D for 3D face model operations
pip install --no-index --no-cache-dir pytorch3d \
-f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py39_cu118_pyt201/download.html# Install all required packages from requirements.txt
pip install -r requirements.txt# Test if all imports work correctly
python -c "
import torch
import torchvision
import numpy as np
import cv2
import transformers
print('✅ All dependencies installed successfully!')
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'CUDA devices: {torch.cuda.device_count()}')
"Download and Setup Dataset
The REACT 2023/2024 Multimodal Challenge Dataset is compiled from the following public datasets for studying dyadic interactions:
Apply for data access at:
Data organization (data/) follows this structure:
data/partition/modality/site/chat_index/person_index/clip_index/actual_data_files
Example data structure:
data
├── test
├── val
├── train
├── Video_files
├── NoXI
├── 010_2016-03-25_Paris
├── Expert_video
├── Novice_video
├── 1
├── 1.png
├── ....
├── 751.png
├── ....
├── ....
├── RECOLA
├── Audio_files
├── NoXI
├── RECOLA
├── group-1
├── P25
├── P26
├── 1.wav
├── ....
├── group-2
├── group-3
├── Emotion
├── NoXI
├── RECOLA
├── group-1
├── P25
├── P26
├── 1.csv
├── ....
├── group-2
├── group-3
├── 3D_FV_files
├── NoXI
├── RECOLA
├── group-1
├── P25
├── P26
├── 1.npy
├── ....
├── group-2
├── group-3
Important details:
- Task: Predict one role's reaction ('Expert' or 'Novice', 'P25' or 'P26') to the other
- 3D_FV_files contain 3DMM coefficients (expression: 52 dim, angle: 3 dim, translation: 3 dim)
- Video specifications:
- Frame rate: 25 fps
- Resolution: 256x256
- Clip length: 751 frames (~30s)
- Audio sampling rate: 44100
- CSV files for training/validation are available at: 'data/train.csv', 'data/val.csv', 'data/test.csv'
Download Additional Resources
- Listener Reaction Neighbors
- Download the appropriate listener reaction neighbors dataset from here
- Place the downloaded files in the dataset root folder
- Ground Truth 3DMMs
- Download the ground truth 3DMMs (test set) for speaker-listener evaluation from here
- Place the downloaded files in the
metric/gtfolder
Required Models and Tools
We use 3DMM coefficients for 3D listener/speaker representation and 3D-to-2D frame rendering.
-
3DMM Model Setup
- Download FaceVerse version 2 model (faceverse_simple_v2.npy)
- Place in
external/FaceVerse/data/ - Get pre-extracted data:
- 3DMM coefficients (Place in
dataset_root/3D_FV_files) - Reference files (mean_face, std_face, reference_full)
- Place in
external/FaceVerse/
- 3DMM coefficients (Place in
-
PIRender Setup
- We use PIRender for 3D-to-2D rendering
- Download our retrained checkpoint (cur_model_fold.pth)
- Place in
external/PIRender/
ReactDiff uses separate configuration files for different phases:
configs/config_train.json- Training configuration (shorter sequences, augmentation enabled)configs/config_eval.json- Evaluation configuration (30-second sequences, PIRender enabled)configs/config.json- General configuration (legacy support)
- Model: 58D 3DMM input, U-Net architecture with cross-attention
- Dataset: Configurable sequence length (256 for training, 750 for evaluation)
- Training: AdamW optimizer, mixed precision FP16, gradient accumulation
- Evaluation: Window-based processing, PIRender integration, video generation
See Configuration Documentation for detailed parameter descriptions.
Training Options
cd src/scripts
./run_train.sh singlecd src/scripts
./run_train.sh multipython train.py \
--config ../../configs/config_train.json \
--out-path ./results/training \
--name reactdiff_model \
--batch-size 100 \
--window-size 16 \
--weight-kinematics-loss 0.01 \
--weight-velocity-loss 1.0Training Features:
- Mixed precision training (FP16)
- Gradient accumulation
- Weights & Biases logging
- EMA (Exponential Moving Average) for model weights
- Configurable loss weights for kinematics and velocity
Generate Results
cd src/scripts
./run_eval.sh single /path/to/checkpoint.pthcd src/scripts
./run_eval.sh multi /path/to/checkpoint.pthpython sample.py \
--config ../../configs/config_eval.json \
--checkpoint /path/to/checkpoint.pth \
--out-path ./results/evaluation \
--window-size 64 \
--steps 50 \
--momentum 0.9Evaluation Features:
- Full 30-second video generation (750 frames)
- PIRender integration for realistic rendering
- Chunked processing for memory efficiency
- Configurable sampling parameters
Our trained model checkpoint could be found at Google Drive
- 🎨 Karras et al. (2022) Framework - State-of-the-art diffusion model implementation
- 🏗️ U-Net Backbone - Robust architecture with cross-attention layers for multi-modal fusion
- ⏱️ Temporal Windowing - Online inference capability for real-time applications
- 📐 3DMM Parameter Prediction - 58-dimensional facial parameter generation (52 expression + 3 rotation + 3 translation)
- 🎤 Wav2Vec2 Integration - Pre-trained audio feature extraction for robust speech understanding
- 👤 3DMM Coefficients - Comprehensive facial representation using 3D Morphable Model
- 🔗 Cross-Modal Attention - Sophisticated attention mechanisms between audio and visual features
- 🎯 Multi-Scale Processing - Hierarchical feature extraction for different temporal scales
- 🎨 PIRender Integration - High-quality 3D-to-2D rendering for realistic video generation
- 👤 FaceVerse Processing - Advanced 3DMM parameter processing and facial modeling
- 💾 Chunked Processing - Memory-efficient processing for long sequences (30+ seconds)
- 📹 Configurable Output - Flexible video output formats (MP4, AVI) with customizable quality
- ⚙️ Separate Configurations - Optimized settings for training vs. evaluation phases
- ⚡ Mixed Precision Training - FP16 training for 2x speed improvement and memory efficiency
- 🖥️ Multi-GPU Support - Distributed training and evaluation via HuggingFace Accelerate
- 📊 Comprehensive Logging - Weights & Biases integration for experiment tracking
- 🔄 EMA (Exponential Moving Average) - Stable model weights for better convergence
- 🎬 Full-Length Video Generation - Generate complete 30-second reaction videos
- 🔄 Real-Time Processing - Online inference with configurable window sizes
- 🎯 Multiple Reaction Styles - Generate diverse and appropriate listener reactions
- 📈 Scalable Architecture - Support for various sequence lengths and batch sizes
- 🛠️ Extensive Customization - Highly configurable parameters for different use cases
If this work helps in your research, please cite:
@article{cheng2025reactdiff,
title={ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model},
author={Cheng, Luo and Siyang, Song and Siyuan, Yan and Zhen, Yu and Zongyuan, Ge},
journal={arXiv preprint arXiv:2510.04712},
year={2025}
}Thanks to the open source of the following projects:
ReactDiff/
├── README.md # This file
├── requirements.txt # Python dependencies
├── configs/ # Configuration files
│ ├── config_train.json # Training configuration
│ ├── config_eval.json # Evaluation configuration
│ ├── config.json # General configuration
│ └── README.md # Configuration documentation
├── src/ # Source code
│ ├── models/ # Model implementations
│ │ └── k_diffusion/ # Diffusion model components
│ ├── data/ # Data handling
│ │ └── dataset.py # Dataset loading and preprocessing
│ ├── utils/ # Utility functions
│ │ ├── utils.py # General utilities
│ │ ├── render.py # 3DMM to video rendering
│ │ └── metric/ # Evaluation metrics
│ ├── external/ # External dependencies
│ │ ├── facebook/ # Wav2Vec2 model
│ │ ├── FaceVerse/ # FaceVerse components
│ │ └── PIRender/ # PIRender components
│ └── scripts/ # Training and evaluation scripts
│ ├── train.py # Training script
│ ├── sample.py # Main sampling script
│ ├── run_train.sh # Training shell script
│ └── run_eval.sh # Evaluation shell script
├── docs/ # Documentation
└── results/ # Output directory (created during training)
- Import Errors: Ensure all dependencies are installed and paths are correct
- CUDA OOM: Reduce batch size or window size in configuration
- Short Videos: Check
clip_lengthin evaluation config (should be 750 for 30s) - PIRender Errors: Verify PIRender checkpoint and FaceVerse files are in place
- Use appropriate configuration file for your task
- Adjust batch size based on GPU memory
- Use multi-GPU for large-scale training
- Enable mixed precision for faster processing
- Use chunked processing for long sequences
For more detailed troubleshooting, see the Configuration Documentation.