(CVPR 2024)
Xiang Deng1, Zerong Zheng2,
Yuxiang Zhang1, Jingxiang Sun1,
Chao Xu2, XiaoDong Yang3, Lizhen Wang1, Yebin Liu1
1Tsinghua University 2NNKosmos Technology 3Li Auto
📄 Paper | 🎥 [Video Demo] |
We present RAM-Avatar, a novel approach for learning real-time, photo-realistic full-body avatars from monocular videos. Our method enables full-body control with high-fidelity rendering of facial expressions, hand gestures, and body textures — all while maintaining real-time performance.
This paper advances the practicality of human avatar learning by introducing RAM-Avatar, a framework that learns real-time, photo-realistic avatars from monocular videos with full-body controllability. To model fine-grained variations in facial expressions and hand gestures, we employ dedicated statistical templates. For the body, a sparsely computed dual attention mechanism enhances texture fidelity on torso and limbs. Built upon this, a lightweight yet powerful StyleUNet architecture, coupled with a temporal-aware discriminator, enables efficient and realistic rendering at real-time speeds. To ensure robust animation under out-of-distribution poses, we propose a Motion Distribution Alignment (MDA) module that reduces domain shift between training and inference. Extensive experiments validate the superiority of our method in both qualitative and quantitative evaluations. We further demonstrate its practical potential via a real-time live avatar system. Code and models will be released for research purposes.
- Python 3.9.17
- PyTorch 2.0.0+cu118
- TorchVision 0.15.1+cu118
- setuptools 68.0.0
- scikit-image 0.22.0
- numpy 1.25.2💡 We recommend using a Conda environment:
conda create -n ramavatar python=3.9 conda activate ramavatar pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.0+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install scikit-image numpy setuptools
To train RAM-Avatar, prepare your dataset as follows:
- Estimate SMPL-X parameters using ProxyCapV2.
- Fit FaceVerse parameters for facial dynamics.
- Render SMPL and facial maps using PyTorch3D.
- Organize the data directory structure:
dataset/train/
├── keypoints_mmpose_hand/
│ ├── 00000001.json
│ ├── 00000002.json
│ └── ...
├── smpl_map/
│ ├── 00000001.png
│ ├── 00000002.png
│ └── ...
├── smpl_map_001/
│ ├── 00000001.png
│ ├── 00000002.png
│ └── ...
├── track2/
│ ├── 00000001.png
│ ├── 00000002.png
│ └── ...
├── 00000001.png # Original frames
├── 00000002.png
└── ...
You can download our pre-trained models and sample datasets here:
- 🔗 Pretrained Checkpoints (Password:
i8gt) - 🔗 Sample Dataset (Password:
e4wp)
— Shared via Baidu Wangpan Super VIP
⚠️ Note: These links are hosted on Baidu Netdisk. International users may need a download accelerator.
Single GPU:
CUDA_VISIBLE_DEVICES=0 python main_train.py --from_json configs/train.json --name train --nump 0Multi-GPU (4 GPUs):
CUDA_VISIBLE_DEVICES=0,1,2,3 python main_train.py --from_json configs/train.json --name train --nump 4CUDA_VISIBLE_DEVICES=0 python main_test.py --from_json configs/test.json --name train --nump 0This work is built upon the following excellent open-source projects. We thank the authors for their contributions:
If you find our work useful in your research, please cite:
@inproceedings{deng2024ram,
title = {RAM-Avatar: Real-Time Photo-Realistic Avatar from Monocular Videos with Full-Body Control},
author = {Deng, Xiang and Zheng, Zerong and Zhang, Yuxiang and Sun, Jingxiang and Xu, Chao and Yang, Xiaodong and Wang, Lizhen and Liu, Yebin},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1996--2007},
year = {2024}
}