Fa-Ting Hong1,2, Zhan Xu2, Haiyang Liu2, Qinjie Lin3, Luchuan Song2, Zhixin Shu2, Yang Zhou2, Duygu Ceylan2, Dan Xu1
1HKUST, 2Adobe Research, 3Northwestern University
FVHuman is able to generate free-viewpoint human videos from multiple images.
For more video demonstrations and qualitative comparisons, please check out our project page for additional high-quality video results and detailed experimental analysis.
Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control.
We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation.
- Free-viewpoint Human Animation: Generate human videos with substantial viewpoint changes
- Pose-correlated Reference Selection: Intelligent selection of relevant reference regions
- Multi-reference Input: Utilizes multiple reference images for comprehensive appearance modeling
- Adaptive Selection Strategy: Attention-based identification of key regions for animation
- Large Viewpoint Variations: Supports zoom-in/zoom-out scenarios and camera control
# Clone the repository
git clone https://github.com/harlanhong/FVHuman.git
cd FVHuman
# Create conda environment
conda create -n fvhuman python=3.8
conda activate fvhuman
# Install dependencies
pip install -r requirements.txtTraining consists of two stages. You can specify the GPU device using CUDA_VISIBLE_DEVICES:
Stage 1:
CUDA_VISIBLE_DEVICES=1 accelerate launch train_s1.py --config config_s1.yaml --exp_name stage1Stage 2:
CUDA_VISIBLE_DEVICES=1 accelerate launch train_s2.py --config config_s2.yaml --exp_name stage2Run inference with trained models:
CUDA_VISIBLE_DEVICES=5 python inference_video_full.py \
--config configs/inference/inference_ted.yaml \
--checkpoint_path state2_ted_full/net-40000.pth \
--save_name user_test/rst.mp4Make sure you have the proper configuration files:
config_s1.yaml- Stage 1 training configurationconfig_s2.yaml- Stage 2 training configurationconfigs/inference/inference_ted.yaml- Inference configuration
Download the pre-trained model checkpoints from: Checkpoint Download Link
The trained model checkpoint should be placed at:
state2_ted_full/net-40000.pth- Stage 2 trained model
We introduce the Multi-Shot TED (MSTed) dataset, designed to capture significant variations in viewpoints and camera distances:
- 1,084 unique identities
- 15,260 video clips
- ~30 hours of total content
- Diverse viewpoints and camera distances
- Professional quality TED talk videos
Dataset Download: Link
data/
├── msted/
│ ├── videos/
│ │ ├── identity_001/
│ │ │ ├── clip_001.mp4
│ │ │ └── ...
│ │ └── ...
│ ├── poses/
│ │ ├── identity_001/
│ │ │ ├── clip_001_poses.json
│ │ │ └── ...
│ │ └── ...
│ └── metadata.json
The illustration of our framework. Our framework feeds a reference set into reference UNet to extract the reference feature. To filter out the redundant information in reference features set, we propose a pose correlation guider to create a correlation map to indicate the informative region of the reference spatially. Moreover, we adopt a reference selection strategy to pick up the informative tokens from the reference feature set according to the correlation map and pass them to the following modules.
Our framework consists of:
- Reference UNet: Extracts reference features from multiple input images
- Pose Correlation Module: Computes similarities between target and source poses
- Adaptive Reference Selection: Selects informative tokens based on correlation maps
- Animation Generation: Synthesizes final human animation
Our method achieves superior performance compared to SOTA methods under large viewpoint changes:
- Qualitative Results: High-fidelity human animation with diverse viewpoints
- Quantitative Evaluation: Improved metrics on viewpoint variation scenarios
- User Studies: Preferred by users for realistic viewpoint transitions
For detailed experimental results and visual comparisons, please refer to our project page and the full paper.
If you find this work useful for your research, please cite:
@article{hong2024fvhuman,
author = {Hong, Fa-Ting and Xu, Zhan and Liu, Haiyang and Lin, Qinjie and Song, Luchuan and Shu, Zhixin and Zhou, Yang and Ceylan, Duygu and Xu, Dan},
title = {Free-viewpoint Human Animation with Pose-correlated Reference Selection},
journal = {CVPR},
year = {2025},
}- Project Page: https://harlanhong.github.io/publications/fvhuman/index.html
- Paper: arXiv
- CVPR 2025: Conference Page
This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
We thank the creators of TED talks for providing diverse and high-quality video content that made the MSTed dataset possible. We also acknowledge the support from HKUST and Adobe Research.
For questions and collaborations, please contact:
- Fa-Ting Hong: [email protected]

