Skip to content

The CVPR 2025 Highlight official implementation of "Free-viewpoint Human Animation with Pose-correlated Reference Selection"

Notifications You must be signed in to change notification settings

harlanhong/FVHuman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FVHuman: Free-viewpoint Human Animation with Pose-correlated Reference Selection

Highlight arXiv Project Page

Fa-Ting Hong1,2, Zhan Xu2, Haiyang Liu2, Qinjie Lin3, Luchuan Song2, Zhixin Shu2, Yang Zhou2, Duygu Ceylan2, Dan Xu1

1HKUST, 2Adobe Research, 3Northwestern University

🎬 Demo

FVHuman is able to generate free-viewpoint human videos from multiple images.

Video Demo

For more video demonstrations and qualitative comparisons, please check out our project page for additional high-quality video results and detailed experimental analysis.

📖 Abstract

Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control.

We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation.

🌟 Key Features

  • Free-viewpoint Human Animation: Generate human videos with substantial viewpoint changes
  • Pose-correlated Reference Selection: Intelligent selection of relevant reference regions
  • Multi-reference Input: Utilizes multiple reference images for comprehensive appearance modeling
  • Adaptive Selection Strategy: Attention-based identification of key regions for animation
  • Large Viewpoint Variations: Supports zoom-in/zoom-out scenarios and camera control

🔧 Installation

# Clone the repository
git clone https://github.com/harlanhong/FVHuman.git
cd FVHuman

# Create conda environment
conda create -n fvhuman python=3.8
conda activate fvhuman

# Install dependencies
pip install -r requirements.txt

🚀 Quick Start

Training

Training consists of two stages. You can specify the GPU device using CUDA_VISIBLE_DEVICES:

Stage 1:

CUDA_VISIBLE_DEVICES=1 accelerate launch train_s1.py --config config_s1.yaml --exp_name stage1

Stage 2:

CUDA_VISIBLE_DEVICES=1 accelerate launch train_s2.py --config config_s2.yaml --exp_name stage2

Testing

Run inference with trained models:

CUDA_VISIBLE_DEVICES=5 python inference_video_full.py \
    --config configs/inference/inference_ted.yaml \
    --checkpoint_path state2_ted_full/net-40000.pth \
    --save_name user_test/rst.mp4

Configuration Files

Make sure you have the proper configuration files:

  • config_s1.yaml - Stage 1 training configuration
  • config_s2.yaml - Stage 2 training configuration
  • configs/inference/inference_ted.yaml - Inference configuration

Model Checkpoints

Download the pre-trained model checkpoints from: Checkpoint Download Link

The trained model checkpoint should be placed at:

  • state2_ted_full/net-40000.pth - Stage 2 trained model

📊 MSTed Dataset

We introduce the Multi-Shot TED (MSTed) dataset, designed to capture significant variations in viewpoints and camera distances:

  • 1,084 unique identities
  • 15,260 video clips
  • ~30 hours of total content
  • Diverse viewpoints and camera distances
  • Professional quality TED talk videos

Dataset Download: Link

Dataset Structure

data/
├── msted/
│   ├── videos/
│   │   ├── identity_001/
│   │   │   ├── clip_001.mp4
│   │   │   └── ...
│   │   └── ...
│   ├── poses/
│   │   ├── identity_001/
│   │   │   ├── clip_001_poses.json
│   │   │   └── ...
│   │   └── ...
│   └── metadata.json

📝 Model Architecture

The illustration of our framework. Our framework feeds a reference set into reference UNet to extract the reference feature. To filter out the redundant information in reference features set, we propose a pose correlation guider to create a correlation map to indicate the informative region of the reference spatially. Moreover, we adopt a reference selection strategy to pick up the informative tokens from the reference feature set according to the correlation map and pass them to the following modules.

Our framework consists of:

  1. Reference UNet: Extracts reference features from multiple input images
  2. Pose Correlation Module: Computes similarities between target and source poses
  3. Adaptive Reference Selection: Selects informative tokens based on correlation maps
  4. Animation Generation: Synthesizes final human animation

🎯 Results

Our method achieves superior performance compared to SOTA methods under large viewpoint changes:

  • Qualitative Results: High-fidelity human animation with diverse viewpoints
  • Quantitative Evaluation: Improved metrics on viewpoint variation scenarios
  • User Studies: Preferred by users for realistic viewpoint transitions

For detailed experimental results and visual comparisons, please refer to our project page and the full paper.

📚 Citation

If you find this work useful for your research, please cite:

@article{hong2024fvhuman,
  author    = {Hong, Fa-Ting and Xu, Zhan and Liu, Haiyang and Lin, Qinjie and Song, Luchuan and Shu, Zhixin and Zhou, Yang and Ceylan, Duygu and Xu, Dan},
  title     = {Free-viewpoint Human Animation with Pose-correlated Reference Selection},
  journal   = {CVPR},
  year      = {2025},
}

🔗 Related Links

📜 License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

🙏 Acknowledgments

We thank the creators of TED talks for providing diverse and high-quality video content that made the MSTed dataset possible. We also acknowledge the support from HKUST and Adobe Research.

📧 Contact

For questions and collaborations, please contact:


⭐ If you find this project helpful, please consider giving it a star! ⭐

About

The CVPR 2025 Highlight official implementation of "Free-viewpoint Human Animation with Pose-correlated Reference Selection"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published