Skip to content

[SIGGRAPH 2025] Official code of the paper "FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios"

License

Notifications You must be signed in to change notification settings

shiyi-zh0408/FlexiAct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

Accepted by SIGGRAPH 2025

Shiyi Zhang1,2*, Junhao Zhuang1,2*, Zhaoyang Zhang2‡, Ying Shan2, Yansong Tang1✉
1Tsinghua University 2ARC Lab, Tencent PCG
*Equal Contribution Project Lead Corresponding Author

     

Your star means a lot to us in developing this project! ⭐⭐⭐

FlexiAct_demo.mp4

📖 Table of Contents

🔥 Update Log

  • [2025/8/18] 📢 📢 The codes of "Automatic Evaluations" for evaluation are released.
  • [2025/5/6] 📢 📢 FlexiAct is released, a flexible action transfer framework in heterogeneous scenarios.
  • [2025/5/6] 📢 📢 Our traning data are released.

📋 TODO

  • Release the instructions for Windows
  • Update the gradio demo's instructions
  • Update the code of "Automatic Evaluations"
  • Release training and inference code
  • Release FlexiAct checkpoints (based on CogVideoX-5B)
  • Release Traning data.
  • Release gradio demo

🛠️ Method Overview

We propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Method

🚀 Getting Started

Environment Requirement 🔧

Step 1: Clone this repo

git clone https://github.com/TencentARC/FlexiAct.git

Step 2: Install required packages

bash env.sh
conda activate cog
Data Preparation ⏬

Option 1: Official data

You can download the data we used in our paper at here.

cd FlexiAct
git clone https://huggingface.co/datasets/shiyi0408/FlexiAct ./benchmark

By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:

|-- benchmark
    |-- captions
        |-- animal
            |-- dogjump
                |-- crop.csv
                |-- val_image.csv
            |-- dogstand
            |-- ...
        |-- camera
            |-- camera_forward
                |-- crop.csv
                |-- val_image.csv
            |-- camera_rotate
            |-- ...
        |-- human
            |-- chest
                |-- crop.csv
                |-- val_image.csv
            |-- crouch
            |-- ...
    |-- reference_videos
        |-- animal
            |-- dogjump
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- dogstand
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- ...
        |-- camera
            |-- camera_forward
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- camera_rotate
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- ...
        |-- human
            |-- chest
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- crouch
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- ...
        |-- extract_vid_and_crop.py
    |-- target_image
        |-- animal
            |-- animal_bird1.jpg
            |-- animal_capy1.webp
            |-- ...
        |-- camera
            |-- view1.jpg
            |-- view2.jpg
            |-- ...
        |-- human
            |-- game_girl_1.webp
            |-- game_girl_2.webp
            |-- ...
    

Option 2: Prepare your own data

For each action, we use crop.csv to store information about the reference videos used for training, and val_image.csv to store information about the target images used for validation during training. The specific steps are as follows:

Step1: Prepare your reference video Save your video in benchmark/reference_videos/{scenario} (using rotate.mp4 as an example, where {scenario} is human). Adjust the parameters in benchmark/reference_videos/extract_vid_and_crop.py according to your needs to determine the cropped segments:

action_name = "rotate" # your action name, same with the reference video name
subject_type = "human" # camera, human, animal
start_second = 3 # start second of the action
end_second = 9 # end second of the action

Then execute:

python benchmark/reference_videos/extract_vid_and_crop.py

You will get the benchmark/reference_videos/{scenario}/{action_name}_crop folder containing 12 new videos after random cropping. This part can refer to the explanation in the second paragraph of section 3.4 in our paper. This helps prevent the Frequency-aware Embedding from focusing on the reference video's layout.

Step2: Create crop.csv

To obtain captions for the reference videos, we recommend using CogVLM to generate video descriptions. Then you need to create crop.csv in benchmark/captions/{scenario}/{action_name}. You can directly copy crop.csv from our provided examples and modify the action name in the path (first column) to {action_name}, and change the caption in the last column to the corresponding caption. You don't need to modify other columns.

Step3: Prepare target images and create val_image.csv

First, prepare the target images you want to animate in benchmark/target_images/{scenario}.

Then, create val_image.csv in benchmark/captions/{scenario}/{action_name} to store the paths and captions of the target images used for testing during training. We recommend using captions similar to those of the reference videos. Below shows the format of val_caption.csv:

Path Caption
benchmark/target_images/{senerio}/{your_target_image1.jpg} ...
benchmark/target_images/{senerio}/{your_target_image2.jpg} ...
Checkpoints 📊

Checkpoints of FlexiAct can be downloaded from here. The ckpt folder contains

  • RefAdapter pretrained checkpoints for CogVideoX-5b-I2V
  • 16 types of FAE pretrained checkpoints for CogVideoX-5b-I2V

You can download the checkpoints and put the checkpoints in the ckpts folder by:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/shiyi0408/FlexiAct ckpts_temp
mv ckpts_temp/ckpts .
rm -r ckpts_temp

You also need to download the base model CogVideoX-5B-I2V to {your_cogvideoi2v_path} by:

git lfs install
git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V {your_cogvideoi2v_path}

The ckpt structure should be like:

|-- ckpts
    |-- FAE
        |-- motion_ckpts
            |-- camera_forward.pt
            |-- ...
        |-- reference_videos
            |-- camera_forward.mp4
            |-- ...
    |-- refnetlora_step40000_model.pt # RefAdapter ckpt

🏃🏼 Running Scripts

Training 🤯

Single GPU Memory Usage: 37GB

Note:
We have provided the pre-trained checkpoint for RefAdapter, so you don't need to train RefAdapter. However, we still provide scripts/train/RefAdapter_train.sh as its training script. If you wish to try training RefAdapter, we recommend using Miradata as training data. The following describes how to train FAE for reference videos.

Training script:

# v: CUDA_VISIBLE_DEVICES
# a: your action name
bash scripts/train/FAE_train.sh -v 0,1,2,3 -a rotate
Inference 📜

Single GPU Memory Usage: 32GB

You can animate your target images with pretrained FAE checkpoints:

bash scripts/inference/Inference.sh
Evaluation 📈

We provide four different metric-based evaluation codes in the eval folder: Motion Fidelity, Appearance Consistency, Temporal Consistency, Text Similarity.

Appearance Consistency, Temporal Consistency: modify target_directory in the code to your output video folder and run.

Text Similarity: modify target_directory in the code to your output video folder and change the csv path to the one containing prompts, then run.

Motion Fidelity: we adopt CoTracker to evaluate the Motion Fidelity between the generated video and the original video. First, you need to set up the dependencies required by CoTracker following the instructions in eval/motion_fidelity/co-tracker/README.md. Then execute:

cd eval/motion_fidelity
git clone https://huggingface.co/shiyi0408/cotracker_ckpt_for_FlexiAct
mv cotracker_ckpt_for_FlexiAct checkpoints

to download the necessary checkpoint.

Next, in eval/motion_fidelity/configs/motion_fidelity_score_config.yaml, change the paths of the generated and original videos.

Finally, use the example code in eval/motion_fidelity/motion_fidelity.py to evaluate the Motion Fidelity between the generated video and the original video.

🤝🏼 Cite Us

@inproceedings{zhang2025flexiact,
  title={Flexiact: Towards flexible action control in heterogeneous scenarios},
  author={Zhang, Shiyi and Zhuang, Junhao and Zhang, Zhaoyang and Shan, Ying and Tang, Yansong},
  booktitle={Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  pages={1--11},
  year={2025}
}

🙏 Acknowledgement

Our code is modified based on diffusers and CogVideoX, thanks to all the contributors!

About

[SIGGRAPH 2025] Official code of the paper "FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published