Accepted by SIGGRAPH 2025
Shiyi Zhang1,2*, Junhao Zhuang1,2*, Zhaoyang Zhang2‡, Ying Shan2, Yansong Tang1✉
1Tsinghua University 2ARC Lab, Tencent PCG
*Equal Contribution ‡Project Lead ✉Corresponding Author
Your star means a lot to us in developing this project! ⭐⭐⭐
FlexiAct_demo.mp4
📖 Table of Contents
- [2025/8/18] 📢 📢 The codes of "Automatic Evaluations" for evaluation are released.
- [2025/5/6] 📢 📢 FlexiAct is released, a flexible action transfer framework in heterogeneous scenarios.
- [2025/5/6] 📢 📢 Our traning data are released.
- Release the instructions for Windows
- Update the gradio demo's instructions
- Update the code of "Automatic Evaluations"
- Release training and inference code
- Release FlexiAct checkpoints (based on CogVideoX-5B)
- Release Traning data.
- Release gradio demo
We propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process.

Environment Requirement 🔧
Step 1: Clone this repo
git clone https://github.com/TencentARC/FlexiAct.git
Step 2: Install required packages
bash env.sh
conda activate cog
Data Preparation ⏬
Option 1: Official data
You can download the data we used in our paper at here.
cd FlexiAct
git clone https://huggingface.co/datasets/shiyi0408/FlexiAct ./benchmark
By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
|-- benchmark
|-- captions
|-- animal
|-- dogjump
|-- crop.csv
|-- val_image.csv
|-- dogstand
|-- ...
|-- camera
|-- camera_forward
|-- crop.csv
|-- val_image.csv
|-- camera_rotate
|-- ...
|-- human
|-- chest
|-- crop.csv
|-- val_image.csv
|-- crouch
|-- ...
|-- reference_videos
|-- animal
|-- dogjump
|-- 0.mp4
|-- 1.mp4
|-- ...
|-- dogstand
|-- 0.mp4
|-- 1.mp4
|-- ...
|-- ...
|-- camera
|-- camera_forward
|-- 0.mp4
|-- 1.mp4
|-- ...
|-- camera_rotate
|-- 0.mp4
|-- 1.mp4
|-- ...
|-- ...
|-- human
|-- chest
|-- 0.mp4
|-- 1.mp4
|-- ...
|-- crouch
|-- 0.mp4
|-- 1.mp4
|-- ...
|-- ...
|-- extract_vid_and_crop.py
|-- target_image
|-- animal
|-- animal_bird1.jpg
|-- animal_capy1.webp
|-- ...
|-- camera
|-- view1.jpg
|-- view2.jpg
|-- ...
|-- human
|-- game_girl_1.webp
|-- game_girl_2.webp
|-- ...
Option 2: Prepare your own data
For each action, we use crop.csv to store information about the reference videos used for training, and val_image.csv to store information about the target images used for validation during training. The specific steps are as follows:
Step1: Prepare your reference video
Save your video in benchmark/reference_videos/{scenario} (using rotate.mp4 as an example, where {scenario} is human). Adjust the parameters in benchmark/reference_videos/extract_vid_and_crop.py according to your needs to determine the cropped segments:
action_name = "rotate" # your action name, same with the reference video name
subject_type = "human" # camera, human, animal
start_second = 3 # start second of the action
end_second = 9 # end second of the action
Then execute:
python benchmark/reference_videos/extract_vid_and_crop.py
You will get the benchmark/reference_videos/{scenario}/{action_name}_crop folder containing 12 new videos after random cropping. This part can refer to the explanation in the second paragraph of section 3.4 in our paper. This helps prevent the Frequency-aware Embedding from focusing on the reference video's layout.
Step2: Create crop.csv
To obtain captions for the reference videos, we recommend using CogVLM to generate video descriptions. Then you need to create crop.csv in benchmark/captions/{scenario}/{action_name}. You can directly copy crop.csv from our provided examples and modify the action name in the path (first column) to {action_name}, and change the caption in the last column to the corresponding caption. You don't need to modify other columns.
Step3: Prepare target images and create val_image.csv
First, prepare the target images you want to animate in benchmark/target_images/{scenario}.
Then, create val_image.csv in benchmark/captions/{scenario}/{action_name} to store the paths and captions of the target images used for testing during training. We recommend using captions similar to those of the reference videos. Below shows the format of val_caption.csv:
| Path | Caption |
|---|---|
benchmark/target_images/{senerio}/{your_target_image1.jpg} |
... |
benchmark/target_images/{senerio}/{your_target_image2.jpg} |
... |
Checkpoints 📊
Checkpoints of FlexiAct can be downloaded from here. The ckpt folder contains
- RefAdapter pretrained checkpoints for CogVideoX-5b-I2V
- 16 types of FAE pretrained checkpoints for CogVideoX-5b-I2V
You can download the checkpoints and put the checkpoints in the ckpts folder by:
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/shiyi0408/FlexiAct ckpts_temp
mv ckpts_temp/ckpts .
rm -r ckpts_temp
You also need to download the base model CogVideoX-5B-I2V to {your_cogvideoi2v_path} by:
git lfs install
git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V {your_cogvideoi2v_path}
The ckpt structure should be like:
|-- ckpts
|-- FAE
|-- motion_ckpts
|-- camera_forward.pt
|-- ...
|-- reference_videos
|-- camera_forward.mp4
|-- ...
|-- refnetlora_step40000_model.pt # RefAdapter ckpt
Training 🤯
Single GPU Memory Usage: 37GB
Note:
We have provided the pre-trained checkpoint for RefAdapter, so you don't need to train RefAdapter. However, we still provide scripts/train/RefAdapter_train.sh as its training script. If you wish to try training RefAdapter, we recommend using Miradata as training data. The following describes how to train FAE for reference videos.
Training script:
# v: CUDA_VISIBLE_DEVICES
# a: your action name
bash scripts/train/FAE_train.sh -v 0,1,2,3 -a rotateInference 📜
Single GPU Memory Usage: 32GB
You can animate your target images with pretrained FAE checkpoints:
bash scripts/inference/Inference.sh
Evaluation 📈
We provide four different metric-based evaluation codes in the eval folder: Motion Fidelity, Appearance Consistency, Temporal Consistency, Text Similarity.
Appearance Consistency, Temporal Consistency: modify target_directory in the code to your output video folder and run.
Text Similarity: modify target_directory in the code to your output video folder and change the csv path to the one containing prompts, then run.
Motion Fidelity: we adopt CoTracker to evaluate the Motion Fidelity between the generated video and the original video. First, you need to set up the dependencies required by CoTracker following the instructions in eval/motion_fidelity/co-tracker/README.md.
Then execute:
cd eval/motion_fidelity
git clone https://huggingface.co/shiyi0408/cotracker_ckpt_for_FlexiAct
mv cotracker_ckpt_for_FlexiAct checkpoints
to download the necessary checkpoint.
Next, in eval/motion_fidelity/configs/motion_fidelity_score_config.yaml, change the paths of the generated and original videos.
Finally, use the example code in eval/motion_fidelity/motion_fidelity.py to evaluate the Motion Fidelity between the generated video and the original video.
@inproceedings{zhang2025flexiact,
title={Flexiact: Towards flexible action control in heterogeneous scenarios},
author={Zhang, Shiyi and Zhuang, Junhao and Zhang, Zhaoyang and Shan, Ying and Tang, Yansong},
booktitle={Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
pages={1--11},
year={2025}
}
Our code is modified based on diffusers and CogVideoX, thanks to all the contributors!