FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

Accepted by SIGGRAPH 2025

Shiyi Zhang^1,2*, Junhao Zhuang^1,2*, Zhaoyang Zhang^2‡, Ying Shan², Yansong Tang^1✉
¹Tsinghua University ²ARC Lab, Tencent PCG
^*Equal Contribution ^‡Project Lead ^✉Corresponding Author

Your star means a lot to us in developing this project! ⭐⭐⭐

FlexiAct_demo.mp4

📖 Table of Contents

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

🔥 Update Log

[2025/8/18] 📢 📢 The codes of "Automatic Evaluations" for evaluation are released.
[2025/5/6] 📢 📢 FlexiAct is released, a flexible action transfer framework in heterogeneous scenarios.
[2025/5/6] 📢 📢 Our traning data are released.

📋 TODO

Release the instructions for Windows
Update the gradio demo's instructions
Update the code of "Automatic Evaluations"
Release training and inference code
Release FlexiAct checkpoints (based on CogVideoX-5B)
Release Traning data.
Release gradio demo

🛠️ Method Overview

We propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process.

🚀 Getting Started

Environment Requirement 🔧

Step 1: Clone this repo

git clone https://github.com/TencentARC/FlexiAct.git

Step 2: Install required packages

bash env.sh
conda activate cog

Data Preparation ⏬

Option 1: Official data

You can download the data we used in our paper at here.

cd FlexiAct
git clone https://huggingface.co/datasets/shiyi0408/FlexiAct ./benchmark

By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:

|-- benchmark
    |-- captions
        |-- animal
            |-- dogjump
                |-- crop.csv
                |-- val_image.csv
            |-- dogstand
            |-- ...
        |-- camera
            |-- camera_forward
                |-- crop.csv
                |-- val_image.csv
            |-- camera_rotate
            |-- ...
        |-- human
            |-- chest
                |-- crop.csv
                |-- val_image.csv
            |-- crouch
            |-- ...
    |-- reference_videos
        |-- animal
            |-- dogjump
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- dogstand
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- ...
        |-- camera
            |-- camera_forward
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- camera_rotate
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- ...
        |-- human
            |-- chest
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- crouch
                |-- 0.mp4
                |-- 1.mp4
                |-- ...
            |-- ...
        |-- extract_vid_and_crop.py
    |-- target_image
        |-- animal
            |-- animal_bird1.jpg
            |-- animal_capy1.webp
            |-- ...
        |-- camera
            |-- view1.jpg
            |-- view2.jpg
            |-- ...
        |-- human
            |-- game_girl_1.webp
            |-- game_girl_2.webp
            |-- ...

Option 2: Prepare your own data

For each action, we use crop.csv to store information about the reference videos used for training, and val_image.csv to store information about the target images used for validation during training. The specific steps are as follows:

Step1: Prepare your reference video Save your video in benchmark/reference_videos/{scenario} (using rotate.mp4 as an example, where {scenario} is human). Adjust the parameters in benchmark/reference_videos/extract_vid_and_crop.py according to your needs to determine the cropped segments:

action_name = "rotate" # your action name, same with the reference video name
subject_type = "human" # camera, human, animal
start_second = 3 # start second of the action
end_second = 9 # end second of the action

Then execute:

python benchmark/reference_videos/extract_vid_and_crop.py

You will get the benchmark/reference_videos/{scenario}/{action_name}_crop folder containing 12 new videos after random cropping. This part can refer to the explanation in the second paragraph of section 3.4 in our paper. This helps prevent the Frequency-aware Embedding from focusing on the reference video's layout.

Step2: Create crop.csv

To obtain captions for the reference videos, we recommend using CogVLM to generate video descriptions. Then you need to create crop.csv in benchmark/captions/{scenario}/{action_name}. You can directly copy crop.csv from our provided examples and modify the action name in the path (first column) to {action_name}, and change the caption in the last column to the corresponding caption. You don't need to modify other columns.

Step3: Prepare target images and create val_image.csv

First, prepare the target images you want to animate in benchmark/target_images/{scenario}.

Then, create val_image.csv in benchmark/captions/{scenario}/{action_name} to store the paths and captions of the target images used for testing during training. We recommend using captions similar to those of the reference videos. Below shows the format of val_caption.csv:

Path	Caption
`benchmark/target_images/{senerio}/{your_target_image1.jpg}`	...
`benchmark/target_images/{senerio}/{your_target_image2.jpg}`	...

Checkpoints 📊

Checkpoints of FlexiAct can be downloaded from here. The ckpt folder contains

RefAdapter pretrained checkpoints for CogVideoX-5b-I2V
16 types of FAE pretrained checkpoints for CogVideoX-5b-I2V

You can download the checkpoints and put the checkpoints in the ckpts folder by:

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/shiyi0408/FlexiAct ckpts_temp
mv ckpts_temp/ckpts .
rm -r ckpts_temp

You also need to download the base model CogVideoX-5B-I2V to {your_cogvideoi2v_path} by:

git lfs install
git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V {your_cogvideoi2v_path}

The ckpt structure should be like:

|-- ckpts
    |-- FAE
        |-- motion_ckpts
            |-- camera_forward.pt
            |-- ...
        |-- reference_videos
            |-- camera_forward.mp4
            |-- ...
    |-- refnetlora_step40000_model.pt # RefAdapter ckpt

🏃🏼 Running Scripts

Training 🤯

Single GPU Memory Usage: 37GB

Note:
We have provided the pre-trained checkpoint for RefAdapter, so you don't need to train RefAdapter. However, we still provide scripts/train/RefAdapter_train.sh as its training script. If you wish to try training RefAdapter, we recommend using Miradata as training data. The following describes how to train FAE for reference videos.

Training script:

# v: CUDA_VISIBLE_DEVICES
# a: your action name
bash scripts/train/FAE_train.sh -v 0,1,2,3 -a rotate

Inference 📜

Single GPU Memory Usage: 32GB

You can animate your target images with pretrained FAE checkpoints:

bash scripts/inference/Inference.sh

Evaluation 📈

We provide four different metric-based evaluation codes in the eval folder: Motion Fidelity, Appearance Consistency, Temporal Consistency, Text Similarity.

Appearance Consistency, Temporal Consistency: modify target_directory in the code to your output video folder and run.

Text Similarity: modify target_directory in the code to your output video folder and change the csv path to the one containing prompts, then run.

Motion Fidelity: we adopt CoTracker to evaluate the Motion Fidelity between the generated video and the original video. First, you need to set up the dependencies required by CoTracker following the instructions in eval/motion_fidelity/co-tracker/README.md. Then execute:

cd eval/motion_fidelity
git clone https://huggingface.co/shiyi0408/cotracker_ckpt_for_FlexiAct
mv cotracker_ckpt_for_FlexiAct checkpoints

to download the necessary checkpoint.

Next, in eval/motion_fidelity/configs/motion_fidelity_score_config.yaml, change the paths of the generated and original videos.

Finally, use the example code in eval/motion_fidelity/motion_fidelity.py to evaluate the Motion Fidelity between the generated video and the original video.

🤝🏼 Cite Us

@inproceedings{zhang2025flexiact,
  title={Flexiact: Towards flexible action control in heterogeneous scenarios},
  author={Zhang, Shiyi and Zhuang, Junhao and Zhang, Zhaoyang and Shan, Ying and Tang, Yansong},
  booktitle={Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  pages={1--11},
  year={2025}
}

🙏 Acknowledgement

Our code is modified based on diffusers and CogVideoX, thanks to all the contributors!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
data		data
eval		eval
models		models
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
app.py		app.py
env.sh		env.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

🔥 Update Log

📋 TODO

🛠️ Method Overview

🚀 Getting Started

🏃🏼 Running Scripts

🤝🏼 Cite Us

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

shiyi-zh0408/FlexiAct

Folders and files

Latest commit

History

Repository files navigation

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

🔥 Update Log

📋 TODO

🛠️ Method Overview

🚀 Getting Started

🏃🏼 Running Scripts

🤝🏼 Cite Us

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages