FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing

Abstract: Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose \textbf{FreeMask}, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

Updates

2025.1.16 Code Version 1 released (on base model Zeroscope, editing tasks including stylization & shape editing)
🎉 2024.12.9 Accepted by AAAI2025！
Code will be made publicly available after meticulous internal review. Stay tuned ⭐ for updates!

Citation

@inproceedings{cai2025freemask,
   title={FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing},
   author={Cai, Lingling and Zhao, Kang and Yuan, Hangjie and Zhang, Yingya and Zhang, Shiwei and Huang, Kejie},
   booktitle={Proceedings of the Thirty-Ninth Conference on Association for the Advancement of Artificial Intelligence (AAAI-25)},
   year={2025}
}

Quick Start

1. Environment Requirements

conda create -n freemask python==3.11
conda activate freemask
pip install -r requirements.txt

2. MMC metrics calculation

2.1 prepare a video dataset with pre-computed segmentations (take DAVIS 2016 for example)

download DAVIS2016: https://davischallenge.org/davis2016/code.html In DAVIS2016, videos with only a single category segmentation map are selected, and 8 frames are chosen from each video for computation.

2.2 cross-attention maps visualization

Prepare configs for all of your selected videos like this:

config/mask_bear.yaml

you need to change these settings for different videos:

dataset_config:
    path: "dataset/frames/bear" #change to your video frame path
    prompt: "a bear is walking" #change to your prompt
...
editing_config:
    cal_maps: True #True for cross-attention visualization
    dataname: "bear" #change to your video name
    word: ["bear","bear"] #change to your edited object 
    ...
    editing_prompts: [
        a bear is walking
    ] #change to your prompt

Run for cross-attention visualization

python cal_mask.py --config config/mask_bear.yaml

Then, cross-attention maps towards dataname across all layers and all timesteps will be saved at ./camap/dataname

2.3 calculation

calculate the MIoU of all cross-attention maps with the ground-truth segmentation mask, then get the TMMC and LMMC according the the Eq.2-6 in the paper.

we provide an exmaple for MIoU calculation for one cross-attention map with the ground-truth segmentation mask:

python calculate_miou.py "dataset/miou_test/bear_mask.jpg" "dataset/miou_test/binarized_bear_camap.jpg"

After calculating the average MMC of all videos, you will get a codebook about MMC across timesteps and layers.

3. Editing for different tasks

3.1 style translation

prepare a configs like:

config/giraffe_style.yaml

Run for style translation:

python run.py --config config/giraffe_style.yaml

3.2 shape editing

prepare a configs like:

config/girl_jump_shape.yaml

Run for shape editing:

python run.py --config config/girl_jump_shape.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
CLIP		CLIP
assets		assets
config		config
dataset		dataset
diffusers		diffusers
video_diffusion		video_diffusion
.DS_Store		.DS_Store
README.md		README.md
cal_mask.py		cal_mask.py
cal_miou.py		cal_miou.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing

Updates

Citation

Quick Start

1. Environment Requirements

2. MMC metrics calculation

2.1 prepare a video dataset with pre-computed segmentations (take DAVIS 2016 for example)

2.2 cross-attention maps visualization

2.3 calculation

3. Editing for different tasks

3.1 style translation

3.2 shape editing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

LinglingCai0314/FreeMask

Folders and files

Latest commit

History

Repository files navigation

FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing

Updates

Citation

Quick Start

1. Environment Requirements

2. MMC metrics calculation

2.1 prepare a video dataset with pre-computed segmentations (take DAVIS 2016 for example)

2.2 cross-attention maps visualization

2.3 calculation

3. Editing for different tasks

3.1 style translation

3.2 shape editing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages