Fuse2Match: Training-Free Fusion of Flow, Diffusion, and Contrastive Models for Zero-Shot Semantic Matching
This repository contains the official implementation of our NeurIPS 2025 Poster paper "Fuse2Match: Training-Free Fusion of Flow, Diffusion, and Contrastive Models for Zero-Shot Semantic Matching".
Recent work shows that features from Stable Diffusion (SD) and contrastively pretrained models like DINO can be directly used for zero-shot semantic correspondence via naive feature concatenation. In this paper, we explore the stronger potential of Stable Diffusion 3 (SD3), a rectified flow-based model with a multimodal transformer backbone (MM-DiT). We show that semantic signals in SD3 are scattered across multiple timesteps and transformer layers, and propose a multi-level fusion scheme to extract discriminative features. Moreover, we identify that naive fusion across models suffers from inconsistent distributions, thus leading to suboptimal performance. To address this, we propose a simple yet effective confidence-aware feature fusion strategy that re-weights each model’s contribution based on prediction confidence scores derived from their matching uncertainties. Notably, this fusion approach is not only training-free but also enables per-pixel adaptive integration of heterogeneous features. The resulting representation, \modelname, significantly outperforms strong baselines on SPair-71k, PF-Pascal, and PSC6K, validating the benefit of combining SD3, SD, and DINO through our proposed confidence-aware feature fusion.
conda create -n fuse2match python==3.12.3
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install diffusers timm transformers numpy accelerate sentencepieceRequirements NVIDIA A100 40GB GPU recommended for optimal performance.
Preparing Data
cd path/to/data # Replace with your dataset folder path
wget https://cvlab.postech.ac.kr/research/SPair-71k/data/SPair-71k.tar.gz
tar -xvf SPair-71k.tar.gzEvaluation
python evaluation.py \
--model_names sd3 sd2 dinov2 \
--dataset_name SPair71k \
--dataset_path path/to/SPair-71k \ # Replace with your dataset folder path
\
--sd3_save_path sd3 \
--sd3_img_size 1024 1024 \
--sd3_t 721 621 521 \
--sd3_layers 24 25 \
--sd3_facets query value \
\
--sd2_save_path sd2 \
--sd2_img_size 768 768 \
--sd2_t 261 \
--up_ft_index 1 \
\
--dinov2_save_path dinov2 \
--dinov2_img_size 840 840 \
--dinov2_size large \
\
--ensemble_size 8 \
--load_from_localIf you find our work useful, please cite:
@inproceedings{zuo2025esd3,
title = {Fuse2Match: Training-Free Fusion of Flow, Diffusion, and Contrastive Models for Zero-Shot Semantic Matching},
author = {Zuo, Jing and Wang, Jiaqi and Qi, Yonggang and Song, Yi-Zhe},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025}
}