Anurag Bagchi · Zhipeng Bao · Yu-Xiong Wang · Pavel Tokmakov · Martial Hebert
Official PyTorch implementation of the ICCV 2025 paper "ReferEverything".
We present Refer Everything Model (REM) by re-purposing Text-to-Video generation models to zero-shot segment any concept in a Video using Text.
- [Coming Soon] Interactive demos, datasets, Mevis ckpts
- [Oct, 2025] Released the code and pretrained checkpoints for ModelScopeT2V-1.4B and Wan2.1-14B.
git clone https://github.com/yourusername/ReferEverything.git
cd ReferEverythingconda env create -f MS_env.yml
conda activate MS_envconda env create -f Wan_env.yml
conda activate Wan_envFinetuned checkpoints for both models can be downloaded from Huggingface
bash run_REM_MS_sample.sh #Change the arguments in the script accordingly.The Wan2.1-T2V-14B model is quite large. Please download the base Wan2.1-T2V-14B model from Huggingface to an approriate disk with enough space.
bash run_REM_Wan14b_sample.sh #Change the arguments in the script accordingly.We use RefCOCO/+/g and Refer-Youtube to train REM. Please follow ReferFormer to prepare the training data.
Train the spatial weights on Refcoco/+/g
bash train_REM_MS_imgs.sh #Change the arguments in the script accordingly.Train on Refer-Youtube
bash train_REM_MS_vid.sh #Change the arguments in the script accordingly.To save memory during training we pre-compute the T5 text embeddings using utils/encode_wantxt_T5.py
Train jointly on Refer-Youtube and Refcoco/+/g
bash train_REM_Wan.sh #Change the arguments in the script accordingly.Please follow the instructions in Ref-Davis, Ref-Youtube, Burst, VSPW-stuff
