Skip to content
/ s2d Public

Official Implementation for the paper "S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation

Notifications You must be signed in to change notification settings

leonsick/s2d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation

S2D is a simple unsupervised video instance segmentation (UVIS) method. Our approach is trained exclusively on video data without using any human annotations, avoiding synthetic videos (ImageNet data) entirely.

S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation
Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
Ulm University, Google, KAUST, TU Vienna

[arxiv]

Note: This repository is currently under development. We will provide detailed instructions for keymask discovery and model training soon.

Dataset Preparation

Follow the data preparation process from VideoCutLER.

Method Overview

S2D has three main stages:
  1. First, we predict single-frame masks using a SOTA unsupervised image instance segmentation model, CutS3D. Please find the weights here.
  2. We then perform Keymask Discovery to identify temporally-coherent, high-quality keymasks across the video.
  3. Finally, we perform Sparse-To-Dense Keymask Distillation to train a video instance segmentation model using the discovered keymasks. This is followed by another round of self-distillation.

Inference Demo for VideoCutLER with Pre-trained Models

We provide demo_video/demo.py that is able to demo builtin configs. Run it with:

cd model_training
python demo_video/demo.py \
  --config-file configs/imagenet_video/s2d_inference_kd_video_mask2former_R50_cls_agnostic.yaml \
  --input <your-video-path>/*.jpg \
  --confidence-threshold 0.8 \
  --output imgs/ \
  --opts MODEL.WEIGHTS s2d_zeroshot.pth

Our zero-shot S2D model, trained on a mixture-of-datasets (SA-V, MOSE, VIPSeg) can be obtained from here. Then you should specify MODEL.WEIGHTS to the model checkpoint for evaluation. Above command will run the inference and show visualizations in an OpenCV window, and save the results in the mp4 format. For details of the command line arguments, see demo.py -h or look at its source code to understand its behavior. Some common arguments are:

  • To get a higher recall, use a smaller --confidence-threshold.
  • To save each frame's segmentation result, add --save-frames True before --opts.
  • To save each frame's segmentation masks, add --save-masks True before --opts.

Unsupervised Zero-shot Evaluation

To evaluate a model's performance on various datasets, such as YouTubeVIS-2021, please refer to datasets/README.md for instructions on preparing the datasets. Next, download the model weights, specify the "model_weights", "config_file" and the path to "DETECTRON2_DATASETS", then run the following commands.

export DETECTRON2_DATASETS=/PATH/TO/DETECTRON2_DATASETS/
CUDA_VISIBLE_DEVICES=0 python train_net_video.py --num-gpus 4 \
  --config-file configs/imagenet_video/s2d_inference_kd_video_mask2former_R50_cls_agnostic.yaml \
  --eval-only MODEL.WEIGHTS s2d_zeroshot.pth \
  OUTPUT_DIR OUTPUT-DIR/ytvis_2021 DATASETS.TEST '("ytvis_2021_val",)'

Instructions for Keymask Discovery and Model Training

We will provide detailed instructions soon. ToDos:

  • Provide installation instructions
  • Write instructions for Single-Frame Mask Prediction with CutS3D
  • Write instructions for Keymask Discovery
  • Write instructions for Sparse-To-Dense Keymask Distillation

Acknowledgements

Our code is in large parts based on the VideoCutLER implementation. Thank you to the authors for releasing their code.

How to get support from us?

If you have any general questions, feel free to email us at Leon Sick. If you have code or implementation-related questions, please feel free to send emails to us.

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@article{sick2025s2d,
  title={S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation},
  author={Sick, Leon and Hoyer, Lukas and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo},
  journal={arXiv preprint arXiv:2512.14440},
  year={2025}
}

About

Official Implementation for the paper "S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages