S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation

S2D is a simple unsupervised video instance segmentation (UVIS) method. Our approach is trained exclusively on video data without using any human annotations, avoiding synthetic videos (ImageNet data) entirely.

S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation
Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
Ulm University, Google, KAUST, TU Vienna

[arxiv]

Note: This repository is currently under development. We will provide detailed instructions for keymask discovery and model training soon.

Dataset Preparation

Follow the data preparation process from VideoCutLER.

Method Overview

S2D has three main stages:

First, we predict single-frame masks using a SOTA unsupervised image instance segmentation model, CutS3D. Please find the weights here.
We then perform Keymask Discovery to identify temporally-coherent, high-quality keymasks across the video.
Finally, we perform Sparse-To-Dense Keymask Distillation to train a video instance segmentation model using the discovered keymasks. This is followed by another round of self-distillation.

Inference Demo for VideoCutLER with Pre-trained Models

We provide demo_video/demo.py that is able to demo builtin configs. Run it with:

cd model_training
python demo_video/demo.py \
  --config-file configs/imagenet_video/s2d_inference_kd_video_mask2former_R50_cls_agnostic.yaml \
  --input <your-video-path>/*.jpg \
  --confidence-threshold 0.8 \
  --output imgs/ \
  --opts MODEL.WEIGHTS s2d_zeroshot.pth

Our zero-shot S2D model, trained on a mixture-of-datasets (SA-V, MOSE, VIPSeg) can be obtained from here. Then you should specify MODEL.WEIGHTS to the model checkpoint for evaluation. Above command will run the inference and show visualizations in an OpenCV window, and save the results in the mp4 format. For details of the command line arguments, see demo.py -h or look at its source code to understand its behavior. Some common arguments are:

To get a higher recall, use a smaller --confidence-threshold.
To save each frame's segmentation result, add --save-frames True before --opts.
To save each frame's segmentation masks, add --save-masks True before --opts.

Unsupervised Zero-shot Evaluation

To evaluate a model's performance on various datasets, such as YouTubeVIS-2021, please refer to datasets/README.md for instructions on preparing the datasets. Next, download the model weights, specify the "model_weights", "config_file" and the path to "DETECTRON2_DATASETS", then run the following commands.

export DETECTRON2_DATASETS=/PATH/TO/DETECTRON2_DATASETS/
CUDA_VISIBLE_DEVICES=0 python train_net_video.py --num-gpus 4 \
  --config-file configs/imagenet_video/s2d_inference_kd_video_mask2former_R50_cls_agnostic.yaml \
  --eval-only MODEL.WEIGHTS s2d_zeroshot.pth \
  OUTPUT_DIR OUTPUT-DIR/ytvis_2021 DATASETS.TEST '("ytvis_2021_val",)'

Instructions for Keymask Discovery and Model Training

We will provide detailed instructions soon. ToDos:

Provide installation instructions
Write instructions for Single-Frame Mask Prediction with CutS3D
Write instructions for Keymask Discovery
Write instructions for Sparse-To-Dense Keymask Distillation

Acknowledgements

Our code is in large parts based on the VideoCutLER implementation. Thank you to the authors for releasing their code.

How to get support from us?

If you have any general questions, feel free to email us at Leon Sick. If you have code or implementation-related questions, please feel free to send emails to us.

Citation

If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.

@article{sick2025s2d,
  title={S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation},
  author={Sick, Leon and Hoyer, Lukas and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo},
  journal={arXiv preprint arXiv:2512.14440},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
imgs		imgs
keymask_ident		keymask_ident
model_training		model_training
.gitignore		.gitignore
README.md		README.md
mseg_requirements.txt		mseg_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation

Dataset Preparation

Method Overview

Inference Demo for VideoCutLER with Pre-trained Models

Unsupervised Zero-shot Evaluation

Instructions for Keymask Discovery and Model Training

Acknowledgements

How to get support from us?

Citation

About

Uh oh!

Releases

Packages

Languages

leonsick/s2d

Folders and files

Latest commit

History

Repository files navigation

S2D: Sparse-To-Dense Keymask Distillation For Unsupervised Video Instance Segmentation

Dataset Preparation

Method Overview

Inference Demo for VideoCutLER with Pre-trained Models

Unsupervised Zero-shot Evaluation

Instructions for Keymask Discovery and Model Training

Acknowledgements

How to get support from us?

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages