Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences.
Official source code, model checkpoints and training logs for paper "Smoothing Slot Attention Iterations and Recurrences".
Object discovery accuracy: (Input resolution is 256×256 (224×224); DINO2 ViT-S/14 is used for encoding) Numbers are detailed in acc-v3.xlsx.
Object discovery visualization:
Object recognition accuracy: Numbers are detailed in acc-recogn-v3.xlsx.
⭐⭐⭐ Please check GitHub repo VQ-VFM-OCL. ⭐⭐⭐
- config-smoothsa/ # *** configs for our SmoothSA ***
- config-spot/ # configs for baseline SPOT
- object_centric_bench/
- datum/ # dataset loading and preprocessing
- model/ # model building
- ...
- smoothsa.py # *** for our SmoothSA model building ***
- randsfq.py # *** for our SmoothSA model building ***
- ...
- learn/ # metrics, optimizers and callbacks
- train.py
- eval.py
- requirements.txt- archive-smoothsa/ # *** our RandSF.Q models and logs ***
- archive-spot/ # baseline model checkpoints and training logs
- archive-recogn/ # object recognition models based on SmoothSA and SPOTDatasets ClevrTex, COCO, VOC, MOVi-C, MOVi-D and YTVIS, which are converted into LMDB format and can be used off-the-shelf, are available as below.
- dataset-clevrtex: converted dataset ClevrTex.
- dataset-coco: converted dataset COCO.
- dataset-voc: converted dataset VOC.
- dataset-movi_c: converted dataset MOVi-C.
- dataset-movi_d: converted dataset MOVi-D.
- dataset-ytvis: converted dataset YTVIS, the high-quality version.
The checkpoints and training logs (@ random seeds 42, 43 and 44) for all models are available as releases. All backbones are unified as DINO2-S/14.
- archive-smoothsa: Our SmoothSA trained on datasets ClevrTex, COCO, VOC, MOVi-C/D and YTVIS.
- Model checkpoints and training logs of our own method.
- archive-spot: SPOT on ClevrTex, COCO and VOC.
- My implementation of paper SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers, CVPR 2024.
- For other image OCL baselines, SLATE, DINOSAUR, SlotDiffusion and DIAS, please check repo VQ-VFM-OCL and DIAS;
- For other video OCL baselines, VideoSAUR, SlotContrast and RandSF.Q, please check repo RandSF.Q.
- archive-recogn: Object recognition models based on our SmoothSA and baseline SPOT, trained on datasets COCO and YTVIS.
- Slots extracted by SmoothSA or SPOT are matched with ground-truth object segmentations by some threshold, and the matched slots are used to train category classification and bounding box regression.
- For other object recognition baselines, RandSF.Q and SlotContrast, please check repo RandSF.Q.
Take SmoothSA on COCO as an example.
(1) Environment
To set up the environment, run:
# python 3.11
pip install -r requirements.txt(2) Dataset
To prepare the dataset, download Converted Datasets and unzip to path/to/your/dataset/. Or convert them by yourself according to XxxDataset.convert_dataset() docs.
(3) Train
To train the model, run:
python train.py \
--seed 42 \
--cfg_file config-smoothsa/smoothsa_r-coco.py \
--data_dir path/to/your/dataset \
--save_dir save(4) Evaluate
To evaluate the model, run:
python eval.py \
--cfg_file config-smoothsa/smoothsa_r-coco.py \
--data_dir path/to/your/dataset \
--ckpt_file archive-smoothsa/smoothsa_r-coco/best.pth \
--is_viz True \
--is_img True
# object discovery accuracy values will be printed in the terminal
# object discovery visualization will be saved to ./smoothsa_r-coco/If you have any issues on this repo or cool ideas on OCL, please do not hesitate to contact me!
- page: https://genera1z.github.io
- email: [email protected], [email protected]
If you are applying OCL (not limited to this repo) to tasks like visual question answering, visual prediction/reasoning, world modeling and reinforcement learning, let us collaborate!
My further research works on OCL can be found in my repos or my academic page.
If you find this repo useful, please cite our work.
@article{zhao2025smoothsa,
title={{Smoothing Slot Attention Iterations and Recurrences}},
author={Zhao, Rongzhen and Yang, Wenyan and Kannala, Juho and Pajarinen, Joni},
journal={arXiv:2508.05417},
year={2025}
}




