Skip to content

Genera1Z/VQ-VFM-OCL

Repository files navigation

VVO Vector-Quantized Vision Foundation Models for Object-Centric Learning



⚗️ (2026/01/06) Update !!!

Please check our brand new OCL works:

  • RandSF.Q: significantly surpasses state-of-the-art video OCL, e.g., SlotContrast, by up to 10 points!
  • SmoothSA: improves the state of the art even further, e.g., SPOT / DIAS (images) and SlotContrast / RandSF.Q (videos), with minimal modifications!




Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision.

🎉 Accepted to ACM MM 2025 as a Poster

Official source code, model checkpoints and training logs for paper "Vector-Quantized Vision Foundation Models for Object-Centric Learning".

Supported OCL methods include, categorized by OCL decoding:

🏆 Performance

(1) ⭐⭐⭐ Re-evaluated Performance Values @ Version 3 ⭐⭐⭐

Which are detailed in acc-v3.xlsx. (Encoding with backbone DINO2-S/14 at resolution 256x256/224 and 384x384/336)

ari arifg mbo miou
slate_r_vqvae-clevrtex 17.4±2.9 87.4±1.7 44.5±2.2 43.3±2.4
slate_r_vqvae-coco 20.5±0.6 28.8±0.3 27.4±0.3 26.1±0.3
slate_r_vqvae-voc 22.4±0.2 26.3±0.8 37.8±0.4 36.6±0.3
steve_c_vqvae-movi_d 32.7±0.2 66.5±0.2 23.0±0.3 21.2±0.3
vqdino_tfd_r-clevrtex 55.4±18.2 85.9±0.7 54.4±2.4 53.6±2.5
vqdino_tfd_r-coco 24.4±2.0 31.5±1.1 30.2±0.6 28.8±0.8
vqdino_tfd_r-voc 26.9±0.9 26.9±1.4 40.5±0.3 39.5±0.4
vqdino_tfdt_c-movi_d 36.7±0.4 72.5±3.7 26.1±1.2 24.6±1.3
dinosaur_r-clevrtex 49.6±23.8 89.4±0.3 52.1±5.4 51.7±5.5
dinosaur_r-coco 21.1±1.0 37.0±1.2 28.7±0.5 27.3±0.5
dinosaur_r-voc 25.7±0.6 36.3±1.4 41.2±0.6 40.2±0.6
vqdino_mlp_r-clevrtex 64.8±0.3 88.8±0.6 56.1±0.4 55.7±0.5
vqdino_mlp_r-coco 22.1±0.4 36.0±0.6 29.1±0.3 27.8±0.3
vqdino_mlp_r-voc 25.9±0.9 35.6±0.9 41.5±0.2 40.6±0.2
slotdiffusion_r_vqvae-clevrtex 66.1±1.3 82.7±1.6 54.3±0.5 53.4±0.8
slotdiffusion_r_vqvae-coco 20.6±0.6 29.0±0.1 27.5±0.4 26.1±0.4
slotdiffusion_r_vqvae-voc 20.3±1.4 21.6±1.9 35.6±1.0 34.4±1.1
vqdino_dfz_r-clevrtex 72.2±0.2 81.9±2.0 57.6±0.7 56.8±0.8
vqdino_dfz_r-coco 21.2±0.3 28.7±1.0 27.7±0.1 26.4±0.1
vqdino_dfz_r-voc 22.6±0.5 24.4±0.3 37.3±0.1 36.3±0.2
slate_r_vqvae-coco-r384 43.4±1.0 34.1±0.3 28.1±0.4 26.6±0.4
vqdino_tfd_r-coco-r384 46.2±0.8 37.5±1.1 30.3±0.4 28.7±0.5
dinosaur_r-coco-r384 46.8±0.1 42.3±0.6 30.4±0.1 29.0±0.1
vqdino_mlp_r-coco-r384 46.5±0.7 42.6±0.5 30.3±0.2 29.0±0.2
slotdiffusion_r_vqvae-coco-r384 43.4±0.5 34.5±0.4 28.3±0.2 26.8±0.2
vqdino_dfz_r-coco-r384 45.3±1.2 34.3±0.4 29.0±0.7 27.5±0.7

(2) Old Performance Values

Object discovery performance with DINO2 ViT (s/14) for OCL encoding. VVO is instantiated as VQDINO; Tfd, TfdT, Mlp and Dfz are Transformer, Transformer-temporal, MLP and Diffusion for OCL decoding respectively.

Using higher resolution.

Qualitative results.

🌟 Highlights

  • fp16 fast training Automatic mixed precision training (fp32+fp16) is enabled. Most of the training can be finished less than 4 or 8 hours (for image or video OCL respectively) using one V100 GPU.

  • less I/O overhead Datasets are stored in LMBD database format to save I/O overhead, beneficial especially on computing cluster.

  • config-driven experiment This is totally config-driven framework, largely inspired by OpenMMLab, but with much less capsulation.

  • strong baselines All models requiring VAE are implemented with StableDiffusion pretrained VAE TinyVAE; All models are trained with strong data augmentations; All models employ vision foundation model DINO2 as their backbone.

🚑️ Changelogs

  • [2026/01/06] Unify interfaces to RandSF.Q and SmoothSA, which are our brand new SotA methods!
  • [2025/11/07] Fix lmdb multiprocessing issues due to torch>=3.7.
  • ⭐⭐⭐ [2025/10/20] ⭐⭐⭐ Object discovery accuracy values are updated for version 3. Check this table file acc-v3.xlsx for details.
  • [2025/10/19] Version 3: re-implement segmentation evaluation; corresponding new dataset lmdb files are uploaded. Thus, object discovery acc could change a little, especially ARI values.

🧭 Repo Stucture

Source code.

- config-slatesteve/    # configs for SLATE and STEVE
- config-dinosaur/      # configs for DINOSAUR
- config-slotdiffusion/ # configs for SlotDiffusion
- config-vqdino/        # *** configs for our VQDINO ***
- object_centric_bench/
  - datum/              # dataset loading and preprocessing
  - model/              # model building
    - ...
    - vaez.py           # *** for vector-quantization ***
    - vqvfmocl.py       # *** for our VVO model building ***
    - ...
  - learn/              # metrics, optimizers and callbacks
- train.py
- eval.py
- requirements.txt

Releases.

- dataset-clevrtex/     # dataset files in LMDB format
- dataset-coco/
- dataset-voc/
- dataset-movi_d/
- slatesteve/           # baseline model checkpoints and training logs
- dinosaur/
- slotdiffusion/
- vqdino_tfd/           # our VQDINO-Tfd models and logs
- vqdino_mlp/           # our VQDINO-Mlp models and logs
- vqdino_dfz/           # our VQDINO-Dfz models and logs
- r384/                 # models and logs trained at resolution 384x384

🚀 Converted Datasets

Datasets ClevrTex, COCO, VOC and MOVi-D, which are converted into LMDB format and can be used off-the-shelf, are available as releases.

🧠 Model Checkpoints & Training Logs

The checkpoints and training logs (@ random seeds 42, 43 and 44) for all models are available as releases. All backbones are unified as DINO2-S/14.

  • slatesteve: SLATE on ClevrTex, COCO and VOC; STEVE on MOVi-D.
    • My implementation of paper Illiterate DALL-E Learns to Compose, ICLR 2022, achieving much better performance.
    • My implementation of paper Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos, NeurIPS 2022, achieving much better performance.
  • vqdino_tfd: VQDINO-Tfd on ClevrTex, COCO and VOC; VQDINO-TfdT on MOVi-D.
    • Our VVO's counterparts to SLATE and STEVE.
  • dinosaur: DINOSAUR on ClevrTex, COCO and VOC.
    • My implementation of paper Bridging the Gap to Real-World Object-Centric Learning, ICLR 2023, replacing its DINO2-B/14 with DINO2-S/14.
  • vqdino_mlp: VQDINO-Mlp on ClevrTex, COCO and VOC.
    • Our VVO's counterpart to DINOSAUR.
  • slotdiffusion: SlotDiffusion on ClevrTex, COCO and VOC.
    • My implementation of paper SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models, NeurIPS 2023, replacing its DINO1-B/8 with DINO2-S/14.
  • vqdino_dfz: VQDINO-Dfz on ClevrTex, COCO and VOC.
    • Our VVO's counterpart to SlotDiffusion.
  • vqdino-r384: VQDINO-Tfd/Mlp/Dfz on COCO with resolution 384x384 (336x336).
    • Our VVO's counterparts to the baselines training with higher resolution.

🔥 How to Use

Take VQDINO-Tfd on COCO as an example.

(1) Environment

To set up the environment, run:

# python 3.11
pip install -r requirements.txt

(2) Dataset

To prepare the dataset, download Converted Datasets and unzip to path/to/your/dataset/. Or convert them by yourself according to XxxDataset.convert_dataset() docs.

(3) Train

To train the model, run:

# 1. pretrain the VQDINO VAE model
python train.py \
    --seed 42 \
    --cfg_file config-vqdino/vqdino-coco-c256.py \
    --data_dir path/to/your/dataset \
    --save_dir save

# *. place the best VAE checkpoint at archive-slatesteve/vqvae-coco-c256-msf/best.pth
mv save archive-slatesteve

# 2. train the VQDINO OCL model
python train.py \
    --seed 42 \
    --cfg_file config-vqdino/vqdino_tfd_r-coco.py \
    --data_dir path/to/your/dataset \
    --save_dir save \
    --ckpt_file archive-vqdino/vqdino-coco-c256/best.pth

(4) Evaluate

To evaluate the model, run:

python eval.py \
    --cfg_file config-vqdino/vqdino_tfd_r-coco.py \
    --data_dir path/to/your/dataset \
    --ckpt_file archive-vqdino/vqdino_tfd_r-coco/best.pth \
    --is_viz True \
    --is_img True
# object discovery accuracy values will be printed in the terminal
# object discovery visualization will be saved to ./vqdino_tfd_r-coco/

💡 Tips

  1. Any config file can be converted into typical Python code by changing from
model = dict(type=ClassName, key1=value1,..)

to

model = ClassName(key1=value1,..)
  1. All config files follow a similar structure, and you can use file comparator Meld with VSCode plugin Meld Diff to check their differences.

📝 TODO

  • ⬜ SPOT & VVO-Tfd9: To be integrated into this framework;
  • ⬜ VideoSAUR & VVO-SmdT: To be integrated into this framework.

🤗 Contact & Support

If you have any issues on this repo or cool ideas on OCL, please do not hesitate to contact me!

If you are applying OCL (not limited to this repo) to tasks like visual question answering, visual prediction/reasoning, world modeling and reinforcement learning, let us collaborate!

📚 Citation

If you find this repo useful, please cite our work.

@article{zhao2025vvo,
  title={{Vector-Quantized Vision Foundation Models for Object-Centric Learning}},
  author={Zhao, Rongzhen and Wang, Vivienne and Kannala, Juho and Pajarinen, Joni},
  journal={ACM MM},
  year={2025}
}

About

Vector-Quantized Vision Foundation Models for Object-Centric Learning, ACM MM 2025.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages