CVPR 2025 | Project Website
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. CTRL-O introduces language-based control, enabling directed object extractions and multimodal applications, and achieves strong results on downstream tasks such as text-to-image generation and visual question answering.
Our code is based on the Object Centric Learning Framwork
OCLF (Object Centric Learning framework) is a framework designed to ease running experiments for object centric learning research, yet is not limited to this use case. At its heart lies the idea that while code is not typically composable many experiments in machine learning very similar with minor changes and only represent minor changes.
One such example is multi-task training where a model might be trained to solve multiple tasks at the same time. Different ablations of said model would then contain different model components but largely remain the same.
OCLF allows for such ablations without creating duplicate code by defining models and experiments in configuration files and allowing their composition in configuration space via hydra.
Installing OCLF requires at least python3.8. Installation can be done using
poetry. After installing
poetry, check out the repo and setup a development environment:
git clone [email protected]:dido1998/CTRL-O.git
cd CTRL-O
# check poetry config: `poetry config --list`
# change venv location (default is project root /venv): `poetry config virtualenvs.path /your/custom/path`
poetry self update
pip install --upgrade pip
poetry installThis installs the ocl package and the cli scripts used for running
experiments in a poetry managed virtual environment.
Next we need to prepare a dataset. For this follow the steps below to install the dependencies needed for dataset conversion and creation.
We provide pre-curated datasets for training CTRL-O.
- VG + COCO: https://huggingface.co/adidolkar123/visual_genome_coco
- VG: https://huggingface.co/adidolkar123/visual_genome/
To download these datasets use:
huggingface-cli download <dataset_name> --local-dir scripts/datasets/outputs/ --local-dir-use-symlinks False
We also provide scripts to create your own datasets:
For the coco dataset
cd scripts/datasets
poetry install
bash download_scripts/download_coco_data.sh
bash download_and_convert.sh COCOThis should create a webdataset in the path scripts/datasets/outputs/coco.
To run the experiments, the dataset needs to be exposed to OCLF
cd ../.. # Go back to root folder
export DATASET_PREFIX=scripts/datasets/outputs # Expose dataset pathThe main model from the paper is trained on VG+COCO data. To launch an experiment for this training run you can use:
poetry run ocl_train +experiment=projects/prompting/vg/prompt_vg_small14_dinov2_mapping_lang_point_pred_sepThis run should achieve a binding hits of ~60%.
The output of the training run should be stored at outputs/projects/prompting/vg/prompt_vg_small14_dinov2_mapping_lang_point_pred_sep/<timestamp>.
For a more detailed guide on how to install, setup, and use OCLF check out the Tutorial in the docs.
We also provide inference and visualization scripts for the pretrained model in ocl/cli/inference.py
Before running the script, make sure to update the paths to the pretrained model checkpoint here and the images you want to use for inference here.
poetry run python ocl/cli/inference.pyWe provide a pretrained CTRL-O model on Hugging Face. You can download it using the following command:
huggingface-cli download adidolkar123/pretrained_coco_vgcoco --local-dir pretrained_models/ctrlo --local-dir-use-symlinks FalseThis will download the model checkpoint and configuration file into the pretrained_models/ctrlo directory. After downloading, please update the paths in ocl/cli/inference.py to point to the downloaded files.
If you use CTRL-O in your work please cite the bibtex entry below
@inproceedings{didolkar2025ctrlo,
title={CTRL-O: Language-Controllable Object-Centric Visual Representation Learning},
author={Didolkar, Aniket Rajiv and Zadaianchuk, Andrii and Awal, Rabiul and Seitzer, Maximilian and Gavves, Efstratios and Agrawal, Aishwarya},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}This project is a fork of the Object Centric Learning Framework (OCLF) by Max Horn, Maximilian Seitzer, Andrii Zadaianchuk, Zixu Zhao, Dominik Zietlow, Florian Wenzel, and Tianjun Xiao.
CTRL-O extends OCLF by introducing language-based control for object-centric representation learning, enabling specific object targeting and multimodal applications.
Original project is licensed under Apache-2.0.
