CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

CVPR 2025 | Project Website

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. CTRL-O introduces language-based control, enabling directed object extractions and multimodal applications, and achieves strong results on downstream tasks such as text-to-image generation and visual question answering.

Our code is based on the Object Centric Learning Framwork

Object Centric Learning Framework (OCLF)

What is OCLF?

OCLF (Object Centric Learning framework) is a framework designed to ease running experiments for object centric learning research, yet is not limited to this use case. At its heart lies the idea that while code is not typically composable many experiments in machine learning very similar with minor changes and only represent minor changes.

One such example is multi-task training where a model might be trained to solve multiple tasks at the same time. Different ablations of said model would then contain different model components but largely remain the same.

OCLF allows for such ablations without creating duplicate code by defining models and experiments in configuration files and allowing their composition in configuration space via hydra.

Quickstart - Development setup

Installing OCLF requires at least python3.8. Installation can be done using poetry. After installing poetry, check out the repo and setup a development environment:

git clone [email protected]:dido1998/CTRL-O.git
cd CTRL-O
# check poetry config: `poetry config --list`
# change venv location (default is project root /venv): `poetry config virtualenvs.path /your/custom/path`
poetry self update
pip install --upgrade pip
poetry install

This installs the ocl package and the cli scripts used for running experiments in a poetry managed virtual environment.

Next we need to prepare a dataset. For this follow the steps below to install the dependencies needed for dataset conversion and creation.

Datasets

We provide pre-curated datasets for training CTRL-O.

To download these datasets use:

huggingface-cli download <dataset_name> --local-dir scripts/datasets/outputs/ --local-dir-use-symlinks False

We also provide scripts to create your own datasets:

For the coco dataset

cd scripts/datasets
poetry install
bash download_scripts/download_coco_data.sh
bash download_and_convert.sh COCO

This should create a webdataset in the path scripts/datasets/outputs/coco.

To run the experiments, the dataset needs to be exposed to OCLF

cd ../..   # Go back to root folder
export DATASET_PREFIX=scripts/datasets/outputs  # Expose dataset path

Training

The main model from the paper is trained on VG+COCO data. To launch an experiment for this training run you can use:

poetry run ocl_train +experiment=projects/prompting/vg/prompt_vg_small14_dinov2_mapping_lang_point_pred_sep

This run should achieve a binding hits of ~60%.

The output of the training run should be stored at outputs/projects/prompting/vg/prompt_vg_small14_dinov2_mapping_lang_point_pred_sep/<timestamp>.

For a more detailed guide on how to install, setup, and use OCLF check out the Tutorial in the docs.

Inference and Visualization

We also provide inference and visualization scripts for the pretrained model in ocl/cli/inference.py

Before running the script, make sure to update the paths to the pretrained model checkpoint here and the images you want to use for inference here.

poetry run python ocl/cli/inference.py

Pretrained models

We provide a pretrained CTRL-O model on Hugging Face. You can download it using the following command:

huggingface-cli download adidolkar123/pretrained_coco_vgcoco --local-dir pretrained_models/ctrlo --local-dir-use-symlinks False

This will download the model checkpoint and configuration file into the pretrained_models/ctrlo directory. After downloading, please update the paths in ocl/cli/inference.py to point to the downloaded files.

Citation

If you use CTRL-O in your work please cite the bibtex entry below

@inproceedings{didolkar2025ctrlo,
    title={CTRL-O: Language-Controllable Object-Centric Visual Representation Learning},
    author={Didolkar, Aniket Rajiv and Zadaianchuk, Andrii and Awal, Rabiul and Seitzer, Maximilian and Gavves, Efstratios and Agrawal, Aishwarya},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}

Acknowledgements

This project is a fork of the Object Centric Learning Framework (OCLF) by Max Horn, Maximilian Seitzer, Andrii Zadaianchuk, Zixu Zhao, Dominik Zietlow, Florian Wenzel, and Tianjun Xiao.

CTRL-O extends OCLF by introducing language-based control for object-centric representation learning, enabling specific object targeting and multimodal applications.

Original project is licensed under Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
containers		containers
docs		docs
example_images		example_images
images		images
ocl		ocl
routed		routed
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Object Centric Learning Framework (OCLF)

What is OCLF?

Quickstart - Development setup

Datasets

Training

Inference and Visualization

Pretrained models

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Object Centric Learning Framework (OCLF)

What is OCLF?

Quickstart - Development setup

Datasets

Training

Inference and Visualization

Pretrained models

Citation

Acknowledgements

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages