Skip to content

dido1998/CTRL-O

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

CVPR 2025 | Project Website

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. CTRL-O introduces language-based control, enabling directed object extractions and multimodal applications, and achieves strong results on downstream tasks such as text-to-image generation and visual question answering.

CTRL-O Demo

Our code is based on the Object Centric Learning Framwork

Object Centric Learning Framework (OCLF)

Linting and Testing Status Docs site

What is OCLF?

OCLF (Object Centric Learning framework) is a framework designed to ease running experiments for object centric learning research, yet is not limited to this use case. At its heart lies the idea that while code is not typically composable many experiments in machine learning very similar with minor changes and only represent minor changes.

One such example is multi-task training where a model might be trained to solve multiple tasks at the same time. Different ablations of said model would then contain different model components but largely remain the same.

OCLF allows for such ablations without creating duplicate code by defining models and experiments in configuration files and allowing their composition in configuration space via hydra.

Quickstart - Development setup

Installing OCLF requires at least python3.8. Installation can be done using poetry. After installing poetry, check out the repo and setup a development environment:

git clone [email protected]:dido1998/CTRL-O.git
cd CTRL-O
# check poetry config: `poetry config --list`
# change venv location (default is project root /venv): `poetry config virtualenvs.path /your/custom/path`
poetry self update
pip install --upgrade pip
poetry install

This installs the ocl package and the cli scripts used for running experiments in a poetry managed virtual environment.

Next we need to prepare a dataset. For this follow the steps below to install the dependencies needed for dataset conversion and creation.

Datasets

We provide pre-curated datasets for training CTRL-O.

  1. VG + COCO: https://huggingface.co/adidolkar123/visual_genome_coco
  2. VG: https://huggingface.co/adidolkar123/visual_genome/

To download these datasets use:

huggingface-cli download <dataset_name> --local-dir scripts/datasets/outputs/ --local-dir-use-symlinks False

We also provide scripts to create your own datasets:

For the coco dataset

cd scripts/datasets
poetry install
bash download_scripts/download_coco_data.sh
bash download_and_convert.sh COCO

This should create a webdataset in the path scripts/datasets/outputs/coco.

To run the experiments, the dataset needs to be exposed to OCLF

cd ../..   # Go back to root folder
export DATASET_PREFIX=scripts/datasets/outputs  # Expose dataset path

Training

The main model from the paper is trained on VG+COCO data. To launch an experiment for this training run you can use:

poetry run ocl_train +experiment=projects/prompting/vg/prompt_vg_small14_dinov2_mapping_lang_point_pred_sep

This run should achieve a binding hits of ~60%.

The output of the training run should be stored at outputs/projects/prompting/vg/prompt_vg_small14_dinov2_mapping_lang_point_pred_sep/<timestamp>.

For a more detailed guide on how to install, setup, and use OCLF check out the Tutorial in the docs.

Inference and Visualization

We also provide inference and visualization scripts for the pretrained model in ocl/cli/inference.py

Before running the script, make sure to update the paths to the pretrained model checkpoint here and the images you want to use for inference here.

poetry run python ocl/cli/inference.py

Pretrained models

We provide a pretrained CTRL-O model on Hugging Face. You can download it using the following command:

huggingface-cli download adidolkar123/pretrained_coco_vgcoco --local-dir pretrained_models/ctrlo --local-dir-use-symlinks False

This will download the model checkpoint and configuration file into the pretrained_models/ctrlo directory. After downloading, please update the paths in ocl/cli/inference.py to point to the downloaded files.

Citation

If you use CTRL-O in your work please cite the bibtex entry below

@inproceedings{didolkar2025ctrlo,
    title={CTRL-O: Language-Controllable Object-Centric Visual Representation Learning},
    author={Didolkar, Aniket Rajiv and Zadaianchuk, Andrii and Awal, Rabiul and Seitzer, Maximilian and Gavves, Efstratios and Agrawal, Aishwarya},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}

Acknowledgements

This project is a fork of the Object Centric Learning Framework (OCLF) by Max Horn, Maximilian Seitzer, Andrii Zadaianchuk, Zixu Zhao, Dominik Zietlow, Florian Wenzel, and Tianjun Xiao.

CTRL-O extends OCLF by introducing language-based control for object-centric representation learning, enabling specific object targeting and multimodal applications.

Original project is licensed under Apache-2.0.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors