Peter Kulits, Michael J. Black, Silvia Zuffi
Data and code coming soon.
We train an LLM to decode a frozen CLIP embedding of a natural image into a structured compositional scene representation encompassing both animals and their habitats. Data can be found at https://raw.is.tue.mpg.de/download.php after registering on the project page. The environment can be configured withconda/micromamba from environment.yml or using the Dockerfile.
After the data has been downloaded, training can be initiated with the following:
python train.py \
--images_tar data/train.tar \
--data_path data/train.gz.feather \
--images_val_tar data/val.tar \
--data_path_val data/val.gz.feather \
--per_device_train_batch_size X \
--output_dir ./checkpoints/RAW-Y \
--max_steps 40000 \
--image_aspect_ratio padpython inference.py \
--model-path ./checkpoints/RAW-Y \
--images_tar data/val.tar \
--out_path ./out/RAW-Y.json.gz \
--image_aspect_ratio padLICENSE.











