This is the official code for the paper "Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models".
This code has been tested in the following settings, but is expected to work in other systems.
- Ubuntu 20.04
- CUDA 11.8
- NVIDIA RTX A6000
conda create -n oor python=3.8
conda activate oor
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
git clone https://github.com/facebookresearch/pytorch3d.git
cd pytorch3d
git checkout -f v0.7.2
pip install -e .
pip install transformers==4.38.2 opencv-python==4.2.0.32 scipy==1.4.1 numpy==1.23.5 tensorboardX==2.5.1 pyrender==0.1.45 torchdiffeq matplotlib wandb trimesh[easy]An unexpected accident caused the loss of all our data and code. We are able to recover the training and inference code, but not the data generation part. Instead, we release a rule-based dataset.
The rule-based dataset consists of the following 9 pairs of OORs:
- Object category pair:
(desk, monitor)| Relationship:on - Object category pair:
(desk, keyboard)| Relationship:on - Object category pair:
(desk, mouse)| Relationship:on - Object category pair:
(desk, teacup)| Relationship:on - Object category pair:
(desk, teapot)| Relationship:on - Object category pair:
(monitor, keyboard)| Relationship:in front of - Object category pair:
(mouse, keyboard)| Relationship:next to - Object category pair:
(teacup, teapot)| Relationship:around - Object category pair:
(teacup, teapot)| Relationship:pour
Each OOR pair dataset contains 1000 OOR samples: 500 are generated with one object as the base and the other as the target, and the remaining 500 are generated with the base and target swapped.
Each OOR pair dataset is saved as a pickle file. See data/oor_pickle/ directory.
The object meshes used to create the training dataset are located in the data/CAD directory. They are all collected from Sketchfab:
desk.obj: linkmonitor.obj: linkkeyboard.obj: linkmouse.obj: linkteapot.obj: linkteacup.obj: link
We also provide a pre-trained model trained on a rule-based dataset. See results/ckpts/OOR/ckpt_epoch20000.pth
If you want to train your model from scratch, run:
bash scripts/train_score_r_t_s.shIf you don't change any arguments, checkpoints will be saved in results/ckpts/OOR.
Set text_prompt, base_object and target_object in scripts/inference_r_t_s.sh. Then,
bash scripts/inference_r_t_s.shThe inference results will be saved as follows: results/inference/pairwise_oor/{input_text_prompt}/base-{base_object}_target-{target_object}/inference.pkl
This is similar to the inpainting in image diffusion models. Therefore, additional information about the mask is required. The mask structure should be as follows:
{'target_R': None or (3,3) shape array, 'target_t': None or (3,) shape array, 'target_s': None or (3,) shape array, 'base_s': None or (3,) shape array}If not None, the information is masked and maintained during the inference process. The format should be a relative pose and scale defined in the base object’s instance canonical space. For example, the longest component of base_s should be 1. Please refer to the method formulation section of the paper for details. See data/mask_info_pickle/desk_teacup_on.pkl for an example.
Now, set text_prompt, base_object, target_object, and mask_info_path in scripts/inference_masked_r_t_s.sh. Then,
bash scripts/inference_masked_r_t_s.shThe inference results will be saved as follows: results/inference/masked_pairwise_oor/{input_text_prompt}/base-{base_object}_target-{target_object}/{mask_info_name}.pkl
Multi-object object generation requires information about the scene. The structure is as follows:
{'prompt_list': ['A monitor is on a desk', ...], 'base_list': [('desk', 0), ...], 'target_list': [('monitor', 1), ...]}len(prompt_list)==len(base_list)==len(target_list)- When drawing a graph with (base, target) pairs, it must be a DAG with a single starting node (global base).
- The obj id of the global base must be
0.
See data/multi_info_pickle/desk_monitor_keyboard_mouse.pkl for an example.
Now, set scene_info_path in scripts/inference_multi_r_t_s.sh. Then,
bash scripts/inference_multi_r_t_s.shThe inference results will be saved as follows: results/inference/multi_oor/{scene_info_name}/inference.pkl
We need information about the existing scene and the OORs we want to add. The structure is as follows:
{'prompt_list': ['A teapot on a desk', ...], 'base_list': [('desk', 0), ...], 'target_list': [('teapot', 3), ...], 'existing_scene_info': {('desk', 0): {"R": ..., "t": ..., "s": ...}, ...}}len(prompt_list)==len(base_list)==len(target_list)- When drawing a graph with (base, target) pairs, it must be a DAG with a single starting node (global base).
- The obj id of the global base must be
0. - Objects added to a scene cannot be ancestors of nodes in the existing scene graph, and must be descendants of at least one existing node (More specifically, it is not allowed that the added object is the base and the existing object is the target).
See data/add_scene_info_pickle/desk_teacupx2_teapotx2.pkl for an example.
Now, set scene_info_path in scripts/inference_add_to_scene.sh. Then,
bash scripts/inference_add_to_scene.shThe inference results will be saved as follows: results/inference/add_multi_oor/{scene_info_name}/inference.pkl
As you can see from the paper, the OOR change procedure is deterministic. Therefore, we remove the batch size from the inputs and instead load multiple scene info from the input pickle file, using the number of scenes as the batch size. The structure is as follows:
{'prompt_list': ['A teapot pours into a teacup.', 'A teapot pours into a teacup.'], 'base_list': [('teacup', 1), ('teacup', 2)], 'target_list': [('teapot', 3), ('teapot', 4)], 'existing_scene_info_list': [{('desk', 0): {"R": ..., "t": ..., "s": ...}, ...}, ...]}len(prompt_list)==len(base_list)==len(target_list)- When drawing a graph with (base, target) pairs, it must be a DAG (It does not have to be a single start node).
- The obj id of the global base must be
0. - For each OOR to be changed, every object pair (base object, target object) must already exist within
existing_scene_info, and the global base cannot be the target object. - Assume that the scale of objects (both base and target) is fixed.
See data/change_scene_info_pickle/desk_teacupx2_teapotx2.pkl for an example.
Now, set scene_info_path in scripts/inference_change_oor.sh. Then,
bash scripts/inference_change_oor.shThe inference results will be saved as follows: results/inference/change_multi_oor/{scene_info_name}/inference.pkl
Run:
python3 visualize_oor.py --pickle_path {pickle path}
# For example,
# 1) dataset
python3 visualize_oor.py --pickle_path data/oor_pickle/desk_monitor_on.pkl
# 2) inference
python3 visualize_oor.py --pickle_path results/inference/pairwise_oor/A_monitor_is_on_a_desk/base-desk_target-monitor/inference.pklWe provide a very naive pyrender visualization. I personally recommend using Blender.
- Integrating inconsistency and collision term into
Changing OORs in the Existing Scene
@inproceedings{oor,
title={Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models},
author={Baik, Sangwon and Kim, Hyeonwoo and Joo, Hanbyul},
booktitle={ICCV},
year={2025}
}