[2024 NIPS] Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation
by Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Bo Li, Yang Tang, and Pan Zhou
In this paper, we present Modality Adaptation with text-to-image Diffusion Models (MADM). With the powerful generalization of Text-to-Image Diffusion Models (TIDMs), we extend domain adaptation to modality adaptation, aiming to segment other unexplored visual modalities in the real-world.
Qualitative semantic segmentation results generated by SoTA methods MIC, Rein, and our proposed MADM on three modalities.
-
Create a conda virtual env, activate it, and install packages.
conda create -n MADM python==3.10 conda activate MADM conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -U openmim mim install mmcv==1.3.7 python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' pip install -r requirements.txt
-
Cityscapes (RGB): Download the gtFine_trainvaltest.zip and leftImg8bit_trainvaltest.zip from cityscapes. Gererate *labelTrainIds.png via cityscapesscripts. Gerenate
samples_with_class.jsonfor rare class sample (RCS) via Data Preprocessing in DAFormer. -
DELIVER (Depth): Download DELIVER dataset.
-
FMB (Infrared): Download FMB dataset.
-
DSEC (Event): Download testing semantic labels and training events & testing events aggregated in the edge form.
-
Modify the project path in line 393 and dataset path in line 394-403 of
main.py
The data folder structure should look like this:
path/to/datasets
├── Cityscapes
│ ├── leftImg8bit
│ ├── gtFine
│ ├── samples_with_class.json
│ ├── ...
├── DELIVER
│ ├── depth
│ ├── semantic
│ ├── ...
├── FMB
│ ├── train
│ ├── test
│ ├── ...
├── DSEC
│ ├── 69mask_train_edges
│ ├── 69mask_test_edges
│ ├── test_semantic_labels
│ ├── ...
Following the examples in huggingface to automatically download the stable-diffusion-v1-4 model and modify the stable_diffusion_name_or_path in config_files/common/models/mtmadise_multi_lora.py.
- Training our MADM requires 2 GPUs with greater than 40GB of memory.
- Cityscapes (RGB) → DELIVER (Depth)
CUDA_VISIBLE_DEVICES=0,1 python main.py --config-file config_files/SemSeg/MTMADISE/mtmadise_cityscapes_rgb_to_depth_11.py --num-gpus 2 --bs 2 --tag RGB2Depth
- Cityscapes (RGB) → FMB (Infrared)
CUDA_VISIBLE_DEVICES=0,1 python main.py --config-file config_files/SemSeg/MTMADISE/mtmadise_cityscapes_rgb_to_infrared_9.py --num-gpus 2 --bs 2 --tag RGB2Infrared
- Cityscapes (RGB) → DSEC (Event)
CUDA_VISIBLE_DEVICES=0,1 python main.py --config-file config_files/SemSeg/MTMADISE/mtmadise_cityscapes_rgb_to_event_11.py --num-gpus 2 --bs 2 --tag RGB2Event
Download the trained model of Cityscapes (RGB) → DELIVER (Depth) or Cityscapes (RGB) → FMB (Infrared) or Cityscapes (RGB) → DSEC (Event) and put them in the trained_checkpoints folder. Then, you can inference with them:
- Cityscapes (RGB) → DELIVER (Depth)
CUDA_VISIBLE_DEVICES=0,1 python main.py --config-file config_files/SemSeg/MTMADISE/mtmadise_cityscapes_rgb_to_depth_11.py --num-gpus 2 --bs 2 --tag RGB2Depth_eval --eval-only --init-from ./trained_checkpoints/model_RGB2Depth.pth
- Cityscapes (RGB) → FMB (Infrared)
CUDA_VISIBLE_DEVICES=0,1 python main.py --config-file config_files/SemSeg/MTMADISE/mtmadise_cityscapes_rgb_to_infrared_9.py --num-gpus 2 --bs 2 --tag RGB2Infrared_eval --eval-only --init-from ./trained_checkpoints/model_RGB2Infrared.pth
- Cityscapes (RGB) → DSEC (Event)
CUDA_VISIBLE_DEVICES=0,1 python main.py --config-file config_files/SemSeg/MTMADISE/mtmadise_cityscapes_rgb_to_event_11.py --num-gpus 2 --bs 2 --tag RGB2Event_eval --eval-only --init-from ./trained_checkpoints/model_RGB2Event.pth
- For the RGB2Infrared and RGB2Event, since previously trained checkpoints are lost, we provide two new checkpoints with similar performance to that in the paper: RGB2Infrared (Original: 62.23, New: 61.88) and RGB2Event (Original: 56.31, New: 56.68).
Thanks ODISE, DAFormer, Stable Diffusion, Detectron2, and MMCV for their public code and released models.
If you find this project useful, please consider citing:
@article{MADM,
title={Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation},
author={Xia, Ruihao and Liang, Yu and Jiang, Peng-Tao and Zhang, Hao and Li, Bo and Tang, Yang and Zhou, Pan},
journal={arXiv:2410.21708},
year={2024}
}