GitHub - jiwoohong93/ita-mdt_code: [CVPR 2025] ITA-MDT official implementation

ITA-MDT:
Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On

Ji Woo Hong, Tri Ton, Pham X. Trung, Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo
Korea Advanced Institute of Science and Technology (KAIST)

Requirements

git clone https://github.com/jiwoohong93/ita-mdt_code.git
cd ita-mdt_code

bash environment.sh
conda activate ITA-MDT

The above commands will create and activate the conda environment with all core dependencies for ITA-MDT.

(optional) We recommend utilizing Adan and xFormers for improved training and generation efficiency.

Pre-trained Models Required

Two pre-trained components are required and will be automatically downloaded on the first run of training or generation:

DINOv2 — Vision Transformer backbone for garment feature extraction.
Stable Diffusion VAE — Variational Autoencoder for image encoding/decoding in latent space.

Once downloaded, they will be cached locally for subsequent runs.

Datasets Preparation

Download VITON-HD from HERE

Download DressCode from HERE

Place both datasets inside the DATA/ folder:

DATA/
  ├── zalando-hd-resized/
  ├── DressCode/

1. Additional Images Required for DressCode

To generate the agnostic images and corresponding masks, we adopt the dataset preparation of CAT-DM. For DensePose, we utilize new part-based color map DensePose images provided by IDM-VTON to ensure consistency with the VITON-HD dataset.

These images are required for proper training and generation. They can all be downloaded from HERE.

After downloading, place each garment category’s folder and its images into the corresponding directory of the original DressCode dataset.

2. Pre-process SRE (Salient Region Extraction)

This code pre-processes SRE and saves salient region images in advance for faster and more efficient training and generation.

Run the following command:

python preprocess_salient_region_extraction.py --path_to_datasets ./DATA

--path_to_datasets should point to the folder containing zalando-hd-resized and DressCode directories.
This script will process both datasets and save salient region images into the cloth_sr folder for each category.

or, you can download the pre-processed salient region images from HERE.

After downloading, place each garment category’s folder and its images into the corresponding directory.

Expected Data Structure

zalando-hd-resized/
  ├── test/
  │   ├── agnostic-mask
  │   ├── agnostic-v3.2
  │   ├── cloth
  │   ├── cloth_sr
  │   ├── image
  │   └── image-densepose
  ├── train/
  │   ├── agnostic-mask
  │   ├── agnostic-v3.2
  │   ├── cloth
  │   ├── cloth_sr
  │   ├── image
  │   └── image-densepose
  ├── test_pairs.txt
  └── train_pairs.txt

DressCode/
  ├── dresses/
  │   ├── agnostic
  │   ├── cloth_sr
  │   ├── image-densepose
  │   ├── images
  │   ├── mask
  │   ├── test_pairs_paired.txt
  │   ├── test_pairs_unpaired.txt
  │   └── train_pairs.txt
  ├── lower_body/
  │   ├── agnostic
  │   ├── cloth_sr
  │   ├── image-densepose
  │   ├── images
  │   ├── mask
  │   ├── test_pairs_paired.txt
  │   ├── test_pairs_unpaired.txt
  │   └── train_pairs.txt
  └── upper_body/
      ├── agnostic
      ├── cloth_sr
      ├── image-densepose
      ├── images
      ├── mask
      ├── test_pairs_paired.txt
      ├── test_pairs_unpaired.txt
      └── train_pairs.txt

Training

Run:

bash train.sh

Variables to Edit in `train.sh`

export CUDA_VISIBLE_DEVICES= → GPU IDs to use for training (comma-separated).
NUM_GPUS= → Number of GPUs to use.
export OPENAI_LOGDIR= → Directory to save training logs and checkpoints.
LR= → Learning rate.
BATCH_SIZE= → Batch size.
SAVE_INTERVAL= → Save model checkpoint every this many steps.
MASTER_PORT= → Port used for inter-process communication in distributed training (change if conflict occurs).
(Optional) --resume_checkpoint → Uncomment and set a path if resuming from a saved checkpoint.

Generation

You can download the checkpoint of our ITA-MDT from HERE.

[2025-10-08] Reuploaded with the correct model weights.

VITON-HD

Run:

bash generate_vitonhd.sh

DressCode

Run:

bash generate_dc.sh

Variables to Edit in Generation Scripts

Common for both generate_vitonhd.sh and generate_dc.sh:

export CUDA_VISIBLE_DEVICES= → GPU ID to use for generation.
OUTPUT_DIR= → Path where generated images will be saved.
MODEL_PATH= → Path to trained weights (ema).
BATCH_SIZE= → Images generated per batch.
NUM_SAMPLING_STEPS= → Diffusion sampling steps.
UNPAIR=false → Whether to use unpaired garment-person combinations.

For generate_dc.sh:

SUBDATA= → Category of DressCode dataset (dresses, upper_body, or lower_body).

Evaluation

The evaluation code is adapted from LaDI-VTON. Please refer to the original repository for the environment required to run the evaluation.

Run:

bash eval.sh

Variables to Edit in `eval.sh`

CUDA_VISIBLE_DEVICES= → GPU ID to use for evaluation.
--batch_size= → Batch size for evaluation.
--gen_folder= → Path to generated images to be evaluated.
--dataset= → Dataset to evaluate on (vitonhd or dresscode).
--test_order= → Paired/unpaired evaluation (paired or unpaired). For unpaired, only FID is valid.
--category= → Category for DressCode dataset (upper_body, lower_body, dresses).

Citation

We kindly encourage citation of our work if you find it useful.

@article{hong2025ita,
  title={ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On},
  author={Hong, Ji Woo and Ton, Tri and Pham, Trung X and Koo, Gwanhyeong and Yoon, Sunjae and Yoo, Chang D},
  journal={arXiv preprint arXiv:2503.20418},
  year={2025}
}

License

The codes in this repository are released under the CC BY-NC-SA 4.0 license.

Acknowledgement

This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)
(No. RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments),
and partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)
(No. RS-2022-II220184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
DATA		DATA
images_github		images_github
masked_diffusion		masked_diffusion
scr		scr
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.sh		environment.sh
eval.sh		eval.sh
generate_dc.py		generate_dc.py
generate_dc.sh		generate_dc.sh
generate_vitonhd.py		generate_vitonhd.py
generate_vitonhd.sh		generate_vitonhd.sh
image_train.py		image_train.py
preprocess_salient_region_extraction.py		preprocess_salient_region_extraction.py
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ITA-MDT:
Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On

Requirements

Pre-trained Models Required

Datasets Preparation

1. Additional Images Required for DressCode

2. Pre-process SRE (Salient Region Extraction)

Expected Data Structure

Training

Variables to Edit in `train.sh`

Generation

VITON-HD

DressCode

Variables to Edit in Generation Scripts

Evaluation

Variables to Edit in `eval.sh`

Citation

License

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

jiwoohong93/ita-mdt_code

Folders and files

Latest commit

History

Repository files navigation

ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Frameworkfor Image-Based Virtual Try-On

Requirements

Pre-trained Models Required

Datasets Preparation

1. Additional Images Required for DressCode

2. Pre-process SRE (Salient Region Extraction)

Expected Data Structure

Training

Variables to Edit in train.sh

Generation

VITON-HD

DressCode

Variables to Edit in Generation Scripts

Evaluation

Variables to Edit in eval.sh

Citation

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

ITA-MDT:
Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On

Variables to Edit in `train.sh`

Variables to Edit in `eval.sh`

Packages