ITA-MDT:
Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On
Ji Woo Hong,
Tri Ton,
Pham X. Trung,
Gwanhyeong Koo,
Sunjae Yoon,
Chang D. Yoo
Korea Advanced Institute of Science and Technology (KAIST)
git clone https://github.com/jiwoohong93/ita-mdt_code.git
cd ita-mdt_code
bash environment.sh
conda activate ITA-MDTThe above commands will create and activate the conda environment with all core dependencies for ITA-MDT.
(optional) We recommend utilizing Adan and xFormers for improved training and generation efficiency.
Two pre-trained components are required and will be automatically downloaded on the first run of training or generation:
- DINOv2 — Vision Transformer backbone for garment feature extraction.
- Stable Diffusion VAE — Variational Autoencoder for image encoding/decoding in latent space.
Once downloaded, they will be cached locally for subsequent runs.
Download VITON-HD from HERE
Download DressCode from HERE
Place both datasets inside the DATA/ folder:
DATA/
├── zalando-hd-resized/
├── DressCode/
To generate the agnostic images and corresponding masks, we adopt the dataset preparation of CAT-DM. For DensePose, we utilize new part-based color map DensePose images provided by IDM-VTON to ensure consistency with the VITON-HD dataset.
These images are required for proper training and generation. They can all be downloaded from HERE.
After downloading, place each garment category’s folder and its images into the corresponding directory of the original DressCode dataset.
This code pre-processes SRE and saves salient region images in advance for faster and more efficient training and generation.
Run the following command:
python preprocess_salient_region_extraction.py --path_to_datasets ./DATA--path_to_datasetsshould point to the folder containingzalando-hd-resizedandDressCodedirectories.- This script will process both datasets and save salient region images into the
cloth_srfolder for each category.
or, you can download the pre-processed salient region images from HERE.
After downloading, place each garment category’s folder and its images into the corresponding directory.
zalando-hd-resized/
├── test/
│ ├── agnostic-mask
│ ├── agnostic-v3.2
│ ├── cloth
│ ├── cloth_sr
│ ├── image
│ └── image-densepose
├── train/
│ ├── agnostic-mask
│ ├── agnostic-v3.2
│ ├── cloth
│ ├── cloth_sr
│ ├── image
│ └── image-densepose
├── test_pairs.txt
└── train_pairs.txt
DressCode/
├── dresses/
│ ├── agnostic
│ ├── cloth_sr
│ ├── image-densepose
│ ├── images
│ ├── mask
│ ├── test_pairs_paired.txt
│ ├── test_pairs_unpaired.txt
│ └── train_pairs.txt
├── lower_body/
│ ├── agnostic
│ ├── cloth_sr
│ ├── image-densepose
│ ├── images
│ ├── mask
│ ├── test_pairs_paired.txt
│ ├── test_pairs_unpaired.txt
│ └── train_pairs.txt
└── upper_body/
├── agnostic
├── cloth_sr
├── image-densepose
├── images
├── mask
├── test_pairs_paired.txt
├── test_pairs_unpaired.txt
└── train_pairs.txt
Run:
bash train.shexport CUDA_VISIBLE_DEVICES=→ GPU IDs to use for training (comma-separated).NUM_GPUS=→ Number of GPUs to use.export OPENAI_LOGDIR=→ Directory to save training logs and checkpoints.LR=→ Learning rate.BATCH_SIZE=→ Batch size.SAVE_INTERVAL=→ Save model checkpoint every this many steps.MASTER_PORT=→ Port used for inter-process communication in distributed training (change if conflict occurs).- (Optional)
--resume_checkpoint→ Uncomment and set a path if resuming from a saved checkpoint.
You can download the checkpoint of our ITA-MDT from HERE.
[2025-10-08] Reuploaded with the correct model weights.
Run:
bash generate_vitonhd.shRun:
bash generate_dc.shCommon for both generate_vitonhd.sh and generate_dc.sh:
export CUDA_VISIBLE_DEVICES=→ GPU ID to use for generation.OUTPUT_DIR=→ Path where generated images will be saved.MODEL_PATH=→ Path to trained weights (ema).BATCH_SIZE=→ Images generated per batch.NUM_SAMPLING_STEPS=→ Diffusion sampling steps.UNPAIR=false→ Whether to use unpaired garment-person combinations.
For generate_dc.sh:
SUBDATA=→ Category of DressCode dataset (dresses,upper_body, orlower_body).
The evaluation code is adapted from LaDI-VTON. Please refer to the original repository for the environment required to run the evaluation.
Run:
bash eval.shCUDA_VISIBLE_DEVICES=→ GPU ID to use for evaluation.--batch_size=→ Batch size for evaluation.--gen_folder=→ Path to generated images to be evaluated.--dataset=→ Dataset to evaluate on (vitonhdordresscode).--test_order=→ Paired/unpaired evaluation (pairedorunpaired). For unpaired, only FID is valid.--category=→ Category for DressCode dataset (upper_body,lower_body,dresses).
We kindly encourage citation of our work if you find it useful.
@article{hong2025ita,
title={ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On},
author={Hong, Ji Woo and Ton, Tri and Pham, Trung X and Koo, Gwanhyeong and Yoon, Sunjae and Yoo, Chang D},
journal={arXiv preprint arXiv:2503.20418},
year={2025}
}The codes in this repository are released under the CC BY-NC-SA 4.0 license.
This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)
(No. RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments),
and partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)
(No. RS-2022-II220184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).