This is the official PyTorch codes for the paper:
DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
Zheng-Peng Duan1,2 *, Jiawei Zhang2, Xin Jin1, Ziheng Zhang1, Zheng Xiong2, Dongqing Zou2,3, Jimmy S. Ren2,4, Chunle Guo1, Chongyi Li1 †
1 VCIP, CS, Nankai University, 2 SenseTime Research, 3 PBVR, 4 Hong Kong Metropolitan University
*This project is done during the internship at SenseTime Research.
†Corresponding author.
⭐ If DiT4SR is helpful to your images or projects, please help star this repo. Thank you! 👈
- 2025.07.07 Create this repo and release related code of our paper.
-
Release a huggingface demo -
Release Checkpoints -
Release NKUSR8K dataset -
Release training and inference code -
Release Chinese version and supplemenatary material
- Clone repo
git clone https://github.com/adam-duan/DiT4SR.git
cd DiT4SR- Install packages
conda env create -f environment.yamlStep 1: Download Checkpoints
- Download the [dit4sr_f and dit4sr_q] checkpoints and place them in the following directories:
preset/dit4sr_fandpreset/dit4sr_q. - Download the [stable-diffusion-3.5-medium] checkpoints and place it in the
preset/stable-diffusion-3.5-mediumdirectory. - Download the [clip-vit-large-patch14-336] and [
llava-v1.5-13b] and place them in the
llava_ckptdirectory.
Step 2: Prepare testing data
Place low-quality images in preset/datasets/test_datasets/.
You can download RealSR, DrealSR and RealLR200 from [SeeSR],
and download RealLQ250 from [DreamClear].
Thanks for their awesome works.
Step 3: Running testing command
# test w/o llava, one GPU is enough
bash bash/test_wllava.sh
# test w/ llava, two GPUs are required
bash bash/test_wollava.shReplace the placeholders [pretrained_model_name_or_path], [transformer_model_name_or_path], [image_path], [output_dir], and [prompt_path] with their respective paths before running the command.
The evaluation script (test_wollava.sh) is designed to run with pre-generated prompts in order to reduce the computational cost of LLaVA during testing.
We provide our pre-generated and processed prompts in the following directory preset/prompts.
Step 4: Check the results
The processed results will be saved in the [output_dir] directory.
We provide a gradio demo for DiT4SR, which is the same with . You can use the demo to test your own images.
CUDA_VISIBLE_DEVICES=0,1 python gradio_dit4sr.py \
--transformer_model_name_or_path "preset/models/dit4sr_f" Note that dit4sr_q achieves superior performance in terms of perceptual quality, while dit4sr_f better preserves image fidelity. All results reported in this paper are generated using dit4sr_q.
Step 1: Download the training data
Download the training datasets including DIV2K, DIV8K, Flickr2K, Flickr8K, and our [NKUSR8K] dataset.
Step 2: Prepare the training data
- Following [SeeSR], you can generate the LR-HR pairs for training using
bash_data/make_pairs.sh. - Using
bash_data/make_prompt.shto generate the prompts for each HR image. - Using
bash_data/make_latent.shto generate the latent codes for both HR and LR images. - Using
bash_data/make_embedding.shto generate the embedding for each prompt. - Don't forget to download [NULL_pooled_prompt_embeds.pt and NULL_prompt_embeds.pt] and place them in the corresponding directories.
Data Structure After Preprocessing
preset/datasets/training_datasets/
└── gt
└── 0000001.png # GT images, (3, 512, 512)
└── ...
└── sr_bicubic
└── 0000001.png # Bicubic LR images, (3, 512, 512)
└── ...
└── prompt_txt
└── 0000001.txt # prompts for teacher model and lora model
└── ...
└── prompt_embeds
└── NULL_prompt_embeds.pt # SD3 prompt embedding tensors, (154, 4096)
└── 0000001.pt
└── ...
└── pooled_prompt_embeds
└── NULL_pooled_prompt_embeds.pt # SD3 pooled embedding tensors, (2048,)
└── 0000001.pt
└── ...
└── latent_hr
└── 0000001.pt # SD3 latent space tensors, (16, 64, 64)
└── ...
└── latent_lr
└── 0000001.pt # SD3 latent space tensors, (16, 64, 64)
└── ...
Step 3: Start train
Use the following command to start the training process:
bash bash/train.shThis project is licensed under the Pi-Lab License 1.0 - see the LICENSE file for details.
If you find our repo useful for your research, please consider citing our paper:
@inproceedings{duan2025dit4sr,
title={DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution},
author={Duan, Zheng-Peng and Zhang, Jiawei and Jin, Xin and Zhang, Ziheng and Xiong, Zheng and Zou, Dongqing and Ren, Jimmy and Guo, Chun-Le and Li, Chongyi},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}For technical questions, please contact adamduan0211[AT]gmail.com
