Skip to content
/ Tar Public

[NeurIPS 2025] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Notifications You must be signed in to change notification settings

csuhan/Tar

 
 

Repository files navigation

Tar + Lumina2: Lumina-Image-2.0 as A Strong Dif-DTok


🏠 Architecture

✨ Lumina-Accessory directly leverages the self-attention mechanism in DiT to perform interaction between condition and target image tokens, consistent with approaches such as OminiControl, DSD, VisualCloze, etc.

✨ Built on top of Lumina-Image-2.0, Lumina-Accessory introduces an additional condition processor, initialized with the weights of the latent processor.

✨ We pass TA-Tok's discrete tokens to Lumina-Accessory for transforming text-aligned representation into the pixel space with high quality. We made minor modifications to Lumina-Accessory, such as iterative parquet dataset loading and TA-Tok condition support.

💻 Finetuning Code

1. Create a conda environment and install PyTorch

conda create -n Lumina2 -y
conda activate Lumina2
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y

2.Install dependencies

pip install -r requirements.txt

3. Install flash-attn

pip install flash-attn --no-build-isolation

4. Prepare data

We suggest to use parquet dataset for loading large scale training data. Check csuhan/ImageNet1K-T2I-QwenVL-QwenImage for example.

5. Start finetuning

bash scripts/run_1024_finetune_tatok.sh

🚀 Inference Code

Please check the inference script in the main branch.

About

[NeurIPS 2025] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published