Skip to content

Tiezheng11/Vision-Language-Vision

Repository files navigation

VLV-AutoEncoder

Website arXiv GitHub HF Model HF Dataset

Abstract

We introduce the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). We establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder.

Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. By fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash.

Paper

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Tiezheng Zhang1, Yitong Li3, Yu-Cheng Chou1, Jieneng Chen1, Alan L. Yuille1, Chen Wei2,, and Junfei Xiao1,
1 Johns Hopkins University,
2 Rice University,
3 Tsinghua University,
paper | code

0. Installation

git clone https://github.com/Tiezheng11/Vision-Language-Vision.git

See installation instructions to create an environment and obtain requirements.

1. Dataset Preparation

Please refer to laion2B-en-aesthetic for dataset preparation. We provide a sub-dataset with 6M image-text pairs on Hugging Face.

2. Checkpoints

Checkpoints could be downloaded from Hugging Face.

3. Stage-1 Training: Vision-Language Alignment

Configuration Options:

train_dataset_path="./data/Images_6M"          # Path to training images
val_dataset_path="./assets"                    # Path to validation images
run_name="Stage1_Training"                     # Experiment name
lr=5e-5                                        # Learning rate
num_steps=10000                                # Total training steps
save_step=1000                                 # Save checkpoint interval
global_batch_size=64                           # Global batch size across all GPUs
num_gpus=8                                     # Number of GPUs to use

Training Command:

bash train_stage1.sh

4. Stage-2 Training: Caption Generation

Configuration Options:

epochs=2                                       # Number of training epochs
lr=1e-5                                        # Learning rate
batch_size=8                                   # Batch size per GPU
total_steps=10000                              # Total training steps
dataset_path="./data/Image_384_VL_Captions"    # Path to image-caption pairs
qwen_model="./pretrained_checkpoints/Qwen2.5-3B"  # Path to Qwen model
diffusion_model_path="./pretrained_checkpoints/stage1_checkpoint"  # Stage-1 checkpoint

Training Command:

bash train_stage2.sh

5. Inference

The VLV framework supports two types of inference:

5.1 Caption Generation

Generate detailed captions for input images using the trained VLV model.

bash Caption_inference.sh

Configuration Options:

IMAGE_PATH="./assets"                          # Input image directory
CKPT_DIR="path/to/stage2/checkpoint"          # Stage-2 checkpoint path
OUTPUT_DIR="./captions.json"                  # Output JSON file
GUIDANCE_SCALE=2.0                            # Guidance scale for generation
NUM_INFERENCE_STEPS=50                        # Number of inference steps

Output Format:

The script generates a JSON file with the following structure:

{
  "image1.jpg": "A detailed caption describing the image content...",
  "image2.jpg": "Another detailed caption...",
  ...
}

5.2 Image Reconstruction

Reconstruct images from their learned representations to verify the quality of the information bottleneck.

bash Image_inference.sh

Configuration Options:

# Edit Image_inference.sh to customize inference
IMAGE_PATH="./assets"                          # Input image directory
CKPT_DIR="./pretrained_checkpoints/stage1_checkpoint"  # Stage-1 checkpoint path
OUTPUT_DIR="./reconstructed_images"           # Output directory
GUIDANCE_SCALE=2.0                            # Guidance scale
NUM_INFERENCE_STEPS=50                        # Number of diffusion steps
SEEDS=42                                      # Random seed for reproducibility

Output:

  • Reconstructed Images: Generated images from the learned representations
  • Side-by-Side Comparisons: Original and reconstructed images concatenated
  • Quality Metrics: FID scores for quantitative evaluation

6. Troubleshooting

Common Issues and Solutions

Memory Issues

  • Problem: CUDA out of memory during training
  • Solution: Reduce batch size, enable gradient checkpointing, use mixed precision (fp16/bf16), please be aware of the batch size in Stage-1 training.

Checkpoint Loading Issues

  • Problem: KeyError when loading Stage-1 checkpoint in Stage-2
  • Solution: Ensure the checkpoint path points to the correct Stage-1 model directory

WandB Issues

  • Problem: WandB login or API key issues
  • Solution: Set WANDB_API_KEY environment variable or run wandb login

Dataset Issues

  • Problem: DataLoader errors or file not found
  • Solution: Verify dataset paths in configuration files and ensure JSON files exist, please construct dataloader for stage-2 training.

TODO

  • [] Working on Stable Diffusion XL and Stable Diffusion 3.5.

Citation

@article{zhang2025vision,
title     = {Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models},
author    = {Zhang, Tiezheng and Li, Yitong and Chou, Yu-Cheng and Chen, Jieneng and Yuille, Alan and Wei, Chen and Xiao, Junfei},
journal   = {arXiv preprint arXiv:2507.07104},
year      = {2025}}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published