[Read our arXiv Paper] [Project Page]
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda Shapiro, Ranjay Krishna
Multimodal language models (MLMs) struggle with visual reasoning tasks that require intermediate visual representations like depth maps or object bounding boxes, which they cannot naturally produce. To address this, we introduce Aurora, a method that augments MLMs with Perception Tokens, tokenized image representations that serve as auxiliary reasoning tools—resulting in significant improvements across visual benchmarks and enabling more effective multimodal reasoning.
Similar to LLaVA follow these steps:
git clone https://github.com/mahtabbigverdi/Aurora-perception.git
cd Aurora-perception/LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install peft==0.11.1
pip install flash-attn==2.5.9.post1To generate depth tokens, follow the installation and usage instructions provided in the AiT repository. Follow these steps to prepare the dataset:
- Download ADE20k
Get the ADE20K dataset from the official website.
- Generate Pseudo Depth Maps
Use DepthAnything to generate grayscale pseudo depth maps for the ADE images:
python run.py --encoder vits --img-path ADE20k --outdir ADE_depth --pred-only --grayscale- Train VQVAE
Use the configuration file provided in this repo to train the VQVAE model from AiT:
cd AiT/vae
python -m torch.distributed.launch --nproc_per_node=1 train_depth_vqvae_dist.py configs/depth/ait_depth_vqvae.py- Generate Depth Tokens
After training, use the script below to extract token sequences from depth maps. Update the script with your model path and input directory as needed:
cd AiT/vae
python get_codes.pyThis will generate a .npy dictionary where keys are image filenames and values are depth token sequences in the format:
<DEPTH_START> <DEPTH_i1> ... <DEPTH_i100> <DEPTH_END>
- Create LLaVA-Style QA Data
Convert ADE_codes.npy into a LLaVA-compatible JSON format.
A sample output file is included: Data/train_depth_20k.json
- Prepare CoT (Chain-of-Thought) Data with Visual Markers
Aurora also uses multitask data including CoT-style visual reasoning. For this:
-
Use 500 randomly selected ADE images (stored in
AiT/vae/ADE_500files.npy) -
Add 2–5 visual markers per image using this notebook:
AiT/vae/add_marks_ade.ipynbOutput: a folder ADE_blink/ with 500 marked images
- Merge Depth & CoT Data for Curriculum Learning
Aurora follows a curriculum learning approach:
-
Early epochs focus on depth generation
-
Later epochs gradually include more CoT-style reasoning data
To prepare this:
Use AiT/vae/create_depth_annealing_data.ipynb
This creates a single JSON file:
train_depth_annealing_data.jsonIt contains concatenated training data for all 10 epochs (used with a SequentialSampler in the DataLoader).
You can download train_depth_annealing_data.json from here.
Just like with depth estimation, follow these steps to prepare the data:
- Download the LVIS training dataset from the official webite and install
pycocotools. - Download
annotations.jsonfrom here and put it inAurora-perception/Bbox - Open and run the notebook
Bbox/create_bbox_annealing_data.ipynb, following the instructions in each cell. This will generate the training data for the counting task, you can find the data inData/train_lvis_annealing_data.json.
We follow the LoRA fine-tuning instructions from the original LLaVA repository, with a few modifications to train.py and the fine-tuning scripts to support the introduction of new tokens.
You can find the relevant scripts under LLaVA/scripts/v1_5.
depth estimation:
cd LLaVA
bash scripts/v1_5/finetune_task_lora_depth_annealing.shcounting:
cd LLaVA
bash scripts/v1_5/finetune_task_lora_lvis_annealing.shYou can download model checkpoints from here.
We use the same evaluation script as the original LLaVA repo, with minor modifications in LLaVA/llava/eval/model_vqa.py (e.g., changing the temperature parameter).
To generate answers for a .jsonl question file, use the following command:
cd LLaVA/llava/eval
python model_vqa.py --model-path /path/to/your/model --question-file /path/to/questions.jsonl --coordinates-data False --depth-data True --image-folder /path/to/image/folder --answers-file /path/to/output/file You can then use a separate Python script or a language model to validate the answers against the ground truths.
Evaluation files (images, questions, and answers) for the HardBLINK benchmark are available in the Data/evals directory.
Note:
If you want to enable constrained decoding, replace the utils.py file in the Transformers package (transformers/generation/utils.py) with the version provided in Aurora-perception/utils.py.
- Visual Instruction Tuning
- All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
- BLINK : Multimodal Large Language Models Can See but Not Perceive
@article{bigverdi2024perception,
title={Perception Tokens Enhance Visual Reasoning in Multimodal Language Models},
author={Bigverdi, Mahtab and Luo, Zelun and Hsieh, Cheng-Yu and Shen, Ethan and Chen, Dongping and Shapiro, Linda G and Krishna, Ranjay},
journal={arXiv preprint arXiv:2412.03548},
year={2024}
}