Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda Shapiro, Ranjay Krishna

Introduction

Multimodal language models (MLMs) struggle with visual reasoning tasks that require intermediate visual representations like depth maps or object bounding boxes, which they cannot naturally produce. To address this, we introduce Aurora, a method that augments MLMs with Perception Tokens, tokenized image representations that serve as auxiliary reasoning tools—resulting in significant improvements across visual benchmarks and enabling more effective multimodal reasoning.

Installation

Similar to LLaVA follow these steps:

git clone https://github.com/mahtabbigverdi/Aurora-perception.git
cd Aurora-perception/LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip 
pip install -e .
pip install -e ".[train]"
pip install peft==0.11.1
pip install flash-attn==2.5.9.post1

Dataset

Depth estimation

To generate depth tokens, follow the installation and usage instructions provided in the AiT repository. Follow these steps to prepare the dataset:

Download ADE20k

Get the ADE20K dataset from the official website.

Generate Pseudo Depth Maps

Use DepthAnything to generate grayscale pseudo depth maps for the ADE images:

python run.py --encoder vits --img-path ADE20k --outdir ADE_depth --pred-only --grayscale

Train VQVAE

Use the configuration file provided in this repo to train the VQVAE model from AiT:

cd AiT/vae
python -m torch.distributed.launch --nproc_per_node=1 train_depth_vqvae_dist.py  configs/depth/ait_depth_vqvae.py

Generate Depth Tokens

After training, use the script below to extract token sequences from depth maps. Update the script with your model path and input directory as needed:

cd AiT/vae
python get_codes.py

This will generate a .npy dictionary where keys are image filenames and values are depth token sequences in the format: <DEPTH_START> <DEPTH_i1> ... <DEPTH_i100> <DEPTH_END>

Create LLaVA-Style QA Data

Convert ADE_codes.npy into a LLaVA-compatible JSON format. A sample output file is included: Data/train_depth_20k.json

Prepare CoT (Chain-of-Thought) Data with Visual Markers

Aurora also uses multitask data including CoT-style visual reasoning. For this:

Use 500 randomly selected ADE images (stored in AiT/vae/ADE_500files.npy)
Add 2–5 visual markers per image using this notebook:

AiT/vae/add_marks_ade.ipynb

Output: a folder ADE_blink/ with 500 marked images

Merge Depth & CoT Data for Curriculum Learning

Aurora follows a curriculum learning approach:

Early epochs focus on depth generation
Later epochs gradually include more CoT-style reasoning data

To prepare this:

Use AiT/vae/create_depth_annealing_data.ipynb

This creates a single JSON file:

train_depth_annealing_data.json

It contains concatenated training data for all 10 epochs (used with a SequentialSampler in the DataLoader).

You can download train_depth_annealing_data.json from here.

Bounding Box Prediction

Just like with depth estimation, follow these steps to prepare the data:

Download the LVIS training dataset from the official webite and install pycocotools.
Download annotations.json from here and put it in Aurora-perception/Bbox
Open and run the notebook Bbox/create_bbox_annealing_data.ipynb , following the instructions in each cell. This will generate the training data for the counting task, you can find the data in Data/train_lvis_annealing_data.json.

Training

We follow the LoRA fine-tuning instructions from the original LLaVA repository, with a few modifications to train.py and the fine-tuning scripts to support the introduction of new tokens.

You can find the relevant scripts under LLaVA/scripts/v1_5.

depth estimation:

cd LLaVA
bash scripts/v1_5/finetune_task_lora_depth_annealing.sh

counting:

cd LLaVA
bash scripts/v1_5/finetune_task_lora_lvis_annealing.sh

You can download model checkpoints from here.

Evaluations

We use the same evaluation script as the original LLaVA repo, with minor modifications in LLaVA/llava/eval/model_vqa.py (e.g., changing the temperature parameter).

To generate answers for a .jsonl question file, use the following command:

cd LLaVA/llava/eval
python model_vqa.py --model-path /path/to/your/model --question-file /path/to/questions.jsonl --coordinates-data False --depth-data True --image-folder /path/to/image/folder --answers-file /path/to/output/file

You can then use a separate Python script or a language model to validate the answers against the ground truths.

Evaluation files (images, questions, and answers) for the HardBLINK benchmark are available in the Data/evals directory.

Note: If you want to enable constrained decoding, replace the utils.py file in the Transformers package (transformers/generation/utils.py) with the version provided in Aurora-perception/utils.py.

Related works

Citation

@article{bigverdi2024perception,
      title={Perception Tokens Enhance Visual Reasoning in Multimodal Language Models},
      author={Bigverdi, Mahtab and Luo, Zelun and Hsieh, Cheng-Yu and Shen, Ethan and Chen, Dongping and Shapiro, Linda G and Krishna, Ranjay},
      journal={arXiv preprint arXiv:2412.03548},
      year={2024}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
AiT		AiT
Bbox		Bbox
Data		Data
LLaVA		LLaVA
assets		assets
README.md		README.md
out.txt		out.txt
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Introduction

Installation

Dataset

Depth estimation

Bounding Box Prediction

Training

Evaluations

Related works

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mahtabbigverdi/Aurora-perception

Folders and files

Latest commit

History

Repository files navigation

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Introduction

Installation

Dataset

Depth estimation

Bounding Box Prediction

Training

Evaluations

Related works

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages