Skip to content

KangLiao929/Puffin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Thinking with Camera

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy

arXiv Project Page Puffin Model Puffin-4M Dataset Demo

Introduction

We make the first attempt to seamlessly integrate camera geometry into a unified multimodal model, introducing a camera-centric framework, i.e., Puffin, to advance multimodal spatial intelligence.

πŸ“ Changelog & News

  • 2025.10.10: The paper, project page, code, model, dataset, and demo of Puffin are online.
  • 2026.01.10: The scripts of the camera-centric evaluation has been released.
  • Release the scripts of the dataset construction pipeline.
  • Release the camera caption (by our method) of the commonly used large-scale text-to-image datasets, such as megalith-10m.

πŸ–₯️ Requirements and Installation

The code has been implemented with PyTorch 2.7.0 and CUDA 12.6.

An example of installation commands is provided as follows:

# git clone this repository
git clone https://github.com/KangLiao929/Puffin
cd Puffin

# create new anaconda env
conda create -n Puffin python=3.10
conda activate Puffin

# install python dependencies
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

πŸ‚ Demo & Quick Inference

We release three model variants: Puffin-Base, Puffin-Thinking, and Puffin-Instruct, to accommodate different application needs. Puffin-Base provides a foundation model for unified camera-centric understanding and generation; Puffin-Thinking enhances spatial reasoning and generation by thinking with camera; and Puffin-Instruct is optimized by instruction tuning, supporting cross-view tasks and complex multimodal interactions.

Download the model checkpoints from πŸ€— KangLiao/Puffin and organize them as follows:

Puffin/
β”œβ”€β”€ checkpoints
    β”œβ”€β”€ Puffin-Align.pth # provided for customized SFT
    β”œβ”€β”€ Puffin-Base.pth
    β”œβ”€β”€ Puffin-Thinking.pth
    β”œβ”€β”€ Puffin-Instruct.pth

It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download KangLiao/Puffin  --local-dir checkpoints --repo-type model

logo Camera-controllable Image Generation

The generated images can be obtained by text prompts and camera prompts (roll: -r, pitch: -p, vertical field-of-view: -f, all in radius) using the following command:

export PYTHONPATH=./:$PYTHONPATH
python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
          --checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
          --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
          -r -0.3939 -p 0.0277 -f 0.7595

To enable the thinking mode of image generation, please simply change the settings and append --thinking flag:

python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
          --checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
          --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
          -r -0.3939 -p 0.0277 -f 0.7595 \
          --thinking

logo Camera Understanding

The camera understanding results (scene descriptions and camera parameters) can be obtained using the following command:

python scripts/demo/understanding.py configs/pipelines/stage_2_base.py \
          --checkpoint checkpoints/Puffin-Base.pth --image_path assets/test_img/test.jpg \
          --save_dir vis_results/

The visualization results (pixel-wise camera maps) can also be found at --save_dir.

Like the camera-controllable generation, the thinking mode can be enabled by changing the settings and append --thinking flag:

python scripts/demo/understanding.py configs/pipelines/stage_3_thinking.py \
          --checkpoint checkpoints/Puffin-Thinking.pth --image_path assets/test_img/test.jpg \
          --save_dir vis_results/ \
          --thinking

logo World Exploration

The generated target view can be obtained by an initial view and camera prompts (roll: -r, pitch: -p, yaw: -y, all in radius) using the following command:

python scripts/demo/world_exploration.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --init_image assets/test_img/test_cross_view.jpg \
          --output world_exploration_result.jpg \
          -r 0.1 -p -0.1 -y 0.2

The above process can be applied to the 3D world generation (e.g., Figure A8 in the paper) like world models, the multi-view results are generated around an initial view:

python scripts/demo/world_exploration_3D.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --init_view_path assets/test_img/ \
          --output world_exploration_3D/

logo Spatial Imagination

Given an initial view and the expected location (left, behind, and right), Puffin can imagine the scene description of the target view using the following command:
python scripts/demo/spatial_imagination.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --image assets/test_img/test_cross_view.jpg \
          --location behind

logo Photographic Guidance

Puffin can suggest camera parameter adjustments from an initial view to achieve images with higher photographic aesthetics. The deviation (pitch and yaw) between the target image and initial image can be obtained using the following command:
python scripts/demo/photographic_guidance.py configs/pipelines/stage_4_instruction_tuning.py \
          --checkpoint checkpoints/Puffin-Instruct.pth --image assets/test_img/test_cross_view.jpg

logo Puffin-4M Dataset

Datasets and benchmarks that span vision, language, and camera modalities remain scarce in the domain of spatial multimodal intelligence. To address this gap, we introduce Puffin-4M, a large-scale, high-quality dataset comprising 4 million vision-language-camera triplets. We release the training data and evaluation benchmark in πŸ€— KangLiao/Puffin-4M. The whole dataset is approximately 449GB in size. Note that we omit the camera maps from the uploaded training data due to their large total size (~3 MB each, amounting to ~11.4 TB in total). However, these maps can be easily generated from the captions using the following command:

python scripts/camera/cam_dataset.py \
          --input_root Puffin-4M/training_data/cap_folder \
          --output_root Puffin-4M/training_data/cam_folder

The scripts of the construction pipeline for our Puffin-4M will be updated in Dataset Pipeline soon.

✈️ Training

We conduct a multi-stage training strategy, where the vision encoder, LLM, and the diffusion model are aligned in the first stage. Then, in the SFT stage, the models are jointly optimized using both base and thinking datasets. Finally, an instruction-tuning stage is applied, involving various cross-view generation and understanding tasks. The implementation details are provided in Training.

πŸ–ΌοΈ Evaluation

We evaluate our camera-centric generation and understanding performance on public datasets and our constructed benchmark (πŸ€— KangLiao/Puffin-4M/benchmark).

For camera understanding, we conduct evaluations on three common datasets, MegaDepth, TartanAir, and LaMAR. Notably, images from these datasets are primarily captured or simulated in well-structured environments. Moreover, the camera parameters in some datasets are limited in distribution. To complement these settings, we construct a more challenging dataset, Puffin-Und, designed for a comprehensive assessment of camera understanding. This dataset contains 1,000 images spanning diverse camera configurations and scenarios (πŸ€— KangLiao/Puffin-4M/benchmark/Puffin-Und). Additionally, since no benchmark dataset exists for text-to-image generation with precise camera parameters, we construct Puffin-Gen to fill this gap. The dataset consists of 650 caption–camera pairs spanning diverse scenarios and camera configurations (πŸ€— KangLiao/Puffin-4M/benchmark/Puffin-Gen). The evaluation details are provided in Evaluation.

πŸ“š Citation

If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:

  @article{liao2025puffin,
    title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
    author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
    journal={arXiv preprint arXiv:2510.08673},
    year={2025}
  }

πŸ—žοΈ License

This project is licensed under NTU S-Lab License 1.0.

πŸ™ Acknowledgement

The project builds upon OpenUni, MetaQuery, Qwen2.5, RADIOv3, SD3, and GeoCalib.

About

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages