DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya

About

DynamicVL is a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. This repository ships the DVL-Suite dataset, task-specific benchmarks, and evaluation scripts that cover both closed-form vision-language tasks and pixel-level change detection.

News

2025/08 DynamicVL was accepted to NeurIPS 2025! We will add encoder-decoder-based semantic change detection implementations to this repo. Stay tuned!

Environment Setup

# Create the conda environment
conda create -n dvl python=3.10 -y
conda activate dvl

# Install the package
(dvl): pip install -e .

# Optional: manually install PyTorch if the vLLM dependency conflicts with your environment
# Note: Downgrade cu128 if it conflicts with your CUDA drivers.
(dvl): pip install -U torch torchvision xformers --index-url https://download.pytorch.org/whl/cu128

# Optional: fix "version `GLIBCXX_3.4.32' not found" errors
(dvl): conda install -c conda-forge gcc=13 gxx=13 -y
(dvl): export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Data Setup

Download the DVL-Suite dataset and unzip the training and test archives:

mkdir data && cd data
unzip train.zip
unzip test.zip

Expected directory layout:

data/
├── train/                          # DVL-Instruct (Training Set)
│   ├── images/{city}/{region}/{image_id_timestamp}.tif
│   ├── cd_sem_masks/
│   ├── cd_refer_seg_masks/
│   ├── regional_caption/
│   ├── metadata.json
│   ├── basic_change_choice_qa.json
│   ├── basic_change_report_qa.json
│   ├── change_speed_choice_qa.json
│   ├── change_speed_report_qa.json
│   ├── change_referring_seg_qa.json
│   ├── eco_assessment.json
│   ├── dense_temporal_caption.json
│   └── regional_caption.json
└── test/                           # DVL-Bench (Test Set)
    └── [same structure as train/]

Usage

Vision-Language Tasks

Load Data

from dvl.vqa.dataset import DynamicVLVQA

dataset = DynamicVLVQA(subset="BCA-QA", data_dir="data/train")
for item in dataset:
    # images: List[PIL.Image] across time
    # messages: multi-turn Q&A dicts
    # metadata: contains id, task_type, prompts, options_str, image_list, time_stamps
    print(item)

Evaluate Open-Source Models (vLLM)

(dvl): python -m dvl.vqa.run_vllm \
    --model_id Qwen/Qwen2.5-VL-3B-Instruct \
    --subset BCA-QA

Available subsets:

BCA-QA - Basic Change Analysis (QA)
CSE-QA - Change Speed Estimation (QA)
BCA-Report - Basic Change Analysis (Report)
CSE-Report - Change Speed Estimation (Report)
DTC - Dense Temporal Caption
RCC - Regional Change Caption
EA - Environmental Assessment

Note: Set --batch_size 1 for llava-hf/llava-onevision-qwen2-7b-ov-hf to avoid GPU OOM.

Output: results/vqa/Qwen--Qwen2.5-VL-3B-Instruct/ stores .jsonl predictions and .json summaries.

Evaluate Commercial Models (Azure OpenAI)

export AZURE_OPENAI_BASE="{your-azure-endpoint}"
export AZURE_OPENAI_KEY="{your-api-key}"
export AZURE_OPENAI_API_VERSION="{your-api-version}"

(dvl): python -m dvl.vqa.run_azure_openai \
    --model_id gpt-4o \
    --subset BCA-QA

Output: results/vqa/gpt-4o/ stores task-specific .jsonl predictions and .json metrics.

GPT-Based Evaluation for Reports and Captions

export AZURE_OPENAI_BASE="{your-azure-endpoint}"
export AZURE_OPENAI_KEY="{your-api-key}"
export AZURE_OPENAI_API_VERSION="{your-api-version}"

(dvl): python -m dvl.vqa.pretty_print.gpt_eval \
    --gpt_model_id gpt-4.1-mini \
    --eval_model_id "Qwen/Qwen2.5-VL-3B-Instruct" \
    --subset DTC

Supported subsets:

BCA-Report
CSE-Report
DTC
RCC

Output: results/vqa/Qwen--Qwen2.5-VL-3B-Instruct/ includes GPT-scored .jsonl files (for example DTC.gpt-4.1-mini.jsonl).

Aggregate Metrics

# Multi-choice QA tasks (BCA-QA, CSE-QA, EA)
(dvl): python -m dvl.vqa.pretty_print.acc_table

# Open-ended generation tasks (Reports & Captions)
(dvl): python -m dvl.vqa.pretty_print.gen_table --gpt_model_id gpt-4.1-mini

Tabulated metrics are printed to console and saved in results/vqa/.

Referring Change Detection

Load Data

from dvl.vqa.dataset import DynamicVLReferSeg

dataset = DynamicVLReferSeg(data_dir="data/train")
for item in dataset:
    # t1_image, t2_image: np.ndarray of shape (1024, 1024, 3)
    # gt_mask: binary change mask
    # messages: instruction-response history
    # cd_info: source/target land-cover classes and indices
    # metadata: contains the unique evaluation id
    print(item)

Evaluate Predictions

Organize predicted masks using item["metadata"]["id"] as the filename stem:

{your-pred-dir}/
├── change_referring_seg_qa_0.png
├── change_referring_seg_qa_1.png
└── ...

Run the evaluation utilities:

# LISA-style binary IoU metrics
(dvl): python -m dvl.vqa.pretty_print.referseg_iou --pred_dir "{your-pred-dir}"

# MambaCD-style semantic change detection metrics
(dvl): python -m dvl.vqa.pretty_print.referseg_cd --pred_dir "{your-pred-dir}"

Scores are printed to console and stored alongside the submitted prediction masks.

Citation

If you find DynamicVL useful, please cite:

@article{xuan2025dynamicvl,
  title={DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding},
  author={Xuan, Weihao and Wang, Junjue and Qi, Heli and Chen, Zihang and Zheng, Zhuo and Zhong, Yanfei and Xia, Junshi and Yokoya, Naoto},
  journal={arXiv preprint arXiv:2505.21076},
  year={2025}
}

License

DynamicVL is released under the Apache-2.0 License.

Acknowledgements

DynamicVL builds on NAIP aerial imagery and the open-source multimodal community. We appreciate all contributors who benchmarked cutting-edge MLLMs on our dataset and shared feedback during the public release.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/dvl		src/dvl
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya

About

News

Environment Setup

Data Setup

Usage

Vision-Language Tasks

Load Data

Evaluate Open-Source Models (vLLM)

Evaluate Commercial Models (Azure OpenAI)

GPT-Based Evaluation for Reports and Captions

Aggregate Metrics

Referring Change Detection

Load Data

Evaluate Predictions

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

weihao1115/dynamicvl

Folders and files

Latest commit

History

Repository files navigation

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya

About

News

Environment Setup

Data Setup

Usage

Vision-Language Tasks

Load Data

Evaluate Open-Source Models (vLLM)

Evaluate Commercial Models (Azure OpenAI)

GPT-Based Evaluation for Reports and Captions

Aggregate Metrics

Referring Change Detection

Load Data

Evaluate Predictions

Citation

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages