Skip to content

[NeurIPS 2025] The official repo of "DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding".

Notifications You must be signed in to change notification settings

weihao1115/dynamicvl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

ArXiv Hugging Face Dataset

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya

About

DynamicVL is a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. This repository ships the DVL-Suite dataset, task-specific benchmarks, and evaluation scripts that cover both closed-form vision-language tasks and pixel-level change detection.

News

  • 2025/08   DynamicVL was accepted to NeurIPS 2025! We will add encoder-decoder-based semantic change detection implementations to this repo. Stay tuned!

Environment Setup

# Create the conda environment
conda create -n dvl python=3.10 -y
conda activate dvl

# Install the package
(dvl): pip install -e .

# Optional: manually install PyTorch if the vLLM dependency conflicts with your environment
# Note: Downgrade cu128 if it conflicts with your CUDA drivers.
(dvl): pip install -U torch torchvision xformers --index-url https://download.pytorch.org/whl/cu128

# Optional: fix "version `GLIBCXX_3.4.32' not found" errors
(dvl): conda install -c conda-forge gcc=13 gxx=13 -y
(dvl): export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Data Setup

Download the DVL-Suite dataset and unzip the training and test archives:

mkdir data && cd data
unzip train.zip
unzip test.zip

Expected directory layout:

data/
├── train/                          # DVL-Instruct (Training Set)
│   ├── images/{city}/{region}/{image_id_timestamp}.tif
│   ├── cd_sem_masks/
│   ├── cd_refer_seg_masks/
│   ├── regional_caption/
│   ├── metadata.json
│   ├── basic_change_choice_qa.json
│   ├── basic_change_report_qa.json
│   ├── change_speed_choice_qa.json
│   ├── change_speed_report_qa.json
│   ├── change_referring_seg_qa.json
│   ├── eco_assessment.json
│   ├── dense_temporal_caption.json
│   └── regional_caption.json
└── test/                           # DVL-Bench (Test Set)
    └── [same structure as train/]

Usage

Vision-Language Tasks

Load Data

from dvl.vqa.dataset import DynamicVLVQA

dataset = DynamicVLVQA(subset="BCA-QA", data_dir="data/train")
for item in dataset:
    # images: List[PIL.Image] across time
    # messages: multi-turn Q&A dicts
    # metadata: contains id, task_type, prompts, options_str, image_list, time_stamps
    print(item)

Evaluate Open-Source Models (vLLM)

(dvl): python -m dvl.vqa.run_vllm \
    --model_id Qwen/Qwen2.5-VL-3B-Instruct \
    --subset BCA-QA

Available subsets:

  • BCA-QA - Basic Change Analysis (QA)
  • CSE-QA - Change Speed Estimation (QA)
  • BCA-Report - Basic Change Analysis (Report)
  • CSE-Report - Change Speed Estimation (Report)
  • DTC - Dense Temporal Caption
  • RCC - Regional Change Caption
  • EA - Environmental Assessment

Note: Set --batch_size 1 for llava-hf/llava-onevision-qwen2-7b-ov-hf to avoid GPU OOM.

Output: results/vqa/Qwen--Qwen2.5-VL-3B-Instruct/ stores .jsonl predictions and .json summaries.

Evaluate Commercial Models (Azure OpenAI)

export AZURE_OPENAI_BASE="{your-azure-endpoint}"
export AZURE_OPENAI_KEY="{your-api-key}"
export AZURE_OPENAI_API_VERSION="{your-api-version}"

(dvl): python -m dvl.vqa.run_azure_openai \
    --model_id gpt-4o \
    --subset BCA-QA

Output: results/vqa/gpt-4o/ stores task-specific .jsonl predictions and .json metrics.

GPT-Based Evaluation for Reports and Captions

export AZURE_OPENAI_BASE="{your-azure-endpoint}"
export AZURE_OPENAI_KEY="{your-api-key}"
export AZURE_OPENAI_API_VERSION="{your-api-version}"

(dvl): python -m dvl.vqa.pretty_print.gpt_eval \
    --gpt_model_id gpt-4.1-mini \
    --eval_model_id "Qwen/Qwen2.5-VL-3B-Instruct" \
    --subset DTC

Supported subsets:

  • BCA-Report
  • CSE-Report
  • DTC
  • RCC

Output: results/vqa/Qwen--Qwen2.5-VL-3B-Instruct/ includes GPT-scored .jsonl files (for example DTC.gpt-4.1-mini.jsonl).

Aggregate Metrics

# Multi-choice QA tasks (BCA-QA, CSE-QA, EA)
(dvl): python -m dvl.vqa.pretty_print.acc_table

# Open-ended generation tasks (Reports & Captions)
(dvl): python -m dvl.vqa.pretty_print.gen_table --gpt_model_id gpt-4.1-mini

Tabulated metrics are printed to console and saved in results/vqa/.

Referring Change Detection

Load Data

from dvl.vqa.dataset import DynamicVLReferSeg

dataset = DynamicVLReferSeg(data_dir="data/train")
for item in dataset:
    # t1_image, t2_image: np.ndarray of shape (1024, 1024, 3)
    # gt_mask: binary change mask
    # messages: instruction-response history
    # cd_info: source/target land-cover classes and indices
    # metadata: contains the unique evaluation id
    print(item)

Evaluate Predictions

Organize predicted masks using item["metadata"]["id"] as the filename stem:

{your-pred-dir}/
├── change_referring_seg_qa_0.png
├── change_referring_seg_qa_1.png
└── ...

Run the evaluation utilities:

# LISA-style binary IoU metrics
(dvl): python -m dvl.vqa.pretty_print.referseg_iou --pred_dir "{your-pred-dir}"

# MambaCD-style semantic change detection metrics
(dvl): python -m dvl.vqa.pretty_print.referseg_cd --pred_dir "{your-pred-dir}"

Scores are printed to console and stored alongside the submitted prediction masks.

Citation

If you find DynamicVL useful, please cite:

@article{xuan2025dynamicvl,
  title={DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding},
  author={Xuan, Weihao and Wang, Junjue and Qi, Heli and Chen, Zihang and Zheng, Zhuo and Zhong, Yanfei and Xia, Junshi and Yokoya, Naoto},
  journal={arXiv preprint arXiv:2505.21076},
  year={2025}
}

License

DynamicVL is released under the Apache-2.0 License.

Acknowledgements

DynamicVL builds on NAIP aerial imagery and the open-source multimodal community. We appreciate all contributors who benchmarked cutting-edge MLLMs on our dataset and shared feedback during the public release.

About

[NeurIPS 2025] The official repo of "DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages