This section tracks the development and release timeline of GTR-Bench.
| Date | Milestone |
|---|---|
| Apr 2025 | Project launched. Initial benchmark design and task formulation for geo-temporal reasoning in multi-camera networks. |
| May–Sep 2025 | Data collection and annotation. Benchmark construction using CityFlow (outdoor) and MTMMC (indoor) datasets. Design of 7 task types spanning basic reasoning (GL, ATI, MS) and combinatorial tasks (CR, NSF, TF, MTTF). |
| Oct 2025 | Benchmark completion. Evaluation of 10+ VLMs; analysis of performance gaps and model deficiencies. |
| Nov 2025 | Paper published on arXiv and code/data open-sourced on GitHub. |
| Jan 2026 | Accepted by ICLR 2026 🎉 |
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence.
📄 Paper: arXiv:2510.07791
Overview of Geo-temporal Reasoning and GTR-Bench. Given a graphic map and multiple video clips from non-overlapping cameras, geo-temporal reasoning infers motion state of moving targets in a large-scale camera network. GTR-Bench comprises 3 basic reasoning tasks, including Geo-location, Arrival Time-Interval, and Motion-State, and 4 combinatorial tasks including Causal Reordering, Next Spot Forecasting, Trajectory Forecasting, and Multi-Target Trajectory Forecasting. GTR-Bench covers both outdoor (vehicles) and indoor (pedestrians) scenarios.
- Geo-Location (GL): Infer intermediate locations between start/end points
- Arrival Time-Interval (ATI): Predict time interval of target's arrival at specific location
- Motion-State (MS): Infer target's motion state at intermediate locations
- Causal Reordering (CR): Determine correct chronological sequence from unordered video clips
- Next Spot Forecasting (NSF): Predict next camera location and time interval
- Trajectory Forecasting (TF): Forecast complete future trajectory sequence
- Multi-Target Trajectory Forecasting (MTTF): Predict meeting point of two targets
![]() |
![]() |
| Overview of GTR-Bench Results. Average performance across VLMs. OS = Open-source; PM = Proprietary. | Task performance of GTR-Bench. |
| Methods | Rank | GL | ATI | MS | CR | NSF | TF | MTTF | GL | ATI | MS | CR | NSF | TF | MTTF | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary Models (API) | ||||||||||||||||
| Claude-3-7-Sonnet | 7 | 53.33 | 66.67 | 26.67 | 46.67 | 25.75 | 8.90 | 21.97 | 36.67 | 40.00 | 13.33 | 46.67 | 9.48 | 9.51 | 3.56 | 29.23 |
| GPT-4o | 4 | 56.67 | 76.67 | 40.00 | 63.33 | 20.53 | 0.00 | 23.10 | 30.00 | 53.33 | 40.00 | 50.00 | 13.00 | 0.00 | 2.79 | 33.53 |
| Claude-4-Sonnet | 3 | 73.33 | 50.00 | 50.00 | 63.33 | 8.05 | 6.18 | 16.94 | 66.67 | 33.33 | 43.33 | 58.62 | 2.60 | 4.01 | 0.00 | 34.03 |
| GPT-5 | 2 | 53.33 | 76.67 | 40.00 | 40.00 | 12.04 | 12.12 | 7.34 | 60.00 | 30.00 | 43.33 | 86.21 | 11.34 | 2.55 | 1.75 | 34.05 |
| Gemini-2.5-Pro | 1 | 60.00 | 46.67 | 33.33 | 56.67 | 19.13 | 13.16 | 19.18 | 63.33 | 13.33 | 26.67 | 70.00 | 25.11 | 28.09 | 14.37 | 34.93 |
| Open-source Models | ||||||||||||||||
| Qwen2-VL-2B-Instruct | 13 | 33.33 | 16.67 | 20.00 | 56.67 | 0.00 | 0.28 | 0.00 | 30.00 | 13.33 | 33.33 | 43.33 | 0.00 | 0.21 | 0.00 | 17.65 |
| InternVL3-2B | 12 | 33.33 | 46.67 | 23.33 | 36.67 | 6.15 | 0.00 | 9.95 | 13.33 | 23.33 | 33.33 | 30.00 | 0.65 | 0.08 | 1.62 | 18.46 |
| Qwen2.5-VL-7B-Instruct | 11 | 23.33 | 46.67 | 3.33 | 60.00 | 0.00 | 0.00 | 0.51 | 40.00 | 30.00 | 6.67 | 63.33 | 0.00 | 0.00 | 0.00 | 19.56 |
| InternVL3-8B | 10 | 26.67 | 60.00 | 33.33 | 50.00 | 0.00 | 4.79 | 5.42 | 20.00 | 26.67 | 30.00 | 50.00 | 0.00 | 0.79 | 1.67 | 22.10 |
| Qwen2-VL-7B-Instruct | 9 | 43.33 | 60.00 | 16.67 | 50.00 | 5.78 | 0.00 | 10.01 | 20.00 | 40.00 | 36.67 | 36.67 | 3.62 | 0.00 | 0.00 | 23.05 |
| GLM-4.1V-9B-Thinking | 8 | 60.00 | 57.14 | 25.00 | 62.07 | 10.29 | 0.00 | 25.38 | 26.67 | 38.46 | 34.48 | 55.17 | 2.87 | 0.00 | 1.67 | 28.51 |
| Qwen2.5-VL-32B-Instruct | 6 | 43.33 | 60.00 | 33.33 | 66.67 | 0.65 | 0.00 | 15.72 | 33.33 | 56.67 | 43.33 | 70.00 | 3.33 | 0.00 | 0.00 | 30.45 |
| InternVL3-38B | 5 | 40.00 | 73.33 | 30.00 | 53.33 | 8.27 | 8.20 | 20.58 | 50.00 | 56.67 | 26.67 | 37.93 | 11.10 | 4.37 | 10.24 | 30.76 |
| Human Level | - | 90.00 | 84.25 | 90.91 | 89.75 | 68.31 | 51.24 | 55.83 | 98.20 | 90.78 | 89.45 | 97.35 | 74.64 | 57.36 | 62.46 | 78.61 |
Our evaluation based on more than 10 popular VLMs reveals a critical performance gap. Bold = best result; Underline = second best result.
⚠️ Important: Before using GTR-Bench, you need to download and prepare the required datasets.
- Source: AI City Challenge
- Purpose: Vehicle tracking and re-identification
Expected directory structure:
./data/outdoor/cityflow/
└── AICity22_Track1_MTMC_Tracking/
└── train/
└── S04/
└── c020/
└── vdo.avi
- Source: MTMMC Dataset
- Purpose: Multi-modal camera tracking
Expected directory structure:
./data/indoor/mtmmc/
└── train/
└── s01/
└── c03/
└── rgb/
GTR-Bench organizes data by scenario type:
data/
├── indoor/ # Indoor scenario (pedestrians)
│ ├── mtmmc/ # MTMMC dataset
│ └── homography/ # Indoor camera homography data
└── outdoor/ # Outdoor scenario (vehicles)
└── cityflow/ # CityFlow dataset
- Download the source datasets from the official links above
- Place the data in the corresponding directories following the expected structure
- Reference the map images and video clips for geo-temporal reasoning tasks
- The benchmark supports both outdoor (vehicle) and indoor (pedestrian) scenarios across 7 task types
- CityFlow Dataset: AI City Challenge organizers
- MTMMC Dataset: Multi-Target Multi-Modal Camera tracking team


