WikiTiLo: Can Vision-Language Models Be a Good Guesser? Exploring VLMs for Times and Location Reasoning
Accepted to WACV 2024 in Waikoloa, Hawaii, USA
Can Vision-Language Models(VLM), which are pre-trained with large-scale image-text resources, infer the shooting location and time of a photo like human capability?
To address this question, we propose a two-stage RECOGNITION & REASONING probing task applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features based on visual cues and further reason about it.
For this task, we construct a new dataset, WikiTiLo(WikiCommon Times and Location), which comprises images captured over a broad time range and is geographically balanced to mitigate cultural bias. The dataset has been carefully curated to ensure that each image contains distinct visual cues that align with human expert knowledge.
We evaluated three discriminative VLMs and two generative VLMs on WikiTiLo. Experiments show that visual encoders in discriminative VLMs can generate context-agnostic visual features that help identify times/locations, but generative VLMs fail to reason based on the visual cues. Here is the results table of performance without training(0-shot) for RECOGNITION and REASONING, compared to the Frequency baseline and human baseline.

The WikiTiLo dataset(download from huggingface repo) consists of 6296 images with annotation of the specific time and country where images are taken. There are some samples from WikiTiLo above. The dataset covers 30 countries in 8 regions, and the years from 1826 to 2021. we split 80% of the entire dataset as the training set, 10% as the validation set, and 10% as the test set. In the following charts, we demonstrate all the countries and regions in WikiTiLo, and the distribution of the location and time labels.
@inproceedings{zhang2024can,
title={Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning},
author={Zhang, Gengyuan and Zhang, Yurui and Zhang, Kerui and Tresp, Volker},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={636--645},
year={2024}
}


