WikiTiLo: Can Vision-Language Models Be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Accepted to WACV 2024 in Waikoloa, Hawaii, USA

Can Vision-Language Models(VLM), which are pre-trained with large-scale image-text resources, infer the shooting location and time of a photo like human capability?

To address this question, we propose a two-stage RECOGNITION & REASONING probing task applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features based on visual cues and further reason about it.

For this task, we construct a new dataset, WikiTiLo(WikiCommon Times and Location), which comprises images captured over a broad time range and is geographically balanced to mitigate cultural bias. The dataset has been carefully curated to ensure that each image contains distinct visual cues that align with human expert knowledge.

We evaluated three discriminative VLMs and two generative VLMs on WikiTiLo. Experiments show that visual encoders in discriminative VLMs can generate context-agnostic visual features that help identify times/locations, but generative VLMs fail to reason based on the visual cues. Here is the results table of performance without training(0-shot) for RECOGNITION and REASONING, compared to the Frequency baseline and human baseline.

Dataset

The WikiTiLo dataset(download from huggingface repo) consists of 6296 images with annotation of the specific time and country where images are taken. There are some samples from WikiTiLo above. The dataset covers 30 countries in 8 regions, and the years from 1826 to 2021. we split 80% of the entire dataset as the training set, 10% as the validation set, and 10% as the test set. In the following charts, we demonstrate all the countries and regions in WikiTiLo, and the distribution of the location and time labels.

Citation

@inproceedings{zhang2024can,
  title={Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning},
  author={Zhang, Gengyuan and Zhang, Yurui and Zhang, Kerui and Tresp, Volker},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={636--645},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiTiLo: Can Vision-Language Models Be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Dataset

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

WikiTiLo: Can Vision-Language Models Be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Dataset

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages