GAEA: A Geolocation Aware Conversational Assistant [WACV 2026🔥]

Ron Campos* , Ashmal Vayani* , Parth Parag Kulkarni*, Rohit Gupta , Aizan Zafar, Aritra Dutta , Mubarak Shah

* Equally contributing first authors

University of Central Florida

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Official GitHub repository for `GAEA: A Geolocation Aware Conversational Assistant`.

📢 Latest Updates

Sep-05-25- GAEA is accepted at WACV 2026! 🔥🔥
Aug-02-25- Datasets and Model are released HuggingFace! 🔥🔥
Mar-20-25- Technical report of GAEA is released on arxiv! 🔥🔥
Mar-20-25- GAEA-1.4M, GAEA-Bench Dataset and codes are released. GAEA-Bench 4,000 diverse conversational QA pairs equipped with geolocalization capabilities. GAEA1.4M entails over 1.4M QA pairs for enhancing the conversational capabilities of geolocalizable LMM, GAEA. 🔥🔥

🏆 Highlights

Figure: We compare the performance of various LMMs on the geographically-grounded visual-question-answering task, included in our new GAEA-Bench benchmark. Most LMMs can describe the Wat Pho statue, but only GAEA, our Geolocation Aware Assistant, retrieves the correct nearby cafe, Cafe Amazon (left). Qualitative SVQA comparison showing GAEA’s ability to provide accurate, location-specific answers where other LMMs fail (right).

Abstract: Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tremendous progress of large multimodal models (LMMs)—proprietary and open-source—researchers attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle. In this work, we propose to solve this problem by introducing a conversational model GAEA that can provide information regarding the location of an image, as required by a user. No large-scale dataset enabling the training of such a model exists. Thus we propose a comprehensive dataset GAEA-1.4M with 800K images and around 1.4M question-answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark, GAEA-Bench comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and best proprietary model, GPT-4o by 8.28%. We will publicly release our dataset and codes.

GAEA is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.

Main contributions:

GAEA-1.4M: A Diverse Training Dataset: We propose GAEA-1.4M, a new dataset designed for training conversational image geolocalization models, incorporating diverse visual and contextual data.

GAEA-Bench: Evaluating Conversational Geolocalization: To assess conversational capabilities in geolocalization, we introduce GAEA-Bench, a benchmark featuring various question-answer formats.

GAEA: An Interactive Geolocalization Chatbot: We present GAEA, a conversational chatbot that extends beyond geolocalization to provide rich contextual insights about locations from images.

Benchmarking Against State-of-the-Art LMMs: We quantitatively compare our model's performance against 8 open-source and 3 proprietary LMMs, including GPT-4o and Gemini-2.0-Flash.

🗂️ Dataset

GAEA-1.4M Dataset Overview

Figure: Data Collection and Annotation Pipeline. (Left) GAEA-1.4M includes geographically diverse visual samples from various data sources, such as MP-16, GLD-v2, and CityGuesser68k. (Middle) We also incorporate OpenStreetMap (OSM) metadata and auxiliary context for each image, ranging from climate zones to geographical clues about the country. (Right) Using open-source LLMs and GPT-4o, we generate four diverse question-answer pairs across geolocation, reasoning, and conversational subsets.

GAEA-Bench Curation Pipeline

Figure: Overview of `GAEA-Bench`. `GAEA-Bench` is designed to evaluate the conversational abilities of various LMMs across different question types, including MCQs, T/F, and both short and long VQAs. We have carefully selected a subset of 3.5k samples from MP-16 and generated corresponding OSM metadata to generate QA pairs using GPT-4o. `GAEA-Bench` aims to fill the gap in conversational benchmarks by incorporating geolocalization capabilities.

Conversational Evaluation Pipeline

Figure: Evaluation pipeline for conversational benchmarking on GAEA-Bench, highlighting various question types we introduce in our GAEA-Bench. Each question type is evaluated with various defined criteria using GPT-4o as a judge. For instance, SVQA is evaluated against Accuracy and Correctness, and LVQA is evaluated on Consistency, Fluency, and Relevancy criteria.

Classification Accuracy Evaluation Pipeline

Figure: Classification and distance threshold accuracy computation pipeline simultaneously evaluates geolocalization performance at city and country level by comparing model predictions with ground truth annotations derived from reverse-geocoding GPS coordinates and accuracy at different distance thresholds by geocoding predictions of the model.

Dataset Statistics

Statistic Value

Total images 822,951

Total cities / countries 41,481 / 234

Total questions 1,432,519

Total geo-localization questions 822,951

Total explanatory captions 236,935

Total open-ended questions 267,668

Total multiple-choice questions 48,673

Total true/false questions 56,292

Qualitative Example of GAEA-1.4M

Figure: Examples of the four question types in our dataset: SVQA, MCQ, TF, and LVQA. Each type targets a distinct reasoning skill grounded in geographical, visual, or contextual understanding. Our dataset has three categories, including Geolocalization, Reasoning (LVQA), and Conversational (SVQA, MCQ, TF) QAs, as shown in the figure.

Training

Downloading and Setting Up GAEA-1.4M Dataset

GAEA-1.4M dataset can be downloaded from our huggingface. GAEA-1.4M consists of 1.4M question-answer (MCQ) pairs spanning four question types: MCQs, TF, and Short and Long VQAs. The general structure of our dataset looks like the following:

GAEA-1.4M/ |–– MP-16/ | |–– ###/ | | |–– ###/ | | | |–– ##########jpg | | | |–– ... # remaining images | |–– ... # remaining folders with similar structure |–– GLDv2/ | |–– #/ | | |–– #/ | | | |–– #/ | | | | |–– ##########.jpg | | | | |–– ... # remaining images | |–– ... # remaining folders with similar structure |–– CityGuessr/ | |–– city_#_######.jpg | |–– ... # remaining images

Download the dataset

# Download the GAEA-1.4M dataset cd scripts chmod +x download_gaea_train.sh ./download_gaea_train.sh

Download the weights to Qwen2.5-VL

# Download Qwen2.5-VL base model git lfs install git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

Installation

conda create -n gaea python=3.10 conda activate gaea pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt pip install qwen-vl-utils pip install flash-attn==2.5.8 --no-build-isolation

Please install the latest transformers from git to finetune Qwen2.5-VL

pip install git+https://github.com/huggingface/transformers accelerate

Training

cd scripts chmod +x train_gaea.sh #make script executable. ./train_gaea.sh

Evaluation

Downloading and Setting Up GAEA-Bench Dataset

GAEA-Bench dataset can be downloaded from our huggingface. GAEA-Bench consists of 4k conversational QA pairs extended from MP-16 and OpenStreetMaps (OSM) in various question types, including MCQs, TF, and Short and Long VQAs.

# Download the GAEA-Bench dataset cd scripts chmod +x download_gaea_bench.sh ./download_gaea_bench.sh

Preparing the dataset

# Organize the GAEA cd scripts chmod +x prepare_gaea_bench.sh ./prepare_gaea_bench.sh

Conversational GAEA-Bench Evaluation

Run the following command for evaluation

cd scripts chmod +x run_gaea_bench.sh #make script executable. ./run_gaea_bench.sh

Standard Geolocalization Evaluation

Install IM2GPS, IM2GPS3k, YFCC4k, YFCC26k, and GWS15k to run the evaluation. After installation, update the paths in the shell script and run the evaluation command.

cd scripts chmod +x run_distance_metrics.sh #make script executable. ./run_distance_metrics.sh

Classification Accuracy Evaluation

Install CityGuessr, GeoDE, and Dollar Street to run the evaluation. After installation, update the paths in the shell script and run the evaluation command.

cd scripts chmod +x run_cc_preds.sh #make script executable. ./run_cc_preds.sh

Benchmarking and Evaluation Results

GAEA-Bench Evaluation

Figure: We benchmark 11 open-source and proprietary LMMs on GAEA-Bench. Notably, GAEA outperforms all open-source models and fares higher than the proprietary models on decision-making questions (MCQs and TFs). We provide the relative performance change for each model compared to `GAEA`. We use GPT-4o as a judge for evaluation, and it has been documented that LLMs as judges prefer their long-form output; hence, the scores for these models are likely overestimated.

Standard Geolocalization Evaluation Results

Figure: We benchmark the performance of various specialized models on standard geolocation datasets. `GAEA` demonstrates competitive results, outperforming GaGA on multiple distance thresholds in both IM2GPS and IM2GPS3k.

Classification Accuracy Evaluation Results

Figure: Classification accuracy for both city and country labels, where `GAEA` establishes itself as a strong baseline, surpassing several recent LMMs in performance.

📂 License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images in GAEA and GAEA-Bench dataset are collected from public domains and sources (refer to main paper for more details) and are for academic research use only. By using GAEA and GAEA-Bench, you agree not to use the dataset for any harm or unfair discrimination. Please note that the data in this dataset may be subject to other agreements. Video copyrights belong to the original dataset providers, video creators, or platforms.

📜 Citation

If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:

@misc{campos2025gaeageolocationawareconversational, title={GAEA: A Geolocation Aware Conversational Assistant}, author={Ron Campos and Ashmal Vayani and Parth Parag Kulkarni and Rohit Gupta and Aritra Dutta and Mubarak Shah}, year={2025}, eprint={2503.16423}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.16423}, }

🙏 Acknowledgements

This repository has borrowed Video-LMM evaluation code from TimeChat and LLaMA-VID. We also borrowed partial code from ALM-Bench, CVRR-Evaluation-Suit repository. We thank the authors for releasing their code.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Assets		Assets
evaluations		evaluations
inference		inference
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Statistic	Value
Total images	822,951
Total cities / countries	41,481 / 234
Total questions	1,432,519
Total geo-localization questions	822,951
Total explanatory captions	236,935
Total open-ended questions	267,668
Total multiple-choice questions	48,673
Total true/false questions	56,292

License

UCF-CRCV/GAEA

Folders and files

Latest commit

History

Repository files navigation

GAEA: A Geolocation Aware Conversational Assistant [WACV 2026🔥]

* Equally contributing first authors

University of Central Florida

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Official GitHub repository for GAEA: A Geolocation Aware Conversational Assistant.

📢 Latest Updates

🏆 Highlights

GAEA is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.

🗂️ Dataset

GAEA-1.4M Dataset Overview

GAEA-Bench Curation Pipeline

Conversational Evaluation Pipeline

Classification Accuracy Evaluation Pipeline

Dataset Statistics

Qualitative Example of GAEA-1.4M

Training

Downloading and Setting Up GAEA-1.4M Dataset

Installation

Training

Evaluation

Downloading and Setting Up GAEA-Bench Dataset

Preparing the dataset

Conversational GAEA-Bench Evaluation

Standard Geolocalization Evaluation

Classification Accuracy Evaluation

Benchmarking and Evaluation Results

GAEA-Bench Evaluation

Standard Geolocalization Evaluation Results

Classification Accuracy Evaluation Results

📂 License

📜 Citation

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Official GitHub repository for `GAEA: A Geolocation Aware Conversational Assistant`.

`GAEA` is the first open-source conversational model for conversational capabilities equipped with global-scale geolocalization.

`GAEA-1.4M` Dataset Overview

`GAEA-Bench` Curation Pipeline

Conversational `GAEA-Bench` Evaluation

Packages