Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Conghui He, Weijia Li

Sun Yat-Sen University, Shanghai AI Laboratory, Sensetime Research, Wuhan University

📰 News

[2025.06] 🌱 We uploaded our dataset and code.
[2025.06] 😄 We are very happy to announce that where am I was accepted by ICCV 2025
[2024.12] 🔥 We have released Where am I? Cross-View Geo-localization with Natural Language Descriptions. Check out the paper.

🏆 Contributions

Novel mission setting： We introduce and formalize the Cross-View Geo-localization task based on natural language descriptions, utilizing scene text descriptions to retrieve corresponding OSM or satellite images for geographical localization.

Dataset Contribution： We propose CVG-Text, a dataset with well-aligned street-views, satellite images, OSM, and text descriptions across three cities and over 30,000 coordinates. Additionally a progressive scene text generation framework based on LMM is presented, which reduces vague descriptions and generates high-quality scene text.

New retrieve methods： We introduce CrossText2Loc, a novel text localization method that excels in handling long texts and interpretability. It achieves over a 10% improvement in Top-1 recall compared to existing methods, while offering retrieval reasoning beyond similarity scores.

🛠️ Requirements and Installation

Ensure your environment meets the following requirements:

conda create -n CVG-Text python=3.9 -y
conda activate CVG-Text
pip install -r requirements.txt

🤗 Dataset Download and Path Configuration

Dataset： The images and annotation files for CVG-Text can be found at https://huggingface.co/datasets/LHL3341/CVG-Text_full

Due to restrictions on Google Street View and Google Maps imagery, the dataset release will not include the raw street-view or satellite images themselves. Instead, each image is assigned a unique identifier (ID). Researchers can independently retrieve the corresponding images using their own API keys through the Google Street View API https://www.google.com/streetview/ and Google Maps Static API https://www.google.com/maps.

Path Configuration： After downloading, update the /path/to/dataset/ in ./config.yaml with the actual dataset paths.

Model Checkpoints： Our model checkpoints are available at: https://huggingface.co/CVG-Text/CrossText2Loc

🚀 Quick Start

To retrieve satellite images (sat) using NewYork-mixed (panoramic + single-view) text and the Ours model, run:

python zeroshot.py --version NewYork-mixed --img_type sat --model CLIP-L/14@336 --expand

You can also evaluate specific checkpoint by setting --checkpoint {your_checkpoint_path} For more examples, please refer to the script in ./scripts/evaluate.sh.

For attention visualization and the explainable retrieval module (ERM), please run the code in the ./visualize directory

🏋️‍♂️ Train

To train the Ours model on Brisbane-mixed and OSM datasets, use the following command:

python -m torch.distributed.run --nproc_per_node=4 finetune.py --lr 1e-5 --batch_size 128 --epochs 40 --version Brisbane-mixed --model CLIP-L/14@336 --expand --img_type sat --logging

The --logging flag determines whether to save log files and model checkpoints.

Note

AttributeError: 'ResidualAttentionBlock' object has no attribute 'attn_probs'

If you directly run visualization code, you may encounter this error. This is because the original OpenAI CLIP model does not store attention weights by default.

To resolve this, please follow the solution provided in this issue: hila-chefer/Transformer-MM-Explainability#39.

BibTeX 🙏

If you have any questions, be free to contact with me!

@article{ye2024cross,
  title={Where am I? Cross-View Geo-localization with Natural Language Descriptions},
  author={Ye, Junyan and Lin, Honglin and Ou, Leyan and Chen, Dairong and Wang, Zihao and He, Conghui and Li, Weijia},
  journal={arXiv preprint arXiv:2412.17007},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
CVG-Text		CVG-Text
__pycache__		__pycache__
scripts		scripts
visualize		visualize
.nojekyll		.nojekyll
README.md		README.md
config.yaml		config.yaml
dataset.py		dataset.py
download_osm_street.py		download_osm_street.py
evaluate.py		evaluate.py
finetune.py		finetune.py
model_loader.py		model_loader.py
requirements.txt		requirements.txt
train_blip.py		train_blip.py
train_vilt.py		train_vilt.py
trainer.py		trainer.py
utils.py		utils.py
zeroshot.py		zeroshot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Where am I? Cross-View Geo-localization with Natural Language Descriptions

📰 News

🏆 Contributions

🛠️ Requirements and Installation

🤗 Dataset Download and Path Configuration

🚀 Quick Start

🏋️‍♂️ Train

Note

BibTeX 🙏

About

Uh oh!

Releases

Packages

Contributors 2

Languages

yejy53/CVG-Text

Folders and files

Latest commit

History

Repository files navigation

Where am I? Cross-View Geo-localization with Natural Language Descriptions

📰 News

🏆 Contributions

🛠️ Requirements and Installation

🤗 Dataset Download and Path Configuration

🚀 Quick Start

🏋️‍♂️ Train

Note

BibTeX 🙏

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages