Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Conghui He, Weijia Li
Sun Yat-Sen University, Shanghai AI Laboratory, Sensetime Research, Wuhan University
- [2025.06] 🌱 We uploaded our dataset and code.
- [2025.06] 😄 We are very happy to announce that where am I was accepted by ICCV 2025
- [2024.12] 🔥 We have released Where am I? Cross-View Geo-localization with Natural Language Descriptions. Check out the paper.
Novel mission setting: We introduce and formalize the Cross-View Geo-localization task based on natural language descriptions, utilizing scene text descriptions to retrieve corresponding OSM or satellite images for geographical localization.
Dataset Contribution: We propose CVG-Text, a dataset with well-aligned street-views, satellite images, OSM, and text descriptions across three cities and over 30,000 coordinates. Additionally a progressive scene text generation framework based on LMM is presented, which reduces vague descriptions and generates high-quality scene text.
New retrieve methods: We introduce CrossText2Loc, a novel text localization method that excels in handling long texts and interpretability. It achieves over a 10% improvement in Top-1 recall compared to existing methods, while offering retrieval reasoning beyond similarity scores.
Ensure your environment meets the following requirements:
conda create -n CVG-Text python=3.9 -y
conda activate CVG-Text
pip install -r requirements.txtDataset: The images and annotation files for CVG-Text can be found at https://huggingface.co/datasets/LHL3341/CVG-Text_full
Due to restrictions on Google Street View and Google Maps imagery, the dataset release will not include the raw street-view or satellite images themselves. Instead, each image is assigned a unique identifier (ID). Researchers can independently retrieve the corresponding images using their own API keys through the Google Street View API https://www.google.com/streetview/ and Google Maps Static API https://www.google.com/maps.
Path Configuration: After downloading, update the /path/to/dataset/ in ./config.yaml with the actual dataset paths.
Model Checkpoints: Our model checkpoints are available at: https://huggingface.co/CVG-Text/CrossText2Loc
To retrieve satellite images (sat) using NewYork-mixed (panoramic + single-view) text and the Ours model, run:
python zeroshot.py --version NewYork-mixed --img_type sat --model CLIP-L/14@336 --expandYou can also evaluate specific checkpoint by setting --checkpoint {your_checkpoint_path}
For more examples, please refer to the script in ./scripts/evaluate.sh.
For attention visualization and the explainable retrieval module (ERM), please run the code in the ./visualize directory
To train the Ours model on Brisbane-mixed and OSM datasets, use the following command:
python -m torch.distributed.run --nproc_per_node=4 finetune.py --lr 1e-5 --batch_size 128 --epochs 40 --version Brisbane-mixed --model CLIP-L/14@336 --expand --img_type sat --loggingThe --logging flag determines whether to save log files and model checkpoints.
AttributeError: 'ResidualAttentionBlock' object has no attribute 'attn_probs'
If you directly run visualization code, you may encounter this error. This is because the original OpenAI CLIP model does not store attention weights by default.
To resolve this, please follow the solution provided in this issue: hila-chefer/Transformer-MM-Explainability#39.
If you have any questions, be free to contact with me!
@article{ye2024cross,
title={Where am I? Cross-View Geo-localization with Natural Language Descriptions},
author={Ye, Junyan and Lin, Honglin and Ou, Leyan and Chen, Dairong and Wang, Zihao and He, Conghui and Li, Weijia},
journal={arXiv preprint arXiv:2412.17007},
year={2024}
}