Tian Liu*1 · Anwesha Basu*1 · James Caverlee1 · Shu Kong2
1Texas A&M University 2University of Macau
*The first two authors contribute equally.
We explore the capability of Large Multimodal Models (LMMs) in visual species recognition, a challenging fine-grained visual classification task. Surprisingly, we find that LMMs still struggle with this task, falling far behind a specialized few-shot recognition expert model. Yet, by leveraging the top-k predictions from the expert model to guide LMMs through in-context learning, we can significantly enhance VSR performance. Our method, Post-hoc correction (POC), achieves state-of-the-art results on multiple VSR benchmarks, outperforming prior few-shot methods by >10% accuracy. Importantly, POC requires no extra training, validation, or manual intervention, serving as a plug-and-play module to significantly enhance various existing FSL methods.
- 2025-12-16: POC code is released.
- 2025-12-10: arXiv preprint is published.
Create conda environment and install dependencies.
# lab server has CUDA version 12.8, thus using pytorch-cuda=12.1 for compatibility
# DINOv3 requries python=3.10
conda create -n poc python=3.10 -y
conda activate poc
conda install pytorch torchvision torchaudio torchmetrics pytorch-cuda=12.1 -c pytorch -c nvidia
# install openclip and clip
pip install open_clip_torch
pip install git+https://github.com/openai/CLIP.git
pip install pandas scikit-learn
# clone dinov3
git clone https://github.com/facebookresearch/dinov3.git
# install gdown for downloading datasets
pip install gdownFor LMM inference with Qwen, you can follow instructions or steps below to set up Qwen2.5-VL-7B locally.
# setup Qwen2.5-VL-7B locally using huggingface transformers
conda create --name qwen --clone poc
conda activate qwen
pip install transformers==4.51.3 accelerate
pip install qwen-vl-utils[decord]Prepare the datasets following the instructions in DATASETS.md.
The few-shot splits are provided in the data/${dataset_name}/few-shot${num_shots}_seed${seed}.txt files, the test splits are in data/${dataset_name}/test.txt.
You can simply just download the datasets without repeating the sampling process.
- Obtain the top-k predictions on the test set using a few-shot finetuned model.
# activate conda environment
conda activate poc
# few-shot linear probing
bash scripts/run_dataset_seed_probing.sh semi-aves 1
# few-shot finetuning
bash scripts/run_dataset_seed_fewshot_finetune.sh semi-aves 1
# obtain top-k predictions on test set for a pretrained model
bash scripts/run_dataset_seed_topk.sh semi-aves 1
# we can also run batch experiments for multiple datasets and seeds
bash scripts/batch_probing.sh
bash scripts/batch_fewshot_finetune.sh
bash scripts/batch_topk.sh
# run other FSL baselines, we will release more FSL baselines soon
bash scripts/batch_cmlp.shFor running FineR, follow the command given below. Note this is just FineR and not POC on top of FineR. You can change the number of shots to 4, 8 or 16. Update the train_list and output_json arguments accordingly.
# Activate your environment
conda activate poc
# Obtain FineR predictions.
python finer_topk.py \
--model_cfg ViT-B-32 \
--pretrained laion400m_e32 \
--device cuda \
--dataset_name semi-aves \
--dataset_root_train ../path/to/your/semi-aves/ \
--dataset_root_test ../path/to/your/semi-aves/ \
--train_list data/semi-aves/fewshot4_seed1.txt \
--test_list data/semi-aves/test.txt \
--metrics_json data/semi-aves/semi-aves_labels.json \
--name_key most_common_name \
--use_random_aug --aug_repeats 10 --encode_chunk_size 512 \
--alpha 0.7 \
--logit_scale_eval 1.0 \
--logit_scale_export 50 \
--batch_size 256 --workers 8 \
--topk 10 \
--output_json ../path/to/your/fineR_semi-aves_4shot_topk_fused.json- Query LMM for post-hoc correction.
conda activate qwen
cd POC_dev/lmm-inference/
# generate few-shot reference images by stitching them together
python pregenerate_reference_images.py \
--class-json ../data/species196_insecta/species196_insecta_labels.json \
--k 4 \
--dataset species196_insecta \
--seed-file ../data/species196_insecta/fewshot16_seed1.txt \
--image-root /home/ltmask/dataset/species196_insecta \
--output-dir /home/ltmask/dataset/species196_insecta
# run POC with Qwen2.5-VL-7B-Instruct
python run_inference_local_hf.py \
--prompt-template top5-multimodal-4shot-with-confidence_ranking \
--prompt-dir species196_insecta \
--backend huggingface \
--hf-model-name-or-path Qwen/Qwen2.5-VL-7B-Instruct \
--config-yaml ../config.yml \
--image-dir species196_insecta \
--image-paths ../data/species196_insecta/test.txt \
--ref-image-dir /home/ltmask/dataset/species196_insecta/pregenerated_references_4shot \
--taxonomy-json ../data/species196_insecta/species196_insecta_labels.json \
--topk-json /home/ltmask/POC_dev/hanna/finetune_vitb32_openclip_laion400m_species196_insecta_4_1_topk_test_predictions_fixed.json \
--output-csv /home/ltmask/POC_dev/hanna/output.csv \
--error-file /home/ltmask/POC_dev/hanna/error.txt \
--max_new_tokens 900See QUERYLMM.md for more detailed instructions on running query with each LMM.
Check out our related works below:
- SWIFT (arXiv 2025): enabling successful semi-supervised learning with VLM
- VEST (arXiv 2025): retrieving open data for validation in few-shot learning
- SWAT (CVPR 2025): retrieving open data for few-shot finetuning a VLM
- REAL (CVPR 2024): uncovering the failures and causes in zero-shot VLMs
If you find our project useful, please consider citing our works:
@article{liu2025poc,
title={Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?},
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.15748},
year={2025}
}
@article{liu2025swift,
title={Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective},
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.10244},
year={2025}
}
@article{wang2025enabling,
title={Enabling Validation for Robust Few-Shot Recognition},
author={Wang, Hanxin and Liu, Tian and Kong, Shu},
journal={arXiv preprint arXiv:2506.04713},
year={2025}
}
@inproceedings{liu2025few,
title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}
@inproceedings{parashar2024neglected,
title={The Neglected Tails in Vision-Language Models},
author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}