Surely Large Multimodal Models (Don't) Excel
in Visual Species Recognition?

Tian Liu^*1 · Anwesha Basu^*1 · James Caverlee¹ · Shu Kong²

¹Texas A&M University ²University of Macau
*The first two authors contribute equally.

We explore the capability of Large Multimodal Models (LMMs) in visual species recognition, a challenging fine-grained visual classification task. Surprisingly, we find that LMMs still struggle with this task, falling far behind a specialized few-shot recognition expert model. Yet, by leveraging the top-k predictions from the expert model to guide LMMs through in-context learning, we can significantly enhance VSR performance. Our method, Post-hoc correction (POC), achieves state-of-the-art results on multiple VSR benchmarks, outperforming prior few-shot methods by >10% accuracy. Importantly, POC requires no extra training, validation, or manual intervention, serving as a plug-and-play module to significantly enhance various existing FSL methods.

News

2025-12-16: POC code is released.
2025-12-10: arXiv preprint is published.

Create Environment

Create conda environment and install dependencies.

# lab server has CUDA version 12.8, thus using pytorch-cuda=12.1 for compatibility
# DINOv3 requries python=3.10

conda create -n poc python=3.10 -y
conda activate poc
conda install pytorch torchvision torchaudio torchmetrics pytorch-cuda=12.1 -c pytorch -c nvidia

# install openclip and clip
pip install open_clip_torch
pip install git+https://github.com/openai/CLIP.git

pip install pandas scikit-learn 

# clone dinov3
git clone https://github.com/facebookresearch/dinov3.git

# install gdown for downloading datasets
pip install gdown

For LMM inference with Qwen, you can follow instructions or steps below to set up Qwen2.5-VL-7B locally.

# setup Qwen2.5-VL-7B locally using huggingface transformers
conda create --name qwen --clone poc
conda activate qwen
pip install transformers==4.51.3 accelerate
pip install qwen-vl-utils[decord]

Dataset Prepraration

Prepare the datasets following the instructions in DATASETS.md. The few-shot splits are provided in the data/${dataset_name}/few-shot${num_shots}_seed${seed}.txt files, the test splits are in data/${dataset_name}/test.txt. You can simply just download the datasets without repeating the sampling process.

Code Usage

Obtain the top-k predictions on the test set using a few-shot finetuned model.

# activate conda environment
conda activate poc

# few-shot linear probing
bash scripts/run_dataset_seed_probing.sh semi-aves 1

# few-shot finetuning
bash scripts/run_dataset_seed_fewshot_finetune.sh semi-aves 1

# obtain top-k predictions on test set for a pretrained model
bash scripts/run_dataset_seed_topk.sh semi-aves 1

# we can also run batch experiments for multiple datasets and seeds
bash scripts/batch_probing.sh
bash scripts/batch_fewshot_finetune.sh
bash scripts/batch_topk.sh

# run other FSL baselines, we will release more FSL baselines soon
bash scripts/batch_cmlp.sh

For running FineR, follow the command given below. Note this is just FineR and not POC on top of FineR. You can change the number of shots to 4, 8 or 16. Update the train_list and output_json arguments accordingly.

# Activate your environment
conda activate poc

# Obtain FineR predictions. 
python finer_topk.py \
  --model_cfg ViT-B-32 \
  --pretrained laion400m_e32 \
  --device cuda \
  --dataset_name semi-aves \
  --dataset_root_train ../path/to/your/semi-aves/ \
  --dataset_root_test  ../path/to/your/semi-aves/ \
  --train_list data/semi-aves/fewshot4_seed1.txt \
  --test_list  data/semi-aves/test.txt \
  --metrics_json data/semi-aves/semi-aves_labels.json \
  --name_key most_common_name \
  --use_random_aug --aug_repeats 10 --encode_chunk_size 512 \
  --alpha 0.7 \
  --logit_scale_eval 1.0 \
  --logit_scale_export 50 \
  --batch_size 256 --workers 8 \
  --topk 10 \
  --output_json ../path/to/your/fineR_semi-aves_4shot_topk_fused.json

Query LMM for post-hoc correction.

conda activate qwen
cd POC_dev/lmm-inference/

# generate few-shot reference images by stitching them together
python pregenerate_reference_images.py \
  --class-json ../data/species196_insecta/species196_insecta_labels.json \
  --k 4 \
  --dataset species196_insecta \
  --seed-file ../data/species196_insecta/fewshot16_seed1.txt \
  --image-root /home/ltmask/dataset/species196_insecta \
  --output-dir /home/ltmask/dataset/species196_insecta

# run POC with Qwen2.5-VL-7B-Instruct
python run_inference_local_hf.py \
  --prompt-template top5-multimodal-4shot-with-confidence_ranking \
  --prompt-dir species196_insecta \
  --backend huggingface \
  --hf-model-name-or-path Qwen/Qwen2.5-VL-7B-Instruct \
  --config-yaml ../config.yml \
  --image-dir species196_insecta \
  --image-paths ../data/species196_insecta/test.txt \
  --ref-image-dir /home/ltmask/dataset/species196_insecta/pregenerated_references_4shot \
  --taxonomy-json ../data/species196_insecta/species196_insecta_labels.json \
  --topk-json /home/ltmask/POC_dev/hanna/finetune_vitb32_openclip_laion400m_species196_insecta_4_1_topk_test_predictions_fixed.json \
  --output-csv /home/ltmask/POC_dev/hanna/output.csv \
  --error-file /home/ltmask/POC_dev/hanna/error.txt \
  --max_new_tokens 900

See QUERYLMM.md for more detailed instructions on running query with each LMM.

Related Works

Check out our related works below:

SWIFT (arXiv 2025): enabling successful semi-supervised learning with VLM
VEST (arXiv 2025): retrieving open data for validation in few-shot learning
SWAT (CVPR 2025): retrieving open data for few-shot finetuning a VLM
REAL (CVPR 2024): uncovering the failures and causes in zero-shot VLMs

Citations

If you find our project useful, please consider citing our works:

@article{liu2025poc,
title={Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?}, 
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.15748},
year={2025}
}

@article{liu2025swift,
title={Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective}, 
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.10244},
year={2025}
}

@article{wang2025enabling,
title={Enabling Validation for Robust Few-Shot Recognition}, 
author={Wang, Hanxin and Liu, Tian and Kong, Shu},
journal={arXiv preprint arXiv:2506.04713},
year={2025}
}

@inproceedings{liu2025few,
    title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
    author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}

@inproceedings{parashar2024neglected,
    title={The Neglected Tails in Vision-Language Models},
    author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data		data
dataset_preparation		dataset_preparation
lmm-inference		lmm-inference
scripts		scripts
utils		utils
.DS_Store		.DS_Store
DATASETS.md		DATASETS.md
LICENSE		LICENSE
QUERYLMM.md		QUERYLMM.md
README.md		README.md
config.yml		config.yml
finer_topk.py		finer_topk.py
main.py		main.py
main_ssl.py		main_ssl.py
testing.py		testing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Surely Large Multimodal Models (Don't) Excel
in Visual Species Recognition?

News

Create Environment

Dataset Prepraration

Code Usage

Related Works

Citations

About

Uh oh!

Releases

Packages

Languages

License

tian1327/POC

Folders and files

Latest commit

History

Repository files navigation

Surely Large Multimodal Models (Don't) Excelin Visual Species Recognition?

News

Create Environment

Dataset Prepraration

Code Usage

Related Works

Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Surely Large Multimodal Models (Don't) Excel
in Visual Species Recognition?

Packages