Skip to content

tian1327/POC

Repository files navigation

Surely Large Multimodal Models (Don't) Excel
in Visual Species Recognition?

Tian Liu*1 · Anwesha Basu*1 · James Caverlee1 · Shu Kong2

1Texas A&M University   2University of Macau
*The first two authors contribute equally.

Paper PDF Project Page

We explore the capability of Large Multimodal Models (LMMs) in visual species recognition, a challenging fine-grained visual classification task. Surprisingly, we find that LMMs still struggle with this task, falling far behind a specialized few-shot recognition expert model. Yet, by leveraging the top-k predictions from the expert model to guide LMMs through in-context learning, we can significantly enhance VSR performance. Our method, Post-hoc correction (POC), achieves state-of-the-art results on multiple VSR benchmarks, outperforming prior few-shot methods by >10% accuracy. Importantly, POC requires no extra training, validation, or manual intervention, serving as a plug-and-play module to significantly enhance various existing FSL methods.

teaser

News

  • 2025-12-16: POC code is released.
  • 2025-12-10: arXiv preprint is published.

Create Environment

Create conda environment and install dependencies.

# lab server has CUDA version 12.8, thus using pytorch-cuda=12.1 for compatibility
# DINOv3 requries python=3.10

conda create -n poc python=3.10 -y
conda activate poc
conda install pytorch torchvision torchaudio torchmetrics pytorch-cuda=12.1 -c pytorch -c nvidia

# install openclip and clip
pip install open_clip_torch
pip install git+https://github.com/openai/CLIP.git

pip install pandas scikit-learn 

# clone dinov3
git clone https://github.com/facebookresearch/dinov3.git

# install gdown for downloading datasets
pip install gdown

For LMM inference with Qwen, you can follow instructions or steps below to set up Qwen2.5-VL-7B locally.

# setup Qwen2.5-VL-7B locally using huggingface transformers
conda create --name qwen --clone poc
conda activate qwen
pip install transformers==4.51.3 accelerate
pip install qwen-vl-utils[decord]

Dataset Prepraration

Prepare the datasets following the instructions in DATASETS.md. The few-shot splits are provided in the data/${dataset_name}/few-shot${num_shots}_seed${seed}.txt files, the test splits are in data/${dataset_name}/test.txt. You can simply just download the datasets without repeating the sampling process.

Code Usage

  1. Obtain the top-k predictions on the test set using a few-shot finetuned model.
# activate conda environment
conda activate poc

# few-shot linear probing
bash scripts/run_dataset_seed_probing.sh semi-aves 1

# few-shot finetuning
bash scripts/run_dataset_seed_fewshot_finetune.sh semi-aves 1

# obtain top-k predictions on test set for a pretrained model
bash scripts/run_dataset_seed_topk.sh semi-aves 1

# we can also run batch experiments for multiple datasets and seeds
bash scripts/batch_probing.sh
bash scripts/batch_fewshot_finetune.sh
bash scripts/batch_topk.sh

# run other FSL baselines, we will release more FSL baselines soon
bash scripts/batch_cmlp.sh

For running FineR, follow the command given below. Note this is just FineR and not POC on top of FineR. You can change the number of shots to 4, 8 or 16. Update the train_list and output_json arguments accordingly.

# Activate your environment
conda activate poc

# Obtain FineR predictions. 
python finer_topk.py \
  --model_cfg ViT-B-32 \
  --pretrained laion400m_e32 \
  --device cuda \
  --dataset_name semi-aves \
  --dataset_root_train ../path/to/your/semi-aves/ \
  --dataset_root_test  ../path/to/your/semi-aves/ \
  --train_list data/semi-aves/fewshot4_seed1.txt \
  --test_list  data/semi-aves/test.txt \
  --metrics_json data/semi-aves/semi-aves_labels.json \
  --name_key most_common_name \
  --use_random_aug --aug_repeats 10 --encode_chunk_size 512 \
  --alpha 0.7 \
  --logit_scale_eval 1.0 \
  --logit_scale_export 50 \
  --batch_size 256 --workers 8 \
  --topk 10 \
  --output_json ../path/to/your/fineR_semi-aves_4shot_topk_fused.json
  1. Query LMM for post-hoc correction.
conda activate qwen
cd POC_dev/lmm-inference/

# generate few-shot reference images by stitching them together
python pregenerate_reference_images.py \
  --class-json ../data/species196_insecta/species196_insecta_labels.json \
  --k 4 \
  --dataset species196_insecta \
  --seed-file ../data/species196_insecta/fewshot16_seed1.txt \
  --image-root /home/ltmask/dataset/species196_insecta \
  --output-dir /home/ltmask/dataset/species196_insecta

# run POC with Qwen2.5-VL-7B-Instruct
python run_inference_local_hf.py \
  --prompt-template top5-multimodal-4shot-with-confidence_ranking \
  --prompt-dir species196_insecta \
  --backend huggingface \
  --hf-model-name-or-path Qwen/Qwen2.5-VL-7B-Instruct \
  --config-yaml ../config.yml \
  --image-dir species196_insecta \
  --image-paths ../data/species196_insecta/test.txt \
  --ref-image-dir /home/ltmask/dataset/species196_insecta/pregenerated_references_4shot \
  --taxonomy-json ../data/species196_insecta/species196_insecta_labels.json \
  --topk-json /home/ltmask/POC_dev/hanna/finetune_vitb32_openclip_laion400m_species196_insecta_4_1_topk_test_predictions_fixed.json \
  --output-csv /home/ltmask/POC_dev/hanna/output.csv \
  --error-file /home/ltmask/POC_dev/hanna/error.txt \
  --max_new_tokens 900

See QUERYLMM.md for more detailed instructions on running query with each LMM.

Related Works

Check out our related works below:

  • SWIFT (arXiv 2025): enabling successful semi-supervised learning with VLM
  • VEST (arXiv 2025): retrieving open data for validation in few-shot learning
  • SWAT (CVPR 2025): retrieving open data for few-shot finetuning a VLM
  • REAL (CVPR 2024): uncovering the failures and causes in zero-shot VLMs

Citations

If you find our project useful, please consider citing our works:

@article{liu2025poc,
title={Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?}, 
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.15748},
year={2025}
}

@article{liu2025swift,
title={Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective}, 
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.10244},
year={2025}
}

@article{wang2025enabling,
title={Enabling Validation for Robust Few-Shot Recognition}, 
author={Wang, Hanxin and Liu, Tian and Kong, Shu},
journal={arXiv preprint arXiv:2506.04713},
year={2025}
}

@inproceedings{liu2025few,
    title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
    author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}

@inproceedings{parashar2024neglected,
    title={The Neglected Tails in Vision-Language Models},
    author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}

About

Harnessing LMMs via post-hoc correction (POC) for visual species recognition

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published