Tian Liu1 · Anwesha Basu1 · James Caverlee1 · Shu Kong2
1Texas A&M University 2University of Macau
We investigate the failures of representative semi-supervised learning methods, e.g., FixMatch and DebiasPL, in the challenging few-shot setup for finetuning a pretrained VLM. Our analyses reveal the root cause in the rather ''flat'' softmax probabilities from contrastively pretrained VLMs, which leads to weak supervision and zero utilization of pseudo-labeled data.
To address this, we propose simple-yet-effective remedies, including classifier initialization and temperature tuning. Building upon these insights, our final method SWIFT effectively finetunes a VLM on limited labeled data, abundant unlabeled data, and task-relevant retrieved data. SWIFT outperforms recent FSL and SSL methods by 5% accuracy across five benchmarks, even rivalling fully supervised finetuning with all labels.
- 2025-12-30: SWIFT code is released.
- 2025-12-11: arXiv preprint is published.
Create conda environment and install dependencies following the instructions in ENV.md.
Prepare the datasets following the instructions in DATASETS.md.
Retrieve relevant pretraining data following the instructions in RETRIEVAL.md.
To run SWIFT:
# first run linear probing on few-shot data to initialize classifier
bash scripts/run_dataset_seed_probing.sh semi-aves 1
# then run stage 2 and 3 training with the initialized classifier
bash scripts/run_dataset_seed_swift.sh semi-aves 1To run FS-FT:
# run FS-FT with text-initialized classifier
bash scripts/run_dataset_seed_FSFT_text-init.sh semi-aves 1To run FixMatch and DebiasPL baselines:
# run for a single dataset
bash scripts/run_dataset_seed_fixmatch.sh semi-aves 1
bash scripts/run_dataset_seed_debiasPL.sh semi-aves 1To run fully supervised references:
# w/o RA
bash scripts/run_dataset_seed_oracle1.sh semi-aves 1
# w/ RA
bash scripts/run_dataset_seed_oracle2.sh semi-aves 1Check out our related works below:
- POC (arXiv 2025): harnessing large multimodal models for few-shot visual species recognition
- SWIFT (arXiv 2025): enabling successful semi-supervised learning with VLM
- VEST (arXiv 2025): retrieving open data for validation in few-shot learning
- SWAT (CVPR 2025): retrieving open data for few-shot finetuning a VLM
- REAL (CVPR 2024): uncovering the failures and causes in zero-shot VLMs
If you find our project useful, please consider citing our works:
@article{liu2025swift,
title={Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective},
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.10244},
year={2025}
}
@article{liu2025poc,
title={Surely Large Multimodal Models (Don’t) Excel in Visual Species Recognition?},
author={Liu, Tian and Basu, Anwesha and Kong, Shu},
journal={arXiv preprint arXiv:2512.15748},
year={2025}
}
@article{wang2025enabling,
title={Enabling Validation for Robust Few-Shot Recognition},
author={Wang, Hanxin and Liu, Tian and Kong, Shu},
journal={arXiv preprint arXiv:2506.04713},
year={2025}
}
@inproceedings{liu2025few,
title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}
@inproceedings{parashar2024neglected,
title={The Neglected Tails in Vision-Language Models},
author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}