Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios. CapS-Adapter adeptly constructs support sets that closely mirror target distributions, utilizing instance-level distribution features extracted from multimodal large models. By leveraging CLIP's single and cross-modal strengths, CapS-Adapter enhances predictive accuracy through the use of multimodal support sets. Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19% over the previous leading method. Our contributions are substantiated through extensive validation on multiple benchmark datasets, demonstrating superior performance and robust generalization capabilities.
Our code was tested on Python 3.8.18 and PyTorch 1.13.1+cu117.
We recommend setting up a conda virtual environment and installing all the requirements. You can follow these steps to set up environment correctly:
git clone https://github.com/WLuLi/CapS-Adapter.git
cd caps
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda activate
conda env create -f environment.ymlAfterward, you can activate the conda environment using
conda activate capsWe provide detailed instructions on how to prepare all the needed datasets in DATA.md.
You can run Zero-shot CLIP inference using:
python run_zs_baseline.py --dataset <dataset> --backbone <CLIP_visual_backbone>The backbone parameter can be one of [RN50, RN101, ViT-B/32, ViT-B/16, all](all for all these 5 backbones).
For ensuring reproducibility, we used CuPL prompt files generated by SuS-X in gpt3-prompts. These prompts are used for CuPL and CuPL+e inference and text classifier CuPL and CuPL+e.
You can run the CuPL and CuPL+e baselines using:
python run_cupl_baseline.py --dataset <dataset> --backbone <CLIP_visual_backbone>We provide scripts for CapS Construction.
For generating captions, we need to sample from the target dataset first. You can run the following script to extract k samples from each class in dataset:
python datasampler.py --k <k> --dataset <dataset_name>To generate captions using ShareCaptioner, we deploy it locally. You can run the following script to automatically download ShareCaptioner to ./models/share_captioner and generate captions for the samples from each class in dataset:
python generate_captions.py --dataset <dataset>For generating images using the Stable-Diffusion v1-4 checkpoint, we need a huggingface token. Please create an account on huggingface and find your token under the access tokens tab.
You can run the following script to generate images_number images for each class in dataset:
python generate_caps.py --dataset <dataset> --huggingface_key <huggingface_key> --images_per_class <images_number>You can create the test and validation image features using:
python encode_datasets.py --dataset <dataset>This script will save the test, validation features in ./data/features/<dataset>.
You can create the different text classifier weights using:
python generate_text_classifier_weights.py --dataset <dataset>This script will again save all the text classifier weights in ./data/features/<dataset>.
It should be noted that the features for the two text classifier modes, CuPL and CuPL+e, need to be generated by running the script run_cupl_baseline.py.
You can create the CapS features (features of images_number images for each class in dataset) using:
python encode_caps.py --dataset <dataset> --images_per_class <images_number> --regenerateThese scripts will also save the CapS image and text features in ./data/features/<dataset>. If the command line argument regenerate is added, it will clear the existing feature files in ./data/features/<dataset> and regenerate them.
You can create the few shot weights (1,2,4,8,16 shots) for <dataset> using:
python encode_few_shot.py --dataset <dataset> --regenerateIf the command line argument regenerate is added, it will clear the existing feature files in ./data/features/few_features/<dataset> and regenerate them.
It should be noted that madapter requires caption features for inference. Therefore, encode_few_shot.py randomly extracts k-shots images from pictures that have already generated captions. Please ensure that datasampler.py and generate_captions.py has been run beforehand.
Once you have correctly saved all the feature files, you can run M-Adapter using:
python madapter.py --dataset <dataset> --backbone <CLIP_visual_backbone> --sus_type <sus_type> --log_file_path <filepath>or
python madapter.py --dataset <dataset> --backbone <CLIP_visual_backbone> --sus_type <sus_type> --k <k> --log_file_path <filepath>The sus_type parameter is caps for using CapS and fewshot for using few shot features. The parameter k is only effective in the few-shot setting, representing k-shot. The log_file_path parameter is used to specify the output path for results of madapter.py.
If you found this work useful, please consider citing it as:
We build on several previous well-maintained repositories like CLIP, CoOp, CLIP-Adapter, TIP-Adapter, CuPL and SuS-X. We thank the authors for providing such amazing code, and enabling further research towards better vision-language model adaptation. We also thank the authors of the amazing Stable-Diffusion, which is pivotal component of our method.
If you have any questions about the code, feel free to open an issue on the GitHub repository or email us at [email protected].
