This repository is the official PyTorch implementation of "Masked Images Are Counterfactual Samples for Robust Fine-tuning" [paper], accepted by CVPR 2023.
- 2023-03-24: Code released.
Our experiments are conducted on:
- OS: Ubuntu 20.04.4
- GPU: NVIDIA GeForce RTX 3090
- Python 3.9
- PyTorch 1.11
- cudatoolkit 11.3.1
- torchvision 0.12.0
- tensorboard 2.8.0
- scikit-learn 1.0.2
- torchattacks
- tqdm
The data directory (DATA_DIR) should contain the following sub-directories:
ILSVRC2012: ImageNetimagenet-a: ImageNet-Aimagenet-r: ImageNet-Rimagenet-sketch: ImageNet-Sketchimagenetv2-matched-frequency: ImageNet-V2objectnet-1.0: ObjectNet
Please modify line 3-6 of the main script run.sh to set the proper directories:
LOG_DIR: root directory for the logging of all experiments and runsDATA_DIR: the directory for all datasets as stated aboveMODEL_DIR: the directory for pre-trained model weights (i.e., CLIP weights; the weights will be automatically downloaded if not exist)EXP_NAME: experiment name; to be a sub-directory ofLOG_DIR
The bash script run.sh provides a uniform and simplified interface of the Python scripts for training and evaluation, which accepts the following arguments:
- script mode: to train or evaluate a model; can be
train,evalortrain-eval - architecture:
clip_{arch}, where{arch}can beViT-B/32,ViT-B/16orViT-L/14. - method: the training method (see
example.shorrun.shfor available options) - masking: the masking strategy (see
example.sh) - seed: an integer seed number (note: we use three seeds (0, 1, 2) in the paper)
- other arguments that are passed to the Python scripts
The following commands show an example of fine-tuning a CLIP ViT-B/32 model with our proposed method, using object-mask (threshold 0.3) & single-fill. Please refer to example.sh for more examples.
# Build the zero-shot model
CUDA_VISIBLE_DEVICES=0 bash run.sh train 'clip_ViT-B/32' 'zeroshot' '' 0
# Fine-tune using our approach
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run.sh train 'clip_ViT-B/32' 'FT_FD_image_mask' 'ObjMaskSingleFill(0.3)' 0
# Evaluate the fine-tuned model (replace `train` by `eval`)
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run.sh eval 'clip_ViT-B/32' 'FT_FD_image_mask' 'ObjMaskSingleFill(0.3)' 0(WIP)
Some of the code in this repository is based on the following repositories:
- CLIP: https://github.com/openai/CLIP
- WiSE-FT: https://github.com/mlfoundations/wise-ft
- CAM for ViT: https://github.com/hila-chefer/Transformer-MM-Explainability