SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva
- 🔥 Specificity Metric SPECS introduces a novel specificity-focused evaluation metric based on CLIP, enhancing accuracy in assessing detailed visual information in dense image captions.
- 🔥 Human Alignment Achieves state-of-the-art correlation with human judgments among Representational Similarity metrics, highly competitive with many LLM-based evaluation methods.
- 🔥 Efficiency Provides a faster and more resource-efficient evaluation approach compared to computationally expensive LLM-based metrics.
Please first clone our repo from github by running the following command.
git clone https://github.com/mbzuai-nlp/SPECS.git
cd SPECSconda env create -f environment.yml
pip install git+https://github.com/openai/CLIP.git
python -m spacy download en_core_web_smThen, download the checkpoints of our model SPEC and place it under ./checkpoints
Prepare ShareGPT4V dataset
First, download all images we used.
- LAION-CC-SBU-558K: images.zip
- COCO: train2017
- WebData: images. Only for academic usage.
- SAM: images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from here.
- GQA: images
- OCR-VQA: download script. We save all files as
.jpg - TextVQA: trainvalimages
- VisualGenome: part1, part2
Then, download the long caption of these image share-captioner_coco_lcs_sam_1246k_1107.json
organize the data as follows in projects/ShareGPT4V/data:
ShareGPT4V
├── ...
├── data
| ├── share-captioner_coco_lcs_sam_1246k_1107.json
│ ├── llava
│ │ ├── llava_pretrain
│ │ │ ├── images
│ ├── coco
│ │ ├── train2017
│ ├── sam
│ │ ├── images
│ ├── gqa
│ │ ├── images
│ ├── ocr_vqa
│ │ ├── images
│ ├── textvqa
│ │ ├── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ ├── VG_100K_2
│ ├── share_textvqa
│ │ ├── images
│ ├── web-celebrity
│ │ ├── images
│ ├── web-landmark
│ │ ├── images
│ ├── wikiart
│ │ ├── images
├── ...
When download the ShareGPT4V dataset then use /SPECS/data/create_sharegpt4v.py to preprocess the dataset.
sDCI dataset consisting of 7805 images, each paired with 10 captions.
You can compute SPECS scores for an image–caption pair using the following code:
from PIL import Image
import torch
import torch.nn.functional as F
from model import longclip
# Device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load SPECS model
model, preprocess = longclip.load("spec.pt", device=device)
model.eval()
# Load image
image_path = "SPECS/images/cat.png"
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Define text descriptions
texts = [
"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
"The cat is partially tucked under a multi-colored woven jumper.",
"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
"The cat is partially tucked under a multi-colored woven blanket.",
"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
"The cat is partially tucked under a multi-colored woven blanket with fringed edges."
]
# Process inputs
text_tokens = longclip.tokenize(texts).to(device)
# Get features and calculate SPECS
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
# Calculate cosine similarity
similarity = F.cosine_similarity(image_features.unsqueeze(1), text_features.unsqueeze(0), dim=-1)
# SPECS
specs_scores = torch.clamp((similarity + 1.0) / 2.0, min=0.0)
# Output results
print("SPECS")
for i, score in enumerate(specs_scores.squeeze()):
print(f" Text {i+1}: {score:.4f}")This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:
-
Text 1: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."
→ Score: 0.4293 -
Text 2: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."
→ Score: 0.4457 -
Text 3: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."
→ Score: 0.4583
To run zero-shot classification on imagenet dataset, run the following command after preparing the data
cd /SPECS/evaluation/extrinsic/classification/imagenet
python imagenet.pySimilarly, run the following command for cifar datset
cd /SPECS/evaluation/extrinsic/classification/cifar
python cifar10.py #cifar10
python cifar100.py #cifar100To run text-image retrieval on COCO2017 or Urban1k, run the following command after preparing the data
cd /SPECS/evaluation/extrinsic/retrieval/coco.py
python coco.py #COCO2017
python Urban1k.py #Urban1kDownload the dataset from the HuggingFace.
cd /SPECS/evaluation/intrinsic/scripts
python specs \
--model_name_or_path \
--data_dir \
--data_split test \
--output_dir /path/to/save/results \
--postfix _longclipeval bash /SPECS/train/run_spec.sh