VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
🚀 Welcome to the official repository of VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding!
Applying in-context learning to video-language tasks faces challenges due to the limited context length in video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD video understanding tasks that extends effective context length without incurring high costs.
Our VideoICL implementation includes the following key features:
- ✅ Similarity-based Example Selection: Selects relevant video-question pairs based on query relevance.
- 🔁 Confidence-based Iterative Inference: Iteratively refining the results until a high-confidence response is obtained.
- 🏆 State-of-the-Art Performance: Outperforms existing baselines including GPT-4o and Gemini on multiple benchmarks with 7B model.
In this repository, we evaluate Qwen2-VL-7B model using VideoICL on a video classification task using the UCF-Crime dataset.
conda create -n videoicl python=3.10 -y
conda activate videoicl
git clone https://github.com/KangsanKim07/VideoICL.git
cd VideoICLDownload following files to data/UCF-Crime/raw folder from this link.
- Anomaly-Videos-Part-1~4.zip
- Normal_Videos_for_Event_Recognition.zip
- UCF-Crimes-Train-Test-Split.zip
And run
sh data/UCF-Crimes/preprocess.shAfter running preprocessinng, data folder should be like this.
data
└── UCF-Crimes
├── raw
│ ├── Anomaly-Videos-Part-*.zip
│ ├── Normal_Videos_for_Event_Recognition.zip
│ ├── UCF-Crimes-Train-Test-Split.zip
│ └── ...
├── videos
│ ├── Normal_Videos_event
│ ├── Abuse
│ ├── Arrest
│ ├── ...
│ └── Vandalism
└── Action_Recognition_splits
├── test_001.txt
├── test_002.txt
├── ...
├── train_003.txt
└── train_004.txt
Download InternVideo2 checkpoint.
And run
sh scripts/extract_visual_feat.sh ${PATH_TO_InternVideo2-stage2_1b-224p-f4.pt}It will generate a file of video features as data/UCF-Crimes/vid_feat.pkl.
sh sctipts/get_simrank.shIt will generate similarity rankings for each test video in data/UCF-Crimes/simrank.
pip install qwen-vl-utils
sh scripts/run_videoicl.sh| Model | #example | Animal Kingdom | Sports-QA | Pit-VQA | UCF-Crime | Drive& Act | CapERA |
|---|---|---|---|---|---|---|---|
| GPT-4o | 0 | 58.2 | - | 6.9 | 58.0 | - | 0.173 |
| Gemini-1.5 Pro | 0 | 72.9 | - | 14.7 | 55.1 | - | 0.176 |
| LLaVA-Video-72B | 0 | 69.7 | 25.7 | 5.7 | 35.6 | 14.6 | 0.170 |
| LLaVA-Video-7B | 0 | 68.0 | 25.5 | 6.7 | 39.3 | 20.2 | 0.181 |
| +VideoICL | 8 | 72.3 | 47.6 | 61.3 | 53.3 | 53.4 | 0.178 |
| Qwen2-VL-7B | 0 | 58.6 | 26.8 | 5.8 | 36.1 | 10.6 | 0.138 |
| +VideoICL | 8 | 66.3 | 51.5 | 59.6 | 48.7 | 49.3 | 0.189 |
| Oryx-1.5-7B | 0 | 58.6 | 28.3 | 3.8 | 11.9 | 10.7 | 0.151 |
| +VideoICL | 8 | 58.5 | 52.0 | 58.4 | 44.0 | 57.3 | 0.179 |
If you find this work useful, please cite our paper:
@article{kim2024videoicl,
title={VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding},
author={Kim, Kangsan and Park, Geon and Lee, Youngwan and Yeo, Woongyeong and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2412.02186},
year={2024}
}
