LLMVS is a video summarization framework that leverages the capabilities of recent Large Language Models (LLMs). Our method translates video frames into captions using a Multi-modal Large Language Model (M-LLM) and assesses frame importance through an LLM based on local context. These local importance scores are refined through a global attention mechanism, ensuring summaries effectively reflect both details and the overarching narrative.
- Clone Repository
git clone https://github.com/mlee47/LLMVS.git
cd LLMVS- Create and Activate Virtual Environment
conda create -n llmvs python=3.8
conda activate llmvs
pip install pip==23.3.2
conda install hdf5=1.10.6 h5py=2.10.0
pip install -r requirements.txt- Download Path: PGL-SUM or direct download
- Download Path: PGL-SUM or direct download
- SumMe LLaMA Embeddings: download
- TVSum LLaMA Embeddings: download
- +) MR.HiSum LLaMA Embeddings: download
Note: All LLaMA embeddings are max-pooled.
Download the datasets and LLaMA embeddings, then organize them in the following directory structure:
LLMVS/
├── llama_emb/
│ ├── summe_sum/
│ │ ├── gen/gen_pool.h5
│ │ └── user_prompt/user_prompt_pool.h5
│ └── tvsum_sum/
│ ├── gen/gen_pool.h5
│ └── user_prompt/user_prompt_pool.h5
├── SumMe/
│ └── eccv16_dataset_summe_google_pool5.h5
├── TVSum/
├── eccv16_dataset_tvsum_google_pool5.h5
└── ydata-tvsum50-anno.tsv
We use 5 GPUs for distributed training.
bash train.shSumMe Dataset Training:
python train.py --tag summe_split0 --model summe_head2_layer3 --lr 0.000119 --epochs 200 --dataset summe --reduced_dim 2048 --num_heads 2 --num_layers 3 --split_idx 0 --pt_path 'llama_emb/summe_sum/'TVSum Dataset Training:
python train.py --tag tvsum_split0 --model tvsum_head2_layer3 --lr 0.00007 --epochs 200 --dataset tvsum --reduced_dim 2048 --num_heads 2 --num_layers 3 --split_idx 0 --pt_path 'llama_emb/tvsum_sum/'--dataset: Dataset selection (summeortvsum)--split_idx: Data split index (0-4)--epochs: Number of training epochs (default: 200)--lr: Learning rate- SumMe: 0.000119
- TVSum: 0.00007
--reduced_dim: Dimension reduction size (default: 2048)--num_heads: Number of Transformer heads (default: 2)--num_layers: Number of Transformer layers (default: 3)
You can download model checkpoints from the following link:
Model Checkpoints: Download
After downloading, extract the checkpoints to the following directory structure:
LLMVS/
├── Summaries/
│ ├── {SumMe model_name}/
│ │ └── summe/
│ │ ├── summe_split0/
│ │ │ ├── best_rho_model/
│ │ │ ├── best_tau_model/
│ │ │ └── configuration.txt
│ │ ├── summe_split1/
│ │ ├── ...
│ │ └── summe_split4/
│ └── {TVSum model_name}/
│ └── tvsum/
│ ├── tvsum_split0/
│ │ ├── best_rho_model/
│ │ ├── best_tau_model/
│ │ └── configuration.txt
│ ├── tvsum_split1/
│ ├── ...
│ └── tvsum_split4/
bash test.shCUDA_VISIBLE_DEVICES=0,2,4,5,6 python test.py --dataset {dataset} --split_idx 0 --tag {tag} --weights 'Summaries/{model_name}/{dataset}/{tag}/ckpt_file' --pt_path llama_emb/{dataset}/ --result_dir 'Summaries/{model_name}/{dataset}/' --num_heads 2 --num_layers 3 --reduced_dim 2048When using this project, please cite as follows:
@inproceedings{lee2025video,
title={Video Summarization with Large Language Models},
author={Lee, Min Jung and Gong, Dayoung and Cho, Minsu},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={18981--18991},
year={2025}
}