Video Summarization with Large Language Models (CVPR25) [Paper | Project]

Project Overview

LLMVS is a video summarization framework that leverages the capabilities of recent Large Language Models (LLMs). Our method translates video frames into captions using a Multi-modal Large Language Model (M-LLM) and assesses frame importance through an LLM based on local context. These local importance scores are refined through a global attention mechanism, ensuring summaries effectively reflect both details and the overarching narrative.

Environment Setup

Installation

Clone Repository

git clone https://github.com/mlee47/LLMVS.git
cd LLMVS

Create and Activate Virtual Environment

conda create -n llmvs python=3.8
conda activate llmvs
pip install pip==23.3.2
conda install hdf5=1.10.6 h5py=2.10.0
pip install -r requirements.txt

Dataset Download and Preparation

1. SumMe Dataset

Download Path: PGL-SUM or direct download

2. TVSum Dataset

Download Path: PGL-SUM or direct download

3. LLaMA Embeddings

SumMe LLaMA Embeddings: download
TVSum LLaMA Embeddings: download
+) MR.HiSum LLaMA Embeddings: download

Note: All LLaMA embeddings are max-pooled.

4. Data Directory Structure

Download the datasets and LLaMA embeddings, then organize them in the following directory structure:

LLMVS/
├── llama_emb/
│   ├── summe_sum/
│   │   ├── gen/gen_pool.h5
│   │   └── user_prompt/user_prompt_pool.h5
│   └── tvsum_sum/
│       ├── gen/gen_pool.h5
│       └── user_prompt/user_prompt_pool.h5
├── SumMe/
│   └── eccv16_dataset_summe_google_pool5.h5
├── TVSum/
    ├── eccv16_dataset_tvsum_google_pool5.h5
    └── ydata-tvsum50-anno.tsv

Training Method

1. Training Script

We use 5 GPUs for distributed training.

bash train.sh

2. Individual Training Commands

SumMe Dataset Training:

python train.py --tag summe_split0 --model summe_head2_layer3 --lr 0.000119 --epochs 200 --dataset summe --reduced_dim 2048 --num_heads 2 --num_layers 3 --split_idx 0 --pt_path 'llama_emb/summe_sum/'

TVSum Dataset Training:

python train.py --tag tvsum_split0 --model tvsum_head2_layer3 --lr 0.00007 --epochs 200 --dataset tvsum --reduced_dim 2048 --num_heads 2 --num_layers 3 --split_idx 0 --pt_path 'llama_emb/tvsum_sum/'

3. Key Hyperparameters

--dataset: Dataset selection (summe or tvsum)
--split_idx: Data split index (0-4)
--epochs: Number of training epochs (default: 200)
--lr: Learning rate
- SumMe: 0.000119
- TVSum: 0.00007
--reduced_dim: Dimension reduction size (default: 2048)
--num_heads: Number of Transformer heads (default: 2)
--num_layers: Number of Transformer layers (default: 3)

Trained Models

Download Checkpoints

You can download model checkpoints from the following link:

Model Checkpoints: Download

Directory Structure

After downloading, extract the checkpoints to the following directory structure:

LLMVS/
├── Summaries/
│   ├── {SumMe model_name}/          
│   │   └── summe/
│   │       ├── summe_split0/
│   │       │   ├── best_rho_model/ 
│   │       │   ├── best_tau_model/ 
│   │       │   └── configuration.txt
│   │       ├── summe_split1/
│   │       ├── ...
│   │       └── summe_split4/
│   └── {TVSum model_name}/         
│       └── tvsum/
│           ├── tvsum_split0/
│           │   ├── best_rho_model/ 
│           │   ├── best_tau_model/ 
│           │   └── configuration.txt
│           ├── tvsum_split1/
│           ├── ...
│           └── tvsum_split4/

Evaluation Method

1. Evaluation Script

bash test.sh

2. Individual Evaluation

CUDA_VISIBLE_DEVICES=0,2,4,5,6 python test.py --dataset {dataset} --split_idx 0 --tag {tag}  --weights 'Summaries/{model_name}/{dataset}/{tag}/ckpt_file' --pt_path llama_emb/{dataset}/ --result_dir 'Summaries/{model_name}/{dataset}/' --num_heads 2 --num_layers 3 --reduced_dim 2048

Citation

When using this project, please cite as follows:

@inproceedings{lee2025video,
  title={Video Summarization with Large Language Models},
  author={Lee, Min Jung and Gong, Dayoung and Cho, Minsu},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={18981--18991},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
fig		fig
networks		networks
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
test.sh		test.sh
test_splits.py		test_splits.py
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video Summarization with Large Language Models (CVPR25) [Paper | Project]

Project Overview

Environment Setup

Installation

Dataset Download and Preparation

1. SumMe Dataset

2. TVSum Dataset

3. LLaMA Embeddings

4. Data Directory Structure

Training Method

1. Training Script

2. Individual Training Commands

3. Key Hyperparameters

Trained Models

Download Checkpoints

Directory Structure

Evaluation Method

1. Evaluation Script

2. Individual Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

mlee47/LLMVS

Folders and files

Latest commit

History

Repository files navigation

Video Summarization with Large Language Models (CVPR25) [Paper | Project]

Project Overview

Environment Setup

Installation

Dataset Download and Preparation

1. SumMe Dataset

2. TVSum Dataset

3. LLaMA Embeddings

4. Data Directory Structure

Training Method

1. Training Script

2. Individual Training Commands

3. Key Hyperparameters

Trained Models

Download Checkpoints

Directory Structure

Evaluation Method

1. Evaluation Script

2. Individual Evaluation

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages