Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [CVPR 2025]

Jinhui Yi*, Syed Talal Wasim*, Yanan Luo*, Muzammal Naseer, Juergen Gall

*Equal Contribution

Abstract: We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5$\times$ reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model's effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4$\times$ faster processing speeds than previous methods.

🚀 News

[Feb. 26, 2025] 💥💥💥 Our Video-Panda has been accepted by CVPR 2025! 💥
[Dec. 29, 2024] 🔥 We release the pretrained and finetuned Video-Panda models. You can download them from Huggingface or Onedrive.
[Dec. 25, 2024] 💥 The Paper and the code are released.

💡 Overview

Video-Panda is an encoder-free video conversation model that directly processes video inputs through a novel spatio-temporal alignment block (STAB). It eliminates the need for heavyweight pretrained encoders and requires less than 50M parameters.

Comparison to existing video-language model architectures.

Detailed architecture of our Spatio-Temporal Alignment Block (STAB).

🔍 Model Performance and Visualization

Model performance on MSVD-QA versus the model size of the visual component in logarithmic scale. The bubble size indicates the amount of finetuning data (in thousands). Models using the same training dataset as ours (100K samples) are shown in dark green, while those using different datasets are in blue.

Qualitative examples showing the impact of removing Frame-wise Spatial Relationship Aggregator (FSRA) and Global Spatio-Temporal Relationship Aggregator (GSTRA).

Getting Started

🔧 Installation

1. Prepare the code and the environment

Python >= 3.10
Pytorch == 2.1.0
CUDA Version >= 11.7

git clone https://github.com/jh-yi/Video-Panda
cd Video-Panda
conda create -n videopanda python=3.10 -y
conda activate videopanda

pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn==2.6.3 --no-build-isolation
pip install git+https://github.com/huggingface/accelerate.git

export PYTHONPATH="$PYTHONPATH:."

2. Prepare the pretrained models and configs

We train our model based on EVE. Download EVE-7B-Pretrain-v1.0 and extract them into checkpoints/ path. Replace the checkpoints/EVE-7B-Pretrain-v1.0/config.json with videopanda/config/config.json.

For evaluation, download the pretrained and finetuned Video-Panda models. You can download them from Huggingface or Onedrive, and extract them into checkpoints/ path.

After downloading all of them, organize the models as follows.

checkpoints
├── EVE-7B-Pretrain-v1.0
│   │── config.json -> config.json
│   │── ...
└── Video-Panda-7B
    │── videopanda_fitu
    │   │── config.json
    │   │── ...
    └── videopanda_prtr1
        │── config.json
        │── ...

3. Prepare the datasets

Video-Panda was trained with Valley-702k dataset and Video-ChatGPT-100k dataset, and was evaluated on four open-ended VideoQA datasets: MSRVTT-QA, MSVD-QA, TGIF-QA, and ActivityNet-QA. Please follow the instructions in Video-LLaVA for downloading the data.

After downloading all of them, organize the data as follows in DATA_ROOT.

DATA_ROOT
├── train
│  ├── train_json
│  ├── valley
│  └── videochatgpt_tune
└── eval
   └── GPT_Zero_Shot_QA
       ├── Activitynet_Zero_Shot_QA
       ├── MSRVTT_Zero_Shot_QA
       ├── MSVD_Zero_Shot_QA
       └── TGIF_Zero_Shot_QA

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

❤️ Acknowledgements

Our code is based on Video-LLaVA and EVE repositories. We thank the authors for releasing their code. If you use our model, please consider citing these works as well.

This work was supported by the Federal Ministry of Education and Research (BMBF) under grant no. 01IS22094A WEST-AI and the ERC Consolidator Grant FORHUE (101044724). For the computations involved in this research, we acknowledge EuroHPC Joint Undertaking for awarding us access to Leonardo at CINECA, Italy, through EuroHPC Regular Access Call - proposal No. EHPC-REG-2024R01-076.

✏️ Citation

If you find our work, this repository, or pretrained models useful, please consider giving a star ⭐ and citation 📝.

@inproceedings{yi2024video-panda,
    author    = {Jinhui Yi* and Syed Talal Wasim* and Yanan Luo* and Muzammal Naseer and Juergen Gall},
    title     = {Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models},
    journal   = {CVPR},
    year      = {2025},
}

🔒 License

The content of this project is released under the Apache License 2.0 as found in the LICENSE file.

If you have any questions, please create an issue on this repository or contact at [email protected] and [email protected], [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
figs		figs
scripts		scripts
videopanda		videopanda
LICENSE		LICENSE
README.md		README.md
TRAIN_AND_VALIDATE.md		TRAIN_AND_VALIDATE.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [CVPR 2025]

🚀 News

💡 Overview

🔍 Model Performance and Visualization

Getting Started

🔧 Installation

🗝️ Training & Validating

❤️ Acknowledgements

✏️ Citation

🔒 License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

jh-yi/Video-Panda

Folders and files

Latest commit

History

Repository files navigation

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [CVPR 2025]

🚀 News

💡 Overview

🔍 Model Performance and Visualization

Getting Started

🔧 Installation

🗝️ Training & Validating

❤️ Acknowledgements

✏️ Citation

🔒 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages