Skip to content

[ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.

Notifications You must be signed in to change notification settings

Go2Heart/StreamFormer

Repository files navigation

Learning Streaming Video Representation via Multitask Training

Official implementation of Learning Streaming Video Representation via Multitask Training, ICCV 2025 (Oral)

Yibin Yan*, Jilan Xu*, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie

(*: equal contribution)

TODO

  • Add instructions for quick start.
  • Add downstream evaluation pipelines.
  • Release StreamFormer Checkpoints.
  • Release Datasets Annotations.

Quick Start

Installation

git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt

Pre-trained Model Usage

We have uploaded our streamformer pre-trained on Global-, Temporal- and Spatial- granularities to 🤗huggingface.

Inference Usage

from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
    fake_frames = torch.randn(1, 16, 3, 224, 224)
    fake_frames = fake_frames.to(model.device)
    output = model(fake_frames)
    # global representation [B, D]
    print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
    
    # temporal representation [B, T, D]
    print(output.pooler_output.shape, output.pooler_output)
    
    # spatial representation [B, T, HxW, D]
    print(output.last_hidden_state.shape, output.last_hidden_state)

Pre-training

To download our pre-training video annotations, use this link.

Change some necessary paths in scripts/pretrain_streamformer.sh and dataset metadata, and run the scripts.

bash scripts/pretrain_streamformer.sh 

Evaluations

1. Action Recognition

Check the README of Action Recognition.

2. Online Action Detection

Check the README of Online Action Detection.

3. OVIS

Follow the README of CTVIS to install the corresponding environment.

Train StreamFormer for OVIS.

export DETECTRON2_DATASETS=/PATH/TO/VIS/DATA;
python -m downstream.OVIS.train_ctvis --resume --config-file downstream/OVIS/configs/ytvis_2019/CTVIS_Streamformer.yaml --num-gpus 4

4. VideoQA

[🔥News!] We have released our new VideoQA Model (VideoMME(w/o subtitles): 55.0) based on Qwen2.5: [🤗HF Link].

The model can now inference for streaming video input (e.g. when the input video and user query is asynchronous) , with KV-Cache enabled for StreamFormer! For usage example, please check out our naive test script test_kvcache.py.

Follow the README of LLaVA-NeXT to install the corresponding environment.

Prepare the necessary data:

Train StreamFormer checkpoint in 3 statges.

cd downstream/VideoQA
## stage 1 for pretraining
bash scripts/train/stage1_pretrain_timesformer_siglip_base.sh

## stage 2 for image-qa instruction tuning
bash scripts/train/stage2_direct_finetune_timesformer_siglip_base.sh 

## stage 3 for video-qa instruction tuning
bash scripts/train/stage3_direct_finetune_timesformer_video_only.sh 

For Video QA evaluation, you can use this initial model checkpoint for now to run the evaluation code in our example(swapping the ckpt path).

Ackowledgements

Thanks to the codebase of UMT, transformers, MAT, CTVIS, LLaVA-Next.

Citations

If you find our work useful, please cite:

@InProceedings{Yan_2025_ICCV,
    author    = {Yan, Yibin and Xu, Jilan and Di, Shangzhe and Liu, Yikun and Shi, Yudi and Chen, Qirui and Li, Zeqian and Huang, Yifei and Xie, Weidi},
    title     = {Learning Streaming Video Representation via Multitask Training},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {9900-9912}
}

About

[ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published