Learning Streaming Video Representation via Multitask Training

Official implementation of Learning Streaming Video Representation via Multitask Training, ICCV 2025 (Oral)

Yibin Yan*, Jilan Xu*, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie

(*: equal contribution)

TODO

Add instructions for quick start.
Add downstream evaluation pipelines.
Release StreamFormer Checkpoints.
Release Datasets Annotations.

Quick Start

Installation

git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt

Pre-trained Model Usage

We have uploaded our streamformer pre-trained on Global-, Temporal- and Spatial- granularities to 🤗huggingface.

Inference Usage

from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
    fake_frames = torch.randn(1, 16, 3, 224, 224)
    fake_frames = fake_frames.to(model.device)
    output = model(fake_frames)
    # global representation [B, D]
    print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
    
    # temporal representation [B, T, D]
    print(output.pooler_output.shape, output.pooler_output)
    
    # spatial representation [B, T, HxW, D]
    print(output.last_hidden_state.shape, output.last_hidden_state)

Pre-training

To download our pre-training video annotations, use this link.

Change some necessary paths in scripts/pretrain_streamformer.sh and dataset metadata, and run the scripts.

bash scripts/pretrain_streamformer.sh

Evaluations

1. Action Recognition

Check the README of Action Recognition.

2. Online Action Detection

Check the README of Online Action Detection.

3. OVIS

Follow the README of CTVIS to install the corresponding environment.

Train StreamFormer for OVIS.

export DETECTRON2_DATASETS=/PATH/TO/VIS/DATA;
python -m downstream.OVIS.train_ctvis --resume --config-file downstream/OVIS/configs/ytvis_2019/CTVIS_Streamformer.yaml --num-gpus 4

4. VideoQA

[🔥News!] We have released our new VideoQA Model (VideoMME(w/o subtitles): 55.0) based on Qwen2.5: [🤗HF Link].

The model can now inference for streaming video input (e.g. when the input video and user query is asynchronous) , with KV-Cache enabled for StreamFormer! For usage example, please check out our naive test script test_kvcache.py.

Follow the README of LLaVA-NeXT to install the corresponding environment.

Prepare the necessary data:

Train StreamFormer checkpoint in 3 statges.

cd downstream/VideoQA
## stage 1 for pretraining
bash scripts/train/stage1_pretrain_timesformer_siglip_base.sh

## stage 2 for image-qa instruction tuning
bash scripts/train/stage2_direct_finetune_timesformer_siglip_base.sh 

## stage 3 for video-qa instruction tuning
bash scripts/train/stage3_direct_finetune_timesformer_video_only.sh

For Video QA evaluation, you can use this initial model checkpoint for now to run the evaluation code in our example(swapping the ckpt path).

Ackowledgements

Thanks to the codebase of UMT, transformers, MAT, CTVIS, LLaVA-Next.

Citations

If you find our work useful, please cite:

@InProceedings{Yan_2025_ICCV,
    author    = {Yan, Yibin and Xu, Jilan and Di, Shangzhe and Liu, Yikun and Shi, Yudi and Chen, Qirui and Li, Zeqian and Huang, Yifei and Xie, Weidi},
    title     = {Learning Streaming Video Representation via Multitask Training},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {9900-9912}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
datasets		datasets
downstream		downstream
images		images
models		models
scripts		scripts
tools		tools
.gitignore		.gitignore
README.md		README.md
extract_oad_feature.py		extract_oad_feature.py
functional.py		functional.py
optim_factory.py		optim_factory.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_finetuning_multi_task.py		run_finetuning_multi_task.py
sampler.py		sampler.py
utils.py		utils.py
utils_ret.py		utils_ret.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Streaming Video Representation via Multitask Training

TODO

Quick Start

Installation

Pre-trained Model Usage

Inference Usage

Pre-training

Evaluations

1. Action Recognition

2. Online Action Detection

3. OVIS

4. VideoQA

Ackowledgements

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Go2Heart/StreamFormer

Folders and files

Latest commit

History

Repository files navigation

Learning Streaming Video Representation via Multitask Training

TODO

Quick Start

Installation

Pre-trained Model Usage

Inference Usage

Pre-training

Evaluations

1. Action Recognition

2. Online Action Detection

3. OVIS

4. VideoQA

Ackowledgements

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages