Official implementation of Learning Streaming Video Representation via Multitask Training, ICCV 2025 (Oral)
Yibin Yan*, Jilan Xu*, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie
(*: equal contribution)
- Add instructions for quick start.
- Add downstream evaluation pipelines.
- Release StreamFormer Checkpoints.
- Release Datasets Annotations.
git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txtWe have uploaded our streamformer pre-trained on Global-, Temporal- and Spatial- granularities to 🤗huggingface.
from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
fake_frames = torch.randn(1, 16, 3, 224, 224)
fake_frames = fake_frames.to(model.device)
output = model(fake_frames)
# global representation [B, D]
print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
# temporal representation [B, T, D]
print(output.pooler_output.shape, output.pooler_output)
# spatial representation [B, T, HxW, D]
print(output.last_hidden_state.shape, output.last_hidden_state)To download our pre-training video annotations, use this link.
Change some necessary paths in scripts/pretrain_streamformer.sh and dataset metadata, and run the scripts.
bash scripts/pretrain_streamformer.sh Check the README of Action Recognition.
Check the README of Online Action Detection.
Follow the README of CTVIS to install the corresponding environment.
Train StreamFormer for OVIS.
export DETECTRON2_DATASETS=/PATH/TO/VIS/DATA;
python -m downstream.OVIS.train_ctvis --resume --config-file downstream/OVIS/configs/ytvis_2019/CTVIS_Streamformer.yaml --num-gpus 4[🔥News!] We have released our new VideoQA Model (VideoMME(w/o subtitles): 55.0) based on Qwen2.5: [🤗HF Link].
The model can now inference for streaming video input (e.g. when the input video and user query is asynchronous) , with KV-Cache enabled for StreamFormer! For usage example, please check out our naive test script test_kvcache.py.
Follow the README of LLaVA-NeXT to install the corresponding environment.
Prepare the necessary data:
Train StreamFormer checkpoint in 3 statges.
cd downstream/VideoQA
## stage 1 for pretraining
bash scripts/train/stage1_pretrain_timesformer_siglip_base.sh
## stage 2 for image-qa instruction tuning
bash scripts/train/stage2_direct_finetune_timesformer_siglip_base.sh
## stage 3 for video-qa instruction tuning
bash scripts/train/stage3_direct_finetune_timesformer_video_only.sh For Video QA evaluation, you can use this initial model checkpoint for now to run the evaluation code in our example(swapping the ckpt path).
Thanks to the codebase of UMT, transformers, MAT, CTVIS, LLaVA-Next.
If you find our work useful, please cite:
@InProceedings{Yan_2025_ICCV,
author = {Yan, Yibin and Xu, Jilan and Di, Shangzhe and Liu, Yikun and Shi, Yudi and Chen, Qirui and Li, Zeqian and Huang, Yifei and Xie, Weidi},
title = {Learning Streaming Video Representation via Multitask Training},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {9900-9912}
}
