This repository is the official implementation of CogStream: Context-guided Streaming Video Question Answering.
Figure 1: An overview of our proposed CogReasoner model.
Despite advancements in Video Large Language Models (Vid-LLMs), significant challenges persist in streaming video reasoning. The core issues are twofold: the immense computational burden from processing ever-growing historical context, and the model's distraction by irrelevant information, which undermines the reasoning process.
To address this, we introduce CogStream, a new and challenging task named Context-guided Streaming Video Reasoning. This task simulates real-world scenarios, requiring models to intelligently identify and utilize the most relevant historical context to answer questions about an ongoing video stream.
To support this research, we present:
- A new, densely annotated dataset featuring extensive and hierarchical question-answer pairs, built via a semi-automatic pipeline.
- CogReasoner, a novel baseline model that efficiently tackles the CogStream task by leveraging Visual Stream Compression and Historic Dialogue Retrieval.
The CogStream dataset is designed to evaluate and validate a model's capabilities in context-guided streaming video reasoning.
Data Sources: We collected 6,361 videos from six public sources: MovieChat (40.2%), MECD (16.8%), QVhighlights (9.8%), VideoMME (6.5%), COIN (18.0%), and YouCook2 (8.6%). Scale: The final dataset comprises 1,088 high-quality videos and 59,032 QA pairs, formally split into a training set (852 videos) and a testing set (236 videos). Key Features: QA pairs are categorized into three types based on the required temporal context: Basic QA, Streaming QA, and Global QA. Many questions in the Streaming and Global categories require referencing previous dialogue turns for accurate answering, testing a model's deep reasoning abilities.
Figure 2: An illustration of the CogStream task and the hierarchical structure of our dataset.
➡️ Download the Dataset from Huggingface
For detailed instructions on generating your own dataset using our pipeline, please see the guide in the generation directory: ./dataset_gen_pipeline/README.md.
Our proposed CogReasoner framework is designed to efficiently process streaming video and dialogue by focusing on relevant information. It consists of three key modules:
- Visual Stream Compression: This module intelligently processes the incoming video stream. It uses Temporal-Semantic Clustering to group frames into coherent events and then employs Question-aware Streaming Compression to preserve relevant events in high detail while aggressively compressing less relevant ones.
- Historic Dialogue Retrieval: To handle the ever-growing textual context, this module uses an LLM to select only the most relevant historical QA pairs pertinent to the current question. It also determines if a question can be answered using text alone, avoiding unnecessary visual processing.
- Video-text Interleave Reasoning: Finally, the compressed visual information and the retrieved textual context are interleaved chronologically to form the final input, which the LLM uses to generate the answer.
Note: Run all commands from the repository root directory to ensure correct path resolution.
We follow VideoLLaMA3. To install requirements:
conda env create -f environment.yml
Download only the VideoLLaMA3 model weights (.safetensors files) from here and place them in the ./model folder in this repository:
To train the Historic Dialogue Retrieval module in the paper (First Stage), run this command:
torchrun --nproc_per_node=<number of processes> train/language_model_training.py --model_path <path to the base model directory> --QA_path <path to the dataset QA directory>
--nproc_per_node=<number of processes>: Specifies the number of processes to run per node, typically set to the number of available GPUs (e.g., 8 for 8 GPUs).
To train the Video-text Interleave Reasoning module in the paper (Second Stage), you need to configure accelerate before training. Load the provided accelerate configuration file by running:
accelerate config --load_config accelerate_config.yamlThen, run the training command:
accelerate launch train/second_stage_training.py --model_path <path to the base model directory> --video_dir <directory containing train video files> --query_dir <directory containing train query (QA) files> --num_epochs <training epochs number>
To evaluate our model on CogStream, first run the following command to generate answers on our dataset:
torchrun --nproc_per_node=<number of processes> evaluate/answer_generate.py --model_path <path to the base model directory> --lora_adapter_1_path <path to the first stage LoRA adapter> --lora_adapter_2_path <path to the second stage LoRA adapter> --video_dir <directory containing test video files> --query_dir <directory containing test query (QA) files> --save_dir <directory to save the result>
Pretrained lora weights will be released soon.
A visualization of an example result demonstrating the model's performance is shown below.

Performance metrics of different models in 11 CogStream capabilities is shown below. Prm. denotes the number of model parameters, Frm. denotes the number of sampled frames. Models denoted by 
If you find our work useful for your research, please consider citing our paper:
@article{zhao2025cogstream,
title={CogStream: Context-guided Streaming Video Question Answering},
author={Zhao, Zicheng and Wang, Kangyu and Li, Shijie and Qian, Rui and Lin, Weiyao and Liu, Huabin},
journal={arXiv preprint arXiv:2506.10516},
year={2025}
}This project is licensed under the MIT License. All contributions to the code must be made under this license. See the LICENSE file for details.

