PerceptionComp is a benchmark for complex perception-centric video reasoning. It targets questions that cannot be solved from a single frame, a single moment, or a short caption: models must revisit visually complex videos, gather evidence from temporally separated segments, and combine multiple perceptual constraints before answering.
- Complex perception-centric reasoning instead of caption-level shortcut solving.
- 1,114 manually annotated five-choice questions.
- Seven categories spanning outdoor tour, shopping, sport, variety show, home tour, game, and movie.
- Unified workflow for download, local video storage and evaluation.
- Extensible evaluation entry point that supports OpenAI-compatible APIs, Gemini, and custom model runners.
PerceptionComp is released in two parts:
- GitHub repository: contains benchmark annotations, evaluation code, runner templates, analysis utilities, and documentation.
- Hugging Face dataset:
stores the benchmark videos referenced by
video_id.
git clone https://github.com/hrinnnn/PerceptionComp.git
cd PerceptionComppip install -r requirements.txtDownload the benchmark videos from the Hugging Face dataset using the official helper script:
python scripts/download_data.py --repo-id hrinnnn/PerceptionCompIf the Hugging Face dataset requires authentication:
python scripts/download_data.py \
--repo-id hrinnnn/PerceptionComp \
--hf-token YOUR_HF_TOKENThis script downloads the videos from the Hugging Face data/ directory, flattens the downloaded snapshot into the local layout expected by the evaluator, and validates the result against the official annotation file.
After the script finishes successfully, your local layout is ready for evaluation:
benchmark/
videos/
<video_id>.mp4
PerceptionComp currently supports three evaluation modes:
api: OpenAI-compatible APIsgemini: Gemini video-upload workflowcustom: your own model runner
Use this for GPT-style APIs, Qwen API deployments, GLM-compatible endpoints, Doubao-style endpoints, and similar services.
python evaluate/evaluate.py \
--model YOUR_MODEL_NAME \
--provider api \
--api-key YOUR_API_KEY \
--base-url YOUR_BASE_URL \
--video-dir benchmark/videosOptional arguments:
--output-dir: change where results are written--frames: control the number of sampled frames--proxy: pass a proxy for API calls
python evaluate/evaluate.py \
--model YOUR_GEMINI_MODEL_NAME \
--provider gemini \
--api-key YOUR_GEMINI_API_KEY \
--video-dir benchmark/videosOptional arguments:
--force-thinking: retry when<think>tags are missing--output-dir: change where results are written
Evaluation outputs are written to:
evaluate/results/Results-<model>.json
evaluate/results/Results-<model>.csv
The JSON file stores per-question predictions and raw responses. The CSV file stores aggregated scores.
If your model is local, implement a custom runner. You can follow these steps:
cp evaluate/tools/runners/custom_template.py evaluate/tools/runners/my_model.pyOpen evaluate/tools/runners/my_model.py and replace run_your_model(...) with your own inference logic.
Your function should take:
video_pathpromptmodel_namecustom_config(optional)
and return:
- a raw string response from the model
The simplest recommended output format is:
Answer: A
or, if your model supports reasoning traces:
<think>
your reasoning here
</think>
<answer>
A
</answer>
python evaluate/evaluate.py \
--model YOUR_MODEL_NAME \
--provider custom \
--custom-runner evaluate/tools/runners/my_model.py \
--video-dir benchmark/videosIf your runner needs an extra config file:
python evaluate/evaluate.py \
--model YOUR_MODEL_NAME \
--provider custom \
--custom-runner evaluate/tools/runners/my_model.py \
--custom-config path/to/your_config.json \
--video-dir benchmark/videosWhen adapting your own model, do not modify:
- the annotation format,
- the question prompt structure,
- the answer parsing logic,
- the metric computation,
- the output schema.
Only change the model-side inference path. That is what keeps your results comparable to other models.
The default custom runner template is now a near-runnable local transformers scaffold. If your model follows a Hugging Face VLM workflow, you can often start from the template directly instead of writing a runner from scratch.
If you use PerceptionComp, please cite the corresponding paper once the public version is finalized.
@misc{perceptioncomp2026,
title={PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning},
year={2026}
}
