Skip to content

hrinnnn/PerceptionComp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Dataset Paper Website

Introduction

PerceptionComp teaser

PerceptionComp is a benchmark for complex perception-centric video reasoning. It targets questions that cannot be solved from a single frame, a single moment, or a short caption: models must revisit visually complex videos, gather evidence from temporally separated segments, and combine multiple perceptual constraints before answering.

✨ Highlights

  • Complex perception-centric reasoning instead of caption-level shortcut solving.
  • 1,114 manually annotated five-choice questions.
  • Seven categories spanning outdoor tour, shopping, sport, variety show, home tour, game, and movie.
  • Unified workflow for download, local video storage and evaluation.
  • Extensible evaluation entry point that supports OpenAI-compatible APIs, Gemini, and custom model runners.

📦 Data Release

PerceptionComp is released in two parts:

  1. GitHub repository: contains benchmark annotations, evaluation code, runner templates, analysis utilities, and documentation.
  2. Hugging Face dataset: stores the benchmark videos referenced by video_id.

📊 Main Results

PerceptionComp main results

🚀 Quick Start

Step 1. Clone the Repository
git clone https://github.com/hrinnnn/PerceptionComp.git
cd PerceptionComp
Step 2. Install Dependencies
pip install -r requirements.txt
Step 3. Download the Benchmark Videos

Download the benchmark videos from the Hugging Face dataset using the official helper script:

python scripts/download_data.py --repo-id hrinnnn/PerceptionComp

If the Hugging Face dataset requires authentication:

python scripts/download_data.py \
  --repo-id hrinnnn/PerceptionComp \
  --hf-token YOUR_HF_TOKEN

This script downloads the videos from the Hugging Face data/ directory, flattens the downloaded snapshot into the local layout expected by the evaluator, and validates the result against the official annotation file.

After the script finishes successfully, your local layout is ready for evaluation:

benchmark/
  videos/
    <video_id>.mp4
Step 4. Run Evaluation with a Built-in Backend

PerceptionComp currently supports three evaluation modes:

  • api: OpenAI-compatible APIs
  • gemini: Gemini video-upload workflow
  • custom: your own model runner
Option A. OpenAI-Compatible API

Use this for GPT-style APIs, Qwen API deployments, GLM-compatible endpoints, Doubao-style endpoints, and similar services.

python evaluate/evaluate.py \
  --model YOUR_MODEL_NAME \
  --provider api \
  --api-key YOUR_API_KEY \
  --base-url YOUR_BASE_URL \
  --video-dir benchmark/videos

Optional arguments:

  • --output-dir: change where results are written
  • --frames: control the number of sampled frames
  • --proxy: pass a proxy for API calls
Option B. Gemini
python evaluate/evaluate.py \
  --model YOUR_GEMINI_MODEL_NAME \
  --provider gemini \
  --api-key YOUR_GEMINI_API_KEY \
  --video-dir benchmark/videos

Optional arguments:

  • --force-thinking: retry when <think> tags are missing
  • --output-dir: change where results are written
Step 5. Check the Outputs

Evaluation outputs are written to:

evaluate/results/Results-<model>.json
evaluate/results/Results-<model>.csv

The JSON file stores per-question predictions and raw responses. The CSV file stores aggregated scores.

🛠️ Evaluate Your Own Model

If your model is local, implement a custom runner. You can follow these steps:

Step 1. Copy the Template
cp evaluate/tools/runners/custom_template.py evaluate/tools/runners/my_model.py
Step 2. Implement the Model Hook

Open evaluate/tools/runners/my_model.py and replace run_your_model(...) with your own inference logic.

Your function should take:

  • video_path
  • prompt
  • model_name
  • custom_config (optional)

and return:

  • a raw string response from the model

The simplest recommended output format is:

Answer: A

or, if your model supports reasoning traces:

<think>
your reasoning here
</think>
<answer>
A
</answer>
Step 3. Run Evaluation with the Custom Runner
python evaluate/evaluate.py \
  --model YOUR_MODEL_NAME \
  --provider custom \
  --custom-runner evaluate/tools/runners/my_model.py \
  --video-dir benchmark/videos

If your runner needs an extra config file:

python evaluate/evaluate.py \
  --model YOUR_MODEL_NAME \
  --provider custom \
  --custom-runner evaluate/tools/runners/my_model.py \
  --custom-config path/to/your_config.json \
  --video-dir benchmark/videos
Step 4. Keep the Benchmark Protocol Fixed

When adapting your own model, do not modify:

  • the annotation format,
  • the question prompt structure,
  • the answer parsing logic,
  • the metric computation,
  • the output schema.

Only change the model-side inference path. That is what keeps your results comparable to other models.

The default custom runner template is now a near-runnable local transformers scaffold. If your model follows a Hugging Face VLM workflow, you can often start from the template directly instead of writing a runner from scratch.

📚 Citation

If you use PerceptionComp, please cite the corresponding paper once the public version is finalized.

@misc{perceptioncomp2026,
  title={PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages