π’ NoteοΌThis project is still under active development, and the benchmark will be continuously maintained.
If you find this project helpful, please give us a βοΈ on GitHub for the latest update.
TeleEgo is a comprehensive omni benchmark designed for multi-person, multi-scene, multi-task, and multimodal long-term memory reasoning in egocentric video streams. It reflects realistic personal assistant scenarios where continuous egocentric video data is collected across hours or even days, requiring models to maintain and reason over memory, understanding, and cross-memory reasoning. Omni here means that TeleEgo covers the full spectrum of roles, scenes, tasks, modalities, and memory horizons, offering all-round evaluation for egocentric AI assistants.
TeleEgo provides:
- π§ Omni-scale, diverse egocentric data from 5 roles across 4 daily scenarios.
- π€ Multi-modal annotations: video, narration, and speech transcripts.
- β Fine-grained QA benchmark: 3 cognitive dimensions, 12 subcategories.
- Participants: 5 (balanced gender)
- Scenarios:
- Work & Study
- Lifestyle & Routines
- Social Activities
- Outings & Culture
- Recording: 3 days/participant (~14.4 hours each)
- Modalities:
- Egocentric video streams
- Speech & conversations
- Narration and event descriptions
TeleEgo-QA evaluates models along three main dimensions:
-
Memory
- Short-term / Long-term / Ultra-long Memory
- Entity Tracking
- Temporal Comparison & Interval
-
Understanding
- Causal Understanding
- Intent Inference
- Multi-step Reasoning
- Cross-modal Understanding
-
Cross-Memory Reasoning
- Cross-temporal Causality
- Cross-entity Relation
- Temporal Chain Understanding
Each QA instance includes:
- Question type: Single-choice, Multi-choice, Binary, Open-ended
TeleEgo/
βββ teleego_data/ # Dataset samples / metadata
β βββ outputs/ # Output results
β βββ QAs/ # Question-Answer pairs
β βββ video_merged/ # Merged video files
βββ weights/ # Pre-trained weights (MiniCPM-o, Qwen2.5-Omni, ...)
βββ evaluate_gemini25_pro.py # Evaluation script for Gemini 2.5 Pro
βββ evaluate_gpt_4o.py # Evaluation script for GPT-4o
βββ evaluate_minicpm_o.py # Evaluation script for MiniCPM-o
βββ evaluate_qwen25_omni.py # Evaluation script for Qwen2.5-Omni
βββ evaluate_qwen25_vl.py # Evaluation script for Qwen2.5-VL
βββ evaluate_videochat_online.py # Evaluation script for VideoChat
βββ metrics.py # Evaluation metrics
βββ utils.py # Utility functions
βββ run.sh # Execution script
βββ README.md # This file
-
Download the dataset from Hugging Face: π TeleEgo Dataset
Or Baidu Netdisk: π TeleEgo Dataset
-
Organize the dataset in the following structure:
./TeleEgo/teleego_data/
βββ QAs/ # Question-Answer dataset
β βββ merged_P1_A.json # QA data for participant P1
β βββ merged_P2_A.json # QA data for participant P2
β βββ merged_P3_A.json # QA data for participant P3
β βββ merged_P4_A.json # QA data for participant P4
β βββ merged_P5_A.json # QA data for participant P5
βββ outputs/ # Evaluation outputs
β βββ gemini25_pro/ # Results for Gemini 2.5 Pro
β βββ gpt-4o/ # Results for GPT-4o
β βββ minicpm_o/ # Results for MiniCPM-o
β βββ qwen25_omni/ # Results for Qwen2.5-Omni
β βββ qwen25_vl/ # Results for Qwen2.5-VL
β βββ videochat-online/ # Results for VideoChat-Online
βββ video_merged/ # Merged long videos with timestamps
βββ merged_P1.mp4 # P1's 3-day video merged into one file
βββ merged_P2.mp4 # P2's 3-day video merged into one file
βββ merged_P3.mp4 # P3's 3-day video merged into one file
βββ merged_P4.mp4 # P4's 3-day video merged into one file
βββ merged_P5.mp4 # P5's 3-day video merged into one file
βββ timeline_P1.json # P1's timestamp mapping file
βββ timeline_P2.json # P2's timestamp mapping file
βββ timeline_P3.json # P3's timestamp mapping file
βββ timeline_P4.json # P4's timestamp mapping file
βββ timeline_P5.json # P5's timestamp mapping file
Set up your environment according to the official requirements of the model you want to evaluate:
- Qwen2.5-Omni: Follow the official Qwen2.5-Omni setup guide
- MiniCPM-o: Follow the official MiniCPM-o setup guide
- Qwen2.5-VL: Follow the official Qwen2.5-VL setup guide
- VideoChat-Online: Follow the official VideoChat-Online setup guide
- GPT-4o / Gemini 2.5 Pro: Configure your API credentials in
run.sh
To evaluate a model on a specific GPU, use the following command format:
sh run.sh Examples:
# Evaluate Qwen2.5-Omni on GPU 0
sh run.sh eval_qwen25_omni 0Available evaluation functions:
eval_qwen25_omni- Qwen2.5-Omni modeleval_qwen25_vl- Qwen2.5-VL modeleval_minicpm_o- MiniCPM-o modeleval_videochat_online- VideoChat-Online modeleval_gpt_4o- GPT-4o (requires API key)eval_gemini25_pro- Gemini 2.5 Pro (requires API key)
After evaluation, the results will be saved in ./teleego_data/outputs/<model_name>/. To compute evaluation metrics:
python metrics.pyThis will calculate performance metrics across all evaluation dimensions (Memory, Understanding, Cross-Memory Reasoning).
Submit your results to our π Online Leaderboard.
If you find our TeleEgo in your research, please cite:
@article{yan2025teleego,
title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild},
author={Yan, Jiaqi and Ren, Ruilong and Liu, Jingren and Xu, Shuning and Wang, Ling and Wang, Yiheng and Zhong, Xinlin and Wang, Yun and Zhang, Long and Chen, Xiangyu and Sun, Changzhi and others},
journal={arXiv preprint arXiv:2510.23981},
year={2025}
}This project is licensed under the MIT License. Dataset usage is restricted under a research-only license.
If you have any questions, please feel free to reach out: [email protected].
TeleEgo is an Omni benchmark, a step toward building personalized AI assistants with true long-term memory, reasoning and decision-making in real-world wearable scenarios.
Made with β€οΈ by the Ubiquitous AGI team at TeleAI.

