TeleEgo:
Benchmarking Egocentric AI Assistants in the Wild

📢 Note：This project is still under active development, and the benchmark will be continuously maintained.

If you find this project helpful, please give us a ⭐️ on GitHub for the latest update.

📌 Introduction

TeleEgo is a comprehensive omni benchmark designed for multi-person, multi-scene, multi-task, and multimodal long-term memory reasoning in egocentric video streams. It reflects realistic personal assistant scenarios where continuous egocentric video data is collected across hours or even days, requiring models to maintain and reason over memory, understanding, and cross-memory reasoning. Omni here means that TeleEgo covers the full spectrum of roles, scenes, tasks, modalities, and memory horizons, offering all-round evaluation for egocentric AI assistants.

TeleEgo provides:

🧠 Omni-scale, diverse egocentric data from 5 roles across 4 daily scenarios.
🎤 Multi-modal annotations: video, narration, and speech transcripts.
❓ Fine-grained QA benchmark: 3 cognitive dimensions, 12 subcategories.

📊 Dataset Overview

Participants: 5 (balanced gender)
Scenarios:
- Work & Study
- Lifestyle & Routines
- Social Activities
- Outings & Culture
Recording: 3 days/participant (~14.4 hours each)
Modalities:
- Egocentric video streams
- Speech & conversations
- Narration and event descriptions

🧪 Benchmark Tasks

TeleEgo-QA evaluates models along three main dimensions:

Memory
- Short-term / Long-term / Ultra-long Memory
- Entity Tracking
- Temporal Comparison & Interval
Understanding
- Causal Understanding
- Intent Inference
- Multi-step Reasoning
- Cross-modal Understanding
Cross-Memory Reasoning
- Cross-temporal Causality
- Cross-entity Relation
- Temporal Chain Understanding

Each QA instance includes:

Question type: Single-choice, Multi-choice, Binary, Open-ended

🗂️ Repository Structure

TeleEgo/
├── teleego_data/                # Dataset samples / metadata
│   ├── outputs/                 # Output results
│   ├── QAs/                     # Question-Answer pairs
│   └── video_merged/            # Merged video files
├── weights/                     # Pre-trained weights (MiniCPM-o, Qwen2.5-Omni, ...)
├── evaluate_gemini25_pro.py     # Evaluation script for Gemini 2.5 Pro
├── evaluate_gpt_4o.py           # Evaluation script for GPT-4o
├── evaluate_minicpm_o.py        # Evaluation script for MiniCPM-o
├── evaluate_qwen25_omni.py      # Evaluation script for Qwen2.5-Omni
├── evaluate_qwen25_vl.py        # Evaluation script for Qwen2.5-VL
├── evaluate_videochat_online.py # Evaluation script for VideoChat
├── metrics.py                   # Evaluation metrics
├── utils.py                     # Utility functions
├── run.sh                       # Execution script
└── README.md                    # This file

🚀 Usage

📥 Dataset Setup

Download the dataset from Hugging Face: 🔗 TeleEgo Dataset

Or Baidu Netdisk: 🔗 TeleEgo Dataset
Organize the dataset in the following structure:

./TeleEgo/teleego_data/
├── QAs/                              # Question-Answer dataset
│   ├── merged_P1_A.json             # QA data for participant P1
│   ├── merged_P2_A.json             # QA data for participant P2
│   ├── merged_P3_A.json             # QA data for participant P3
│   ├── merged_P4_A.json             # QA data for participant P4
│   └── merged_P5_A.json             # QA data for participant P5
├── outputs/                          # Evaluation outputs
│   ├── gemini25_pro/                # Results for Gemini 2.5 Pro
│   ├── gpt-4o/                      # Results for GPT-4o
│   ├── minicpm_o/                   # Results for MiniCPM-o
│   ├── qwen25_omni/                 # Results for Qwen2.5-Omni
│   ├── qwen25_vl/                   # Results for Qwen2.5-VL
│   └── videochat-online/            # Results for VideoChat-Online
└── video_merged/                     # Merged long videos with timestamps
    ├── merged_P1.mp4                # P1's 3-day video merged into one file
    ├── merged_P2.mp4                # P2's 3-day video merged into one file
    ├── merged_P3.mp4                # P3's 3-day video merged into one file
    ├── merged_P4.mp4                # P4's 3-day video merged into one file
    ├── merged_P5.mp4                # P5's 3-day video merged into one file
    ├── timeline_P1.json             # P1's timestamp mapping file
    ├── timeline_P2.json             # P2's timestamp mapping file
    ├── timeline_P3.json             # P3's timestamp mapping file
    ├── timeline_P4.json             # P4's timestamp mapping file
    └── timeline_P5.json             # P5's timestamp mapping file

🔧 Environment Setup

Set up your environment according to the official requirements of the model you want to evaluate:

Qwen2.5-Omni: Follow the official Qwen2.5-Omni setup guide
MiniCPM-o: Follow the official MiniCPM-o setup guide
Qwen2.5-VL: Follow the official Qwen2.5-VL setup guide
VideoChat-Online: Follow the official VideoChat-Online setup guide
GPT-4o / Gemini 2.5 Pro: Configure your API credentials in run.sh

🧪 Running Evaluations

To evaluate a model on a specific GPU, use the following command format:

sh run.sh

Examples:

# Evaluate Qwen2.5-Omni on GPU 0
sh run.sh eval_qwen25_omni 0

Available evaluation functions:

eval_qwen25_omni - Qwen2.5-Omni model
eval_qwen25_vl - Qwen2.5-VL model
eval_minicpm_o - MiniCPM-o model
eval_videochat_online - VideoChat-Online model
eval_gpt_4o - GPT-4o (requires API key)
eval_gemini25_pro - Gemini 2.5 Pro (requires API key)

📊 Computing Metrics

After evaluation, the results will be saved in ./teleego_data/outputs/<model_name>/. To compute evaluation metrics:

python metrics.py

This will calculate performance metrics across all evaluation dimensions (Memory, Understanding, Cross-Memory Reasoning).

📤 Submit Results

Submit your results to our 🏆 Online Leaderboard.

📜 Citation

If you find our TeleEgo in your research, please cite:

@article{yan2025teleego,
  title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild},
  author={Yan, Jiaqi and Ren, Ruilong and Liu, Jingren and Xu, Shuning and Wang, Ling and Wang, Yiheng and Zhong, Xinlin and Wang, Yun and Zhang, Long and Chen, Xiangyu and Sun, Changzhi and others},
  journal={arXiv preprint arXiv:2510.23981},
  year={2025}
}

🪪 License

This project is licensed under the MIT License. Dataset usage is restricted under a research-only license.

📬 Contact

If you have any questions, please feel free to reach out: [email protected].

Star History

TeleEgo is an Omni benchmark, a step toward building personalized AI assistants with true long-term memory, reasoning and decision-making in real-world wearable scenarios.

Made with ❤️ by the Ubiquitous AGI team at TeleAI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TeleEgo:
Benchmarking Egocentric AI Assistants in the Wild

📌 Introduction

📊 Dataset Overview

🧪 Benchmark Tasks

🗂️ Repository Structure

🚀 Usage

📥 Dataset Setup

🔧 Environment Setup

🧪 Running Evaluations

📊 Computing Metrics

📤 Submit Results

📜 Citation

🪪 License

📬 Contact

Star History

About

Uh oh!

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
assets		assets
docs		docs
LICENSE		LICENSE
README.md		README.md
evaluate_gemini25_pro.py		evaluate_gemini25_pro.py
evaluate_gpt_4o.py		evaluate_gpt_4o.py
evaluate_minicpm_o.py		evaluate_minicpm_o.py
evaluate_qwen25_omni.py		evaluate_qwen25_omni.py
evaluate_qwen25_vl.py		evaluate_qwen25_vl.py
evaluate_videochat_online.py		evaluate_videochat_online.py
metrics.py		metrics.py
run.sh		run.sh
utils.py		utils.py

License

TeleAI-UAGI/TeleEgo

Folders and files

Latest commit

History

Repository files navigation

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

📌 Introduction

📊 Dataset Overview

🧪 Benchmark Tasks

🗂️ Repository Structure

🚀 Usage

📥 Dataset Setup

🔧 Environment Setup

🧪 Running Evaluations

📊 Computing Metrics

📤 Submit Results

📜 Citation

🪪 License

📬 Contact

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

TeleEgo:
Benchmarking Egocentric AI Assistants in the Wild

Packages