Skip to content

TeleAI-UAGI/TeleEgo

Repository files navigation

TeleEgo:
Benchmarking Egocentric AI Assistants in the Wild

Hugging Face arXiv Page Blog

Teaser

πŸ“’ Note:This project is still under active development, and the benchmark will be continuously maintained.


If you find this project helpful, please give us a ⭐️ on GitHub for the latest update.

πŸ“Œ Introduction

TeleEgo is a comprehensive omni benchmark designed for multi-person, multi-scene, multi-task, and multimodal long-term memory reasoning in egocentric video streams. It reflects realistic personal assistant scenarios where continuous egocentric video data is collected across hours or even days, requiring models to maintain and reason over memory, understanding, and cross-memory reasoning. Omni here means that TeleEgo covers the full spectrum of roles, scenes, tasks, modalities, and memory horizons, offering all-round evaluation for egocentric AI assistants.

TeleEgo provides:

  • 🧠 Omni-scale, diverse egocentric data from 5 roles across 4 daily scenarios.
  • 🎀 Multi-modal annotations: video, narration, and speech transcripts.
  • ❓ Fine-grained QA benchmark: 3 cognitive dimensions, 12 subcategories.

πŸ“Š Dataset Overview

  • Participants: 5 (balanced gender)
  • Scenarios:
    • Work & Study
    • Lifestyle & Routines
    • Social Activities
    • Outings & Culture
  • Recording: 3 days/participant (~14.4 hours each)
  • Modalities:
    • Egocentric video streams
    • Speech & conversations
    • Narration and event descriptions

πŸ§ͺ Benchmark Tasks

TeleEgo-QA evaluates models along three main dimensions:

  1. Memory

    • Short-term / Long-term / Ultra-long Memory
    • Entity Tracking
    • Temporal Comparison & Interval
  2. Understanding

    • Causal Understanding
    • Intent Inference
    • Multi-step Reasoning
    • Cross-modal Understanding
  3. Cross-Memory Reasoning

    • Cross-temporal Causality
    • Cross-entity Relation
    • Temporal Chain Understanding

Each QA instance includes:

  • Question type: Single-choice, Multi-choice, Binary, Open-ended

πŸ—‚οΈ Repository Structure

TeleEgo/
β”œβ”€β”€ teleego_data/                # Dataset samples / metadata
β”‚   β”œβ”€β”€ outputs/                 # Output results
β”‚   β”œβ”€β”€ QAs/                     # Question-Answer pairs
β”‚   └── video_merged/            # Merged video files
β”œβ”€β”€ weights/                     # Pre-trained weights (MiniCPM-o, Qwen2.5-Omni, ...)
β”œβ”€β”€ evaluate_gemini25_pro.py     # Evaluation script for Gemini 2.5 Pro
β”œβ”€β”€ evaluate_gpt_4o.py           # Evaluation script for GPT-4o
β”œβ”€β”€ evaluate_minicpm_o.py        # Evaluation script for MiniCPM-o
β”œβ”€β”€ evaluate_qwen25_omni.py      # Evaluation script for Qwen2.5-Omni
β”œβ”€β”€ evaluate_qwen25_vl.py        # Evaluation script for Qwen2.5-VL
β”œβ”€β”€ evaluate_videochat_online.py # Evaluation script for VideoChat
β”œβ”€β”€ metrics.py                   # Evaluation metrics
β”œβ”€β”€ utils.py                     # Utility functions
β”œβ”€β”€ run.sh                       # Execution script
└── README.md                    # This file

πŸš€ Usage

πŸ“₯ Dataset Setup

  1. Download the dataset from Hugging Face: πŸ”— TeleEgo Dataset

    Or Baidu Netdisk: πŸ”— TeleEgo Dataset

  2. Organize the dataset in the following structure:

./TeleEgo/teleego_data/
β”œβ”€β”€ QAs/                              # Question-Answer dataset
β”‚   β”œβ”€β”€ merged_P1_A.json             # QA data for participant P1
β”‚   β”œβ”€β”€ merged_P2_A.json             # QA data for participant P2
β”‚   β”œβ”€β”€ merged_P3_A.json             # QA data for participant P3
β”‚   β”œβ”€β”€ merged_P4_A.json             # QA data for participant P4
β”‚   └── merged_P5_A.json             # QA data for participant P5
β”œβ”€β”€ outputs/                          # Evaluation outputs
β”‚   β”œβ”€β”€ gemini25_pro/                # Results for Gemini 2.5 Pro
β”‚   β”œβ”€β”€ gpt-4o/                      # Results for GPT-4o
β”‚   β”œβ”€β”€ minicpm_o/                   # Results for MiniCPM-o
β”‚   β”œβ”€β”€ qwen25_omni/                 # Results for Qwen2.5-Omni
β”‚   β”œβ”€β”€ qwen25_vl/                   # Results for Qwen2.5-VL
β”‚   └── videochat-online/            # Results for VideoChat-Online
└── video_merged/                     # Merged long videos with timestamps
    β”œβ”€β”€ merged_P1.mp4                # P1's 3-day video merged into one file
    β”œβ”€β”€ merged_P2.mp4                # P2's 3-day video merged into one file
    β”œβ”€β”€ merged_P3.mp4                # P3's 3-day video merged into one file
    β”œβ”€β”€ merged_P4.mp4                # P4's 3-day video merged into one file
    β”œβ”€β”€ merged_P5.mp4                # P5's 3-day video merged into one file
    β”œβ”€β”€ timeline_P1.json             # P1's timestamp mapping file
    β”œβ”€β”€ timeline_P2.json             # P2's timestamp mapping file
    β”œβ”€β”€ timeline_P3.json             # P3's timestamp mapping file
    β”œβ”€β”€ timeline_P4.json             # P4's timestamp mapping file
    └── timeline_P5.json             # P5's timestamp mapping file

πŸ”§ Environment Setup

Set up your environment according to the official requirements of the model you want to evaluate:

πŸ§ͺ Running Evaluations

To evaluate a model on a specific GPU, use the following command format:

sh run.sh  

Examples:

# Evaluate Qwen2.5-Omni on GPU 0
sh run.sh eval_qwen25_omni 0

Available evaluation functions:

  • eval_qwen25_omni - Qwen2.5-Omni model
  • eval_qwen25_vl - Qwen2.5-VL model
  • eval_minicpm_o - MiniCPM-o model
  • eval_videochat_online - VideoChat-Online model
  • eval_gpt_4o - GPT-4o (requires API key)
  • eval_gemini25_pro - Gemini 2.5 Pro (requires API key)

πŸ“Š Computing Metrics

After evaluation, the results will be saved in ./teleego_data/outputs/<model_name>/. To compute evaluation metrics:

python metrics.py

This will calculate performance metrics across all evaluation dimensions (Memory, Understanding, Cross-Memory Reasoning).

πŸ“€ Submit Results

Submit your results to our πŸ† Online Leaderboard.


πŸ“œ Citation

If you find our TeleEgo in your research, please cite:

@article{yan2025teleego,
  title={TeleEgo: Benchmarking Egocentric AI Assistants in the Wild},
  author={Yan, Jiaqi and Ren, Ruilong and Liu, Jingren and Xu, Shuning and Wang, Ling and Wang, Yiheng and Zhong, Xinlin and Wang, Yun and Zhang, Long and Chen, Xiangyu and Sun, Changzhi and others},
  journal={arXiv preprint arXiv:2510.23981},
  year={2025}
}

πŸͺͺ License

This project is licensed under the MIT License. Dataset usage is restricted under a research-only license.


πŸ“¬ Contact

If you have any questions, please feel free to reach out: [email protected].


Star History

Star History Chart


TeleEgo is an Omni benchmark, a step toward building personalized AI assistants with true long-term memory, reasoning and decision-making in real-world wearable scenarios.

Made with ❀️ by the Ubiquitous AGI team at TeleAI.

TeleAI Logo Β Β Β  TeleEgo Logo

About

The official repo of TeleEgo - A Benchmark for Egocentric AI Assistants.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published