Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding (ICCV 2025 Spotlight)

Introduction

This is the official code repository of Embodied VideoAgent. Embodied VideoAgent is an Embodied AI system that understands scenes from videos and embodied sensors, and accomplishes tasks through perception, planning and reasoning.

Please also pay attention to our preceding work VideoAgent. which is a mulit-modal agent for video understanding.

Prerequisites

This project is tested on Ubuntu 22.04 with a NVIDIA RTX 4090(24GB VRAM).

Preparation

Use the following command to build the conda environment.

conda env create -f environment.yaml
conda activate e-videoagent
conda install habitat-sim==0.3.0 withbullet -c conda-forge -c aihabitat
pip install -r requirements.txt

Download the habitat_data.zip from here and unzip it under Embodied-VideoAgent. Rename it as data.
Fill your Azure api key in config/api.yaml

Two-Agent Framework

In Two-Agent Framework, an LLM plays the role of a user and proposes embodied tasks, while Embodied VideoAgent serves as the robot and accomplishes the tasks in habitat-sim simulator.

Run two-agent pipeline:

python two_agent_pipeline.py

Persistent Object Memory

We provide a script to visualize and debug the persistent object memory constructed by YoloWorld detector and 2D-3D lifting, run:

python test_reid.py

use w, a, d for navigation in the scene.

You can also test the performance of persistent object memory using customized scene datasets (should include RGBs, depths, camera poses, fov). Prepare your data for the following function in object_memory.py.

object_memory.process_a_frame(
    timestamp=timestamp,
    rgb=rgb,
    depth=depth,
    depth_mask=depth_mask,
    pos=pos, # camera translation
    rmat=rot, # camera rotation
    fov=hfov
    )

ATTENTION: object memory uses 3D-related functions in utils.py, which is based on the following assumption: in the camera coordinate system, the right direction corresponds to the x-axis, the downward direction corresponds to the y-axis, and the inward direction of the image corresponds to the z-axis. Please ensure that your customized datasets conform to the assumption.

(Optional) Generate episodes with different object layouts

Use:

python generate_episodes.py

to generate different object layouts in a scene. You can then change the episode in two_agent_pipeline.py

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@inproceedings{fan2025embodied,
  title={Embodied videoagent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding},
  author={Fan, Yue and Ma, Xiaojian and Su, Rongpeng and Guo, Jun and Wu, Rujie and Chen, Xi and Li, Qing},
  booktitle={International Conference on Computer Vision},
  year={2025},
}

@inproceedings{fan2024videoagent,
  title={Videoagent: A memory-augmented multimodal agent for video understanding},
  author={Fan, Yue and Ma, Xiaojian and Wu, Rujie and Du, Yuntao and Li, Jiaqi and Gao, Zhi and Li, Qing},
  booktitle={European Conference on Computer Vision},
  pages={75--92},
  year={2024},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
env		env
imgs		imgs
prompt		prompt
README.md		README.md
control.py		control.py
detection_classes.py		detection_classes.py
environment.yaml		environment.yaml
generate_episodes.py		generate_episodes.py
internvl2.py		internvl2.py
internvl2_server.py		internvl2_server.py
llm_function.py		llm_function.py
object3d.py		object3d.py
object_memory.py		object_memory.py
reid.py		reid.py
requirements.txt		requirements.txt
test_reid.py		test_reid.py
two_agent_pipeline.py		two_agent_pipeline.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding (ICCV 2025 Spotlight)

Introduction

Prerequisites

Preparation

Two-Agent Framework

Persistent Object Memory

(Optional) Generate episodes with different object layouts

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Embodied-VideoAgent/embodied-videoagent

Folders and files

Latest commit

History

Repository files navigation

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding (ICCV 2025 Spotlight)

Introduction

Prerequisites

Preparation

Two-Agent Framework

Persistent Object Memory

(Optional) Generate episodes with different object layouts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages