EgoLoc: Temporal Interaction Localization in Egocentric Videos

Authors: Erhang Zhang#, Junyi Ma#, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang*

EgoLoc is a vision-language model (VLM)-based framework that localizes hand-object contact and separation timestamps in egocentric videos in a zero-shot manner. Our approach extends the traditional scope of temporal action localization (TAL) to a finer level, which we define as temporal interaction localization (TIL).

📄 Read our paper – accepted at IROS 2025.

📄 Extended journal version for more details.

📣 We have released the extended version of EgoLoc for long untrimmed videos. Feel free to use it!

from [🧭TAL] to [🎯TIL]

We greatly appreciate Yuchen Xie for helping organize our repository and developing VDA-based version.

1. Getting Started

We provide videos from the EgoPAT3D-DT dataset and our self-recorded data for quick experimentation.

1.1 EgoLoc & Grounded-SAM Environment setup & dependency installation 🚀

EgoLoc One-Liner Installation - For Conda

conda create -n egoloc python=3.10 -y && conda activate egoloc && \
git clone https://github.com/IRMVLab/EgoLoc.git && cd EgoLoc && \
pip install -r requirements.txt

Grounded-SAM Dependency Installation (Mandatory)

Step 1 – Clone Grounded-SAM (with submodules)

git clone --recursive https://github.com/IDEA-Research/Grounded-Segment-Anything.git

If you plan to use CUDA (recommended for speed) outside Docker, set:

export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True          # ensures CUDA kernels are compiled

Step 2 – Build Grounded-SAM components

# 4-A  Segment Anything (SAM)
python -m pip install -e Grounded-Segment-Anything/segment_anything

# 4-B  Grounding DINO
pip install --no-build-isolation -e Grounded-Segment-Anything/GroundingDINO

Step 3 – Vision-language extras

# Diffusers (for prompt-based image generation; optional but handy)
pip install --upgrade 'diffusers[torch]'

Step 4 – OSX module (object-centric cross-attention)

git submodule update --init --recursive
cd Grounded-Segment-Anything/grounded-sam-osx
bash install.sh          # compiles custom ops
cd ../..                 # return to project root

Step 5 – RAM & Tag2Text (open-vocabulary tagger)

git clone https://github.com/xinyu1205/recognize-anything.git
pip install -r recognize-anything/requirements.txt
pip install -e recognize-anything/

Step 6 – Optional utilities

pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel

These are needed for COCO-format mask export, ONNX export, and Jupyter notebooks.

Step 7 – Download pretrained weights (place inside `Grounded-Segment-Anything`)

cd Grounded-Segment-Anything

# Grounding DINO (Swin-T, object-grounded captions)
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

# Segment Anything (ViT-H)
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Step 8 – Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo

git clone https://huggingface.co/google-bert/bert-base-uncased

1.2 EgoLoc 3D-Specific Required Dependencies

Quick installation (extra steps for the 3D demo)

# ---- inside the EgoLoc root --------------------------------------------------
# 1) external repos
git clone https://github.com/geopavlakos/hamer.git
git clone https://github.com/DepthAnything/Video-Depth-Anything.git

# 2) python packages
# Install HaMeR dependencies
# Install Video-Depth-Anything dependencies
pip install opencv-python open3d matplotlib scipy tqdm

# 3) Get Video-Depth-Anything checkpoints

1.3 Known Bugs (With the assumption of successfull installation of all dependecies)

If you encounter a module error regarding segment_anything, please add a __init__.py file inside the directory of ./Grounded-Segment-Anything/segment_anything with the following:

from .segment_anything import SamPredictor, sam_model_registry

If you encounter a bug, please do not hesitate to make a PR.

2. Running EgoLoc for Short Video Clips

We provide both 2D and 3D demos for you to test out.

2.1 Running EgoLoc-2D (RGB Video Only)

We provide several example videos to demonstrate how our 2D version of EgoLoc performs in a closed-loop setup. To run the demo:

python egoloc_2D_demo.py \
  --video_path ./video1.mp4 \
  --output_dir output \
  --config Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
  --grounded_checkpoint Grounded-Segment-Anything/groundingdino_swint_ogc.pth \
  --sam_checkpoint Grounded-Segment-Anything/sam_vit_h_4b8939.pth \
  --bert_base_uncased_path Grounded-Segment-Anything/bert-base-uncased/ \
  --text_prompt hand \
  --box_threshold 0.3 \
  --text_threshold 0.25 \
  --device cuda \
  --credentials auth.env \
  --action "Grasping the object" \
  --grid_size 3 \
  --max_feedbacks 1

The temporal interaction localization results will be saved in the output directory.

Video	Contact Frame	Separation Frame
video1
video2

Note: Due to inherent randomness in VLM-based reasoning, EgoLoc may produce slightly different results on different runs.

2.2 Running EgoLoc-3D (RGB Video + Auto Synthetically Generated Depths)

We also provide our newest 3D version of EgoLoc, which uses 3D hand velocities for adaptive sampling. VDA is used here to synthesize pseudo depth observations, eliminating the reliance on RGB-D cameras for more flexible applications. To run the demo:

python egoloc_3D_demo.py \
          --video_path video3.mp4 \
          --output_dir output \
          --device cuda \
          --credentials auth.env \
          --encoder vits \
          --grid_size 3

The temporal interaction localization results will be saved in the output directory.

Video	Pseudo Depth	Contact Frame	Separation Frame
video3

Note: Due to inherent randomness in VLM-based reasoning, EgoLoc may produce slightly different results on different runs.

2.3. Configuration Parameters

Here are some key arguments you can adjust when running EgoLoc. For file paths related to GroundedSAM, please refer to its original repository.

video_path: Path to the input egocentric video
output_dir: Directory to save the output frames and results
text_prompt: Prompt used for hand grounding (e.g., "hand")
box_threshold: Threshold for hand box grounding confidence
grid_size: Grid size for image tiling used in VLM prompts
max_feedbacks: Number of feedback iterations
credentials: File containing your OpenAI API key

3. Running EgoLoc for Long Videos

We have released a new version of EgoLoc that can handle long videos (without additional dependences). Please use the following command to run it:

python egoloc3d_long.py  --video_path ManiTIL/cabinet/video/video1.mp4  --speed_json ManiTIL/cabinet/speed/video1_speed.json  --video_type long

We have provided the hand speed precomputed by raw depth for quick use. Note that our method still works even no true depth observations are available. Relative 3D hand speed can be extracted by VDA. You can run the VDA-based version for long videos via the following command:

python egoloc3d_long.py  --video_path ManiTIL/cabinet/video/video1.mp4  --video_type long

The example long videos in our ManiTIL benchmark are available here. Please download and unzip to the ./ManiTIL/ directory.
This is the initial release of extended EgoLoc for long videos. We are still in the process of cleaning, refactoring, and documenting the updated codebase and benchmarks. We plan to complete all updates in the following several months. We appreciate your interest and patience. For now, feel free to use EgoLoc in your own projects!

4. Citation

🙏 If you find EgoLoc useful in your research, please consider citing:

@INPROCEEDINGS{zhang2025zero,
  author={Zhang, Erhang and Ma, Junyi and Zheng, Yin-Dong and Zhou, Yixuan and Xu, Fan},
  booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Zero-Shot Temporal Interaction Localization for Egocentric Videos}, 
  year={2025},
  pages={19554-19561}
}

@article{ma2025egoloc,
  title={EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos},
  author={Ma, Junyi and Zhang, Erhang and Zheng, Yin-Dong and Xie, Yuchen and Zhou, Yixuan and Wang, Hesheng},
  journal={arXiv preprint arXiv:2508.12349},
  year={2025}
}

6. License

This project is free software made available under the MIT License. For more details see the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EgoLoc: Temporal Interaction Localization in Egocentric Videos

from [🧭TAL] to [🎯TIL]

1. Getting Started

1.1 EgoLoc & Grounded-SAM Environment setup & dependency installation 🚀

EgoLoc One-Liner Installation - For Conda

Step 1 – Clone Grounded-SAM (with submodules)

Step 2 – Build Grounded-SAM components

Step 3 – Vision-language extras

Step 4 – OSX module (object-centric cross-attention)

Step 5 – RAM & Tag2Text (open-vocabulary tagger)

Step 6 – Optional utilities

Step 7 – Download pretrained weights (place inside `Grounded-Segment-Anything`)

Step 8 – Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo

1.2 EgoLoc 3D-Specific Required Dependencies

1.3 Known Bugs (With the assumption of successfull installation of all dependecies)

2. Running EgoLoc for Short Video Clips

2.1 Running EgoLoc-2D (RGB Video Only)

2.2 Running EgoLoc-3D (RGB Video + Auto Synthetically Generated Depths)

2.3. Configuration Parameters

3. Running EgoLoc for Long Videos

4. Citation

6. License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
output		output
script		script
LICENSE		LICENSE
README.md		README.md
TAL_TIL.png		TAL_TIL.png
auth.env		auth.env
demo.gif		demo.gif
egoloc3d_long.py		egoloc3d_long.py
egoloc_2D_demo.py		egoloc_2D_demo.py
egoloc_3D_demo.py		egoloc_3D_demo.py
requirements.txt		requirements.txt
video1.mp4		video1.mp4
video2.mp4		video2.mp4
video3.mp4		video3.mp4

License

IRMVLab/EgoLoc

Folders and files

Latest commit

History

Repository files navigation

EgoLoc: Temporal Interaction Localization in Egocentric Videos

from [🧭TAL] to [🎯TIL]

1. Getting Started

1.1 EgoLoc & Grounded-SAM Environment setup & dependency installation 🚀

EgoLoc One-Liner Installation - For Conda

Step 1 – Clone Grounded-SAM (with submodules)

Step 2 – Build Grounded-SAM components

Step 3 – Vision-language extras

Step 4 – OSX module (object-centric cross-attention)

Step 5 – RAM & Tag2Text (open-vocabulary tagger)

Step 6 – Optional utilities

Step 7 – Download pretrained weights (place inside Grounded-Segment-Anything)

Step 8 – Download BERT backbone (for text embeddings) <-- Download Within the Grounded Segment Anything Repo

1.2 EgoLoc 3D-Specific Required Dependencies

1.3 Known Bugs (With the assumption of successfull installation of all dependecies)

2. Running EgoLoc for Short Video Clips

2.1 Running EgoLoc-2D (RGB Video Only)

2.2 Running EgoLoc-3D (RGB Video + Auto Synthetically Generated Depths)

2.3. Configuration Parameters

3. Running EgoLoc for Long Videos

4. Citation

6. License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Step 7 – Download pretrained weights (place inside `Grounded-Segment-Anything`)

Packages