Skip to content

ligengen/EgoM2P

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EgoM2P: Egocentric Multimodal Multitask Pretraining

Website | BibTeX | Paper

Official implementation and pre-trained models for :

EgoM2P: Egocentric Multimodal Multitask Pretraining, ICCV 2025
Gen Li, Yutong Chen*, Yiqian Wu*, Kaifeng Zhao*, Marc Pollefeys, Siyu Tang


egom2p main figure

EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories—to handle challenging tasks like monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis. For simplicity, we only visualize four frames here.

Table of contents

Usage

Installation

We provide two installation methods:

  1. Container for aarch64 (our model was trained and tested using this method): First follow the environment setup in this Swiss-AI tutorial. Please finish all steps until this section. Please refer to the provided Dockerfile for reference.
# Before installing, edit req.txt and remove the lines "torch==2.6.0" and "decord==0.6.0"!
pip install -r req.txt

You still need to install decord from source since there is no precompiled version for linux-aarch64.

  1. Conda environment for x86-64 (not tested for multi-node training):
conda create -n egom2p python=3.12.3 -y
conda activate egom2p
pip install -r req.txt

Data

See README_DATA.md for instructions on how to prepare aligned multimodal datasets.

Tokenization

See README_TOKENIZATION.md for instructions on how to train modality-specific tokenizers.

EgoM2P Training

See README_TRAINING.md for instructions on how to train EgoM2P models.

Inference

See README_INFERENCE.md for instructions on how to perform inference.

Pretrained models

EgoM2P model

Model # Mod. Datasets # Params Config Weights
EgoM2P 4 Eight egocentric datasets 400M (incl. embedding layers) Config Checkpoint

We tried to scale the model size further but due to limited dataset size, this will leave as promising future work.

Tokenizers

EgoM2P handles clips with 2 second length. Video modalities are downsampled to 8 fps from 30 fps, and non-video modalities are at 30 fps.

Modality Shape Number of tokens Codebook size Weights
Camera Trajectory 60x9 30 256 Checkpoint
Gaze Dynamics 60x2 30 256 Checkpoint

We use recent pretrained Cosmos Tokenizers for video modalities:

Modality Shape Number of tokens Codebook size Weights
RGB/Depth Video 16x256x256 5120 64k Download encoder.jit and decoder.jit from Checkpoint

We observed that Cosmos Tokenizers do not work well for low fps and low resolution videos (however usually video models need to train from low resolutions first). Feel free to use other video tokenizers and retrain EgoM2P:)

License

EgoM2P builds on 4M, a multimodal image foundation model, by extending it to video and motion domains. We acknowledge 4M and its acknowledgements. While most of the dependencies are not used in EgoM2P, we choose to keep some unused 4M code in our codebase for possible extensions.

We follow its license. The code in this repository is released under the Apache 2.0 license as found in the LICENSE file.

The model weights in this repository are released under the Sample Code license as found in the LICENSE_WEIGHTS file.

Citation

If you find this repository helpful, please consider citing our work:

@InProceedings{Li_2025_ICCV,
    author    = {Li, Gen and Chen, Yutong and Wu, Yiqian and Zhao, Kaifeng and Pollefeys, Marc and Tang, Siyu},
    title     = {EgoM2P: Egocentric Multimodal Multitask Pretraining},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {10830-10843}
}

@InProceedings{li2024egogen,
    author    = {Li, Gen and Zhao, Kaifeng and Zhang, Siwei and Lyu, Xiaozhong and Dusmanu, Mihai and Zhang, Yan and Pollefeys, Marc and Tang, Siyu},
    title     = {EgoGen: An Egocentric Synthetic Data Generator},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {14497-14509}
}

About

[ICCV 2025] The official implementation for EgoM2P: Egocentric Multimodal Multitask Pretraining.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages