EgoM2P: Egocentric Multimodal Multitask Pretraining

Website | BibTeX | Paper

Official implementation and pre-trained models for :

EgoM2P: Egocentric Multimodal Multitask Pretraining, ICCV 2025
Gen Li, Yutong Chen*, Yiqian Wu*, Kaifeng Zhao*, Marc Pollefeys, Siyu Tang

EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories—to handle challenging tasks like monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis. For simplicity, we only visualize four frames here.

Usage

Installation

We provide two installation methods:

Container for aarch64 (our model was trained and tested using this method): First follow the environment setup in this Swiss-AI tutorial. Please finish all steps until this section. Please refer to the provided Dockerfile for reference.

# Before installing, edit req.txt and remove the lines "torch==2.6.0" and "decord==0.6.0"!
pip install -r req.txt

You still need to install decord from source since there is no precompiled version for linux-aarch64.

Conda environment for x86-64 (not tested for multi-node training):

conda create -n egom2p python=3.12.3 -y
conda activate egom2p
pip install -r req.txt

Data

See README_DATA.md for instructions on how to prepare aligned multimodal datasets.

Tokenization

See README_TOKENIZATION.md for instructions on how to train modality-specific tokenizers.

EgoM2P Training

See README_TRAINING.md for instructions on how to train EgoM2P models.

Inference

See README_INFERENCE.md for instructions on how to perform inference.

Pretrained models

EgoM2P model

Model	# Mod.	Datasets	# Params	Config	Weights
EgoM2P	4	Eight egocentric datasets	400M (incl. embedding layers)	Config	Checkpoint

We tried to scale the model size further but due to limited dataset size, this will leave as promising future work.

Tokenizers

EgoM2P handles clips with 2 second length. Video modalities are downsampled to 8 fps from 30 fps, and non-video modalities are at 30 fps.

Modality	Shape	Number of tokens	Codebook size	Weights
Camera Trajectory	60x9	30	256	Checkpoint
Gaze Dynamics	60x2	30	256	Checkpoint

We use recent pretrained Cosmos Tokenizers for video modalities:

Modality	Shape	Number of tokens	Codebook size	Weights
RGB/Depth Video	16x256x256	5120	64k	Download encoder.jit and decoder.jit from Checkpoint

We observed that Cosmos Tokenizers do not work well for low fps and low resolution videos (however usually video models need to train from low resolutions first). Feel free to use other video tokenizers and retrain EgoM2P:)

License

EgoM2P builds on 4M, a multimodal image foundation model, by extending it to video and motion domains. We acknowledge 4M and its acknowledgements. While most of the dependencies are not used in EgoM2P, we choose to keep some unused 4M code in our codebase for possible extensions.

We follow its license. The code in this repository is released under the Apache 2.0 license as found in the LICENSE file.

The model weights in this repository are released under the Sample Code license as found in the LICENSE_WEIGHTS file.

Citation

If you find this repository helpful, please consider citing our work:

@InProceedings{Li_2025_ICCV,
    author    = {Li, Gen and Chen, Yutong and Wu, Yiqian and Zhao, Kaifeng and Pollefeys, Marc and Tang, Siyu},
    title     = {EgoM2P: Egocentric Multimodal Multitask Pretraining},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {10830-10843}
}

@InProceedings{li2024egogen,
    author    = {Li, Gen and Zhao, Kaifeng and Zhang, Siwei and Lyu, Xiaozhong and Dusmanu, Mihai and Zhang, Yan and Pollefeys, Marc and Tang, Siyu},
    title     = {EgoGen: An Egocentric Synthetic Data Generator},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {14497-14509}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
cfgs/default		cfgs/default
cosmos_tokenizer		cosmos_tokenizer
egom2p		egom2p
example_data		example_data
tokenize_script		tokenize_script
train_slurm_script		train_slurm_script
vis_3d		vis_3d
.gitignore		.gitignore
ACKNOWLEDGEMENTS.md		ACKNOWLEDGEMENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE_WEIGHTS		LICENSE_WEIGHTS
README.md		README.md
README_DATA.md		README_DATA.md
README_INFERENCE.md		README_INFERENCE.md
README_TOKENIZATION.md		README_TOKENIZATION.md
README_TRAINING.md		README_TRAINING.md
eval_model_depth2rgb.py		eval_model_depth2rgb.py
eval_model_rgb2cam.py		eval_model_rgb2cam.py
eval_model_rgb2depth.py		eval_model_rgb2depth.py
eval_model_rgb2gaze.py		eval_model_rgb2gaze.py
gen_aligned_training_data.py		gen_aligned_training_data.py
req.txt		req.txt
run_training_egom2p.py		run_training_egom2p.py
run_training_vqvae.py		run_training_vqvae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EgoM2P: Egocentric Multimodal Multitask Pretraining

Table of contents

Usage

Installation

Data

Tokenization

EgoM2P Training

Inference

Pretrained models

EgoM2P model

Tokenizers

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

ligengen/EgoM2P

Folders and files

Latest commit

History

Repository files navigation

EgoM2P: Egocentric Multimodal Multitask Pretraining

Table of contents

Usage

Installation

Data

Tokenization

EgoM2P Training

Inference

Pretrained models

EgoM2P model

Tokenizers

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages