Official implementation and pre-trained models for :
EgoM2P: Egocentric Multimodal Multitask Pretraining, ICCV 2025
Gen Li, Yutong Chen*, Yiqian Wu*, Kaifeng Zhao*, Marc Pollefeys, Siyu Tang
EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories—to handle challenging tasks like monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis. For simplicity, we only visualize four frames here.
We provide two installation methods:
- Container for aarch64 (our model was trained and tested using this method): First follow the environment setup in this Swiss-AI tutorial. Please finish all steps until this section. Please refer to the provided Dockerfile for reference.
# Before installing, edit req.txt and remove the lines "torch==2.6.0" and "decord==0.6.0"!
pip install -r req.txt
You still need to install decord from source since there is no precompiled version for linux-aarch64.
- Conda environment for x86-64 (not tested for multi-node training):
conda create -n egom2p python=3.12.3 -y
conda activate egom2p
pip install -r req.txt
See README_DATA.md for instructions on how to prepare aligned multimodal datasets.
See README_TOKENIZATION.md for instructions on how to train modality-specific tokenizers.
See README_TRAINING.md for instructions on how to train EgoM2P models.
See README_INFERENCE.md for instructions on how to perform inference.
| Model | # Mod. | Datasets | # Params | Config | Weights |
|---|---|---|---|---|---|
| EgoM2P | 4 | Eight egocentric datasets | 400M (incl. embedding layers) | Config | Checkpoint |
We tried to scale the model size further but due to limited dataset size, this will leave as promising future work.
EgoM2P handles clips with 2 second length. Video modalities are downsampled to 8 fps from 30 fps, and non-video modalities are at 30 fps.
| Modality | Shape | Number of tokens | Codebook size | Weights |
|---|---|---|---|---|
| Camera Trajectory | 60x9 | 30 | 256 | Checkpoint |
| Gaze Dynamics | 60x2 | 30 | 256 | Checkpoint |
We use recent pretrained Cosmos Tokenizers for video modalities:
| Modality | Shape | Number of tokens | Codebook size | Weights |
|---|---|---|---|---|
| RGB/Depth Video | 16x256x256 | 5120 | 64k | Download encoder.jit and decoder.jit from Checkpoint |
We observed that Cosmos Tokenizers do not work well for low fps and low resolution videos (however usually video models need to train from low resolutions first). Feel free to use other video tokenizers and retrain EgoM2P:)
EgoM2P builds on 4M, a multimodal image foundation model, by extending it to video and motion domains. We acknowledge 4M and its acknowledgements. While most of the dependencies are not used in EgoM2P, we choose to keep some unused 4M code in our codebase for possible extensions.
We follow its license. The code in this repository is released under the Apache 2.0 license as found in the LICENSE file.
The model weights in this repository are released under the Sample Code license as found in the LICENSE_WEIGHTS file.
If you find this repository helpful, please consider citing our work:
@InProceedings{Li_2025_ICCV,
author = {Li, Gen and Chen, Yutong and Wu, Yiqian and Zhao, Kaifeng and Pollefeys, Marc and Tang, Siyu},
title = {EgoM2P: Egocentric Multimodal Multitask Pretraining},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {10830-10843}
}
@InProceedings{li2024egogen,
author = {Li, Gen and Zhao, Kaifeng and Zhang, Siwei and Lyu, Xiaozhong and Dusmanu, Mihai and Zhang, Yan and Pollefeys, Marc and Tang, Siyu},
title = {EgoGen: An Egocentric Synthetic Data Generator},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {14497-14509}
}
