Masked Modeling for Human Motion Recovery Under Occlusions
Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang
3DV 2026
git clone https://github.com/mikeqzy/MoRo
conda env create -f environment.yml
conda activate moro
conda install pytorch3d -c pytorch3d
pip install --no-build-isolation chumpy==0.70
pip install -r requirements.txtWe use smplfitter to fit the non-parametric mesh to SMPL-X parameters. Please follow their provided script to download these files and put them under body_models.
Additionally, you can download the mesh connection matrices for SMPL-X topology used for the fully convolutional mesh autoencoder in Mesh-VQ-VAE and other regressors for evaluation here. Please also extract and put them under body_models .
We train the tokenizer on AMASS, MOYO and BEDLAM. Download the SMPL-X neutral annotations from their official project pages and unzip the files.
We preprocessed the datasets with the scripts at models/mesh_vq_vae/data/preprocess. Please change the dataset paths accordingly.
MoRo is trained on a mixture of datasets prepared as follows:
-
AMASS:
We preprocess AMASS into 30 fps sequences following RoHM, please refer to the instructions here.
-
MPII, Human3.6M, MPI-INF-3DHP, COCO:
The training dataset uses the SMPL annotations from BEDLAM. Follow the instructions here in the section
Training CLIFF model with real imagesto obtain the required training images and annotations.We further convert the SMPL annotations to SMPL-X using the provided script at
scripts/preprocess/process_hmr_smplx.py. -
BEDLAM:
Download the BEDLAM dataset from their official project page. We use the SMPL-X neutral annotations for training.
-
EgoBody:
First, download the EgoBody dataset from the official EgoBody dataset.
Additionally, download
keypoints_cleaned,mask_jointfrom here andegobody_occ_info.csvfrom here, then place them under the dataset directory.Finally, run the provided preprocessing script at
scripts/preprocess/process_egobody_bbox.pyto generate the bounding box files.
Additionally, we test our method on PROX and RICH. Follow the dataset-specific steps below to place files in the expected locations.
-
RICH:
Download the RICH dataset from their official project page. The preprocessed annotations can be downloaded from GVHMR repo. Put the
hmr4d_supportfolder under the RICH dataset directory. -
PROX:
First, download the PROX dataset from their official project page.
Additionally, download
keypoints_openposeandmask_jointfrom here and place them under the dataset directory.Finally, run the provided preprocessing script at
scripts/preprocess/process_prox_bbox.pyto generate the bounding box files.
Make sure the required pretrained weights and released checkpoints are placed at the exact paths below.
The checkpoint for Mesh-VQ-VAE and MoRo can be downloaded here. Place the tokenizer checkpoint at ckpt/tokenizer/tokenizer.ckpt and MoRo checkpoint at exp/mask_transformer/MIMO-vit-release/video_train/checkpoints/last.ckpt.
To train from scratch, we use the pretrained weights from 4DHumans. You can download vit_pose_hmr2.pth from here and place it under ckpt/backbones.
The original 4DHumans checkpoint is available here. From this model, we extract the ViT backbone weights and re-save them for easier loading.
The data should be organized as follows:
MoRo
├── body_models
| ├── smpl
| ├── smplh
│ ├── smplx
│ ├── smplx_ConnectionMatrices
| ├── J_regressor_h36m.npy
| ├── smpl_neutral_J_regressor.pt
| ├── smplx2smpl_sparse.pt
├── ckpt
│ ├── backbones
│ │ ├── vit_pose_hmr2.pth
│ ├── tokenizer
│ │ ├── tokenizer.ckpt
├── exp
│ ├── mask_transformer
│ │ ├── MIMO-vit-release
│ │ │ ├── video_train
│ │ │ │ ├── checkpoints
│ │ │ │ │ ├── last.ckpt
├── datasets
│ ├── mesh_vq_vae
│ │ ├── bedlam_animations
│ │ ├── AMASS_smplx
│ │ ├── MOYO
│ ├── mask_transformer
│ │ ├── AMASS
│ │ ├── BEDLAM
│ │ ├── coco
│ │ ├── h36m_train
│ │ ├── mpi-inf-3dhp
│ │ ├── mpii
│ │ ├── EgoBody
│ │ ├── PROX
│ │ ├── rich
For a quick demo on custom video taken from static camera, run the following command on a 30 fps video:
python demo.py option=demo demo.video_path=/path/to/demo.mp4 demo.name=<video_name> demo.focal_length=<focal_length>or an image directory with sorted frames:
python demo.py demo.video_path=/path/to/image_dir demo.name=<video_name> demo.focal_length=<focal_length>By default, the rendering result will be saved to ./exp/mask_transformer/MIMO-vit-release/video_train, under the same directory of the released model checkpoint.
The focal length can be optionally provided. If not provided, it will be estimated via HumanFOV from CameraHMR. Please download the checkpoint following here and place it under ckpt/cam_model_cleaned.ckpt.
Training consists of (1) training the mesh tokenizer and (2) multi-stage training for MoRo. Configuration locations are listed in each subsection.
You can train the mesh tokenizer by running:
python train_mesh_vqvae.pyThe configuration file is at configs/mesh_vq_vae/config.yaml.
We adopt a multi-stage training strategy for MoRo:
# Stage 1: Pose pretraining
python train_mask_transformer.py option=pose_pretrain tag=default
# Stage 2: Motion pretraining
python train_mask_transformer.py option=motion_pretrain tag=default
# Stage 3: Image pretraining on image datasets
python train_mask_transformer.py option=image_pretrain tag=default
# Stage 4: Image pretraining on video datasets
python train_mask_transformer.py option=video_pretrain tag=default
# Stage 5: Finetuning on video datasets
python train_mask_transformer.py option=video_train tag=defaultThe configuration files are at configs/mask_transformer/config.yaml, the specific options for each training stage can be found in configs/option.
The training logs and checkpoints will be saved under exp/mask_transformer/MIMO-vit-<tag>/<stage>.
Inference writes results to exp/mask_transformer/MIMO-vit-<tag>/video_train, and then evaluation scripts read from the corresponding result directories.
We set tag=release here to reproduce the results reported on the paper.
python train_mask_transformer.py option=[inference,video_train] tag=release data=egobody
python eval_egobody.py --saved_data_dir=./exp/mask_transformer/MIMO-vit-release/video_train/result_egobody/inference_5_1 --recording_name=all --renderpython train_mask_transformer.py option=[inference,video_train] tag=release data=rich
python eval_rich.py --saved_data_dir=./exp/mask_transformer/MIMO-vit-release/video_train/result_rich/inference_5_1 --seq_name=all --renderpython train_mask_transformer.py option=[inference,video_train] tag=release data=prox
python eval_egobody.py --saved_data_dir=./exp/mask_transformer/MIMO-vit-release/video_train/result_prox/inference_5_1 --recording_name=all --renderThis work was supported as part of the Swiss AI initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project IDs #36 on Alps, enabling large-scale training.
Some code in this repository is adapted from the following repositories:
If you find this code useful for your research, please use the following BibTeX entry.
@inproceedings{qian2026moro,
title={Masked Modeling for Human Motion Recovery Under Occlusions},
author={Qian, Zhiyin and Zhang, Siwei and Bhatnagar, Bharat Lal and Bogo, Federica and Tang, Siyu},
booktitle={3DV},
year={2026}
}
