PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

DexImit

Learning Bimanual Dexterous Manipulation
from Monocular Human Videos

Juncheng Mu^1,2,*

, Sizhe Yang^1,3,*

, Yiming Bao²

, Hojin Bae²

, Tianming Wei²

,
Linning Xu¹

, Boyi Li^4,†

, Huazhe Xu^2,†

, Jiangmiao Pang^1,†

¹Shanghai AI Laboratory

²Tsinghua University

³The Chinese University of Hong Kong

⁴NVIDIA

^*Equal contribution ^†Corresponding author

Paper Code (comming soon) arXiv Twitter/X

Scroll

We introduce DexImit, a framework for learning dexterous manipulation directly from videos. DexImit leverages generated or in-the-wild videos to synthesize physically plausible demonstrations. Moreover, DexImit employs comprehensive data augmentation to achieve diverse generalization, facilitating policy training for zero-shot real-world deployment.

Abstract

Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).

Overview of DexImit: We adopt a four-stage paradigm: Reconstruction- Scheduling- Action- Augmentation. (1) Reconstruct 4D hand-object interactions and transform them to a unified world frame. (2) Decompose the manipulation process into subtasks and schedule bimanual actions for long-horizon tasks using an Action-Centric Scheduling Algorithm. (3) Generate robot trajectories via grasp synthesis and motion planning. (4) Augment the resulting source data comprehensively to enable robust policy learning.

We introduce DexImit, a framework for learning dexterous manipulation directly from videos. DexImit leverages generated or in-the-wild videos to synthesize physically plausible demonstrations, including challenging tool-using, long-horizon, and fine-grained tasks. The gallery highlights the breadth of manipulation tasks generated by DexImit.

DexImit can generate physically plausible data for long-horizon and fine-grained real-world tasks.

Visualization of the synthesized actions.

Sim2Real Results: Learning from Human Videos

Side-by-side comparison of human manipulation videos (left) and corresponding policies (right)

Place Apple

Human

Robot

Place Potato&Pepper

Human

Robot

Place Pot

Human

Robot

Pour Water

Human

Robot

Long-Horizon Task Data

Cook

Cut-Apple

Make-Beverage

Stack-Cup

Simulation Results: Human Demonstrations and Generated Robot Data

Side-by-side comparison of human manipulation videos (left) and corresponding robot trajectories (right)

Bimanual Pick&Place

Human

Robot

Bimanual Grasping

Human

Robot

Unimanual Pick&Place

Human

Robot

Pouring

Human

Robot

Long-Horizon 1

Human

Robot

Long-Horizon 2

Human

Robot

BibTeX


        @article{
          Mu2025DexImit,
          title={DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos},
          author={Juncheng Mu and Sizhe Yang and Yiming Bao and Hojin Bae and Tianming Wei and Linning Xu and Boyi Li and Huazhe Xu and Jiangmiao Pang},
          journal={arXiv preprint},
          year={2026},
          url={https://arxiv.org/abs/2602.10105}
        }