D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi^†1, Jaeyoon Jung^†1, Haebin Seong^†1, Minchan Kim¹, Minyeong Kim²,
Yongjun Cho¹, Yoonshik Kim¹, Yubeen Park¹, Youngjae Yu^‡3, Yunsung Lee^‡1,

^† Equal contribution, ^‡ Co-corresponding author, ¹ MAUM.AI, ² Stanford University, ³ Seoul National University

International Conference on Learning Representations (ICLR) 2026

Paper Code Model (G-IDM) Dataset (480p) Dataset (FHD/QHD)

Existing approaches (e.g., DROID) for collecting embodied AI data are expensive, low diversity, and hard to scale. D2E leverages desktop data which is cheap, high diversity, and easy to scale. The OWA Toolkit captures 335 hours of rich desktop demonstrations across 31 games with 152× compression. The Generalist-IDM uses next-event prediction with temporal offset (NEP-τ) to achieve OOD generalization, enabling pseudo-labeling of 1K+ hours of YouTube gameplay. Vision-Action Pretraining transfers desktop-pretrained representations to embodied AI, achieving 96.6% success on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks which demonstrates desktop-to-robotics transfer.

Pseudo-Label result on YouTube dataset

Brotato

CSGO2

Stardew Valley

Minecraft

Slime Rancher

Raft

Barony

Dinkum

Generalist-IDM uses a single model to label actions on video-only data, across 2D/3D games and visual navigation/UI interactions without separate processing.
Remarkably, in Counter-Strike videos where spectator mode begins around 10 seconds, it can distinguish between active gameplay and spectator mode by recognizing subtle UI elements, avoiding action predictions during spectator phases.

Abstract

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments---particularly gaming---offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152× compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models.

Generalist Inverse Dynamics Model (G-IDM)

The Generalist Inverse Dynamics Model (G-IDM) learns to predict actions from observation transitions across diverse desktop environments. Trained on our multi-domain corpus collected via the OWA Toolkit, G-IDM achieves strong performance across all in-distribution environments, yielding large gains in Pearson correlation (e.g., +39.5 points on Stardew Valley X) and keyboard accuracy (e.g., +57.6 points on Brotato), demonstrating robust generalization over diverse control dynamics.

NEP-τ: Temporal Offset Ablation

A key design choice of the Generalist-IDM is NEP-τ (Next-Event Prediction with Temporal Offset), which shifts the observation window forward by τ milliseconds to incorporate future visual context when predicting the current action. Without any offset (τ = 0), Pearson correlations collapse near zero and keyboard accuracy drops sharply, confirming that future context is essential for resolving the current action. A small offset (τ = 50 ms) recovers mouse prediction but remains suboptimal for keyboard accuracy. Performance stabilizes at τ ≥ 100 ms with only minor variation up to 200 ms, showing that NEP-τ is robust to the exact offset once sufficient future context is provided. We adopt τ = 100 ms as the default in all experiments.

Ground Truth

IDM

G-IDM

Out-of-Distribution Generalization

We evaluate G-IDM on two unseen games: Battlefield 6 (3D) and Ogu and the Secret Forest (2D). In Battlefield 6, G-IDM achieves 63% keyboard accuracy, matching or slightly outperforming the Specialist-IDM. When provided with a few-shot prefix, the predicted mouse scale improves significantly, demonstrating in-context adaptation to mouse sensitivity. In Ogu and the Secret Forest, G-IDM more than doubles the Specialist-IDM's performance (from ~12% to nearly 28%), showing substantial gains even under a large domain gap.

Ground Truth

IDM (Fine Tune)

G-IDM (Zero Shot)

G-IDM (Few Shot)

G-IDM (Fine Tune)

In Battlefield 6, for G-IDM (Zero Shot) we can observe that scale of mouse movement is different with GT, but we can observe that scale of movement remain nearly same for G-IDM (Few Shot). This improvement occurs because providing context examples helps the model calibrate the appropriate movement scale.

Desktop-to-Embodied Transfer

To validate the effectiveness of desktop pretraining for embodied AI, we evaluate our approach on three challenging downstream tasks: LIBERO manipulation, CANVAS navigation, and SO101 real-world pick-and-place. These benchmarks represent diverse embodied scenarios requiring different sensorimotor skills - precise object manipulation, spatial navigation, and real-world pick-and-place. Our Vision-Action Pretraining (VAPT) framework transfers desktop-pretrained representations to these physical domains, demonstrating that sensorimotor patterns learned from gaming environments can generalize to real-world robotic tasks.

LIBERO Manipulation Results

VAPT without pseudo-labels achieves 96.6% total success and 93.6% on long-horizon tasks, comparable to or surpassing much larger models such as π₀ (3.3B) and OpenVLA (7B). Our 1B-parameter model shows particularly strong advantages on long-horizon tasks that require careful action sequencing.

Baseline (42%)

+ VAPT w/o pseudo (98%)

+ VAPT w/ pseudo (100%)

Baseline (36%)

+ VAPT w/o pseudo (88%)

+ VAPT w/ pseudo (90%)

Baseline (10%)

+ VAPT w/o pseudo (88%)

+ VAPT w/ pseudo (46%)

CANVAS Navigation Results

Adding pseudo-labeled demonstrations increases navigation performance to 83.3%, an 8-point improvement over the baseline. The benefit is especially large under misleading instructions, as in sim_orchard (86.7% vs. 53.3%) and sim_street_sidewalk (73.3% vs. 40.0%), indicating that pseudo-labeling is particularly useful for navigation tasks where success depends on high-level planning rather than precise low-level control.

Baseline (fail)

+ VAPT w/ pseudo (success)

Meta-World & SO101 Real-World Results

VAPT consistently outperforms the baseline on Meta-World across all difficulty levels, with gains most pronounced on Hard and Very Hard tasks. We further validate our approach with a real-world pick-and-place experiment using an SO101 robot arm, following the evaluation protocol of SmolVLA (Shukor et al., 2025). The task requires grasping a blue cube and placing it in a white box, with the cube placed at five distinct initial positions. We collect 208 demonstration episodes and evaluate each trained policy over 30 rollouts. The baseline InternVL3-1B achieves a 70% success rate, while both VAPT variants reach 80%, confirming that VAPT transfers effectively to real-world hardware.

Baseline (fail) - right view

+ VAPT (success) - right view

Baseline (fail) - top view

+ VAPT (success) - top view

BibTeX

@article{choi2025d2e,
    title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
    author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
    journal={arXiv preprint arXiv:2510.05684},
    year={2025}
}

The website design was based on general-navigation-models adapted from Nerfies.