World2Act
Latent Action Post-Training via Skill-Compositional World Models
TL;DR
World2Act aligns video and action in a shared latent space, then post-trains a residual policy using transferred world-model dynamics. A skill-compositional data pipeline makes arbitrary-length rollouts stable and improves both simulation and real-world control.
Abstract
World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.
Motivation
Existing world-model post-training still depends heavily on pixel-space supervision. World2Act instead transfers dynamics in latent space.
Existing works correct policies in image space. World2Act transfers dynamics in latent space.
The key gap is not whether world models can imagine behavior, but how to move those dynamics priors into executable VLA control without relying on fragile rendered supervision.
Method
The method is a two-stage pipeline: latent alignment first, then residual-policy post-training using transferred world-model dynamics.
Main pipeline
World2Act aligns world-model video latents with VLA action latents, transfers the learned representation into a shared latent space, and uses that signal to improve a residual policy.
Experimental Results
World2Act improves over strong world-model-based post-training baselines across RoboCasa, LIBERO, and real-world robot evaluation.
RoboCasa
Best success rate shown for GR00T-N1.6-ft + World2Act.
LIBERO
Top average score shown for Cosmos Policy + World2Act.
Real-world tasks
Pick & place, pick bowl, and close drawer.
Real-world improvement
Headline gain reported for physical robot evaluation after World2Act post-training.
RoboCasa
World2Act sets a new RoboCasa state of the art: Cosmos Policy + World2Act reaches 66.3% with only 50 real demos and 50 imagined trajectories, and improves GR00T-N1.6-ft by 2.5% under the same synthetic-data budget.
LIBERO
World2Act achieves the top LIBERO average: it improves GR00T-N1.6-ft from 97.0% to 98.1%, lifts Cosmos Policy by 0.1%, and outperforms DreamGen, which drops to 92.6%, across spatial, object, goal, and long-horizon skills.
Real-world evaluation
On physical hardware, World2Act improves the baseline by an average of 6.67% across all tasks, matching the simulation trend and validating the practical viability of the WM-to-VLA pipeline.
Qualitative Results
Skill-Compositional World Model
Real-World Execution
Summary: World2Act
Bridging the gap between WM and VLA via latent alignment
Key Contributions
Latent Action Alignment: We align WM video dynamics directly with VLA action spaces through feature alignment, reducing dependence on pixel space.
Automatic Data Pipeline: Automates skill-level segmentation and alignment to reduce task-length variance, enabling stable, arbitrary-length generation through our new RoboCasa-Skill and LIBERO-Skill datasets.
Skill-Compositional Framework: We enable arbitrary-length imagination rollouts by decomposing multi-step instructions into atomic subgoals for consistent, autoregressive video generation.
Across simulation and real-world benchmarks: World2Act yields significant improvements over prior WM-based policies and post-training baselines.