World2Act

Latent Action Post-Training via Skill-Compositional World Models

Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, and Ian Reid

MBZUAI, UAE

^*Equal contribution

arXiv Dataset (Coming Soon) Code (Coming Soon)

TL;DR

World2Act aligns video and action in a shared latent space, then post-trains a residual policy using transferred world-model dynamics. A skill-compositional data pipeline makes arbitrary-length rollouts stable and improves both simulation and real-world control.

Abstract

World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

Motivation

Existing world-model post-training still depends heavily on pixel-space supervision. World2Act instead transfers dynamics in latent space.

Comparison between pixel-space post-training and latent-to-policy alignment — Existing works correct policies in image space. World2Act transfers dynamics in latent space.

The key gap is not whether world models can imagine behavior, but how to move those dynamics priors into executable VLA control without relying on fragile rendered supervision.

Method

The method is a two-stage pipeline: latent alignment first, then residual-policy post-training using transferred world-model dynamics.

World2Act full pipeline for direct latent to policy alignment — Main pipeline

World2Act aligns world-model video latents with VLA action latents, transfers the learned representation into a shared latent space, and uses that signal to improve a residual policy.

Experimental Results

World2Act improves over strong world-model-based post-training baselines across RoboCasa, LIBERO, and real-world robot evaluation.

0.726

RoboCasa

Best success rate shown for GR00T-N1.6-ft + World2Act.

0.986

LIBERO

Top average score shown for Cosmos Policy + World2Act.

Real-world tasks

Pick & place, pick bowl, and close drawer.

6.7%

Real-world improvement

Headline gain reported for physical robot evaluation after World2Act post-training.

RoboCasa main results table — RoboCasa

World2Act sets a new RoboCasa state of the art: Cosmos Policy + World2Act reaches 66.3% with only 50 real demos and 50 imagined trajectories, and improves GR00T-N1.6-ft by 2.5% under the same synthetic-data budget.

LIBERO main results table — LIBERO

World2Act achieves the top LIBERO average: it improves GR00T-N1.6-ft from 97.0% to 98.1%, lifts Cosmos Policy by 0.1%, and outperforms DreamGen, which drops to 92.6%, across spatial, object, goal, and long-horizon skills.

Real world robot success rates — Real-world evaluation

On physical hardware, World2Act improves the baseline by an average of 6.67% across all tasks, matching the simulation trend and validating the practical viability of the WM-to-VLA pipeline.

Qualitative Results

Skill-Compositional World Model

Real-World Execution

Summary: World2Act

Bridging the gap between WM and VLA via latent alignment

Scalable WM-to-VLA Transfer for Robotics

World2Act provides a scalable and robust framework for transferring World Model dynamics to VLA policies, enabling reliable control through direct latent alignment.

Key Contributions

✓

Latent Action Alignment: We align WM video dynamics directly with VLA action spaces through feature alignment, reducing dependence on pixel space.

✓

Automatic Data Pipeline: Automates skill-level segmentation and alignment to reduce task-length variance, enabling stable, arbitrary-length generation through our new RoboCasa-Skill and LIBERO-Skill datasets.

✓

Skill-Compositional Framework: We enable arbitrary-length imagination rollouts by decomposing multi-step instructions into atomic subgoals for consistent, autoregressive video generation.

✓

Across simulation and real-world benchmarks: World2Act yields significant improvements over prior WM-based policies and post-training baselines.

World2Act

TL;DR

Abstract

Motivation

Existing works correct policies in image space. World2Act transfers dynamics in latent space.

Method

Main pipeline

Experimental Results

RoboCasa

LIBERO

Real-world tasks

Real-world improvement

RoboCasa

LIBERO

Real-world evaluation

Qualitative Results

RoboCasa Skill-Compositional World Model Imagined Rollouts

LIBERO Skill-Compositional World Model Imagined Rollouts

Real-World Skill-Compositional World Model Imagined Rollouts

Imagination by Skill-WM and Execution by GR00T-N1.6 + World2Act

Execution by GR00T-N1.6 + World2Act

Summary: World2Act

Scalable WM-to-VLA Transfer for Robotics

Key Contributions