Learning Skills from Action-Free Videos

1National Taiwan University 2NVIDIA
ICML Workshop 2025 (early version)
Dimensional collapse in extreme UniDA scenarios
We propose SOF, a method that leverages temporal structures in videos while enabling easier translation to low-level control. SOF learns a latent skill space through optical flow representations that better aligns video and action dynamics, thereby improving long-horizon performance.

Motivation

Temporal structure lies at the center of decision making. However, VLA models built on VLM backbones primarily capture static semantic correlations rather than temporal dynamics. One promising direction is to learn from large-scale internet videos that contain rich temporal dynamics. There are two popular paradigms:


Video Model for Decision Making

Video Model for Decision Making

Representative work in this line includes UniPi, which learns a video planner and extracts actions using an inverse dynamics model. However, planning directly in pixel space does not guarantee a strong temporal representation: the information density between consecutive frames is relatively low (especially compared to language), and generating full-resolution pixels often fails to capture global semantics. Therefore, it can easily generate hallucination of pixels that are hard to translate into low-level actions.

Latent Action Model

Video Model for Decision Making

The other line of work aims to learn a latent space that better aligns with actions. A representative example is LAPA, which learns a latent action space and models forward dynamics within this space. This results in a more compact representation that enables easier translation to low-level actions. However, prior work typically focuses on single-step forward dynamics, which fails to capture the longer-term temporal structure present in videos.

Method

The idea is simple: we aim to learn a representation that captures temporal structure in videos while allowing easy translation to low-level actions. Our key idea is to learn latent skills directly from optical flow, which captures only motion information between frames and is therefore closer to action.

Dimensional collapse in extreme UniDA scenarios

To model temporal structure, we use a causal convolutional layer to process optical flow over time. In the decoder, we incorporate a positional inductive bias by conditioning on the initial frame, leveraging the fact that optical flow captures relative motion. This conditioning allows the model to disentangle skill-relevant dynamics from absolute position.