Motivation
Temporal structure lies at the center of decision making. However, VLA models built on VLM backbones primarily capture static semantic correlations rather than temporal dynamics. One promising direction is to learn from large-scale internet videos that contain rich temporal dynamics. There are two popular paradigms:
Video Model for Decision Making
Representative work in this line includes UniPi, which learns a video planner and extracts actions using an inverse dynamics model. However, planning directly in pixel space does not guarantee a strong temporal representation: the information density between consecutive frames is relatively low (especially compared to language), and generating full-resolution pixels often fails to capture global semantics. Therefore, it can easily generate hallucination of pixels that are hard to translate into low-level actions.
Latent Action Model
The other line of work aims to learn a latent space that better aligns with actions. A representative example is LAPA, which learns a latent action space and models forward dynamics within this space. This results in a more compact representation that enables easier translation to low-level actions. However, prior work typically focuses on single-step forward dynamics, which fails to capture the longer-term temporal structure present in videos.
Method
The idea is simple: we aim to learn a representation that captures temporal structure in videos while allowing easy translation to low-level actions. Our key idea is to learn latent skills directly from optical flow, which captures only motion information between frames and is therefore closer to action.
To model temporal structure, we use a causal convolutional layer to process optical flow over time. In the decoder, we incorporate a positional inductive bias by conditioning on the initial frame, leveraging the fact that optical flow captures relative motion. This conditioning allows the model to disentangle skill-relevant dynamics from absolute position.