💪 We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. 🔥 CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness:
SimplerEnv-OR (Observational Robustness) is designed to evaluate robustness of VLA models against observational disturbances across temporal (left) and spatial (right) dimensions:
We evaluate several methods on our SimplerEnv-OR, including pi-0(JAX), pi-0(Lerobot), TraceVLA, RoboVLMs, SpatialVLA, CogACT, CronusVLA:
We list several visualizations of spatial-dimension/temporal-dimension testing on SimplerEnv-OR. Qualitative comparisons under spatial/temporal disturbances, such as Cyclic Global Full Occlusion (left upper) and Cyclic Local Partial Occlusion (left bottom), Constant Global Jittering (right upper) and Sparse Discrete Impulse Noise (rigth bottom):
We evaluate our method on several real-world tasks with Franka Research 3 Robot, we utilize a third-person camera for visual input. Three task suites are designed: (1) Simple pick-and-place task; (2) Long-horizon tasks; and (3) Generalization and observational robustness tasks.
Simulation experiments includes: (1) Performance comparisons on Google Robot and WidowX Robot of SimplerEnv. The experiments are conducted across 12 tasks, including both visual matching (VM) and visual aggregation (VA) settings. (2) Main results in Libero, the average success rate across 3 seeds over 500 trials per task.
@article{li2025cronusvla,
title={CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation},
author={Li, Hao and Yang, Shuai and Chen, Yilun and Tian, Yang and Yang, Xiaoda and Chen, Xinyi and Wang, Hanqing and Wang, Tai and Zhao, Feng and Lin, Dahua and others},
journal={arXiv preprint arXiv:2506.19816},
year={2025}
}