CronusVLA: Towards Efficient and Robust
Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li^1,2*, Shuai Yang^2,3*, Yilun Chen², Xinyi Chen², Yang Tian², Xiaoda Yang³, Hanqing Wang², Tai Wang², Dahua Lin⁴, Feng Zhao¹, Jiangmiao Pang²

¹University of Science and Technology of China, ²Shanghai Artificial Intelligence Laboratory,
³Zhejiang University, ⁴The Chinese University of Hong Kong
^*Indicates Equal Contribution, Email: [email protected]

Paper Code (CronusVLA) Benchmark (SimplerEnv-OR) 🤗 Weights

Introduction

💪 We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. 🔥 CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness:

Highlights

Efficient multi-frame(temporal) modeling and inference, available in both 7B and 0.5B versions.
Leading performance on SimplerEnv, LIBERO, and real-world Franka experiments.
High robustness on the SimplerEnv-OR benchmark and real-world robustness tests.
SimplerEnv-OR, a novel benchmark for quantitative evaluation of robustness under observational disturbances, featuring 24 types of observational disturbances and 120 severity levels.

Robostnuss (SimplerEnv-OR)

SimplerEnv-OR (Observational Robustness) is designed to evaluate robustness of VLA models against observational disturbances across temporal (left) and spatial (right) dimensions:

We evaluate several methods on our SimplerEnv-OR, including pi-0(JAX), pi-0(Lerobot), TraceVLA, RoboVLMs, SpatialVLA, CogACT, CronusVLA:

We list several visualizations of spatial-dimension/temporal-dimension testing on SimplerEnv-OR. Qualitative comparisons under spatial/temporal disturbances, such as Cyclic Global Full Occlusion (left upper) and Cyclic Local Partial Occlusion (left bottom), Constant Global Jittering (right upper) and Sparse Discrete Impulse Noise (rigth bottom):

Real World

We evaluate our method on several real-world tasks with Franka Research 3 Robot, we utilize a third-person camera for visual input. Three task suites are designed: (1) Simple pick-and-place task; (2) Long-horizon tasks; and (3) Generalization and observational robustness tasks.

Simulation

Simulation experiments includes: (1) Performance comparisons on Google Robot and WidowX Robot of SimplerEnv. The experiments are conducted across 12 tasks, including both visual matching (VM) and visual aggregation (VA) settings. (2) Main results in Libero, the average success rate across 3 seeds over 500 trials per task.

BibTeX


        @article{li2025cronusvla,
          title={CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation},
          author={Li, Hao and Yang, Shuai and Chen, Yilun and Tian, Yang and Yang, Xiaoda and Chen, Xinyi and Wang, Hanqing and Wang, Tai and Zhao, Feng and Lin, Dahua and others},
          journal={arXiv preprint arXiv:2506.19816},
          year={2025}
        }

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling