CronusVLA: Towards Efficient and Robust
Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li1,2*, Shuai Yang2,3*, Yilun Chen2, Xinyi Chen2, Yang Tian2, Xiaoda Yang3, Hanqing Wang2, Tai Wang2, Dahua Lin4, Feng Zhao1, Jiangmiao Pang2
1University of Science and Technology of China, 2Shanghai Artificial Intelligence Laboratory,
3Zhejiang University, 4The Chinese University of Hong Kong

*Indicates Equal Contribution, Email: [email protected]

Introduction

💪 We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. 🔥 CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness:

Introduction of CronusVLA

Highlights

  • Efficient multi-frame(temporal) modeling and inference, available in both 7B and 0.5B versions.
  • Leading performance on SimplerEnv, LIBERO, and real-world Franka experiments.
  • High robustness on the SimplerEnv-OR benchmark and real-world robustness tests.
  • SimplerEnv-OR, a novel benchmark for quantitative evaluation of robustness under observational disturbances, featuring 24 types of observational disturbances and 120 severity levels.

Real World
Real World

Robostnuss (SimplerEnv-OR)

SimplerEnv-OR (Observational Robustness) is designed to evaluate robustness of VLA models against observational disturbances across temporal (left) and spatial (right) dimensions:

Real World

We evaluate several methods on our SimplerEnv-OR, including pi-0(JAX), pi-0(Lerobot), TraceVLA, RoboVLMs, SpatialVLA, CogACT, CronusVLA:

Real World
Real World

We list several visualizations of spatial-dimension/temporal-dimension testing on SimplerEnv-OR. Qualitative comparisons under spatial/temporal disturbances, such as Cyclic Global Full Occlusion (left upper) and Cyclic Local Partial Occlusion (left bottom), Constant Global Jittering (right upper) and Sparse Discrete Impulse Noise (rigth bottom):

Real World Simulation
Real World

Real World

We evaluate our method on several real-world tasks with Franka Research 3 Robot, we utilize a third-person camera for visual input. Three task suites are designed: (1) Simple pick-and-place task; (2) Long-horizon tasks; and (3) Generalization and observational robustness tasks.

Real World

Simulation

Simulation experiments includes: (1) Performance comparisons on Google Robot and WidowX Robot of SimplerEnv. The experiments are conducted across 12 tasks, including both visual matching (VM) and visual aggregation (VA) settings. (2) Main results in Libero, the average success rate across 3 seeds over 500 trials per task.

Simulation_table
Simulation_table

BibTeX


        @article{li2025cronusvla,
          title={CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation},
          author={Li, Hao and Yang, Shuai and Chen, Yilun and Tian, Yang and Yang, Xiaoda and Chen, Xinyi and Wang, Hanqing and Wang, Tai and Zhao, Feng and Lin, Dahua and others},
          journal={arXiv preprint arXiv:2506.19816},
          year={2025}
        }