An open-source research platform for integrating and exploring cutting-edge technologies for generalist robots.
📢 Citation Update: Our technical report is now on arXiv (2604.05014). We kindly invite you to use the updated BibTeX for any ongoing or future citations. If you have already cited StarVLA in a previous version of your work, we would greatly appreciate it if you could update the citation entry in your camera-ready or future revisions. Thank you for your understanding and support! 🙏
In StarVLA (also a pun on "start VLA" ), each functional component (model, data, trainer, config, evaluation, etc.) follows a top-down, intuitive separation and high-cohesion, low-coupling principle, enabling plug-and-play design, rapid prototyping, and independent debugging.
⚠️ Branch notice: ThestarVLA_devbranch is where we actively merge new features and may be temporarily unstable. For verified results, use the stablestarVLAbranch. Thanks to StarVLA's low-coupling design, switching between branches is painless. We encourage tryingstarVLA_devand welcome PRs if you spot any issues!
💡 Tip: Files under any
**/bar/directory are git-ignored, so you can place your custom scripts there (e.g.,examples/LIBERO/train_files/bar/my_train.sh) without polluting the repo.
[2026/04/09] 🔜 🚀 unified multi-benchmark co-training example (combining LIBERO, SimplerEnv, RoboTwin, VLA-Arena, etc.) is coming soon. Stay tuned!
[2026/04/19] 📋 As community PRs grow rapidly, we are establishing PR guidelines to maintain code quality and stability. Thank you all for your contributions! Please review the new PR Guidelines and Branching Strategy before submitting PRs.
[2026/04/09] 🎯 Thanks to the RLinf team, StarVLA now supports RL post-training! Check out the StarVLA × RLinf tutorial to get started.
[2026/04/09] 🔥 WM4A (World Model for Action) is now integrated! Use pretrained video-generation DiT models (Cosmos-Predict2, Wan2.2) as backbones for action prediction. See docs/WM4A.md for architecture details and training instructions.
[2026/03/29] 🔥 Thanks to the ABot-M0 team for providing the pre-trained weights. For Qwen3-VL 4B, you can reload the qwen_vl_interface module in various frameworks!
[2026/03/19] 🔥 StarVLA now provides a complete real-robot development case with Franka robot examples!
[2026/03/03] 🔥 We now support Qwen3.5 as a backbone for VLA — the fastest integration in the community ⚡ With more model size options: 0.8B, 2B, 4B, and 9B! Build your VLA flexibly on top of native multimodal models!
[2026/01/29] 🔥 StarVLA Training Efficiency Report & Training Curves released! Training configs and efficiency benchmarks for community reference.
[2026/01/29] Calvin benchmark experiments were conducted by the UNT team. For inquiries, please contact Zhijie Song ([email protected]) or Feng Yan ([email protected]).
[2025/12/25] We've simultaneously established pipelines for Behavior-1K, RoboTwin 2.0, and CALVIN. We'd love to collaborate and share baseline results for more benchmarks with the community!
Prior Timeline
[2025/12/25] We've released RoboCasa evaluation support, which was trained without pretraining and reached SOTA performance. Check out more details in examples/Robocasa_tabletop.
[2025/12/15] Completed a release regression check to ensure the public code runs smoothly. Routine updates—including recent support for the LeRobot dataset v3.0 and DeepSpeed ZeRO-3—will continue to appear in the 🚧 Daily Development Log.
[2025/12/09] Became the first open-source repository to support training with train your vlm, train your vla, and train your vla with vlm. Check out how to co-train your VLA with multimodal data in examples/CoTrainVLM.
[2025/11/12] We now support Florence-2 as a smaller VLM for resource-constrained development. StarVLA can now run on a single A100 GPU. See the 🚀Train with a smaller VLM section for more details.
[2025/10/30]: We released the LIBERO Training & Evaluation README. Results are very promising. More details are in examples/LIBERO.
[2025/10/25]: We fixed several script links and so everything is smoother now. Thanks to the community for the feedback.
Overview of the StarVLA framework. We present a unified and modular pipeline that connects heterogeneous data sources, pluggable dataloaders, and flexible data representations with a standardized model forwarding interface. The framework supports diverse vision-language foundation models and VLA architectures, enabling end-to-end training and deployment.
Various VLA Frameworks
All variants share the same data interface and infrastructure; only the action head differs.
- StarVLA-FAST: Autoregressive discrete action tokens via a fast tokenizer (à la π₀-fast).
- StarVLA-OFT: Parallel continuous action decoding with an MLP head (à la OpenVLA-OFT/EO).
- StarVLA-PI: Flow-Matching action expert for diffusion-based continuous actions (à la π₀).
- StarVLA-GR00T: Dual-system architecture — VLM as System 2, Flow-Matching as System 1 (à la GR00T).
Various Training Recipes
Every recipe is paradigm-agnostic and applies uniformly to all supported frameworks.
- Supervised fine-tuning (SFT)
- Multimodal Multi-objectives Co-Training
- Cross-embodiment Co-Training
- Reinforcement Learning Adaptation
Broad Benchmark Integration
Achieve state-of-the-art (SOTA) performance on a variety of benchmarks, as follows:
- SimplerEnV
- LIBERO
- LIBERO-plus
- Robocasa
- RoboTwin
- BEHAVIOR
- SO101
- Calvin *See details in
examples/calvin - RLBench
📖 New to StarVLA? Check out our step-by-step Quick Start Guide — a complete walkthrough from installation to training to evaluation using the LIBERO benchmark.
We have more results for RoboCasa, RoboTwin 2.0, Behavior-1k, Calvin. See our 🍀 Overleaf, which continuously presents our real-time experimental results.
See the full list of released models and checkpoints in docs/model_zoo.md.
👇 StarVLA achieves "Lego-like" development via the following designs:
1. Smoke test any submodule
StarVLA emphasizes a modular model design. Each major framework file can be run standalone for rapid debugging and smoke-testing your code. For example:
# model
python starVLA/model/framework/QwenOFT.py --config_yaml starvla_cotrain_oxe.yaml
# dataloader
python starVLA/dataloader/lerobot_datasets.py --config_yaml starvla_cotrain_oxe.yamlNote: starVLA/model/framework/yourframework.py is the single external API surface of the model; it should mirror (be structurally isomorphic to) the framework diagram in your paper.
2. Explicit model boundaries
StarVLA follows top‑down decomposition and the principle of high cohesion & low coupling.
For example:
- Dataloader
- Returns a raw, model‑agnostic dict only; no model‑specific preprocessing (e.g., tokenizer, image encoding).
- A single sample should include (add/remove as needed):
- image: list[PIL.Image] | np.ndarray
- lang: str
- action: np.ndarray[T, action_dim]
- state: Optional[np.ndarray[..., state_dim]]
Both framework.forward() and framework.predict_action() operate directly on raw inputs, keeping train/test boundaries explicit and easy to hack.
3. Flexible configuration system
StarVLA uses a single global configuration object Parameters are passed primarily via extensible dicts, allowing overrides and controlled redundancy.
See docs/faq.md for common questions on configuration, freezing, learning rates, checkpointing, smaller VLMs, and more.
StarVLA was co-founded by Jinhui Ye and Weiyu Guo, and has been gradually expanded by community contributors. Community contributors are the driving force behind StarVLA's growing ecosystem. We deeply appreciate every PR, bug fix, and piece of feedback from the open-source community — your efforts keep StarVLA evolving rapidly. A full, continuously updated contributor list is maintained at starvla.github.io/contributors.
Thanks to all the people who have contributed to StarVLA:
See docs/CONTRIBUTING.md for guidelines on reporting bugs, proposing features, and submitting PRs.
NeuroVLA: A Brain-like Embodied Intelligence for Fluid and Fast Reflexive Robotics Control
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
TwinBrainVLA: TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
LangForce: LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Examples:
accelerate launch \
--config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \
--num_processes 8 \
starVLA/training/train_internvla.py \
--config_yaml ./starVLA/config/training/starvla_cotrain_oxe.yaml \
--framework.qwenvl.base_vlm Qwen/Qwen2.5-VL-7B-Instruct \ # override framework choice
--framework.qwenvl.base_vlm Qwen/Qwen2.5-VL-7B-Instruct \ # override framework choice
--framework.action_model.new_module ${module_name} \ # plug-in a new module to action modelframework.action_model.new_module only adds to the global config; its behavior is on your framework.
Q: Can I freeze the VLM via parameters?
A: Yes. StarVLA uses a regex / name list to control freezing. Example:
--trainer.freeze_modules "qwen_vl_interface.model.model.visual,dino_encoder" \
Tips: You can print(your_model) first to check the relative paths of your modules and list them as comma-separated values.
(implementation in TrainerUtils.freeze_backbones.)
Q: Can I set different learning rates for different modules?
A: Yes, starVLA also uses name: value dict to control learning group. Config example:
trainer:
learning_rate:
base: 1e-05 # other modules
qwen_vl_interface: 1.0e-05
action_model: 1.0e-04(Also referenced in trainer_tools.build_param_lr_groups.)
Q: Can I resume training from a checkpoint?
A: Yes, somehow can. Specify the latest checkpoint path in config.yaml, e.g.:
trainer:
pretrained_checkpoint: path_to_steps_10000.pt
reload_modules: "action_model"Empty reload_modules means full load all model. However, starVLA does not save optimizer state. It requires a lot of memory/disk and bring limited benefit.
🚀 Train with a smaller VLM
accelerate launch \
--config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank $SLURM_PROCID \
--num_machines $SLURM_NNODES \
--num_processes=${TOTAL_GPUS} \
starVLA/training/train_starvla.py \
--config_yaml ./starVLA/config/training/starvla_cotrain_oxe.yaml \
--framework.name QwenGR00T \
--framework.qwenvl.base_vlm microsoft/Florence-2-large \
--run_root_dir ${run_root_dir} \
--run_id ${run_id} \
--wandb_project your_project \
--wandb_entity your_nameNote: To ensure better compatibility with already released checkpoints, we are continuing to use --framework.qwenvl. This parameter will be unified in the next release.
StarVLA is released under the MIT License, which permits commercial use, modification, distribution, and private use. Rebases are allowed for forks and feature branches; when rebasing from upstream StarVLA, use descriptive commit messages (e.g., "chore: rebase from StarVLA") and keep at least the two latest upstream commits as separate. See License for details.
@article{community2026starvla,
title={StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
author={Community, StarVLA},
journal={arXiv preprint arXiv:2604.05014},
year={2026}
}This project draws inspiration and references from several notable open-source initiatives, including:
The codebase was originally forked from InternVLA-M1.
Here's how our community has grown over time:




