| Paper | Website | Model & Data |
|---|---|---|
A simple and efficient Vision-Language-Action (VLA) model for robot manipulation tasks.
conda create -n simvla python=3.10 -y
conda activate simvla
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install transformers>=4.57.0
pip install peft accelerate fastapi tensorboard uvicorn json_numpy safetensors scipy einops timm mmengine pyarrow h5py mediapy num2words av wandb websockets msgpack_numpy
pip install flash-attn==2.5.6 --no-build-isolation
pip install tensorflow tensorflow-datasetsImportant: Use
transformers>=4.57.0.
Download LIBERO dataset, and place it in ./datasets/metas/.
python create_libero_meta.py \
--data_dir ./datasets/metas \
--subsets libero_10 libero_goal libero_object libero_spatial \
--output ./datasets/metas/libero_train.jsonpython compute_libero_norm_stats.py \
--data_dir ./datasets/metas \
--subsets libero_10 libero_goal libero_object libero_spatial \
--output ./norm_stats/libero_norm.jsonSmall Model Configuration:
bash train_smolvlm_small.shLarge Model Configuration:
bash train_smolvlm_large.shcd evaluation/libero
- Vision-Language Backbone: SmolVLM-500M-Instruct (576 hidden dim)
- Action Transformer: Configurable depth and width
- Small: 768 hidden, 12 layers, 12 heads
- Large: 1024 hidden, 24 layers, 16 heads
If you find our codes useful, please consider citing our work
@article{luo2026simvla,
title={SimVLA: A Simple VLA Baseline for Robotic Manipulation},
author={Luo, Yuankai and Chen, Woping and Liang, Tong and Wang, Baiqiao and Li, Zhenguo},
journal={arXiv preprint arXiv:2602.18224},
year={2026}
}