Skip to content

greenvla/GreenVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Green-VLA

Staged Vision-Language-Action Model for Generalist Robots


Sber Robotics Center · Manipulation Team


Paper | Project Page | Models | Quick Start | Benchmarks


arXiv HuggingFace Python PyTorch




Overview

Green-VLA is a ~4B-parameter Vision-Language-Action model built on a staged curriculum:

Stage Name Purpose
Base Multi-Embodiment Pretrained Model VLM backbone with multimodal grounding, trained on 3,000+ hours of demonstrations
R1 Embodiment-Specific Adaptation Supervised fine-tuning for target robot
R2 RL Policy Alignment Trajectory optimization beyond behavior cloning

For full details, see our paper and project page.



Models

We release model checkpoints spanning base pretrained models and embodiment-adapted variants:

Model Stage Params Description Link
GreenVLA-2b-base Base 2B Base pretrained model (lightweight) Hub
GreenVLA-5b-base-stride-1 Base 5B Base pretrained, action expert depth = VLM depth Hub
GreenVLA-5b-base-stride-4 Base 5B Base pretrained, action expert depth = VLM depth / 4 Hub
GreenVLA-5b-stride-1-R1-bridge R1 5B Fine-tuned on Bridge (WidowX) Hub
GreenVLA-5b-stride-1-R2-bridge R2 5B RL-aligned on Bridge (WidowX) Hub
GreenVLA-5b-stride-4-R1-fractal R1 5B Fine-tuned on Fractal (Google Robot) Hub
GreenVLA-5b-stride-4-R2-calvin R2 5B RL-aligned on CALVIN Hub

Recommendation: Start with GreenVLA-5b-base-stride-1 for fine-tuning on your own embodiment, or use one of the R1/R2 checkpoints for direct evaluation.

Choosing stride-1 vs stride-4 (5B base)

The 5B base is available in two action-expert variants: stride-1 (action expert has the same number of layers as the VLM) and stride-4 (action expert has 4× fewer layers). Use the following to decide:

Criterion Stride-1 Stride-4
Inference VRAM ~12.5 GB ~11.1 GB
Training batch size (80 GB GPU, 3×448² images, tokenizer_max_length=640) 5 6
Inference speed (4B proxy, see below) Slower Faster

Inference time (mean seconds per forward, 4B model; 5 warmup + 50 benchmark iterations):

Compiled Stride-1 Stride-4
No 0.273 s 0.124 s
Yes 0.181 s 0.098 s
  • Prefer stride-1 when you need maximum action capacity and have enough VRAM; use it for Bridge/CALVIN and when fine-tuning from scratch.
  • Prefer stride-4 when you are memory- or latency-bound; we release a Fractal R1 checkpoint in this variant.


Quick Start

Installation

Using uv (recommended)
git clone https://github.com/greenvla/GreenVLA.git
cd GreenVLA
uv sync
Using micromamba / conda
git clone https://github.com/greenvla/GreenVLA.git
cd GreenVLA

micromamba env create -n greenvla python=3.11
micromamba activate greenvla
pip install -e .

Inference

Load a model and build the full inference pipeline in a single call:

from lerobot.common.policies.factory import load_pretrained_policy

policy, input_transforms, output_transforms = load_pretrained_policy(
    "SberRoboticsCenter/GreenVLA-5b-base-stride-1",
    data_config_name="bridge",
)

This downloads the config, weights, and normalization statistics from the Hub automatically. It also works with local checkpoint paths:

policy, input_transforms, output_transforms = load_pretrained_policy(
    "/path/to/checkpoint",
    data_config_name="bridge",
    config_overrides={"device": "cuda:0"},
)

See docs/INFERENCE.md for the full inference guide and example notebooks.



Benchmarks

SimplerEnv — Google Robot (Fractal)

Model Visual Matching Variant Agg. Overall
Green-VLA stride-4 R1 77.0% 66.7% 71.8%

SimplerEnv — WidowX (Bridge)

Model Partial Avg Entire Avg
Green-VLA stride-1 R2 94.5% 80.5%
Green-VLA stride-1 R1 89.6% 72.9%

CALVIN

Model Avg Chain Length
Green-VLA stride-4 R2 4.57


Documentation

Guide Description
Fine-Tuning Dataset statistics, configuration, training
Inference Loading models, running inference, example notebooks


Project Structure

GreenVLA/
├── assets/                  # Images, videos, and other media
├── docs/                    # Detailed guides
│   ├── FINE_TUNING.md       # Fine-tuning guide
│   └── INFERENCE.md         # Inference guide & examples
├── lerobot/
│   ├── conf/                # Hydra configs (policy, dataset, training)
│   ├── common/
│   │   └── policies/        # Policy implementations
│   │       └── greenvla_policy/
│   └── scripts/             # Training & inference scripts
└── examples/                # Inference examples & notebooks


Citation

If you find Green-VLA useful, please cite our paper:

@misc{apanasevich2026greenvlastagedvisionlanguageactionmodel,
    title   = {Green-VLA: Staged Vision-Language-Action Model for Generalist Robots},
    author  = {I. Apanasevich and M. Artemyev and R. Babakyan and P. Fedotova and
               D. Grankin and E. Kupryashin and A. Misailidi and D. Nerus and
               A. Nutalapati and G. Sidorov and I. Efremov and M. Gerasyov and
               D. Pikurov and Y. Senchenko and S. Davidenko and D. Kulikov and
               M. Sultankin and K. Askarbek and O. Shamanin and D. Statovoy and
               E. Zalyaev and I. Zorin and A. Letkin and E. Rusakov and
               A. Silchenko and V. Vorobyov and S. Sobolnikov and A. Postnikov},
    year    = {2026},
    eprint  = {2602.00919},
    archivePrefix = {arXiv},
    primaryClass  = {cs.RO},
    url     = {https://arxiv.org/abs/2602.00919},
}

⚠ Acknowledgements

This project draws inspiration and references from several notable open-source initiatives, including:

The codebase was originally forked from LeRobot.


© 2026 Sber Robotics Center · Manipulation Team

About

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages