AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

📖 Overview

Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous flow matching (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure.

AsyncVLA addresses these limitations by introducing a novel framework that brings temporal flexibility through asynchronous flow matching (AFM) and enables self-correction in action generation. Unlike vanilla SFM in traditional VLA models, AsyncVLA generates action tokens in a non-uniform time schedule with action context awareness, significantly improving stability and performance in complex robotic tasks.

Key Innovations

🔄 Asynchronous Flow Matching (AFM): Non-uniform time scheduling for action token generation with action context awareness
🎯 Self-Correction Mechanism: Confidence-based selective refinement of inaccurate action tokens before execution
📊 Confidence Rating: Built-in confidence estimation that identifies and corrects uncertain action predictions
🔀 Unified Training: Single model supporting both SFM and AFM modes with improved KV-cache utilization
📈 Data Efficiency: Achieves superior performance with efficient data utilization

Technical Advantages

Temporal Flexibility: Breaks from rigid uniform time schedules of traditional synchronous flow matching
Action Context Awareness: Incorporates contextual information for better action generation
Error Prevention: Self-correction capabilities reduce failure rates
Dual-Mode Architecture: Seamlessly switches between synchronous and asynchronous modes
State-of-the-Art Performance: Achieves leading results across general embodied evaluations

🛠️ Development

Project Structure

AsyncVLA/
├── models/                    # Core model implementations
│   ├── model/                # Model architectures
│   │   ├── configuration_asyncvla.py
│   │   ├── modeling_AFM_SFM.py
│   │   ├── modeling_asyncvla.py
│   │   └── processing_asyncvla.py
│   ├── data/                 # Data processing utilities
│   └── train/                # Training infrastructure
├── experiments/              # Experiment configurations
│   ├── LIBERO/              # LIBERO dataset experiments
│   └── Simpler_env/         # Bridge/Fractal experiments
├── scripts/                 # Utility scripts
│   ├── train_AFM_SFM.py    # Main training script
│   ├── train_confidence_rater.py
│   └── eval_policy.py      # Evaluation script
└── requirements.txt         # Dependencies

🚀 Installation

Prerequisites

Python 3.10+
CUDA 12.4+ (for GPU acceleration)
Conda or Miniconda
8+ GPUs recommended for training

Environment Setup

Clone the repository

Create and activate conda environment

conda create -n AsyncVLA python=3.10
conda activate AsyncVLA

Install dependencies

cd AsyncVLA
pip install -r requirements.txt

Install Flash Attention (optional but recommended)

pip install flash-attn==2.8.3 --no-build-isolation

Key Dependencies

torch>=2.7.0 - PyTorch framework
transformers>=4.56.0 - Hugging Face transformers
accelerate>=1.10.1 - Distributed training support
lerobot>=0.3.3 - Robotics dataset integration
flash-attn>=2.8.3 - Efficient attention implementation
wandb>=0.21.3 - Experiment tracking (optional)

🎯 Usage

Training

1. Unified AFM+SFM Training

Train the complete AsyncVLA model with unified training procedure supporting both asynchronous and synchronous flow matching modes:

Bridge Dataset:

bash experiments/Simpler_env/train_bridge_AFM_SFM.sh

LIBERO Dataset:

bash experiments/LIBERO/train_libero_AFM_SFM.sh

Fractal Dataset:

bash experiments/Simpler_env/train_fractal_AFM_SFM.sh

2. Confidence Rater Training

Train the confidence rating component that enables selective refinement of uncertain action tokens:

# Bridge dataset
bash experiments/Simpler_env/train_bridge_confidence_rater.sh

# LIBERO dataset
bash experiments/LIBERO/train_libero_confidence_rater.sh

Training Configuration

Key training parameters can be adjusted in the training scripts:

chunk_size: Action sequence length (default: 4)
learning_rate: Base learning rate (default: 1e-4)
vision_lr: Vision tower learning rate (default: 2e-5)
merger_lr: Merger module learning rate (default: 1e-4)
per_device_batch_size: Batch size per GPU (default: 128)
num_train_epochs: Training epochs (default: 20)

Evaluation

Policy Evaluation

Evaluate trained models on various robotic tasks:

# Bridge environment evaluation
cd experiments/Simpler_env/eval
bash eval_bridge.sh

# LIBERO environment evaluation  
cd experiments/LIBERO/eval
bash eval_libero.sh

# Fractal environment evaluation
cd experiments/Simpler_env/eval
bash eval_fractal.sh

Data Configuration

Dataset Setup

LIBERO Datasets: Configure in experiments/LIBERO/data-libero-*.yaml
- data-libero-all.yaml: Complete LIBERO dataset
- data-libero-goal.yaml: Goal-conditioned tasks
- data-libero-object.yaml: Object manipulation tasks
- data-libero-spatial.yaml: Spatial reasoning tasks
- data-libero-long.yaml: Long-horizon tasks
Bridge Dataset: Configure in experiments/Simpler_env/data-bridge.yaml
Fractal Dataset: Configure in experiments/Simpler_env/data-fractal.yaml

Dataset Format

All datasets follow the LeRobot format with the following structure:

lerobot_datasets:
  - repo_id: dataset_name
    root: /path/to/dataset
    select_video_keys: [observation.images.image_0]
    select_state_keys: [observation.state]
    select_action_keys: [action]

🏗️ Architecture

Key Features

Asynchronous Flow Matching (AFM)

Non-uniform Time Scheduling: Generates action tokens with flexible temporal arrangements
Action Context Awareness: Incorporates contextual information for improved action generation
Self-Correction Capability: Enables refinement of uncertain action predictions during generation

Synchronous Flow Matching (SFM)

Traditional Approach: Uniform time scheduling for baseline comparison
Unified Framework: Seamlessly integrated with AFM in a single model
KV-Cache Optimization: Improved memory utilization through dual-mode architecture

Multi-Modal Integration

Vision: Processes RGB images and video sequences
Language: Natural language instruction understanding
Action: Continuous action space prediction with uncertainty

📊 Experiments

Supported Environments

LIBERO: Robot manipulation tasks
- Long-horizon tasks, goal-conditioned, object manipulation, spatial reasoning
- Multi-camera setup (front + wrist cameras)
- 7-DOF action space
Bridge: Real-world robotic manipulation
- Single camera observation
- Continuous action space
- Real-world data distribution
Fractal: Procedurally generated environments
- Diverse visual patterns
- Generalization testing
- Synthetic data augmentation

Training Strategies

Multi-GPU Training: Distributed training across 8+ GPUs
Mixed Precision: BF16 training for memory efficiency
Gradient Checkpointing: Memory optimization for large models
Flash Attention: Efficient attention computation

Evaluation Metrics

Success Rate: Task completion percentage
Action Accuracy: Precision of predicted actions
Confidence Calibration: Reliability of uncertainty estimates
Temporal Consistency: Smoothness of action sequences

🔧 Advanced Usage

Custom Dataset Integration

Create dataset configuration file:

lerobot_datasets:
  - repo_id: your_dataset_name
    root: /path/to/your/dataset
    select_video_keys: [your_video_keys]
    select_state_keys: [your_state_keys]
    select_action_keys: [your_action_keys]

Update training script with new dataset path

Model Customization

Modify models/model/configuration_asyncvla.py for custom configurations:

action_chunk_size: Sequence length for action prediction
max_action_dim: Maximum action dimensionality
num_denoise_steps: Flow matching denoising iterations
num_action_layers: Action projection network depth

Hyperparameter Tuning

Key hyperparameters to adjust:

Learning Rates: Separate rates for vision, language, and action components
Batch Size: Balance between memory usage and training stability
Chunk Size: Trade-off between temporal modeling and computational cost
Denoising Steps: Flow matching precision vs. inference speed

📈 Performance

Benchmark Results

AsyncVLA achieves state-of-the-art results across general embodied evaluations through its innovative asynchronous generation approach:

Data Efficiency: Superior performance with reduced training data requirements
High Success Rates: Enhanced success rates across LIBERO, Bridge, and Fractal benchmarks
Self-Correction: Reduced failure rates through confidence-based action refinement

Computational Requirements

Training: 8x A100/H100 GPUs (recommended)
Inference: Single GPU (RTX 3090 or better)
Memory: 80GB+ GPU memory for full model training
Storage: 500GB+ for dataset storage

🙏 Acknowledgments

Built upon the excellent Qwen2.5-VL architecture
Utilizes LeRobot for dataset management
Inspired by advances in open source vision-language-action models

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
experiments		experiments
models		models
scripts		scripts
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

📖 Overview

Key Innovations

Technical Advantages

🛠️ Development

Project Structure

🚀 Installation

Prerequisites

Environment Setup

Key Dependencies

🎯 Usage

Training

1. Unified AFM+SFM Training

2. Confidence Rater Training

Training Configuration

Evaluation

Policy Evaluation

Data Configuration

Dataset Setup

Dataset Format

🏗️ Architecture

Key Features

Asynchronous Flow Matching (AFM)

Synchronous Flow Matching (SFM)

Multi-Modal Integration

📊 Experiments

Supported Environments

Training Strategies

Evaluation Metrics

🔧 Advanced Usage

Custom Dataset Integration

Model Customization

Hyperparameter Tuning

📈 Performance

Benchmark Results

Computational Requirements

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages