Skip to content

YuhuaJiang2002/AsyncVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

📖 Overview

Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous flow matching (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure.

AsyncVLA addresses these limitations by introducing a novel framework that brings temporal flexibility through asynchronous flow matching (AFM) and enables self-correction in action generation. Unlike vanilla SFM in traditional VLA models, AsyncVLA generates action tokens in a non-uniform time schedule with action context awareness, significantly improving stability and performance in complex robotic tasks.

Key Innovations

  • 🔄 Asynchronous Flow Matching (AFM): Non-uniform time scheduling for action token generation with action context awareness
  • 🎯 Self-Correction Mechanism: Confidence-based selective refinement of inaccurate action tokens before execution
  • 📊 Confidence Rating: Built-in confidence estimation that identifies and corrects uncertain action predictions
  • 🔀 Unified Training: Single model supporting both SFM and AFM modes with improved KV-cache utilization
  • 📈 Data Efficiency: Achieves superior performance with efficient data utilization

Technical Advantages

  • Temporal Flexibility: Breaks from rigid uniform time schedules of traditional synchronous flow matching
  • Action Context Awareness: Incorporates contextual information for better action generation
  • Error Prevention: Self-correction capabilities reduce failure rates
  • Dual-Mode Architecture: Seamlessly switches between synchronous and asynchronous modes
  • State-of-the-Art Performance: Achieves leading results across general embodied evaluations

🛠️ Development

Project Structure

AsyncVLA/
├── models/                    # Core model implementations
│   ├── model/                # Model architectures
│   │   ├── configuration_asyncvla.py
│   │   ├── modeling_AFM_SFM.py
│   │   ├── modeling_asyncvla.py
│   │   └── processing_asyncvla.py
│   ├── data/                 # Data processing utilities
│   └── train/                # Training infrastructure
├── experiments/              # Experiment configurations
│   ├── LIBERO/              # LIBERO dataset experiments
│   └── Simpler_env/         # Bridge/Fractal experiments
├── scripts/                 # Utility scripts
│   ├── train_AFM_SFM.py    # Main training script
│   ├── train_confidence_rater.py
│   └── eval_policy.py      # Evaluation script
└── requirements.txt         # Dependencies

🚀 Installation

Prerequisites

  • Python 3.10+
  • CUDA 12.4+ (for GPU acceleration)
  • Conda or Miniconda
  • 8+ GPUs recommended for training

Environment Setup

  1. Clone the repository

  2. Create and activate conda environment

    conda create -n AsyncVLA python=3.10
    conda activate AsyncVLA
  3. Install dependencies

    cd AsyncVLA
    pip install -r requirements.txt
  4. Install Flash Attention (optional but recommended)

    pip install flash-attn==2.8.3 --no-build-isolation

Key Dependencies

  • torch>=2.7.0 - PyTorch framework
  • transformers>=4.56.0 - Hugging Face transformers
  • accelerate>=1.10.1 - Distributed training support
  • lerobot>=0.3.3 - Robotics dataset integration
  • flash-attn>=2.8.3 - Efficient attention implementation
  • wandb>=0.21.3 - Experiment tracking (optional)

🎯 Usage

Training

1. Unified AFM+SFM Training

Train the complete AsyncVLA model with unified training procedure supporting both asynchronous and synchronous flow matching modes:

Bridge Dataset:

bash experiments/Simpler_env/train_bridge_AFM_SFM.sh

LIBERO Dataset:

bash experiments/LIBERO/train_libero_AFM_SFM.sh

Fractal Dataset:

bash experiments/Simpler_env/train_fractal_AFM_SFM.sh

2. Confidence Rater Training

Train the confidence rating component that enables selective refinement of uncertain action tokens:

# Bridge dataset
bash experiments/Simpler_env/train_bridge_confidence_rater.sh

# LIBERO dataset
bash experiments/LIBERO/train_libero_confidence_rater.sh

Training Configuration

Key training parameters can be adjusted in the training scripts:

  • chunk_size: Action sequence length (default: 4)
  • learning_rate: Base learning rate (default: 1e-4)
  • vision_lr: Vision tower learning rate (default: 2e-5)
  • merger_lr: Merger module learning rate (default: 1e-4)
  • per_device_batch_size: Batch size per GPU (default: 128)
  • num_train_epochs: Training epochs (default: 20)

Evaluation

Policy Evaluation

Evaluate trained models on various robotic tasks:

# Bridge environment evaluation
cd experiments/Simpler_env/eval
bash eval_bridge.sh

# LIBERO environment evaluation  
cd experiments/LIBERO/eval
bash eval_libero.sh

# Fractal environment evaluation
cd experiments/Simpler_env/eval
bash eval_fractal.sh

Data Configuration

Dataset Setup

  1. LIBERO Datasets: Configure in experiments/LIBERO/data-libero-*.yaml

    • data-libero-all.yaml: Complete LIBERO dataset
    • data-libero-goal.yaml: Goal-conditioned tasks
    • data-libero-object.yaml: Object manipulation tasks
    • data-libero-spatial.yaml: Spatial reasoning tasks
    • data-libero-long.yaml: Long-horizon tasks
  2. Bridge Dataset: Configure in experiments/Simpler_env/data-bridge.yaml

  3. Fractal Dataset: Configure in experiments/Simpler_env/data-fractal.yaml

Dataset Format

All datasets follow the LeRobot format with the following structure:

lerobot_datasets:
  - repo_id: dataset_name
    root: /path/to/dataset
    select_video_keys: [observation.images.image_0]
    select_state_keys: [observation.state]
    select_action_keys: [action]

🏗️ Architecture

Key Features

Asynchronous Flow Matching (AFM)

  • Non-uniform Time Scheduling: Generates action tokens with flexible temporal arrangements
  • Action Context Awareness: Incorporates contextual information for improved action generation
  • Self-Correction Capability: Enables refinement of uncertain action predictions during generation

Synchronous Flow Matching (SFM)

  • Traditional Approach: Uniform time scheduling for baseline comparison
  • Unified Framework: Seamlessly integrated with AFM in a single model
  • KV-Cache Optimization: Improved memory utilization through dual-mode architecture

Multi-Modal Integration

  • Vision: Processes RGB images and video sequences
  • Language: Natural language instruction understanding
  • Action: Continuous action space prediction with uncertainty

📊 Experiments

Supported Environments

  1. LIBERO: Robot manipulation tasks

    • Long-horizon tasks, goal-conditioned, object manipulation, spatial reasoning
    • Multi-camera setup (front + wrist cameras)
    • 7-DOF action space
  2. Bridge: Real-world robotic manipulation

    • Single camera observation
    • Continuous action space
    • Real-world data distribution
  3. Fractal: Procedurally generated environments

    • Diverse visual patterns
    • Generalization testing
    • Synthetic data augmentation

Training Strategies

  • Multi-GPU Training: Distributed training across 8+ GPUs
  • Mixed Precision: BF16 training for memory efficiency
  • Gradient Checkpointing: Memory optimization for large models
  • Flash Attention: Efficient attention computation

Evaluation Metrics

  • Success Rate: Task completion percentage
  • Action Accuracy: Precision of predicted actions
  • Confidence Calibration: Reliability of uncertainty estimates
  • Temporal Consistency: Smoothness of action sequences

🔧 Advanced Usage

Custom Dataset Integration

  1. Create dataset configuration file:
lerobot_datasets:
  - repo_id: your_dataset_name
    root: /path/to/your/dataset
    select_video_keys: [your_video_keys]
    select_state_keys: [your_state_keys]
    select_action_keys: [your_action_keys]
  1. Update training script with new dataset path

Model Customization

Modify models/model/configuration_asyncvla.py for custom configurations:

  • action_chunk_size: Sequence length for action prediction
  • max_action_dim: Maximum action dimensionality
  • num_denoise_steps: Flow matching denoising iterations
  • num_action_layers: Action projection network depth

Hyperparameter Tuning

Key hyperparameters to adjust:

  • Learning Rates: Separate rates for vision, language, and action components
  • Batch Size: Balance between memory usage and training stability
  • Chunk Size: Trade-off between temporal modeling and computational cost
  • Denoising Steps: Flow matching precision vs. inference speed

📈 Performance

Benchmark Results

AsyncVLA achieves state-of-the-art results across general embodied evaluations through its innovative asynchronous generation approach:

  • Data Efficiency: Superior performance with reduced training data requirements
  • High Success Rates: Enhanced success rates across LIBERO, Bridge, and Fractal benchmarks
  • Self-Correction: Reduced failure rates through confidence-based action refinement

Computational Requirements

  • Training: 8x A100/H100 GPUs (recommended)
  • Inference: Single GPU (RTX 3090 or better)
  • Memory: 80GB+ GPU memory for full model training
  • Storage: 500GB+ for dataset storage

🙏 Acknowledgments

  • Built upon the excellent Qwen2.5-VL architecture
  • Utilizes LeRobot for dataset management
  • Inspired by advances in open source vision-language-action models

About

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors