Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous flow matching (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure.
AsyncVLA addresses these limitations by introducing a novel framework that brings temporal flexibility through asynchronous flow matching (AFM) and enables self-correction in action generation. Unlike vanilla SFM in traditional VLA models, AsyncVLA generates action tokens in a non-uniform time schedule with action context awareness, significantly improving stability and performance in complex robotic tasks.
- 🔄 Asynchronous Flow Matching (AFM): Non-uniform time scheduling for action token generation with action context awareness
- 🎯 Self-Correction Mechanism: Confidence-based selective refinement of inaccurate action tokens before execution
- 📊 Confidence Rating: Built-in confidence estimation that identifies and corrects uncertain action predictions
- 🔀 Unified Training: Single model supporting both SFM and AFM modes with improved KV-cache utilization
- 📈 Data Efficiency: Achieves superior performance with efficient data utilization
- Temporal Flexibility: Breaks from rigid uniform time schedules of traditional synchronous flow matching
- Action Context Awareness: Incorporates contextual information for better action generation
- Error Prevention: Self-correction capabilities reduce failure rates
- Dual-Mode Architecture: Seamlessly switches between synchronous and asynchronous modes
- State-of-the-Art Performance: Achieves leading results across general embodied evaluations
AsyncVLA/
├── models/ # Core model implementations
│ ├── model/ # Model architectures
│ │ ├── configuration_asyncvla.py
│ │ ├── modeling_AFM_SFM.py
│ │ ├── modeling_asyncvla.py
│ │ └── processing_asyncvla.py
│ ├── data/ # Data processing utilities
│ └── train/ # Training infrastructure
├── experiments/ # Experiment configurations
│ ├── LIBERO/ # LIBERO dataset experiments
│ └── Simpler_env/ # Bridge/Fractal experiments
├── scripts/ # Utility scripts
│ ├── train_AFM_SFM.py # Main training script
│ ├── train_confidence_rater.py
│ └── eval_policy.py # Evaluation script
└── requirements.txt # Dependencies
- Python 3.10+
- CUDA 12.4+ (for GPU acceleration)
- Conda or Miniconda
- 8+ GPUs recommended for training
-
Clone the repository
-
Create and activate conda environment
conda create -n AsyncVLA python=3.10 conda activate AsyncVLA
-
Install dependencies
cd AsyncVLA pip install -r requirements.txt -
Install Flash Attention (optional but recommended)
pip install flash-attn==2.8.3 --no-build-isolation
torch>=2.7.0- PyTorch frameworktransformers>=4.56.0- Hugging Face transformersaccelerate>=1.10.1- Distributed training supportlerobot>=0.3.3- Robotics dataset integrationflash-attn>=2.8.3- Efficient attention implementationwandb>=0.21.3- Experiment tracking (optional)
Train the complete AsyncVLA model with unified training procedure supporting both asynchronous and synchronous flow matching modes:
Bridge Dataset:
bash experiments/Simpler_env/train_bridge_AFM_SFM.shLIBERO Dataset:
bash experiments/LIBERO/train_libero_AFM_SFM.shFractal Dataset:
bash experiments/Simpler_env/train_fractal_AFM_SFM.shTrain the confidence rating component that enables selective refinement of uncertain action tokens:
# Bridge dataset
bash experiments/Simpler_env/train_bridge_confidence_rater.sh
# LIBERO dataset
bash experiments/LIBERO/train_libero_confidence_rater.shKey training parameters can be adjusted in the training scripts:
chunk_size: Action sequence length (default: 4)learning_rate: Base learning rate (default: 1e-4)vision_lr: Vision tower learning rate (default: 2e-5)merger_lr: Merger module learning rate (default: 1e-4)per_device_batch_size: Batch size per GPU (default: 128)num_train_epochs: Training epochs (default: 20)
Evaluate trained models on various robotic tasks:
# Bridge environment evaluation
cd experiments/Simpler_env/eval
bash eval_bridge.sh
# LIBERO environment evaluation
cd experiments/LIBERO/eval
bash eval_libero.sh
# Fractal environment evaluation
cd experiments/Simpler_env/eval
bash eval_fractal.sh-
LIBERO Datasets: Configure in
experiments/LIBERO/data-libero-*.yamldata-libero-all.yaml: Complete LIBERO datasetdata-libero-goal.yaml: Goal-conditioned tasksdata-libero-object.yaml: Object manipulation tasksdata-libero-spatial.yaml: Spatial reasoning tasksdata-libero-long.yaml: Long-horizon tasks
-
Bridge Dataset: Configure in
experiments/Simpler_env/data-bridge.yaml -
Fractal Dataset: Configure in
experiments/Simpler_env/data-fractal.yaml
All datasets follow the LeRobot format with the following structure:
lerobot_datasets:
- repo_id: dataset_name
root: /path/to/dataset
select_video_keys: [observation.images.image_0]
select_state_keys: [observation.state]
select_action_keys: [action]- Non-uniform Time Scheduling: Generates action tokens with flexible temporal arrangements
- Action Context Awareness: Incorporates contextual information for improved action generation
- Self-Correction Capability: Enables refinement of uncertain action predictions during generation
- Traditional Approach: Uniform time scheduling for baseline comparison
- Unified Framework: Seamlessly integrated with AFM in a single model
- KV-Cache Optimization: Improved memory utilization through dual-mode architecture
- Vision: Processes RGB images and video sequences
- Language: Natural language instruction understanding
- Action: Continuous action space prediction with uncertainty
-
LIBERO: Robot manipulation tasks
- Long-horizon tasks, goal-conditioned, object manipulation, spatial reasoning
- Multi-camera setup (front + wrist cameras)
- 7-DOF action space
-
Bridge: Real-world robotic manipulation
- Single camera observation
- Continuous action space
- Real-world data distribution
-
Fractal: Procedurally generated environments
- Diverse visual patterns
- Generalization testing
- Synthetic data augmentation
- Multi-GPU Training: Distributed training across 8+ GPUs
- Mixed Precision: BF16 training for memory efficiency
- Gradient Checkpointing: Memory optimization for large models
- Flash Attention: Efficient attention computation
- Success Rate: Task completion percentage
- Action Accuracy: Precision of predicted actions
- Confidence Calibration: Reliability of uncertainty estimates
- Temporal Consistency: Smoothness of action sequences
- Create dataset configuration file:
lerobot_datasets:
- repo_id: your_dataset_name
root: /path/to/your/dataset
select_video_keys: [your_video_keys]
select_state_keys: [your_state_keys]
select_action_keys: [your_action_keys]- Update training script with new dataset path
Modify models/model/configuration_asyncvla.py for custom configurations:
action_chunk_size: Sequence length for action predictionmax_action_dim: Maximum action dimensionalitynum_denoise_steps: Flow matching denoising iterationsnum_action_layers: Action projection network depth
Key hyperparameters to adjust:
- Learning Rates: Separate rates for vision, language, and action components
- Batch Size: Balance between memory usage and training stability
- Chunk Size: Trade-off between temporal modeling and computational cost
- Denoising Steps: Flow matching precision vs. inference speed
AsyncVLA achieves state-of-the-art results across general embodied evaluations through its innovative asynchronous generation approach:
- Data Efficiency: Superior performance with reduced training data requirements
- High Success Rates: Enhanced success rates across LIBERO, Bridge, and Fractal benchmarks
- Self-Correction: Reduced failure rates through confidence-based action refinement
- Training: 8x A100/H100 GPUs (recommended)
- Inference: Single GPU (RTX 3090 or better)
- Memory: 80GB+ GPU memory for full model training
- Storage: 500GB+ for dataset storage
- Built upon the excellent Qwen2.5-VL architecture
- Utilizes LeRobot for dataset management
- Inspired by advances in open source vision-language-action models