- π Paper: https://arxiv.org/abs/2510.14902
- π Project Page: https://vla-2.github.io
- 10.27.25 initial upload.
- 11.03.25 update Deployment.
VLA-2/
βββ experiments/ # Main experimental codes
β βββ robot/ # Core VLA-2 implementation
β β βββ openvla_utils.py # OpenVLA utility functions
β β βββ robot_utils.py # Robot interaction utilities
β β βββ libero_run/ # Main scripts for LIBERO environment
β β βββ main_agent_clean.py # π― Main execution script, use client to get service from vision_planner_service
β β βββ vision_planner_service.py # Vision & planning service
β β βββ qwenvl.py # Verification module wrapper
β β βββ libero_utils.py # LIBERO environment utilities
β β βββ regenerate_libero_dataset.py # Dataset regeneration
β β βββ mps_start.sh # Multi-process service start
β β βββ mps_stop.sh # Multi-process service stop
β βββ val_zsh/ # Validation shell scripts
β βββ 0.sh, 10.sh # 0 and 10 test scenarios
β βββ goal.sh, goal_new.sh # Goal-based evaluations
β βββ objects.sh # Object manipulation tests
β βββ orange.sh # Specific object tests
β βββ spatial.sh # Spatial reasoning tests
βββ script/ # Tool and utility scripts
β βββ __init__.py # Package initialization
β βββ auto_DL.py # Automatic searching utilities
β βββ color.json # Color configuration
β βββ Judge_simple.py # Simple judgment module
β βββ mmgdino.py # MM-GroundingDINO integration, including Vision and Language understanding
β βββ mmgdino_simple.py # Simplified MM-GroundingDINO
β βββ qwenvl_meg.py # QwenVL model enhancement
β βββ SAM2_1.py # Segment Anything Model 2.1
β βββ SAPdivision.py # SAP (Sub-Action Planning) division
β βββ segvideo.py # Video segmentation
β βββ segvideo_simple.py # Simplified video segmentation
β βββ Wholebody.py # A media function
β βββ test_images/ # Test images and configurations
β βββ info.json # Image metadata
β βββ replacetest.py # Replacement testing
β βββ smoke_results.json # Smoke test results
β βββ test.py # Test runner
βββ prismatic/ # OpenVLA codebase (original)
βββ vla-scripts/ # Model testing
βββ deploy.py # Model deployment script
βββ finetune.py # Fine-tuning script
βββ train.py # Training script
βββ extern/ # External conversion utilities
βββ convert_openvla_weights_to_hf.py # Weight conversion
βββ test_openvla.py # OpenVLA testing
βββ verify_openvla.py # OpenVLA verification
main_agent_clean.py: Main execution script containing all tool module calls and agent logic implementationvision_planner_service.py: Service server for planner, Vision, and Language modules. Due to library version compatibility issues, we run the execution and verification module code in a separate process, communicating with the main process through socket communication. For module naming and content details, please refer to the paper.qwenvl.py: Wrapper function for the verification module
- Computer Vision:
SAM2_1.py,segvideo.py,mmgdino.py- Advanced vision processing - Language Models:
qwenvl_meg.py,Judge_simple.py- Language understanding and judgment - Planning:
SAPdivision.py- Sub-action planning and task decomposition - Utilities:
auto_DL.py,Wholebody.py- Automation and analysis tools
The remaining code in the experiments folder is based on OpenVLA codebase
- Backbone Models: Support for various LLM and vision architectures
- VLA Integration: Specialized vision-language-action model implementations
- Training Infrastructure: Distributed training with DDP/FSDP support
- Data Processing: RLDS dataset integration and preprocessing
- Comprehensive test scenarios covering different aspects of robot manipulation
- Goal-oriented tasks, object manipulation, and spatial reasoning evaluations
This project uses a dual conda environment setup to avoid library version conflicts, particularly with transformers. We recommend using OpenVLA's recommended configuration for the main environment and our specified requirements for the server environment.
- Anaconda/Miniconda: Latest version
- Git: For repository cloning
- NVIDIA Driver: 550.54.14+
- CUDA: Compatible with PyTorch 2.2/2.3
- OpenVLA: Core VLA framework
- LIBERO_ZERO: Evaluation benchmark
- Bulk-Bing-Image-downloader: Image downloading utility
- Cutie: Video object segmentation
- MM-GroundingDINO: Grounding DINO integration
- SAM 2.1: Segment Anything Model
- Qwen-VL: Vision-Language model
- GLM-4.1V: Thinking model
# Create and activate client environment
conda env create -f client.yml
conda activate client
# Install video segmentation library
git clone https://github.com/hkchengrex/Cutie
cd Cutie && pip install -e .
cd ..
# Install robot learning benchmark
git clone https://github.com/zhangjiaxuan-Xuan/LIBERO_ZERO
# Optional: cd LIBERO_ZERO && pip install -e .
# Recommended: Import LIBERO_ZERO by absolute path
# Install OpenVLA dependencies
pip install dlimp@git+https://github.com/moojink/dlimp_openvla
pip install thinplate@git+https://github.com/cheind/py-thin-plate-spline
# Optional: Install Flash Attention for performance
pip install flash-attn==2.5.5# Create and activate server environment
conda env create -f server.yml
conda activate server
# Install bulk image downloader
pip install git+https://github.com/ostrolucky/Bulk-Bing-Image-downloader
# Install latest transformers (includes tokenizers)
pip install git+https://github.com/huggingface/transformers.git
# Optional: Install Flash Attention for performance
pip install flash-attn==2.6.1- Download required model weights to local storage
- Update model paths in all files in experiments and scripts as needed
- Use validation scripts in
val_zsh/folder for initial testing
Enter the 'val_zsh' directory and run a test script, e.g.,
cd val_zsh
zsh 0.shif you find this project useful in your research, please consider citing:
@misc{zhaozhang2025vla2,
title={VLAΒ²: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation},
author={Han Zhao, Jiaxuan Zhang, Wenxuan Song, Pengxiang Ding, Donglin Wang},
eprint={2510.14902},
archivePrefix={arXiv},
primaryClass={cs.RO},
year={2025}
}- OpenVLA: Open Vision-Language Agents (https://arxiv.org/abs/2304.09103, https://github.com/openvla/openvla)
- Agentic-Robot: Referenced codebase (https://github.com/Agentic-Robot/agentic-robot)
- LIBERO: Lifelong Robot Learning Benchmark (https://arxiv.org/abs/2307.01620)
- Qwen-VL: Qwen Vision-Language Model (https://github.com/QwenLM/Qwen3-VL)
- MM-GroundingDINO: Grounding DINO Model (https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino)
- Segment Anything Model 2.1: (https://docs.ultralytics.com/zh/models/sam-2/#interactive-segmentation)
- GLM-V: GLM Vision-Language Model (https://github.com/zai-org/GLM-V)
- Updating, new features coming soon.