Vision-Language-Action (VLA) models have demonstrated strong generalization capabilities in robotic manipulation tasks. However, their inference process typically relies on large-scale vision and language models, resulting in high computational overhead that makes real-time deployment challenging.
In practice, we observe that action generation exhibits significant temporal redundancy: many continuous motion phases have stable trends that can be quickly predicted based on historical information, while complex vision-language reasoning is only needed at critical decision points (e.g., grasping or placing).
Based on this observation, we propose SP-VLA, a framework for VLA model inference acceleration that improves efficiency through two mechanisms: Model Scheduling and Token Pruning:
- Model Scheduling: Dynamically selects inference paths based on action types, using a lightweight action generator during simple motion phases and invoking the full VLA model at critical decision stages.
- Token Pruning: Filters key visual tokens using spatial and semantic information, reducing input size while maintaining spatial understanding capabilities.
By combining these two mechanisms, SP-VLA significantly reduces inference overhead while ensuring model decision-making capabilities, making VLA models more suitable for practical robotic system deployment.
- Attention-based Pruning: Leverages attention scores from the SigLIP vision encoder to identify important visual tokens
- Edge Detection Enhancement: Combines Canny edge detection to ensure retention of tokens containing critical edge information
- Dynamic Pruning Rate: Adaptively adjusts pruning intensity based on robot motion velocity
- Retains more tokens during high-speed motion to ensure accuracy
- Applies more aggressive pruning during low-speed motion to improve speed
- Motion Pattern Detection: Identifies scenarios where the robot performs smooth planar movements
- Linear Regression Prediction: Uses ridge regression to fit historical actions and predict the next action
- Conditional Skipping: Only skips model inference when the following conditions are met:
- Vertical motion is sufficiently small relative to horizontal motion
- Absolute vertical displacement is below threshold
- Action history buffer contains sufficient generated actions
On the LIBERO benchmark, SP-VLA achieves 1.5× inference acceleration while maintaining task success rates unchanged.
In the SimplerEnv environment, SP-VLA not only achieves 2.4× acceleration but also improves task performance by 6%.
These tasks cover various robotic manipulation scenarios including spatial reasoning, long-horizon operations, and complex object interactions.
Real robot experiments conducted on the Franka Panda manipulator demonstrate:
- 2.5× end-to-end inference acceleration
- Task success rate decreases by only 1%
# Clone this repository
git clone https://github.com/ChildTang/SP-VLA.git
cd SP-VLA
# Follow OpenVLA's installation steps to set up the project
# See: https://github.com/openvla/openvla# Evaluate on LIBERO Spatial tasks
cd experiments/robot/libero
python run_libero.py \
--pretrained_checkpoint /path/to/checkpoint \
--task_suite_name libero_spatial \
--cuda_device 0openvla_rule/
├── experiments/
│ └── robot/
│ ├── libero/
│ │ └── run_libero.py # LIBERO evaluation script
│ ├── openvla_utils.py # OpenVLA utility functions
│ └── robot_utils.py # General robot utilities
├── prismatic/
│ ├── extern/hf/
│ │ └── modeling_prismatic.py # Core model implementation (with optimizations)
Token pruning is implemented in PrismaticVisionBackbone.forward() in prismatic/extern/hf/modeling_prismatic.py:
# 1. Extract attention scores
siglip_attn = compute_attention_scores(siglip_q, siglip_k)
# 2. Dynamic threshold calculation
threshold = calculate_dynamic_threshold(z_trans, cfg)
# 3. Select important tokens
important_idx = select_tokens_by_attention(siglip_attn, threshold)
# 4. Combine with edge detection
edge_idx = detect_edge_tokens(raw_image)
important_idx = union(important_idx, edge_idx)
# 5. Apply pruning
patches = patches[:, important_idx]Step skipping is implemented in OpenVLAForActionPrediction.predict_action():
# 1. Check motion pattern
if is_planar_movement(recent_actions):
# Use linear regression prediction
action = fit_next_action(action_history)
else:
# Run full VLA inference
action = vla.generate(...)If you use this code, please cite our paper:
@article{li2025sp,
title={Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration},
author={Li, Ye and Meng, Yuan and Sun, Zewen and Ji, Kangye and Tang, Chen and Fan, Jiajun and Ma, Xinzhu and Xia, Shutao and Wang, Zhi and Zhu, Wenwu},
journal={arXiv preprint arXiv:2506.12723},
year={2025}
}This project is built upon the following excellent works:
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This implementation focuses on inference acceleration. For training-related functionality, please refer to the original OpenVLA repository.





