Welcome to PAT (Prefix-Aware Attention), a high-performance optimization framework designed to accelerate LLM decoding by intelligently leveraging shared prefix patterns across batched sequences. This documentation provides a comprehensive guide for beginner developers to understand, install, and integrate PAT into LLM serving systems.
What is PAT?
PAT represents a breakthrough approach to optimizing transformer attention computation during the decoding phase of Large Language Models. Published at ASPLOS 2026, PAT addresses the primary bottleneck in LLM inference: excessive KV cache reads during attention computation.
The core innovation lies in identifying complex shared prefix patterns within batched sequences and scheduling these shared prefixes into separate Cooperative Thread Array (CTA) computations. By recognizing that multiple requests often share common prefixes (such as system prompts or frequently used text patterns), PAT dramatically reduces redundant memory accesses and computational overhead.
Sources: README.md
Key Features and Capabilities
PAT delivers performance optimizations through several innovative mechanisms:
| Feature | Description | Benefit |
|---|---|---|