Overview

Welcome to PAT (Prefix-Aware Attention), a high-performance optimization framework designed to accelerate LLM decoding by intelligently leveraging shared prefix patterns across batched sequences. This documentation provides a comprehensive guide for beginner developers to understand, install, and integrate PAT into LLM serving systems.

PAT Architecture

What is PAT?

PAT represents a breakthrough approach to optimizing transformer attention computation during the decoding phase of Large Language Models. Published at ASPLOS 2026, PAT addresses the primary bottleneck in LLM inference: excessive KV cache reads during attention computation.

The core innovation lies in identifying complex shared prefix patterns within batched sequences and scheduling these shared prefixes into separate Cooperative Thread Array (CTA) computations. By recognizing that multiple requests often share common prefixes (such as system prompts or frequently used text patterns), PAT dramatically reduces redundant memory accesses and computational overhead.

Sources: README.md

Key Features and Capabilities

PAT delivers performance optimizations through several innovative mechanisms:

Feature	Description	Benefit

Requirement	Minimum Version	Recommended	Notes
GPU	NVIDIA A100 80GB	NVIDIA A100 80GB	SM80 architecture required
CUDA	12.4	12.4+	Must match driver version
NVIDIA Driver	550+	Latest production	Check with `nvidia-smi`
System RAM	64GB	128GB+	For large model caching
Disk Space	200GB	200GB+	Including model weights and datasets
OS	Linux x86-64	Ubuntu 22.04 LTS	Tested on Google Cloud Deep Learning VM

PAT/ ├── README.md ├── setup.py # Package build configuration ├── LICENSE # Apache 2.0 license │ ├── csrc/ # CUDA kernel implementation │ ├── pat.h # Core parameter structures │ ├── api.h # Python-C++ bindings │ ├── prefix_tree.h # Prefix tree C++ implementation │ ├── kernel_traits.h # Kernel configuration traits │ ├── pat_fwd_kernel.h # Forward kernel templates │ └── *.cu # Specialized kernel instantiations │ ├── prefix_attn/ # Python API package │ ├── __init__.py # Public API exports │ ├── data_class.py # Core data structures │ ├── prefix_tree.py # Prefix tree management │ ├── block_scheduler.py # Computation scheduling │ └── async_tree.py # Asynchronous tree operations │ ├── plugin/vllm/ # vLLM integration plugin │ ├── attention/ │ │ └── backends/ │ │ └── prefix_attn.py # Attention backend registration │ ├── entrypoints/ # Serving entry points │ └── worker/ # Worker integration │ ├── benchmark/ # Performance evaluation tools │ ├── benchmark_kernel.py # Kernel-level benchmarks │ ├── benchmark_serving.py # End-to-end serving tests │ ├── dataset/ # Test datasets │ └── run_*.sh # Execution scripts │ ├── plot/ # Visualization utilities │ ├── eval_kernel_perf.py # Kernel result plotting │ └── eval_e2e_from_jsonl.py # Serving result evaluation │ └── test/ # Unit tests ├── test.py # Core functionality tests └── utils.py # Test utilities

What is PAT?

Key Features and Capabilities

System Architecture

Core Components in Detail

Prefix Tree Data Structure

Block Scheduling Strategy

Multi-Tile Kernel Implementation

Installation Prerequisites

Project Structure

Integration Workflow

Performance Characteristics

Next Steps

OverviewReport Issue

What is PAT?

Key Features and Capabilities

System Architecture

Core Components in Detail

Prefix Tree Data Structure

Block Scheduling Strategy

Multi-Tile Kernel Implementation

Installation Prerequisites

Project Structure

Integration Workflow

Performance Characteristics

Next Steps

Overview