13 Lecture 04 03 25
13 Lecture 04 03 25
Jayashree Mohan
MSR India
[email protected]
IISc
04/03/2025
Last class
• LLM Inference Lifecycle
• Characteristics and challenges
• Inference serving stack – vLLM : Memory perspective
• Paged attention
• vAttention
Today
• e2e LLM Serving
• Scheduling
Scheduling and Routing Orca, Sarathi, DistServe
Flashattention,
Efficient, fast kernels
Flashdecoding
Sliding window attention
No batching
R1 R1 R2 R3 R4
Time
R2 Batch
R3 R1 R3
R4 R2 R4 Time
Batch size = 2
Why batch?
Exploit hardware parallelism, better serving efficiency
Static batching
Legend:
● Yellow: prompt token
Problem: GPU utilization drops as sequences complete ● Blue: generated token
● Red: end-of-sequence token
source: AnyScale
Legend:
Top: static batching
● Yellow: prompt token
Continuous batching Bottom: continuous batching
●
●
Blue: generated token
Red: end-of-sequence token
source: AnyScale
Solution :Iteration-Level Scheduling
Orca
15
Default : Request Scheduling
Inference Server
Requests
Execution Engine
Execution Engine
Scheduler
Model
x2: I love
x3: my cat is
Responses
Request queue
x2: I love
Model
x3: my cat is
Responses
Request queue
Responses
Request queue
Responses
Request queue
Responses
Request queue
inactive
Responses
Request queue
wait until batch finish
introduce latency
x4: what is
Model
New x1: I think
request
x3: my cat is
Responses
Request queue
19
Timestep 0
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
select requests
Request Pool
Request Pool
Responses x2: I love
x3: my cat is
select requests
Request Pool
Request Pool
Responses
select requests
Request Pool
Request Pool
Responses
select requests
x2: I love
x4: what is Request Pool
Request Pool
the
Responses
New request
x3: my cat is so
select requests
x2:
x2:
x2:
Ilove
I Ilove
love Request Pool
Request Pool
theasian
theasian
the
Responses
x3: my
x3: cat is
x3:my
mycat
catisisso
so x4: what is
so cute
cute
x2:I Ilove
x2: love Request Pool
Request Pool
the
the asian
Responses
x3:
x3:my
mycat
catisisso
so
cute
select requests
x2:Ix2
x2:
: I love
Ilove
love Request Pool
Request Pool
thethe asian food
theasian
the food
asian
Responses
x3:
x3:
x3: mymy catcat is so
is so
x3:my
mycat catisisso
so x4: whatx4:
is what
Newis
request
cute
cute <eos>
cute <eos> your
x2:I Ilove
x2: love Request Pool
Request Pool
the
theasian
the food
asian
Responses
x3: mycat
x3: cat is so
x3:my
my catisisso so x4: what is New request
cute
cute<eos>
x3: my cat is so
cute <eos>
x2:I Ilove
x2: love Request Pool
Request Pool x1: I think
the x2:
theI food
theasian
asian love
Responses x4: what is your
themyasian
x3:
x3: cat isfood
so
x3:my
mycat catisisso
so x4: what opinion
is New request x3: my cat is
cute <eos>
<eos>
cute
36
Scheduling Algorithm
Iteration 1 … Attention K/V Manager
x1: my cat is
37
Scheduling Algorithm
Iteration 1 … Attention K/V Manager
x1: my cat is
my
cat
is
38
Scheduling Algorithm
Iteration 1 … Attention K/V Manager
x1: my cat is
my
Iteration 2 …
cat
x1: I
is
39
Scheduling Algorithm
Iteration 1 … Attention K/V Manager
x1: my cat is
my I
Iteration 2 …
cat
x1: I
is
40
Scheduling Algorithm
Iteration 1 Attention K/V Manager
x1: my cat is so
my I
Iteration 2 …
cat so
x1: I
is
41
Scheduling Algorithm
Iteration 1 Attention K/V Manager
x1: my cat is so
my I
Iteration 2
cat so
x1: I like
is like
42
Scheduling Algorithm
Iteration 1 Attention K/V Manager
x1: my cat is so
my I
Iteration 2
cat so
x1: I like
is like
Iteration 3
Deadlock!!
43
Scheduling Algorithm
x1: my cat is
eg: max_tokens = 5
45
Scheduling Algorithm
Iteration 1 … Attention K/V Manager
x1: my cat is
my
cat
eg: max_tokens = 5
is
46
Recall : What can we do better?
• Dynamic memory allocation
I donot have enough memory to I donot have enough memory to complete decoding
TTFT TPOT
Time to first token • Time per output token
Initial response time • Average time between two subsequent generated tokens
High
Throughput = 10 rps Throughput
=completed request / time System
…
under SLO
criteria
can have
Low Goodput!
Goodput = 3 rps
=completed request within SLO / time
High Throughput ≠ High Goodput
High
Throughput = 10 rps Throughput
=completed request / time System
High Throughput can
~ Poor UX …
still have Low Goodput
can have
Low Goodput!
Goodput = 3 rps
=completed request within SLO / time
Impact of batching on Prefill phase
▪Excess parallelism
A prompt can contain 1000s of tokens
▪Beyond a certain point, increasing prompt size:
Increases latency, not throughput
150 20
Latency (ms)
(x1000)
75
8
50
25 4
0 0
128 256 512 1024 2048 128 256 512 1024 2048
Prompt Size Prompt Size
53
Impact of batching on Decode phase
▪Lacks parallelism
▪One token per request
Latency almost entirely dominated by time to read model weights
Increasing batch size:
Increases throughput, not latency
30 2000
Latency (ms)
Tokens per
second
1500
20
1000
10
500
0 0
1 8 16 32 1 8 16 32
Batch Size Batch Size
54
Scheduling objectives
High Throughput Low Latency
55
Batching is essential for high throughput
2000
Tokens per second
1500
1000
500
0
1 8 16 32
Batch Size
56
Key scheduling challenge
57
LLM serving systems today
Timeline
C exits; D, E enter
Ad , Bd, Cd Dp Ep Ad , Bd, Dd, Ed … Prefills increase TBT latency
Prefill
Prioritizing
C exits; D, E enter Schedules
Ad , Bd, Cd Dp, Ep, Ad, Bd …
C exits; D, E enter
Ad , Bd, Cd Ad , Bd Ad , Bd Bd Bd Dp, Ep … Decode
Prioritizing
A exits B exits Schedules
Decodes with low batch size
59
60
Stall free scheduling without compromising on
throughput?
Timeline
Cp Dp Ad , Bd, Cd, Dd …
C, D enter Prefill
Option 1
Prioritizing
Eg: FasterTransformer
Ad , Bd Ad , Bd Bd Bd Cp …
A exits B exits
Low batch size poor throughput
Scheduling trade-off
Ideal Orca
Scheduler vLLM
Throughput (QPS)
(prefill prioritizing)
LightLLM
(dynamic priority)
FasterTransformers
(decode prioritizing)
TBT (Latency)
63
Mixed Batching
vLLM Decode Latency SLO = 10ms
Combine the prefill and decode computation
AD BP AD, BD
64
Key Insight
65
Observation: Arithmetic Intensity Slack
Latency
dominated
by compute
Grows linearly
with num tokens
Latency
dominated
by weight
fetch time
Constrained Independent
due to memory of num tokens
overhead in
decode phase 67
Stall-free batching in Sarathi
Decode Latency SLO = 10ms
Baseline - vLLM
AD BP AD, BD AD, BD
Latency = 16ms
Sarathi-Serve
AD BP1+ AD BP2+ AD
Gain!
Latency = 9ms
~Uniform inter-token latency 68
based on TBT SLO
SARATHI in Action
Ad1Bd1 Cd1Dd1 Ad2Bd2
Ap Bp Cp Dp Bubble Bubble
Ap Prefill
Ap Bp Cp Dp
Ad Decode
time
(a) Baseline iteration-level scheduling Bubble
time
(b) SARATHI : Chunked prefills with decode-maximal batching
Ap Bp Cp Dp
Chunk
composition
Ap1 Ap2 Bp1 Bp2 Bp3 Cp1 Cp2 Dp1 Dp2
Picking Chunk Size
Chunk that meets TBT
Key Insight - Low arithmetic intensity of decodes allows for adding compute
intensive prefills with negligible decode latency cost
22
Is this a perfect solution?
PP TP
Prefill
Decode
Continuous Batching
Coupled Parallelism Strategy
Cause Interference
Summary: Problems caused by Colocation
Continuous Batching
Coupled Parallelism Strategy
Cause Interference
Prompt computation vs. token generation
Which is better,
Pizza is better .
pizza or burger?
82
Inefficient to run both on the same hardware
Which is better,
Pizza is better .
pizza or burger?
GPUs
Server
83
Splitwise/Distserve splits phases onto different servers
Which is better,
Pizza is better .
pizza or burger?
GPUs GPUs
Which is better,
Pizza is better .
pizza or burger?
85
Splitting inference requires fast state transfers
Which is better,
Pizza is better .
pizza or burger?
86
Splitwise uses GPU Infiniband to ship state
Which is better,
Pizza is better .
pizza or burger?
GPU IB
Prompt GPUs KV Token GPUs
87
Splitwise uses GPU Infiniband to ship state
Which is better,
Pizza is better .
pizza or burger?
GPU IB
Prompt GPUs KV Token GPUs
88
Parallelize transfers for high bandwidth
Which is better,
Pizza is better .
pizza or burger?
25GB/s
GPU GPU GPU GPU GPU GPU GPU GPU
Prompt
Prompt phase Transfer state
server
Delay
Token
Transfer state Token phase
server
goodput
Disaggregation achieves better goodput
Colocate Disaggregate (2P1D)
1 GPU for both Prefill and Decode 2 GPU for Prefill +1 GPU for Decode
Simple Disaggregation
achieves 2x goodput
(per GPU)
goodput
Evaluation
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
Power -aware
Prompt phases
Allocated
power
Token phases
23
3x BLOOM-176B inference requests on 8 GPUs
What are issues or potential downsides with
disaggregation?
98
User Interface
End-user Interface Developer Interface
vllm/entrypoints vllm/engine
Synchronous
LLM LLMEngine
vllm/entrypoints/llm.py vllm/engine/llm_engine.py
Batched inference
def add_request()
def abort_request()
def step() → List[ReaquestOutput]
api_server ● Stream one token output for a batch of requests
vllm/entrypoints/api_server.py
AsyncLLMEngine
Asynchronous
openai_api_server
vllm/entrypoints/openai/api_server.py async def generate()
async def abort()
OpenAI-compatible API server + background engine loop
Scheduler N× Worker
vllm/core/scheduler.py vllm/worker/worker.py
BlockSpaceManager
vllm/core/block_manager.py CacheEngine Worker.model
vllm/worker/cache_engine.py vllm/model_executor/models
…
100
Before serving requests
1. Initialize & load model
LLMEngine
vllm/engine/llm_engine.py
Worker ×N
vllm/worker/worker.py
Worker.model
vllm/model_executor/models
LLMEngine
vllm/engine/llm_engine.py
Worker
×N
vllm/worker/worker.py
Worker.profile_num_available_blocks
vllm/model_executor/models
Worker.model
vllm/model_executor/models
Profile the memory usage with a batch with the max possible
#tokens.
#GPU KV blocks = remaining memory / block byte size
102
Before serving requests
3. Pre-allocate KV Blocks
LLMEngine
vllm/engine/llm_engine.py
Scheduler Worker ×N
vllm/core/scheduler.py vllm/worker/worker.py
BlockSpaceManager CacheEngine
vllm/core/block_manager.py vllm/worker/cache_engine.py
Request prompt:
“The future of Artificial Intelligence”
LLMEngine
vllm/engine/llm_engine.py
1. Tokenization:
[1, 450, 5434, 310, 3012, 928, 616, 3159, 28286]
2. Add to the scheduler’s waiting queue
Scheduler
vllm/core/scheduler.py
104
At every step, the scheduler
BlockSpaceManager
vllm/core/block_manager.py ● Allocate space for new KV Cache.
● Handle block sharing.
● Swap/delete preempted KV blocks.
BlockAllocator (GPU/CPU)
→ Cache instructions & Block table
vllm/core/block_manager.py
105
At every step, the worker
N× Worker
vllm/worker/worker.py
Cache instructions
106
At every step, the model
Token IDs & Block Table
ParallelLinear
vllm/parallel_utils/layers.py
107
streaming results
requests LLMEngine detokenize
vllm/engine/llm_engine.py
(prompts)
1 new token ID for each request
unfinished
Scheduler requests N× Worker
vllm/core/scheduler.py vllm/worker/worker.py
BlockSpaceManager
vllm/core/block_manager.py CacheEngine Worker.model
vllm/worker/cache_engine.py vllm/model_executor/models
108