0% found this document useful (0 votes)

27 views100 pages

13 Lecture 04 03 25

Uploaded by

rppay777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views100 pages

13 Lecture 04 03 25

Uploaded by

rppay777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

LLM Scheduling

Jayashree Mohan
MSR India
[email protected]

IISc
04/03/2025
Last class
• LLM Inference Lifecycle
• Characteristics and challenges
• Inference serving stack – vLLM : Memory perspective
• Paged attention
• vAttention
Today
• e2e LLM Serving
• Scheduling
Scheduling and Routing Orca, Sarathi, DistServe

Memory management KV cache, Pagedattention

vattention

Flashattention,
Efficient, fast kernels
Flashdecoding
Sliding window attention

Efficient decoding speculative decoding,

quantization
Acknowledgements
• Some slides borrowed and adapted from Anyscale, distserv, sarathi, splitwise
conference presentations, Arvind Krishnamurthy’s slides, vLLM meetup
E2e LLM Serving
App 1 App k

Global request queue

Load balancer (Round Robin)

Req Q Req Q Prioritization (FCFS)

Replica scheduler Replica scheduler (Sarathi, Orca, vLLM, etc)

GPU GPU GPU GPU

CPU CPU CPU CPU

Scheduling

No batching
R1 R1 R2 R3 R4
Time
R2 Batch
R3 R1 R3
R4 R2 R4 Time
Batch size = 2

Why batch?
Exploit hardware parallelism, better serving efficiency
Static batching

● Batching multiple sequences on GPU, aka “static batching”

Legend:
● Yellow: prompt token
Problem: GPU utilization drops as sequences complete ● Blue: generated token
● Red: end-of-sequence token
source: AnyScale
Legend:
Top: static batching
● Yellow: prompt token
Continuous batching Bottom: continuous batching
●
●
Blue: generated token
Red: end-of-sequence token

Orca [OSDI 22]

source: AnyScale
Solution :Iteration-Level Scheduling
Orca

Inference Server Execution Engine

schedule
one iter
Scheduler
requests
return
select result
requests
responses
Request Pool

* maximum batch size = 3 source: Orca OSDI 2022 Talk

Example
Given input iter:1 iter:2 iter:3 iter:4 iter:5
x1: I think this x1: I think x1: I think this x1: I think this
x1: I think x1: I think this
idea this idea is idea is bad idea is bad
<eos>
x2: I love the x2: I love the x2: I love the
x2: I love x2: I love the
asian asian food asian food <eos>

x3: my cat is x3: my cat is so x3: my cat is so

x3: my cat is so
cute cute <eos>

x4: what is x4: what is your x4: what is your

x4: what is your
opinion opinion <eos>

15
Default : Request Scheduling
Inference Server
Requests
Execution Engine
Execution Engine
Scheduler

Model
x2: I love

x3: my cat is

Responses
Request queue

* Assume we set the max_batch_size = 3

12
Timestep 1
Inference Server
Requests
Execution Engine
Scheduler

x2: I love
Model
x3: my cat is

Responses
Request queue

* Assume we set the max_batch_size = 3

13
Timestep 2
Inference Server
Requests
Execution Engine
Scheduler

x2: I love the

Model
x3: my cat is so

Responses
Request queue

* Assume we set the max_batch_size = 3

14
Timestep 3
Inference Server
Requests
Execution Engine
Scheduler
x2: I love
the asian
Model
New
x4: what is x3: my cat is
request
so cute

Responses
Request queue

* Assume we set the max_batch_size = 3

15
Timestep 4
Inference Server
Requests
Execution Engine
Scheduler
x2: I love
the asian food
Model
x3: my cat is so
Wait x4: what is cute <eos>

Responses
Request queue

* Assume we set the max_batch_size = 3

16
Timestep 5
Inference Server
Requests
Execution Engine
Scheduler
x2: I love
the asian food <eos>
Model
x4: what is x3: my cat is so
cute <eos>

inactive
Responses
Request queue
wait until batch finish
introduce latency

* Assume we set the max_batch_size = 3

17
Timestep 6
Inference Server
Requests
Execution Engine
Scheduler
Scheduler

x4: what is
Model
New x1: I think
request
x3: my cat is

Responses
Request queue

* Assume we set the max_batch_size = 3

18
Drawback: High Latency
1. Low efficiency due to inactive requests

2. Newly arrived requests wait until all requests in the current

batch have finished.

19
Timestep 0
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter

select requests

Request Pool
Request Pool
Responses x2: I love
x3: my cat is

* Assume we set the max_batch_size = 3

21
Timestep 1
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler x2: II love
x2: love
Model x3: my
my cat
select for this iter x3: catisis

select requests

Request Pool
Request Pool
Responses

* Assume we set the max_batch_size = 3

22
Timestep 2
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: I love
Scheduler
Scheduler the
select for this iter
Model
x3: my cat is so

select requests

Request Pool
Request Pool
Responses

* Assume we set the max_batch_size = 3

23
Timestep 2
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
Model

select requests

x2: I love Request Pool

Request Pool
Responses x2:the
I love the
x3:x3:
mymy
cat cat
is sois so

* Assume we set the max_batch_size = 3

24
Timestep 3
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: I love
Scheduler
Scheduler the asian
select for this iter
Model
x3: my cat is
so cute
select requests

x2: I love
x4: what is Request Pool
Request Pool
the
Responses
New request
x3: my cat is so

* Assume we set the max_batch_size = 3

25
Timestep 3
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
Model

select requests

x2:
x2:
x2:
Ilove
I Ilove
love Request Pool
Request Pool
theasian
theasian
the
Responses
x3: my
x3: cat is
x3:my
mycat
catisisso
so x4: what is
so cute
cute

* Assume we set the max_batch_size = 3

26
Timestep 4
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result x2: I love
Scheduler the asian food
Scheduler
select for this iter
Model x3: my cat is so
cute <eos>
select requests x4: what is your

x2:I Ilove
x2: love Request Pool
Request Pool
the
the asian
Responses
x3:
x3:my
mycat
catisisso
so
cute

* Assume we set the max_batch_size = 3

27
Timestep 4
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
Model

select requests

x2:Ix2
x2:
: I love
Ilove
love Request Pool
Request Pool
thethe asian food
theasian
the food
asian
Responses
x3:
x3:
x3: mymy catcat is so
is so
x3:my
mycat catisisso
so x4: whatx4:
is what
Newis
request
cute
cute <eos>
cute <eos> your

* Assume we set the max_batch_size = 3

28
Timestep 5
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: love
x2: I love
Scheduler
Scheduler the asian
the asian food
food
select for this iter
Model <eos>
<eos>
x4: what
x4: whatisisyou
your
select requests opinion

x2:I Ilove
x2: love Request Pool
Request Pool
the
theasian
the food
asian
Responses
x3: mycat
x3: cat is so
x3:my
my catisisso so x4: what is New request
cute
cute<eos>
x3: my cat is so
cute <eos>

* Assume we set the max_batch_size = 3

29
Timestep 5
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: I love
Scheduler
Scheduler the asian food
select for this iter
Model <eos>
x4: what is you
select requests

x2:I Ilove
x2: love Request Pool
Request Pool x1: I think
the x2:
theI food
theasian
asian love
Responses x4: what is your
themyasian
x3:
x3: cat isfood
so
x3:my
mycat catisisso
so x4: what opinion
is New request x3: my cat is
cute <eos>
<eos>
cute

* Assume we set the max_batch_size = 3

30
Timestep 6
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result x1:x2:I think
I love
Scheduler
Scheduler the asian food
select for this iter
Model x3:<eos>
my cat is
x4: what
x4: whatisisyou
your
Allows requests to enter and exit the execution engine at iterationopinion
select requests granularities
- No more wasted wait time before returning results
x2:I Ilove
x2: love - Efficient usePool
Request
Request of resources
Pool
the
theasian
the food
asian
Responses
x3: mycat
x3: cat is so
x3:my
my catisissoso x4: what is New request
cute <eos>
cute
x2: I love
the asian food
<eos>

* Assume we set the max_batch_size = 3

31
How to perform iteration-level batching?
• To exploit parallelism with batching, all requests need to have the
same number of tokens processed.
• This is because the Attention mechanism requires input tensors to have
the same shape (which is based on the number of processed tokens).

• Need to Coalesce B * [L, H]tensors into [B, L, H] tensor for batching.

B -> batch size, L -> input length, H -> hidden size.

How to perform iteration-level batching?
• Batching is only applicable when
• requests are in the same phase (prefill or decode)
• requests have the same length

X1: I think this is I think

isThree cases thisbatch
cannot is normally:
1. both requests are in the prefill phase and each has
different number of input tokens
X2: I love you 2. both areI in thelove
you decodeyou
phase and each is processing a
token at different index from each other
Attention
3. each request is in the different phase: prefill matrices
or have
X3: A person A person different shapes.
person
decode
36
Orca Uses Selective Batching
Not all operations are incompatible with irregularly shaped
tensors.
Matrix multiplication and layer normalization can be batched,
because they do not distinguish different requests.

X1: I think this X1: [1, H]

LayerNorm
Flatten Linear Layer
X2: I love you X2: [1, H] [5, H]

X3: A person is X3: [3, H] Attention

34
Solution: Selective Batching
Instead of batching all tensor operations, only batch the ones you can.
Flatten input tensors into 2-D tensor of shape [∑L, H], and feed into non-Attention
operations (Linear, LayerNorm, Add, GeLU).

For Attention operations, split flattened tensor

into individual requests and run Attention
operation separately. Once completed, merge
requests back.

Attention K/V manager provides Keys/Values

separately for each request.
Scheduling Algorithm
Attention K/V Manager

36
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is

37
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my

cat

38
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my

Iteration 2 …
cat
x1: I

39
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my I

Iteration 2 …
cat
x1: I

40
Scheduling Algorithm
Iteration 1 Attention K/V Manager

x1: my cat is so
my I

Iteration 2 …
cat so
x1: I

41
Scheduling Algorithm
Iteration 1 Attention K/V Manager

x1: my cat is so
my I

Iteration 2
cat so
x1: I like

is like

42
Scheduling Algorithm
Iteration 1 Attention K/V Manager

x1: my cat is so
my I

Iteration 2
cat so
x1: I like

is like
Iteration 3
Deadlock!!
43
Scheduling Algorithm

Solution: Memory Aware Allocation - reserve enough

GPU space for all iterations at the 1st iteration itself

Predetermine ‘max_tokens’ any request can generate

Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is

eg: max_tokens = 5

45
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my

cat
eg: max_tokens = 5

46
Recall : What can we do better?
• Dynamic memory allocation

• But deadlocks can still happen!

• vLLM : Checkpoint and recompute!

I donot have enough memory to I donot have enough memory to complete decoding

Move tokens to DRAM Move tokens to HBM

Results
Takeaway

• ORCA introduced primitives to enable iteration-level scheduling

• But how to meet latency SLOs?
• How to schedule requests to improve overall system throughput
while meeting their SLO?
Scheduling with latency-throughput constraints

TTFT TPOT
Time to ﬁrst token • Time per output token
Initial response time • Average time between two subsequent generated tokens

Fast initial response Human reading speed (P99 latency =250ms)

Chatbot

Data output generation (P99 latency =35ms)

Summarization User can tolerate longer initial response
High Throughput ≠ High Goodput

High
Throughput = 10 rps Throughput
=completed request / time System
…
under SLO
criteria
can have
Low Goodput!
Goodput = 3 rps
=completed request within SLO / time
High Throughput ≠ High Goodput

High
Throughput = 10 rps Throughput
=completed request / time System
High Throughput can
~ Poor UX …
still have Low Goodput
can have
Low Goodput!
Goodput = 3 rps
=completed request within SLO / time
Impact of batching on Prefill phase
▪Excess parallelism
A prompt can contain 1000s of tokens
▪Beyond a certain point, increasing prompt size:
Increases latency, not throughput

150 20
Latency (ms)

Tokens per second

125 16
100 12

(x1000)
75
8
50
25 4
0 0
128 256 512 1024 2048 128 256 512 1024 2048
Prompt Size Prompt Size
53
Impact of batching on Decode phase
▪Lacks parallelism
▪One token per request
Latency almost entirely dominated by time to read model weights
Increasing batch size:
Increases throughput, not latency

30 2000
Latency (ms)

Tokens per
second
1500
20
1000
10
500
0 0
1 8 16 32 1 8 16 32
Batch Size Batch Size

54
Scheduling objectives
High Throughput Low Latency

▪ Time-to-first-token (TTFT) Can be somewhat relaxed

• Scheduling delay + prefill time
▪ Time-between-tokens (TBT) Constrained
• Influences request execution time
• and user experience
▪ Serving Capacity
• Maximum QPS serviceable under specific latency SLO

55
Batching is essential for high throughput

▪ Because decodes are memory-bound

2000
Tokens per second
1500

1000

500

0
1 8 16 32
Batch Size

56
Key scheduling challenge

▪ How to increase decode batch size?

▪ A request needs to compute the prefill first

▪ Batching involves an interleaving of prefills and decodes

How to efficiently interleave prefills and decodes?

57
LLM serving systems today
Timeline

C exits; D, E enter
Ad , Bd, Cd Dp Ep Ad , Bd, Dd, Ed … Prefills increase TBT latency
Prefill
Prioritizing
C exits; D, E enter Schedules
Ad , Bd, Cd Dp, Ep, Ad, Bd …

C exits; D, E enter
Ad , Bd, Cd Ad , Bd Ad , Bd Bd Bd Dp, Ep … Decode
Prioritizing
A exits B exits Schedules
Decodes with low batch size

59
60
Stall free scheduling without compromising on
throughput?

Image credits: Redpanda 61

How is scheduling done today?

Timeline

Cp Dp Ad , Bd, Cd, Dd …
C, D enter Prefill
Option 1

Decodes for requests A, B stalled Prioritizing

Eg: vLLM
Ad , Bd Cp + Ad , Bd Can we process the prefill of a new request,
without pausing the ongoing decodes?
Decode
Option 2

Prioritizing
Eg: FasterTransformer

Ad , Bd Ad , Bd Bd Bd Cp …
A exits B exits
Low batch size poor throughput
Scheduling trade-off
Ideal Orca
Scheduler vLLM

Throughput (QPS)
(prefill prioritizing)

LightLLM
(dynamic priority)

FasterTransformers
(decode prioritizing)

TBT (Latency)

Existing systems compromise on either throughput or latency

63
Mixed Batching
vLLM Decode Latency SLO = 10ms
Combine the prefill and decode computation
AD BP AD, BD

Latency = 8ms Latency = 24ms

But, we still violate SLOs – high latency due to
combining long prefills with decodes
AD BP+ AD

Mistral 7B on A100 Latency = 16ms

64
Key Insight

Chunked prefills : Prefills can be split into

smaller chunks that marginally add to the
decode batch time

65
Observation: Arithmetic Intensity Slack

Latency
dominated
by compute
Grows linearly
with num tokens

Latency
dominated
by weight
fetch time
Constrained Independent
due to memory of num tokens
overhead in
decode phase 67
Stall-free batching in Sarathi
Decode Latency SLO = 10ms

Baseline - vLLM

AD BP AD, BD AD, BD

Split large prefills into smaller chunks – just

Latency = 8ms Latency = 24ms
enough to consume the leftover compute
budget in decode batches
AD BP+ AD

Latency = 16ms
Sarathi-Serve

AD BP1+ AD BP2+ AD

Gain!
Latency = 9ms
~Uniform inter-token latency 68
based on TBT SLO
SARATHI in Action
Ad1Bd1 Cd1Dd1 Ad2Bd2

Ap Bp Cp Dp Bubble Bubble
Ap Prefill
Ap Bp Cp Dp
Ad Decode
time
(a) Baseline iteration-level scheduling Bubble

Cp1Ad1 Dp1Ad2 Cp2Bd1 Bd2Cd1 Dd1

Ap1 Bp1 Ap2 Bp2 Bp3 Dp2

time
(b) SARATHI : Chunked prefills with decode-maximal batching

Ap Bp Cp Dp
Chunk
composition
Ap1 Ap2 Bp1 Bp2 Bp3 Cp1 Cp2 Dp1 Dp2
Picking Chunk Size
Chunk that meets TBT

Larger chunk is not always better

“Tile-quantization” in GPU matmuls

Extraneous computation if tensor shapes
not a multiple of the tile size

Example (per token cost):

Chunk size Per token time (us)

128 0.43 Each chunk should be a multiple
256 0.27 of the tile size
257 0.36
Serving Capacity under SLOs
adapt using different
chunk sizes
Setup
ShareGPT4 trace on on A100 GPUs with strict (S) and relaxed (R) latency SLOs

5-6x higher capacity

21
Summary

Problem: State-of-the-art systems sacriﬁce decode latency to achieve higher

throughput

Key Insight - Low arithmetic intensity of decodes allows for adding compute
intensive preﬁlls with negligible decode latency cost

Key Results - We achieve optimality in both latency and throughput

simultaneously leading up to 6x higher capacity under SLO constraints

Industry Adoption - Available in all major serving frameworks and more.

22
Is this a perfect solution?

• Balancing low latency and high prefill throughput is hard

Overhead of chunking increases as chunk size reduces

We need smaller chunks for lower TBT
Smaller chunks => lower prefill throughput
Colocation → Coupled Parallelism

PP TP

Preﬁll

Decode

TTFT tight, TPOT loose Preﬁll and Decode have

different preferences
Summary: Problems caused by Colocation

Continuous Batching
Coupled Parallelism Strategy
Cause Interference
Summary: Problems caused by Colocation

Is there a better way to

achieve better
Goodput per GPU?

Continuous Batching
Coupled Parallelism Strategy
Cause Interference
Prompt computation vs. token generation