0% found this document useful (0 votes)
27 views100 pages

13 Lecture 04 03 25

Uploaded by

rppay777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views100 pages

13 Lecture 04 03 25

Uploaded by

rppay777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LLM Scheduling

Jayashree Mohan
MSR India
[email protected]

IISc
04/03/2025
Last class
• LLM Inference Lifecycle
• Characteristics and challenges
• Inference serving stack – vLLM : Memory perspective
• Paged attention
• vAttention
Today
• e2e LLM Serving
• Scheduling
Scheduling and Routing Orca, Sarathi, DistServe

Memory management KV cache, Pagedattention


vattention

Flashattention,
Efficient, fast kernels
Flashdecoding
Sliding window attention

Efficient decoding speculative decoding,


quantization
Acknowledgements
• Some slides borrowed and adapted from Anyscale, distserv, sarathi, splitwise
conference presentations, Arvind Krishnamurthy’s slides, vLLM meetup
E2e LLM Serving
App 1 App k

Global request queue

Load balancer (Round Robin)

Req Q Req Q Prioritization (FCFS)

Replica scheduler Replica scheduler (Sarathi, Orca, vLLM, etc)

GPU GPU GPU GPU

CPU CPU CPU CPU


Scheduling

No batching
R1 R1 R2 R3 R4
Time
R2 Batch
R3 R1 R3
R4 R2 R4 Time
Batch size = 2

Why batch?
Exploit hardware parallelism, better serving efficiency
Static batching

● Batching multiple sequences on GPU, aka “static batching”

Legend:
● Yellow: prompt token
Problem: GPU utilization drops as sequences complete ● Blue: generated token
● Red: end-of-sequence token
source: AnyScale
Legend:
Top: static batching
● Yellow: prompt token
Continuous batching Bottom: continuous batching


Blue: generated token
Red: end-of-sequence token

Orca [OSDI 22]

source: AnyScale
Solution :Iteration-Level Scheduling
Orca

Inference Server Execution Engine


schedule
one iter
Scheduler
requests
return
select result
requests
responses
Request Pool

* maximum batch size = 3 source: Orca OSDI 2022 Talk


Example
Given input iter:1 iter:2 iter:3 iter:4 iter:5
x1: I think this x1: I think x1: I think this x1: I think this
x1: I think x1: I think this
idea this idea is idea is bad idea is bad
<eos>
x2: I love the x2: I love the x2: I love the
x2: I love x2: I love the
asian asian food asian food <eos>

x3: my cat is x3: my cat is so x3: my cat is so


x3: my cat is so
cute cute <eos>

x4: what is x4: what is your x4: what is your


x4: what is your
opinion opinion <eos>

15
Default : Request Scheduling
Inference Server
Requests
Execution Engine
Execution Engine
Scheduler

Model
x2: I love

x3: my cat is

Responses
Request queue

* Assume we set the max_batch_size = 3


12
Timestep 1
Inference Server
Requests
Execution Engine
Scheduler

x2: I love
Model
x3: my cat is

Responses
Request queue

* Assume we set the max_batch_size = 3


13
Timestep 2
Inference Server
Requests
Execution Engine
Scheduler

x2: I love the


Model
x3: my cat is so

Responses
Request queue

* Assume we set the max_batch_size = 3


14
Timestep 3
Inference Server
Requests
Execution Engine
Scheduler
x2: I love
the asian
Model
New
x4: what is x3: my cat is
request
so cute

Responses
Request queue

* Assume we set the max_batch_size = 3


15
Timestep 4
Inference Server
Requests
Execution Engine
Scheduler
x2: I love
the asian food
Model
x3: my cat is so
Wait x4: what is cute <eos>

Responses
Request queue

* Assume we set the max_batch_size = 3


16
Timestep 5
Inference Server
Requests
Execution Engine
Scheduler
x2: I love
the asian food <eos>
Model
x4: what is x3: my cat is so
cute <eos>

inactive
Responses
Request queue
wait until batch finish
introduce latency

* Assume we set the max_batch_size = 3


17
Timestep 6
Inference Server
Requests
Execution Engine
Scheduler
Scheduler

x4: what is
Model
New x1: I think
request
x3: my cat is

Responses
Request queue

* Assume we set the max_batch_size = 3


18
Drawback: High Latency
1. Low efficiency due to inactive requests

2. Newly arrived requests wait until all requests in the current


batch have finished.

19
Timestep 0
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter

select requests

Request Pool
Request Pool
Responses x2: I love
x3: my cat is

* Assume we set the max_batch_size = 3


21
Timestep 1
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler x2: II love
x2: love
Model x3: my
my cat
select for this iter x3: catisis

select requests

Request Pool
Request Pool
Responses

* Assume we set the max_batch_size = 3


22
Timestep 2
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: I love
Scheduler
Scheduler the
select for this iter
Model
x3: my cat is so

select requests

Request Pool
Request Pool
Responses

* Assume we set the max_batch_size = 3


23
Timestep 2
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
Model

select requests

x2: I love Request Pool


Request Pool
Responses x2:the
I love the
x3:x3:
mymy
cat cat
is sois so

* Assume we set the max_batch_size = 3


24
Timestep 3
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: I love
Scheduler
Scheduler the asian
select for this iter
Model
x3: my cat is
so cute
select requests

x2: I love
x4: what is Request Pool
Request Pool
the
Responses
New request
x3: my cat is so

* Assume we set the max_batch_size = 3


25
Timestep 3
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
Model

select requests

x2:
x2:
x2:
Ilove
I Ilove
love Request Pool
Request Pool
theasian
theasian
the
Responses
x3: my
x3: cat is
x3:my
mycat
catisisso
so x4: what is
so cute
cute

* Assume we set the max_batch_size = 3


26
Timestep 4
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result x2: I love
Scheduler the asian food
Scheduler
select for this iter
Model x3: my cat is so
cute <eos>
select requests x4: what is your

x2:I Ilove
x2: love Request Pool
Request Pool
the
the asian
Responses
x3:
x3:my
mycat
catisisso
so
cute

* Assume we set the max_batch_size = 3


27
Timestep 4
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
Scheduler
Scheduler
select for this iter
Model

select requests

x2:Ix2
x2:
: I love
Ilove
love Request Pool
Request Pool
thethe asian food
theasian
the food
asian
Responses
x3:
x3:
x3: mymy catcat is so
is so
x3:my
mycat catisisso
so x4: whatx4:
is what
Newis
request
cute
cute <eos>
cute <eos> your

* Assume we set the max_batch_size = 3


28
Timestep 5
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: love
x2: I love
Scheduler
Scheduler the asian
the asian food
food
select for this iter
Model <eos>
<eos>
x4: what
x4: whatisisyou
your
select requests opinion

x2:I Ilove
x2: love Request Pool
Request Pool
the
theasian
the food
asian
Responses
x3: mycat
x3: cat is so
x3:my
my catisisso so x4: what is New request
cute
cute<eos>
x3: my cat is so
cute <eos>

* Assume we set the max_batch_size = 3


29
Timestep 5
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result
x2: I love
Scheduler
Scheduler the asian food
select for this iter
Model <eos>
x4: what is you
select requests

x2:I Ilove
x2: love Request Pool
Request Pool x1: I think
the x2:
theI food
theasian
asian love
Responses x4: what is your
themyasian
x3:
x3: cat isfood
so
x3:my
mycat catisisso
so x4: what opinion
is New request x3: my cat is
cute <eos>
<eos>
cute

* Assume we set the max_batch_size = 3


30
Timestep 6
Orca
Orca
Requests Execution
Execution Engine
Engine
return the result x1:x2:I think
I love
Scheduler
Scheduler the asian food
select for this iter
Model x3:<eos>
my cat is
x4: what
x4: whatisisyou
your
Allows requests to enter and exit the execution engine at iterationopinion
select requests granularities
- No more wasted wait time before returning results
x2:I Ilove
x2: love - Efficient usePool
Request
Request of resources
Pool
the
theasian
the food
asian
Responses
x3: mycat
x3: cat is so
x3:my
my catisissoso x4: what is New request
cute <eos>
cute
x2: I love
the asian food
<eos>

* Assume we set the max_batch_size = 3


31
How to perform iteration-level batching?
• To exploit parallelism with batching, all requests need to have the
same number of tokens processed.
• This is because the Attention mechanism requires input tensors to have
the same shape (which is based on the number of processed tokens).

• Need to Coalesce B * [L, H]tensors into [B, L, H] tensor for batching.

B -> batch size, L -> input length, H -> hidden size.


How to perform iteration-level batching?
• Batching is only applicable when
• requests are in the same phase (prefill or decode)
• requests have the same length

X1: I think this is I think


isThree cases thisbatch
cannot is normally:
1. both requests are in the prefill phase and each has
different number of input tokens
X2: I love you 2. both areI in thelove
you decodeyou
phase and each is processing a
token at different index from each other
Attention
3. each request is in the different phase: prefill matrices
or have
X3: A person A person different shapes.
person
decode
36
Orca Uses Selective Batching
Not all operations are incompatible with irregularly shaped
tensors.
Matrix multiplication and layer normalization can be batched,
because they do not distinguish different requests.

X1: I think this X1: [1, H]


LayerNorm
Flatten Linear Layer
X2: I love you X2: [1, H] [5, H]

X3: A person is X3: [3, H] Attention


34
Solution: Selective Batching
Instead of batching all tensor operations, only batch the ones you can.
Flatten input tensors into 2-D tensor of shape [∑L, H], and feed into non-Attention
operations (Linear, LayerNorm, Add, GeLU).

For Attention operations, split flattened tensor


into individual requests and run Attention
operation separately. Once completed, merge
requests back.

Attention K/V manager provides Keys/Values


separately for each request.
Scheduling Algorithm
Attention K/V Manager

36
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is

37
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my

cat

is

38
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my

Iteration 2 …
cat
x1: I

is

39
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my I

Iteration 2 …
cat
x1: I

is

40
Scheduling Algorithm
Iteration 1 Attention K/V Manager

x1: my cat is so
my I

Iteration 2 …
cat so
x1: I

is

41
Scheduling Algorithm
Iteration 1 Attention K/V Manager

x1: my cat is so
my I

Iteration 2
cat so
x1: I like

is like

42
Scheduling Algorithm
Iteration 1 Attention K/V Manager

x1: my cat is so
my I

Iteration 2
cat so
x1: I like

is like
Iteration 3
Deadlock!!
43
Scheduling Algorithm

Solution: Memory Aware Allocation - reserve enough


GPU space for all iterations at the 1st iteration itself

Predetermine ‘max_tokens’ any request can generate


Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is

eg: max_tokens = 5

45
Scheduling Algorithm
Iteration 1 … Attention K/V Manager

x1: my cat is
my

cat
eg: max_tokens = 5

is

46
Recall : What can we do better?
• Dynamic memory allocation

• But deadlocks can still happen!

• vLLM : Checkpoint and recompute!

I donot have enough memory to I donot have enough memory to complete decoding

Move tokens to DRAM Move tokens to HBM


Results
Takeaway

• ORCA introduced primitives to enable iteration-level scheduling


• But how to meet latency SLOs?
• How to schedule requests to improve overall system throughput
while meeting their SLO?
Scheduling with latency-throughput constraints

TTFT TPOT
Time to first token • Time per output token
Initial response time • Average time between two subsequent generated tokens

Fast initial response Human reading speed (P99 latency =250ms)


Chatbot

Data output generation (P99 latency =35ms)


Summarization User can tolerate longer initial response
High Throughput ≠ High Goodput

High
Throughput = 10 rps Throughput
=completed request / time System

under SLO
criteria
can have
Low Goodput!
Goodput = 3 rps
=completed request within SLO / time
High Throughput ≠ High Goodput

High
Throughput = 10 rps Throughput
=completed request / time System
High Throughput can
~ Poor UX …
still have Low Goodput
can have
Low Goodput!
Goodput = 3 rps
=completed request within SLO / time
Impact of batching on Prefill phase
▪Excess parallelism
A prompt can contain 1000s of tokens
▪Beyond a certain point, increasing prompt size:
Increases latency, not throughput

150 20
Latency (ms)

Tokens per second


125 16
100 12

(x1000)
75
8
50
25 4
0 0
128 256 512 1024 2048 128 256 512 1024 2048
Prompt Size Prompt Size
53
Impact of batching on Decode phase
▪Lacks parallelism
▪One token per request
Latency almost entirely dominated by time to read model weights
Increasing batch size:
Increases throughput, not latency

30 2000
Latency (ms)

Tokens per
second
1500
20
1000
10
500
0 0
1 8 16 32 1 8 16 32
Batch Size Batch Size

54
Scheduling objectives
High Throughput Low Latency

▪ Time-to-first-token (TTFT) Can be somewhat relaxed


• Scheduling delay + prefill time
▪ Time-between-tokens (TBT) Constrained
• Influences request execution time
• and user experience
▪ Serving Capacity
• Maximum QPS serviceable under specific latency SLO

55
Batching is essential for high throughput

▪ Because decodes are memory-bound

2000
Tokens per second
1500

1000

500

0
1 8 16 32
Batch Size

56
Key scheduling challenge

▪ How to increase decode batch size?


▪ A request needs to compute the prefill first

▪ Batching involves an interleaving of prefills and decodes

How to efficiently interleave prefills and decodes?

57
LLM serving systems today
Timeline

C exits; D, E enter
Ad , Bd, Cd Dp Ep Ad , Bd, Dd, Ed … Prefills increase TBT latency
Prefill
Prioritizing
C exits; D, E enter Schedules
Ad , Bd, Cd Dp, Ep, Ad, Bd …

C exits; D, E enter
Ad , Bd, Cd Ad , Bd Ad , Bd Bd Bd Dp, Ep … Decode
Prioritizing
A exits B exits Schedules
Decodes with low batch size

59
60
Stall free scheduling without compromising on
throughput?

Image credits: Redpanda 61


How is scheduling done today?

Timeline

Cp Dp Ad , Bd, Cd, Dd …
C, D enter Prefill
Option 1

Decodes for requests A, B stalled Prioritizing


Eg: vLLM
Ad , Bd Cp + Ad , Bd Can we process the prefill of a new request,
without pausing the ongoing decodes?
Decode
Option 2

Prioritizing
Eg: FasterTransformer

Ad , Bd Ad , Bd Bd Bd Cp …
A exits B exits
Low batch size poor throughput
Scheduling trade-off
Ideal Orca
Scheduler vLLM

Throughput (QPS)
(prefill prioritizing)

LightLLM
(dynamic priority)

FasterTransformers
(decode prioritizing)

TBT (Latency)

Existing systems compromise on either throughput or latency

63
Mixed Batching
vLLM Decode Latency SLO = 10ms
Combine the prefill and decode computation
AD BP AD, BD

Latency = 8ms Latency = 24ms


But, we still violate SLOs – high latency due to
combining long prefills with decodes
AD BP+ AD

Mistral 7B on A100 Latency = 16ms

64
Key Insight

Chunked prefills : Prefills can be split into


smaller chunks that marginally add to the
decode batch time

65
Observation: Arithmetic Intensity Slack

Latency
dominated
by compute
Grows linearly
with num tokens

Latency
dominated
by weight
fetch time
Constrained Independent
due to memory of num tokens
overhead in
decode phase 67
Stall-free batching in Sarathi
Decode Latency SLO = 10ms

Baseline - vLLM

AD BP AD, BD AD, BD

Split large prefills into smaller chunks – just


Latency = 8ms Latency = 24ms
enough to consume the leftover compute
budget in decode batches
AD BP+ AD

Latency = 16ms
Sarathi-Serve

AD BP1+ AD BP2+ AD

Gain!
Latency = 9ms
~Uniform inter-token latency 68
based on TBT SLO
SARATHI in Action
Ad1Bd1 Cd1Dd1 Ad2Bd2

Ap Bp Cp Dp Bubble Bubble
Ap Prefill
Ap Bp Cp Dp
Ad Decode
time
(a) Baseline iteration-level scheduling Bubble

Cp1Ad1 Dp1Ad2 Cp2Bd1 Bd2Cd1 Dd1

Ap1 Bp1 Ap2 Bp2 Bp3 Dp2

Ap1 Bp1 Ap2 Bp2 Bp3 Dp2

time
(b) SARATHI : Chunked prefills with decode-maximal batching

Ap Bp Cp Dp
Chunk
composition
Ap1 Ap2 Bp1 Bp2 Bp3 Cp1 Cp2 Dp1 Dp2
Picking Chunk Size
Chunk that meets TBT

Larger chunk is not always better

“Tile-quantization” in GPU matmuls


Extraneous computation if tensor shapes
not a multiple of the tile size

Example (per token cost):

Chunk size Per token time (us)


128 0.43 Each chunk should be a multiple
256 0.27 of the tile size
257 0.36
Serving Capacity under SLOs
adapt using different
chunk sizes
Setup
ShareGPT4 trace on on A100 GPUs with strict (S) and relaxed (R) latency SLOs

5-6x higher capacity


21
Summary

Problem: State-of-the-art systems sacrifice decode latency to achieve higher


throughput

Key Insight - Low arithmetic intensity of decodes allows for adding compute
intensive prefills with negligible decode latency cost

Key Results - We achieve optimality in both latency and throughput


simultaneously leading up to 6x higher capacity under SLO constraints

Industry Adoption - Available in all major serving frameworks and more.

22
Is this a perfect solution?

• Balancing low latency and high prefill throughput is hard

Overhead of chunking increases as chunk size reduces


We need smaller chunks for lower TBT
Smaller chunks => lower prefill throughput
Colocation → Coupled Parallelism

PP TP

Prefill

Decode

TTFT tight, TPOT loose Prefill and Decode have


different preferences
Summary: Problems caused by Colocation

Continuous Batching
Coupled Parallelism Strategy
Cause Interference
Summary: Problems caused by Colocation

Is there a better way to


achieve better
Goodput per GPU?

Continuous Batching
Coupled Parallelism Strategy
Cause Interference
Prompt computation vs. token generation

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

Prompt phase Token phase


Compute and power intensive Memory intensive
Limited batching benefits Batching improves throughput

82
Inefficient to run both on the same hardware

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

GPUs

Server

83
Splitwise/Distserve splits phases onto different servers

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

GPUs GPUs

Prompt server Token server

Small batches, maximum power Large batches, power capped


84
Each phase could use preferred hardware

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

Prompt GPUs Token GPUs

Prompt server Token server

85
Splitting inference requires fast state transfers

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

Prompt GPUs KV Token GPUs

Prompt server Token server

86
Splitwise uses GPU Infiniband to ship state

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

GPU IB
Prompt GPUs KV Token GPUs

Prompt server Token server

87
Splitwise uses GPU Infiniband to ship state

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

GPU IB
Prompt GPUs KV Token GPUs

Prompt server Token server

88
Parallelize transfers for high bandwidth

Which is better,
Pizza is better .
pizza or burger?

User prompt First token Rest of the output tokens

25GB/s
GPU GPU GPU GPU GPU GPU GPU GPU

Prompt server Token server

: 1/4th of the KV cache Total bandwidth: 100GB/s


35
Assuming 4-way tensor parallelism
Transfer overheads may still be large

Prompt
Prompt phase Transfer state
server
Delay

Token
Transfer state Token phase
server

KV cache sizes can be hundreds of GBs!


90
Splitwise overlaps transfer with prompt phase

Prompt Prompt phase


server Transfer state
Delay

Token Transfer state Token phase


server

Start shipping the KV-cache after the first prompt layer


91
Disaggregation achieves better goodput
Colocate
1 GPU for both Prefill and Decode
Disaggregation achieves better goodput
Colocate Disaggregate (2P1D)
1 GPU for both Prefill and Decode 2 GPU for Prefill +1 GPU for Decode

goodput
Disaggregation achieves better goodput
Colocate Disaggregate (2P1D)
1 GPU for both Prefill and Decode 2 GPU for Prefill +1 GPU for Decode

Simple Disaggregation
achieves 2x goodput
(per GPU)

goodput
Evaluation

Achieves 2.0x - 4.48x


compared to vanilla vLLM
● Chatbot: 2.0 - 3.4x
● Code Completion: 3.2x
● Summarization: 4.5x

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang
Power -aware

Prompt phases
Allocated
power

Token phases

23
3x BLOOM-176B inference requests on 8 GPUs
What are issues or potential downsides with
disaggregation?

• Inefficient resource utilization compared to co-location like Sarathi


• Operational overhead
Goal of the walkthrough

1. Understand how vLLM processes a request


and generates its outputs. Background for
later content.
2. Learn where to modify if you would like to
make a specific modification/contribution.

98
User Interface
End-user Interface Developer Interface
vllm/entrypoints vllm/engine
Synchronous

LLM LLMEngine
vllm/entrypoints/llm.py vllm/engine/llm_engine.py

Batched inference
def add_request()
def abort_request()
def step() → List[ReaquestOutput]
api_server ● Stream one token output for a batch of requests
vllm/entrypoints/api_server.py

AsyncLLMEngine
Asynchronous

Simple demo API server


vllm/engine/async_llm_engine.py

openai_api_server
vllm/entrypoints/openai/api_server.py async def generate()
async def abort()
OpenAI-compatible API server + background engine loop

User custom server


99
requests LLMEngine
(prompts) vllm/engine/llm_engine.py

Centralized Controller Distributed Workers


(same process as LLMEngine, CPU only) (distributed processes with GPUs)

Scheduler N× Worker
vllm/core/scheduler.py vllm/worker/worker.py

BlockSpaceManager
vllm/core/block_manager.py CacheEngine Worker.model
vllm/worker/cache_engine.py vllm/model_executor/models

BlockAllocator (GPU/CPU) PagedAttention


vllm/core/block_manager.py vllm/model_executor/layers/
attention.py


100
Before serving requests
1. Initialize & load model

LLMEngine
vllm/engine/llm_engine.py

Worker ×N
vllm/worker/worker.py

Worker.model
vllm/model_executor/models

model.load_weights Initialize & load from


vllm/model_executor/models HuggingFace Model Hub
101
Before serving requests
2. Profile memory usage

LLMEngine
vllm/engine/llm_engine.py

Worker
×N
vllm/worker/worker.py

Worker.profile_num_available_blocks
vllm/model_executor/models

Worker.model
vllm/model_executor/models

Profile the memory usage with a batch with the max possible
#tokens.
#GPU KV blocks = remaining memory / block byte size

102
Before serving requests
3. Pre-allocate KV Blocks

LLMEngine
vllm/engine/llm_engine.py

Scheduler Worker ×N
vllm/core/scheduler.py vllm/worker/worker.py

BlockSpaceManager CacheEngine
vllm/core/block_manager.py vllm/worker/cache_engine.py

BlockAllocator (GPU/CPU) Allocate KV cache memory


vllm/core/block_manager.py

Initialize block table


103
When requests arrive

Request prompt:
“The future of Artificial Intelligence”
LLMEngine
vllm/engine/llm_engine.py

1. Tokenization:
[1, 450, 5434, 310, 3012, 928, 616, 3159, 28286]
2. Add to the scheduler’s waiting queue

Scheduler
vllm/core/scheduler.py

● Waiting request queue


● Running request queue
● Swapped request queue

104
At every step, the scheduler

Decide the set of requests to run at the current step.


● When there are free KV block memory
Scheduler waiting → running
vllm/core/scheduler.py
● When no KV block memory available for new tokens:
Swapping: running → swapped
Recomputation: running → waiting

BlockSpaceManager
vllm/core/block_manager.py ● Allocate space for new KV Cache.
● Handle block sharing.
● Swap/delete preempted KV blocks.

BlockAllocator (GPU/CPU)
→ Cache instructions & Block table
vllm/core/block_manager.py

105
At every step, the worker

N× Worker
vllm/worker/worker.py

Cache instructions

CacheEngine ● Swap blocks between GPU & CPU.


vllm/worker/cache_engine.py ● Perform copy-on-write for shared blocks.

Token IDs & Block Table

Worker.model ● Execute the model with tensor parallelism


vllm/model_executor/models

106
At every step, the model
Token IDs & Block Table

Worker.model ● A PyTorch model (tensor model parallel shard)


vllm/model_executor/models

PagedAttention ● xformers/FlashAttention for prompt


vllm/model_executor/layers/attention.py ● PagedAttention kernel for generation

ParallelLinear
vllm/parallel_utils/layers.py

QuantizedLinear Optimized kernels for other layers &


vllm/model_executor/layers/quantized_linear/ distributed execution

Faster kernels (e.g. LayerNorm)


vllm/model_executor/layers/…

Sampler Greedy/Random/Beam search → Generated tokens


vllm/model_executor/layers/attention.py

107
streaming results
requests LLMEngine detokenize
vllm/engine/llm_engine.py
(prompts)
1 new token ID for each request

Centralized Controller Distributed Workers


(same process as LLMEngine, CPU only) (distributed processes with GPUs)

unfinished
Scheduler requests N× Worker
vllm/core/scheduler.py vllm/worker/worker.py

BlockSpaceManager
vllm/core/block_manager.py CacheEngine Worker.model
vllm/worker/cache_engine.py vllm/model_executor/models

BlockAllocator (GPU/CPU) PagedAttention


vllm/core/block_manager.py vllm/model_executor/layers/
attention.py

108

You might also like