CENG443
Heterogeneous Parallel Programming
CUDA Shared Memory
Işıl ÖZ, IZTECH, Fall 2024
31 December 2024
Image Blurring Kernel
// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box
for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {
int curRow = Row + blurRow;
int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}
In every iteration of the inner loop, one global memory access is performed for
one floating-point addition
2
Compute-to-Global Memory Access Ratio
The number of floating-point calculation
performed for each access to the global memory
within a region of a program
It is 1 for floating-point add operation in image
blur kernel
Memory-bound programs
execution speed is limited by memory access throughput
3
GPU Performance
GPU device with the global memory bandwidth
1,000 GB/s, or 1 TB/s
4 bytes in each single-precision floating-point value
1000/4=250 giga single-precision operands per
second to be loaded
no more than 250 giga floating-point operations per second
(GFLOPS)
Peak single-precision performance of 12 TFLOPS
Tiny fraction = 250/12000 = 2%
Need to find ways of reducing global memory
accesses!
4
CUDA Memories
Grid
Block (0, 0) Block (1, 0)
Shared Memory Shared Memory
Registers Registers Registers Registers
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Host Global Memory
Constant Memory
5
Registers
Grid
Block (0, 0) Block (1, 0)
Shared Memory Shared Memory • Fastest
Registers Registers Registers Registers
• Only accessible by a thread
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Lifetime of a thread
Host Global Memory
• Limited capacity of registers
Constant Memory
6
Local Variables
All scalar variables declared in kernel and device
functions are placed into registers
Automatic array variables are not stored in
registers (The compiler may decide to store an
array into registers if all accesses are done with
constant index values)
Similar to scalar variables, the scope of these
arrays is limited to individual threads
Once a thread terminates its execution, the
contents of its local variables also cease to exist
7
Image Blurring Kernel
// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box
for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {
int curRow = Row + blurRow;
int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}
8
Shared Memory
Grid
Block (0, 0) Block (1, 0)
• Extremely fast (~4cycles)
Shared Memory Shared Memory
Registers Registers Registers Registers • Highly parallel
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Restricted to a block
Host Global Memory • Small (typically 48 KB per SM)
Constant Memory
9
Shared Memory in CUDA
A special type of memory whose contents are
explicitly defined and used in the kernel source
code
One in each SM
Accessed at much higher speed (in both latency and
throughput) than global memory
Scope of access and sharing - thread blocks
Lifetime – thread block, contents will disappear after the
corresponding thread terminates execution
Accessed by memory load/store instructions
A form of scratchpad memory in computer architecture
10
Shared Memory in Fermi
64KB configurable shared memory and L1 cache
48 KB shared memory and 16 KB L1 cache OR
16 KB shared memory and 48 KB L1 cache
With shared memory, you get full control as to
what gets stored where, while with cache,
everything is done automatically
Even though the compiler and the GPU can still
be very clever in optimizing memory accesses,
you can sometimes still find a better way
11
Shared Variables
Shared variables reside in the shared memory
The scope of a shared variable is within a thread
block, all threads in a block see the same version of a
shared variable, subject to race conditions
A private version of the shared variable is created for
and used by each thread block during kernel execution
When a kernel terminates its execution, the contents
of its shared variables cease to exist
An efficient means for threads within a block to
collaborate with one another, to hold the portion of
global memory data that are heavily used in a kernel
execution phase
12
Shared Variable Example
Statically (size known at compile time)
__global__ void Kernel(int count)
{
__shared__ int a[1024];
...
}
Dynamically (size not known until runtime)
__global__ void Kernel(int count)
{
extern __shared__ int a[];
...
}
Kernel<<< gridDim, blockDim, numBytesSharedMem >>>(count)
13
Global Memory
Grid
Block (0, 0) Block (1, 0)
• Typically implemented in DRAM
Shared Memory Shared Memory
Registers Registers Registers Registers • High access latency:
400-800 cycles
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Finite access bandwidth
Host Global Memory
Constant Memory
• Potential of traffic congestion
14
Global Variables
Visible to all threads of all kernels, can be used
as a means for threads to collaborate across
blocks
The only easy way to synchronize between
threads from different thread blocks or to
ensure data consistency across threads when
accessing global memory is by terminating the
current kernel execution
15
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
Global memory:
cudaMalloc((void **) &d_A, size);
Allocate with
cudaMalloc(void** devPtr, size_t size)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
Free with
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaFree(void* devPtr)
cudaMalloc((void **) &d_C, size);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
16
Hardware View of CUDA Memories
Global Memory I/O
Processing Unit
Shared
Register
Memory ALU
File
Control Unit
PC IR
Processor (SM)
17
Tiling for Reduced Memory Traffic
The global memory is large but slow, whereas
the shared memory is small but fast
Partition the data into subsets called tiles so
that each tile fits into the shared memory
A large wall (i.e., the global memory data) can
be covered by tiles (i.e., subsets that each can
fit into the shared memory)
18
Matrix Multiplication
19
A Basic Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
20
4x4 P: Thread to Data Mapping
P matrix divided into 4 parts
Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3
P3,0 P3,1 P3,2 P3,3
Block(1,0) Block(1,1)
21
Global memory accesses performed by
threads in block0,0
8 global memory access per thread (8*4)
If threads can collaborate, elements are only
loaded from the global memory once, the total
number of accesses to the global memory can be
reduced by half
22
Global Memory Access Pattern
Global Memory
Thread 1 Thread 2 …
23
Tiling
Divide the global memory content into tiles
Global Memory
On-chip Memory
Thread 1 Thread 2
…
24
Tiling
Global Memory
On-chip Memory
Thread 1 Thread 2
…
25
Basic Concept of Tiling
In a congested traffic system, significant
reduction of vehicles can greatly improve the
delay seen by all vehicles
Carpooling for commuters (only cars with more than two or
three people are allowed to use these lanes)
Tiling for global memory accesses
drivers = threads accessing their memory data operands
cars = memory access requests
26
Carpools Need Synchronization
Good when people have similar schedule
Worker A sleep work dinner
Time
Worker B sleep work dinner
Worker A party sleep work
time
sleep work dinner
Worker B
Bad when people have very different schedule
27
Tiling Requires Synchronization Among Threads
Good when threads have similar access timing
Thread 1
Time
Thread 2
…
Thread 1
Time
Thread 2
Bad when threads have very different timing
28
Tiling
Localizes the memory locations accessed among
threads and the timing of their accesses
Divides the long access sequences of each
thread into phases and uses barrier
synchronization to keep the timing of accesses
to each section at close intervals
Controls the amount of on-chip memory required
by localizing the accesses both in time and in
space
29
Outline of Tiling
Identify a tile of global memory contents that are
accessed by multiple threads
Load the tile from global memory into on-chip
memory
Use barrier synchronization to make sure that all
threads are ready to start the phase
Have the multiple threads to access their data from
the on-chip memory
Use barrier synchronization to make sure that all
threads have completed the current phase
Move on to the next tile
30
A Basic Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
31
A Basic Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
32
4x4 P: Thread to Data Mapping
P matrix divided into 4 parts
Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3
P3,0 P3,1 P3,2 P3,3
Block(1,0) Block(1,1)
33
Calculation of P0,0 and P0,1
N0,0 N0,1
N1,0 N1,1
N2,0 N2,1
N3,0 N3,1
M0,0 M0,1 M0,2 M0,3 P0,0 P0,1
M1,0 M1,1 M1,2 M1,3 P1,0 P1,1
34
Tiled Matrix Multiplication
Break up the execution of each thread into
phases (to utilize shared memory) N
So that the data accesses
WIDTH
by the thread block in each phase
are focused on one tile of
M and one tile of N
M P
The tile is of BLOCK_SIZE
BLOCK_WIDTH
elements in each dimension
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
35
2x2 Tiles
36
Phase 0 Load for Block (0,0)
Each thread loads one M element and one N
element
N0,0 N0,1 N0,2 N0,3 N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
37
Phase 0 Use for Block (0,0) (iteration 0)
N0,0 N0,1 N0,2 N0,3 N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
38
Phase 0 Use for Block (0,0) (iteration 1)
N0,0 N0,1 N0,2 N0,3 N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
39
Phase 1 Load for Block (0,0)
Each thread loads one M element and one N
element (remaining elements)
N0,0 N0,1 N0,2 N0,3
N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
40
Phase 1 Use for Block (0,0) (iteration 0)
N0,0 N0,1 N0,2 N0,3
N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
41
Phase 1 Use for Block (0,0) (iteration 1)
N0,0 N0,1 N0,2 N0,3
N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
42
Execution Phases
43
Execution Phases
Shared memory allows each value to be accessed by multiple
threads
44
Data in Shared Memory
Mds and Nds: shared memory arrays for M and N
elements
They are reused to hold input values, allowing a
much smaller shared memory to serve most of
the accesses to global memory
Each phase focuses on a small subset of the
input matrix elements: locality
45
Barrier Synchronization
Synchronize all threads in a block
__syncthreads()
All threads in the same block must reach the
__syncthreads() before any of them can move on
Best used to coordinate the phased execution tiled
algorithms
To ensure that all elements of a tile are loaded at the beginning of a
phase
To ensure that all elements of a tile are consumed at the end of a phase
46
Use 1D Indexing
M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]
N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]
where p is the sequence number of the current phase
47
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}
48
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}
49
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}
50
Before-After
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
for (int p = 0; p < Width/TILE_WIDTH; ++p) {
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)
Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}
51
Tile (Thread Block) Size Considerations
Each thread block should have many threads
TILE_WIDTH of 16 gives 16*16 = 256 threads
TILE_WIDTH of 32 gives 32*32 = 1024 threads
(reduce global memory access by a factor of TILE_WIDTH)
For 16, in each phase, each block performs 2*256 = 512
float loads from global memory for 256 * (2*16) = 8,192
mul/add operations. (16 floating-point operations for each
memory load)
For 32, in each phase, each block performs 2*1024 = 2048
float loads from global memory for 1024 * (2*32) = 65,536
mul/add operations. (32 floating-point operation for each
memory load)
52
Shared Memory
For an SM with 16KB shared memory
Shared memory size is implementation dependent!
For TILE_WIDTH = 16, 256 threads, each thread block uses 2*256*4B
= 2KB shared memory per block, up to 8 thread blocks
This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per
block)
For TILE_WIDTH = 32, 1024 threads, 2*1024*4B= 8KB shared memory
usage per block, allowing 2 thread blocks active at the same time
However, in a GPU where the thread count is limited to 1536 threads per SM, the
number of blocks per SM is reduced to one!
Each __syncthreads() can reduce the number of active
threads for a block
More thread blocks can be advantageous
53
Handling Matrix of Arbitrary Size
Only square matrices whose dimensions (Width)
are multiples of the tile width (TILE_WIDTH)
Real applications need to handle arbitrary sized
matrices
One could pad (add elements to) the rows and
columns into multiples of the tile size, but would
have significant space and data transfer time
overhead
54
NVIDIA NSight Compute
55
References
Chapter 4.2, 4.4, 4.5, 4.6
(Programming Massively Parallel Processors : A
Hands-on Approach, David B. Kirk, Wen-Mei W.
Hwu, Morgan Kaufmann Publishers, 3rd edition)
https://developer.nvidia.com/nsight-compute
https://events.prace-ri.eu/
56