0% found this document useful (0 votes)

21 views56 pages

CUDA Memory

Uploaded by

flyphofly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views56 pages

CUDA Memory

Uploaded by

flyphofly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

CENG443

Heterogeneous Parallel Programming

CUDA Shared Memory

Işıl ÖZ, IZTECH, Fall 2024

31 December 2024
Image Blurring Kernel

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;

int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}

In every iteration of the inner loop, one global memory access is performed for
one floating-point addition

2
Compute-to-Global Memory Access Ratio

The number of floating-point calculation

performed for each access to the global memory
within a region of a program
It is 1 for floating-point add operation in image
blur kernel
Memory-bound programs
execution speed is limited by memory access throughput

3
GPU Performance

GPU device with the global memory bandwidth

1,000 GB/s, or 1 TB/s
4 bytes in each single-precision floating-point value
1000/4=250 giga single-precision operands per
second to be loaded
no more than 250 giga floating-point operations per second
(GFLOPS)
Peak single-precision performance of 12 TFLOPS
Tiny fraction = 250/12000 = 2%
Need to find ways of reducing global memory
accesses!

4
CUDA Memories

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory

5
Registers

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory • Fastest

Registers Registers Registers Registers

• Only accessible by a thread

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Lifetime of a thread
Host Global Memory

• Limited capacity of registers

Constant Memory

6
Local Variables

All scalar variables declared in kernel and device

functions are placed into registers
Automatic array variables are not stored in
registers (The compiler may decide to store an
array into registers if all accesses are done with
constant index values)
Similar to scalar variables, the scope of these
arrays is limited to individual threads
Once a thread terminates its execution, the
contents of its local variables also cease to exist

7
Image Blurring Kernel

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;

8
Shared Memory

Grid

Block (0, 0) Block (1, 0)

• Extremely fast (~4cycles)
Shared Memory Shared Memory

Registers Registers Registers Registers • Highly parallel

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Restricted to a block

Host Global Memory • Small (typically 48 KB per SM)

Constant Memory

9
Shared Memory in CUDA

A special type of memory whose contents are

explicitly defined and used in the kernel source
code
One in each SM
Accessed at much higher speed (in both latency and
throughput) than global memory
Scope of access and sharing - thread blocks
Lifetime – thread block, contents will disappear after the
corresponding thread terminates execution
Accessed by memory load/store instructions
A form of scratchpad memory in computer architecture

10
Shared Memory in Fermi

64KB configurable shared memory and L1 cache

48 KB shared memory and 16 KB L1 cache OR
16 KB shared memory and 48 KB L1 cache
With shared memory, you get full control as to
what gets stored where, while with cache,
everything is done automatically
Even though the compiler and the GPU can still
be very clever in optimizing memory accesses,
you can sometimes still find a better way

11
Shared Variables

Shared variables reside in the shared memory

The scope of a shared variable is within a thread
block, all threads in a block see the same version of a
shared variable, subject to race conditions
A private version of the shared variable is created for
and used by each thread block during kernel execution
When a kernel terminates its execution, the contents
of its shared variables cease to exist
An efficient means for threads within a block to
collaborate with one another, to hold the portion of
global memory data that are heavily used in a kernel
execution phase

12
Shared Variable Example

Statically (size known at compile time)

__global__ void Kernel(int count)
{
__shared__ int a[1024];
...
}

Dynamically (size not known until runtime)

__global__ void Kernel(int count)
{
extern __shared__ int a[];
...
}
Kernel<<< gridDim, blockDim, numBytesSharedMem >>>(count)

13
Global Memory

Grid

Block (0, 0) Block (1, 0)

• Typically implemented in DRAM
Shared Memory Shared Memory

Registers Registers Registers Registers • High access latency:

400-800 cycles
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Finite access bandwidth

Host Global Memory

Constant Memory
• Potential of traffic congestion

14
Global Variables

Visible to all threads of all kernels, can be used

as a means for threads to collaborate across
blocks
The only easy way to synchronize between
threads from different thread blocks or to
ensure data consistency across threads when
accessing global memory is by terminating the
current kernel execution

15
Vector Addition Host Code

void vecAdd(float h_A, float h_B, float *h_C, int n)

int size = n * sizeof(float); float d_A, d_B, *d_C;

Global memory:

cudaMalloc((void **) &d_A, size);

Allocate with
cudaMalloc(void** devPtr, size_t size)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMalloc((void **) &d_B, size);

Free with
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaFree(void* devPtr)
cudaMalloc((void **) &d_C, size);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);

16
Hardware View of CUDA Memories

Global Memory I/O

Processing Unit
Shared
Register
Memory ALU
File

Control Unit
PC IR

Processor (SM)

17
Tiling for Reduced Memory Traffic

The global memory is large but slow, whereas

the shared memory is small but fast
Partition the data into subsets called tiles so
that each tile fits into the shared memory
A large wall (i.e., the global memory data) can
be covered by tiles (i.e., subsets that each can
fit into the shared memory)

18
Matrix Multiplication

19
A Basic Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

20
4x4 P: Thread to Data Mapping

P matrix divided into 4 parts

Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

21
Global memory accesses performed by
threads in block0,0

8 global memory access per thread (8*4)

If threads can collaborate, elements are only
loaded from the global memory once, the total
number of accesses to the global memory can be
reduced by half

22
Global Memory Access Pattern

Global Memory

Thread 1 Thread 2 …

23
Tiling

Divide the global memory content into tiles

Global Memory

On-chip Memory

Thread 1 Thread 2
…

24
Tiling

Global Memory

On-chip Memory

Thread 1 Thread 2
…

25
Basic Concept of Tiling

In a congested traffic system, significant

reduction of vehicles can greatly improve the
delay seen by all vehicles
Carpooling for commuters (only cars with more than two or
three people are allowed to use these lanes)
Tiling for global memory accesses
drivers = threads accessing their memory data operands
cars = memory access requests

26
Carpools Need Synchronization

Good when people have similar schedule

Worker A sleep work dinner
Time
Worker B sleep work dinner

Worker A party sleep work

time
sleep work dinner
Worker B

Bad when people have very different schedule

27
Tiling Requires Synchronization Among Threads

Good when threads have similar access timing

Thread 1
Time
Thread 2
…
Thread 1
Time
Thread 2

Bad when threads have very different timing

28
Tiling

Localizes the memory locations accessed among

threads and the timing of their accesses
Divides the long access sequences of each
thread into phases and uses barrier
synchronization to keep the timing of accesses
to each section at close intervals
Controls the amount of on-chip memory required
by localizing the accesses both in time and in
space

29
Outline of Tiling

Identify a tile of global memory contents that are

accessed by multiple threads
Load the tile from global memory into on-chip
memory
Use barrier synchronization to make sure that all
threads are ready to start the phase
Have the multiple threads to access their data from
the on-chip memory
Use barrier synchronization to make sure that all
threads have completed the current phase
Move on to the next tile

30
A Basic Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

31
A Basic Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

32
4x4 P: Thread to Data Mapping

P matrix divided into 4 parts

Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

33
Calculation of P0,0 and P0,1

N0,0 N0,1

N1,0 N1,1

N2,0 N2,1

N3,0 N3,1

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

34
Tiled Matrix Multiplication

Break up the execution of each thread into

phases (to utilize shared memory) N

So that the data accesses

WIDTH
by the thread block in each phase
are focused on one tile of
M and one tile of N
M P
The tile is of BLOCK_SIZE

BLOCK_WIDTH
elements in each dimension

WIDTH
Row
BLOCK_WIDTH

WIDTH WIDTH

Col

35
2x2 Tiles

36
Phase 0 Load for Block (0,0)

Each thread loads one M element and one N

element

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

37
Phase 0 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

38
Phase 0 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

39
Phase 1 Load for Block (0,0)

Each thread loads one M element and one N

element (remaining elements)

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

40
Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3

41
Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3

42
Execution Phases

43
Execution Phases

Shared memory allows each value to be accessed by multiple

threads

44
Data in Shared Memory

Mds and Nds: shared memory arrays for M and N

elements
They are reused to hold input values, allowing a
much smaller shared memory to serve most of
the accesses to global memory
Each phase focuses on a small subset of the
input matrix elements: locality

45
Barrier Synchronization

Synchronize all threads in a block

__syncthreads()

All threads in the same block must reach the

__syncthreads() before any of them can move on

Best used to coordinate the phased execution tiled

algorithms
To ensure that all elements of a tile are loaded at the beginning of a
phase
To ensure that all elements of a tile are consumed at the end of a phase

46
Use 1D Indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]

where p is the sequence number of the current phase

47
Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width)

{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

48
Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)

{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

49
Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)

{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

50
Before-After

for (int k = 0; k < Width; ++k) {

Pvalue += M[Row*Width+k]*N[k*Width+Col];
}

for (int p = 0; p < Width/TILE_WIDTH; ++p) {

ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)
Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}

51
Tile (Thread Block) Size Considerations

Each thread block should have many threads

TILE_WIDTH of 16 gives 16*16 = 256 threads
TILE_WIDTH of 32 gives 32*32 = 1024 threads
(reduce global memory access by a factor of TILE_WIDTH)
For 16, in each phase, each block performs 2*256 = 512
float loads from global memory for 256 * (2*16) = 8,192
mul/add operations. (16 floating-point operations for each
memory load)

For 32, in each phase, each block performs 2*1024 = 2048

float loads from global memory for 1024 * (2*32) = 65,536
mul/add operations. (32 floating-point operation for each
memory load)

52
Shared Memory

For an SM with 16KB shared memory

Shared memory size is implementation dependent!
For TILE_WIDTH = 16, 256 threads, each thread block uses 2*256*4B
= 2KB shared memory per block, up to 8 thread blocks
This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per
block)
For TILE_WIDTH = 32, 1024 threads, 2*1024*4B= 8KB shared memory
usage per block, allowing 2 thread blocks active at the same time
However, in a GPU where the thread count is limited to 1536 threads per SM, the
number of blocks per SM is reduced to one!

Each __syncthreads() can reduce the number of active

threads for a block
More thread blocks can be advantageous

53
Handling Matrix of Arbitrary Size

Only square matrices whose dimensions (Width)

are multiples of the tile width (TILE_WIDTH)
Real applications need to handle arbitrary sized
matrices
One could pad (add elements to) the rows and
columns into multiples of the tile size, but would
have significant space and data transfer time
overhead

54
NVIDIA NSight Compute

55
References

Chapter 4.2, 4.4, 4.5, 4.6

(Programming Massively Parallel Processors : A
Hands-on Approach, David B. Kirk, Wen-Mei W.
Hwu, Morgan Kaufmann Publishers, 3rd edition)
https://developer.nvidia.com/nsight-compute
https://events.prace-ri.eu/

Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Understanding CUDA Memory Types and Usage
No ratings yet
Understanding CUDA Memory Types and Usage
46 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
CUDA Matrix Multiplication Tiling Techniques
No ratings yet
CUDA Matrix Multiplication Tiling Techniques
58 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
Understanding DRAM Bandwidth in GPUs
No ratings yet
Understanding DRAM Bandwidth in GPUs
23 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
2D Convolution with NVIDIA Caching
No ratings yet
2D Convolution with NVIDIA Caching
30 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Understanding CUDA Memory Management
No ratings yet
Understanding CUDA Memory Management
18 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
Gpu-Accelerated Face Detection Algorithm
No ratings yet
Gpu-Accelerated Face Detection Algorithm
9 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
Tiling Techniques for GPU Performance
No ratings yet
Tiling Techniques for GPU Performance
23 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
Chap9 - CUDA Optimization
No ratings yet
Chap9 - CUDA Optimization
73 pages
Memory Coalescing in CUDA Programming
No ratings yet
Memory Coalescing in CUDA Programming
28 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
CUDA Matrix Multiplication Techniques
100% (1)
CUDA Matrix Multiplication Techniques
55 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Main Memory: Prof. Mike Giles
No ratings yet
Main Memory: Prof. Mike Giles
9 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
PDC Lecture 10
No ratings yet
PDC Lecture 10
32 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
Class 13
No ratings yet
Class 13
19 pages
CUDA Parallel Programming Basics
No ratings yet
CUDA Parallel Programming Basics
44 pages
Class 10
No ratings yet
Class 10
13 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
CUDA C vs Thrust vs Libraries Overview
No ratings yet
CUDA C vs Thrust vs Libraries Overview
64 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Delphi Labs DataSnap XE - Callbacks
No ratings yet
Delphi Labs DataSnap XE - Callbacks
7 pages
Denodo Administrator Prep
No ratings yet
Denodo Administrator Prep
11 pages
ILE RPG Programming Guide
No ratings yet
ILE RPG Programming Guide
616 pages
Tibco BusinessWorks Best Practices
No ratings yet
Tibco BusinessWorks Best Practices
3 pages
IT Lab Manual: Distributed Systems
No ratings yet
IT Lab Manual: Distributed Systems
50 pages
Software Testing Model Paper Answers
No ratings yet
Software Testing Model Paper Answers
34 pages
Chap #2 - Process Managment Operating System
No ratings yet
Chap #2 - Process Managment Operating System
21 pages
5 Assignment Concurrent Programming Depot OBE Apr 2018
No ratings yet
5 Assignment Concurrent Programming Depot OBE Apr 2018
6 pages
Operating System Mid Term Exam Revision Note
No ratings yet
Operating System Mid Term Exam Revision Note
105 pages
Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux
No ratings yet
Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux
19 pages
Python Programming Interview Questions and Answers PDF 100
No ratings yet
Python Programming Interview Questions and Answers PDF 100
13 pages
Futex Seminar Report
No ratings yet
Futex Seminar Report
33 pages
Question Bank For Second Year Oops-Updated
No ratings yet
Question Bank For Second Year Oops-Updated
24 pages
Azure Cosmos DB Change Feed Guide
No ratings yet
Azure Cosmos DB Change Feed Guide
8 pages
Chapter 6 Parallel Processors From Client To Cloud 5th
No ratings yet
Chapter 6 Parallel Processors From Client To Cloud 5th
36 pages
Minecraft Server Startup Log
No ratings yet
Minecraft Server Startup Log
24 pages
SECTION B Operating System HND 2022 Pass Question
No ratings yet
SECTION B Operating System HND 2022 Pass Question
9 pages
Core Java: - Sharad Ballepu
No ratings yet
Core Java: - Sharad Ballepu
56 pages
Ii BSC Iv Sem Os
No ratings yet
Ii BSC Iv Sem Os
47 pages
Problem Statement
No ratings yet
Problem Statement
7 pages
Os-Unit 1
No ratings yet
Os-Unit 1
41 pages
Wood Carving Simulator
No ratings yet
Wood Carving Simulator
40 pages
Go Programming Language Seminar
No ratings yet
Go Programming Language Seminar
40 pages
Art of Latency in DB
No ratings yet
Art of Latency in DB
14 pages
Java Thread Management Overview
No ratings yet
Java Thread Management Overview
29 pages
Understanding Operating Systems Basics
67% (3)
Understanding Operating Systems Basics
5 pages
JP07 10 - OCS Boot Camp PDF
No ratings yet
JP07 10 - OCS Boot Camp PDF
12 pages
Process Scheduling Simplilified Notes
No ratings yet
Process Scheduling Simplilified Notes
7 pages
Windows Kernel Internals II: University of Tokyo - July 2004
No ratings yet
Windows Kernel Internals II: University of Tokyo - July 2004
16 pages
Chapter 3 Operating System Overview
No ratings yet
Chapter 3 Operating System Overview
50 pages

CUDA Memory

Uploaded by

CUDA Memory

Uploaded by

CENG443

Heterogeneous Parallel Programming

CUDA Shared Memory

Işıl ÖZ, IZTECH, Fall 2024

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

int curRow = Row + blurRow;

The number of floating-point calculation

GPU device with the global memory bandwidth

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory • Fastest

• Only accessible by a thread

• Limited capacity of registers

All scalar variables declared in kernel and device

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

int curRow = Row + blurRow;

Block (0, 0) Block (1, 0)

Registers Registers Registers Registers • Highly parallel

Host Global Memory • Small (typically 48 KB per SM)

A special type of memory whose contents are

64KB configurable shared memory and L1 cache

Shared variables reside in the shared memory

Statically (size known at compile time)

Dynamically (size not known until runtime)

Block (0, 0) Block (1, 0)

Registers Registers Registers Registers • High access latency:

• Finite access bandwidth

Visible to all threads of all kernels, can be used

void vecAdd(float *h_A, float *h_B, float *h_C, int n)

int size = n * sizeof(float); float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);

cudaMalloc((void **) &d_B, size);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);

Global Memory I/O

The global memory is large but slow, whereas

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

P matrix divided into 4 parts

P3,0 P3,1 P3,2 P3,3

8 global memory access per thread (8*4)

Divide the global memory content into tiles

In a congested traffic system, significant

Good when people have similar schedule

Worker A party sleep work

Bad when people have very different schedule

Good when threads have similar access timing

Bad when threads have very different timing

Localizes the memory locations accessed among

Identify a tile of global memory contents that are

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

P matrix divided into 4 parts

P3,0 P3,1 P3,2 P3,3

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

Break up the execution of each thread into

So that the data accesses

Each thread loads one M element and one N

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Each thread loads one M element and one N

N0,0 N0,1 N0,2 N0,3

void vecAdd(float h_A, float h_B, float *h_C, int n)

int size = n * sizeof(float); float d_A, d_B, *d_C;

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

global void MatrixMulKernel(float* M, float* N, float* P, int Width)

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)