0% found this document useful (0 votes)

43 views15 pages

Cuuda Nvidai Guide - Part3

The document outlines methods for managing CUDA L2 cache and shared memory, emphasizing the importance of resetting cache properties and managing utilization among concurrent kernels. It provides code examples for matrix multiplication using both global and shared memory, highlighting performance improvements when shared memory is utilized. Additionally, it introduces distributed shared memory capabilities in CUDA, allowing for efficient histogram computations across thread block clusters.

Uploaded by

faraziid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views15 pages

Cuuda Nvidai Guide - Part3

Uploaded by

faraziid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

1.

Reset a previous persisting memory region with the access property,

cudaAccessPropertyNormal .
2. Reset all persisting L2 cache lines to normal by calling
cudaCtxResetPersistingL2Cache() .
3. Eventually untouched lines are automatically reset to normal. Reliance on
automatic reset is strongly discouraged because of the undetermined length of
time required for automatic reset to occur.

3.2.3.6. Manage Utilization of L2 set-aside cache

Multiple CUDA kernels executing concurrently in different CUDA streams may have a
different access policy window assigned to their streams. However, the L2 set-aside
cache portion is shared among all these concurrent CUDA kernels. As a result, the net
utilization of this set-aside cache portion is the sum of all the concurrent kernels’
individual use. The benefits of designating memory accesses as persisting diminish as
the volume of persisting accesses exceeds the set-aside L2 cache capacity.

To manage utilization of the set-aside L2 cache portion, an application must consider

the following:

 Size of L2 set-aside cache.

 CUDA kernels that may concurrently execute.
 The access policy window for all the CUDA kernels that may concurrently execute.
 When and how L2 reset is required to allow normal or streaming accesses to utilize
the previously set-aside L2 cache with equal priority.

3.2.3.7. Query L2 cache Properties

Properties related to L2 cache are a part of cudaDeviceProp struct and can be queried
using CUDA runtime API cudaGetDeviceProperties

CUDA Device Properties include:

 l2CacheSize : The amount of available L2 cache on the GPU.

 persistingL2CacheMaxSize : The maximum amount of L2 cache that can be set-aside
for persisting memory accesses.
 accessPolicyMaxWindowSize : The maximum size of the access policy window.

3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access

The L2 set-aside cache size for persisting memory accesses is queried using CUDA
runtime API cudaDeviceGetLimit and set using CUDA runtime API cudaDeviceSetLimit as
a cudaLimit . The maximum value for setting this limit is
cudaDeviceProp::persistingL2CacheMaxSize .
enum cudaLimit {
/* other fields not shown */
cudaLimitPersistingL2CacheSize
};

3.2.4. Shared Memory

As detailed in Variable Memory Space Specifiers shared memory is allocated using the
__shared__ memory space specifier.

Shared memory is expected to be much faster than global memory as mentioned in

Thread Hierarchy and detailed in Shared Memory. It can be used as scratchpad
memory (or software managed cache) to minimize global memory accesses from a
CUDA block as illustrated by the following matrix multiplication example.

The following code sample is a straightforward implementation of matrix

multiplication that does not take advantage of shared memory. Each thread reads one
row of A and one column of B and computes the corresponding element of C as
illustrated in Figure 8. A is therefore read B.width times from global memory and B is
read A.height times.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
int width;
int height;
float* elements;
} Matrix;

// Thread block size

#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);

// Allocate C in device memory

Matrix d_C;
d_C.width = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);

// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

// Read C from device memory

cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Each thread computes one element of C
// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e < A.width; ++e)
Cvalue += A.elements[row * A.width + e]
* B.elements[e * B.width + col];
C.elements[row * C.width + col] = Cvalue;
}

Figure 8: Matrix Multiplication without Shared Memory

The following code sample is an implementation of matrix multiplication that does

take advantage of shared memory. In this implementation, each thread block is
responsible for computing one square sub-matrix Csub of C and each thread within
the block is responsible for computing one element of Csub. As illustrated in Figure 9,
Csub is equal to the product of two rectangular matrices: the sub-matrix of A of
dimension (A.width, block_size) that has the same row indices as Csub, and the sub-
matrix of B of dimension (block_size, A.width )that has the same column indices as
Csub. In order to fit into the device’s resources, these two rectangular matrices are
divided into as many square matrices of dimension block_size as necessary and Csub
is computed as the sum of the products of these square matrices. Each of these
products is performed by first loading the two corresponding square matrices from
global memory to shared memory with one thread loading one element of each
matrix, and then by having each thread compute one element of the product. Each
thread accumulates the result of each of these products into a register and once done
writes the result to global memory.

By blocking the computation this way, we take advantage of fast shared memory and
save a lot of global memory bandwidth since A is only read (B.width / block_size) times
from global memory and B is read (A.height / block_size) times.

The Matrix type from the previous code sample is augmented with a stride field, so
that sub-matrices can be efficiently represented with the same type. __device__
functions are used to get and set elements and build any sub-matrix from a matrix.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}
// Thread block size
#define BLOCK_SIZE 16
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);
// Allocate C in device memory
Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}
Figure 9: Matrix Multiplication with Shared Memory

3.2.5. Distributed Shared Memory

Thread block clusters introduced in compute capability 9.0 provide the ability for
threads in a thread block cluster to access shared memory of all the participating
thread blocks in a cluster. This partitioned shared memory is called Distributed Shared
Memory, and the corresponding address space is called Distributed shared memory
address space. Threads that belong to a thread block cluster, can read, write or
perform atomics in the distributed address space, regardless whether the address
belongs to the local thread block or a remote thread block. Whether a kernel uses
distributed shared memory or not, the shared memory size specifications, static or
dynamic is still per thread block. The size of distributed shared memory is just the
number of thread blocks per cluster multiplied by the size of shared memory per
thread block.

Accessing data in distributed shared memory requires all the thread blocks to exist. A
user can guarantee that all thread blocks have started executing using cluster.sync()
from Cluster Group API. The user also needs to ensure that all distributed shared
memory operations happen before the exit of a thread block, e.g., if a remote thread
block is trying to read a given thread block’s shared memory, user needs to ensure
that the shared memory read by remote thread block is completed before it can exit.
CUDA provides a mechanism to access to distributed shared memory, and
applications can benefit from leveraging its capabilities. Lets look at a simple
histogram computation and how to optimize it on the GPU using thread block cluster.
A standard way of computing histograms is do the computation in the shared memory
of each thread block and then perform global memory atomics. A limitation of this
approach is the shared memory capacity. Once the histogram bins no longer fit in the
shared memory, a user needs to directly compute histograms and hence the atomics
in the global memory. With distributed shared memory, CUDA provides an
intermediate step, where a depending on the histogram bins size, histogram can be
computed in shared memory, distributed shared memory or global memory directly.

The CUDA kernel example below shows how to compute histograms in shared memory
or distributed shared memory, depending on the number of histogram bins.
#include <cooperative_groups.h>

// Distributed Shared memory histogram kernel

__global__ void clusterHist_kernel(int *bins, const int nbins, const int
bins_per_block, const int *__restrict__ input,
size_t array_size)
{
extern __shared__ int smem[];
namespace cg = cooperative_groups;
int tid = cg::this_grid().thread_rank();

// Cluster initialization, size and calculating local bin offsets.

cg::cluster_group cluster = cg::this_cluster();
unsigned int clusterBlockRank = cluster.block_rank();
int cluster_size = cluster.dim_blocks().x;

for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)

{
smem[i] = 0; //Initialize shared memory histogram to zeros
}

// cluster synchronization ensures that shared memory is initialized to zero in

// all thread blocks in the cluster. It also ensures that all thread blocks
// have started executing and they exist concurrently.
cluster.sync();

for (int i = tid; i < array_size; i += blockDim.x * gridDim.x)

{
int ldata = input[i];

//Find the right histogram bin.

int binid = ldata;
if (ldata < 0)
binid = 0;
else if (ldata >= nbins)
binid = nbins - 1;

//Find destination block rank and offset for computing

//distributed shared memory histogram
int dst_block_rank = (int)(binid / bins_per_block);
int dst_offset = binid % bins_per_block;

//Pointer to target block shared memory

int *dst_smem = cluster.map_shared_rank(smem, dst_block_rank);

//Perform atomic update of the histogram bin

atomicAdd(dst_smem + dst_offset, 1);
}

// cluster synchronization is required to ensure all distributed shared

// memory operations are completed and no thread block exits while
// other thread blocks are still accessing distributed shared memory
cluster.sync();

// Perform global memory histogram, using the local distributed memory histogram
int *lbins = bins + cluster.block_rank() * bins_per_block;
for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)
{
atomicAdd(&lbins[i], smem[i]);
}
}
The above kernel can be launched at runtime with a cluster size depending on the
amount of distributed shared memory required. If histogram is small enough to fit in
shared memory of just one block, user can launch kernel with cluster size 1. The code
snippet below shows how to launch a cluster kernel dynamically based depending on
shared memory requirements.

// Launch via extensible launch

{
cudaLaunchConfig_t config = {0};
config.gridDim = array_size / threads_per_block;
config.blockDim = threads_per_block;

// cluster_size depends on the histogram size.

// ( cluster_size == 1 ) implies no distributed shared memory, just thread block local
shared memory
int cluster_size = 2; // size 2 is an example here
int nbins_per_block = nbins / cluster_size;

//dynamic shared memory size is per block.

//Distributed shared memory size = cluster_size * nbins_per_block * sizeof(int)
config.dynamicSmemBytes = nbins_per_block * sizeof(int);

CUDA_CHECK(::cudaFuncSetAttribute((void *)clusterHist_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize, config.dynamicSmemBytes));

cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = cluster_size;
attribute[0].val.clusterDim.y = 1;
attribute[0].val.clusterDim.z = 1;

config.numAttrs = 1;
config.attrs = attribute;

cudaLaunchKernelEx(&config, clusterHist_kernel, bins, nbins, nbins_per_block, input,

array_size);
}

3.2.6. Page-Locked Host Memory

The runtime provides functions to allow the use of page-locked (also known as pinned)
host memory (as opposed to regular pageable host memory allocated by malloc() ):

 cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory;

 cudaHostRegister() page-locks a range of memory allocated by malloc() (see
reference manual for limitations).

Using page-locked host memory has several benefits:

 Copies between page-locked host memory and device memory can be performed
concurrently with kernel execution for some devices as mentioned in
Asynchronous Concurrent Execution.
 On some devices, page-locked host memory can be mapped into the address
space of the device, eliminating the need to copy it to or from device memory as
detailed in Mapped Memory.
 On systems with a front-side bus, bandwidth between host memory and device
memory is higher if host memory is allocated as page-locked and even higher if in
addition it is allocated as write-combining as described in Write-Combining
Memory.

 Note

Page-locked host memory is not cached on non I/O coherent Tegra devices. Also,
cudaHostRegister() is not supported on non I/O coherent Tegra devices.

The simple zero-copy CUDA sample comes with a detailed document on the page-
locked memory APIs.

3.2.6.1. Portable Memory

A block of page-locked memory can be used in conjunction with any device in the
system (see Multi-Device System for more details on multi-device systems), but by
default, the benefits of using page-locked memory described above are only available
in conjunction with the device that was current when the block was allocated (and
with all devices sharing the same unified address space, if any, as described in Unified
Virtual Address Space). To make these advantages available to all devices, the block
needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or
page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister() .

3.2.6.2. Write-Combining Memory

By default page-locked host memory is allocated as cacheable. It can optionally be
allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to
cudaHostAlloc() . Write-combining memory frees up the host’s L1 and L2 cache
resources, making more cache available to the rest of the application. In addition,
write-combining memory is not snooped during transfers across the PCI Express bus,
which can improve transfer performance by up to 40%.

Reading from write-combining memory from the host is prohibitively slow, so write-
combining memory should in general be used for memory that the host only writes to.

Using CPU atomic instructions on WC memory should be avoided because not all CPU
implementations guarantee that functionality.

3.2.6.3. Mapped Memory

A block of page-locked host memory can also be mapped into the address space of
the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag
cudaHostRegisterMapped to cudaHostRegister() . Such a block has therefore in general
two addresses: one in host memory that is returned by cudaHostAlloc() or malloc() ,
and one in device memory that can be retrieved using cudaHostGetDevicePointer() and
then used to access the block from within a kernel. The only exception is for pointers
allocated with cudaHostAlloc() and when a unified address space is used for the host
and the device as mentioned in Unified Virtual Address Space.

Accessing host memory directly from within a kernel does not provide the same
bandwidth as device memory, but does have some advantages:

 There is no need to allocate a block in device memory and copy data between this
block and the block in host memory; data transfers are implicitly performed as
needed by the kernel;
 There is no need to use streams (see Concurrent Data Transfers) to overlap data
transfers with kernel execution; the kernel-originated data transfers automatically
overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the
application must synchronize memory accesses using streams or events (see
Asynchronous Concurrent Execution) to avoid any potential read-after-write, write-
after-read, or write-after-write hazards.

To be able to retrieve the device pointer to any mapped page-locked memory, page-
locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the
cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise,
cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support

mapped page-locked host memory. Applications may query this capability by checking
the canMapHostMemory device property (see Device Enumeration), which is equal to 1 for
devices that support mapped page-locked host memory.

Note that atomic functions (see Atomic Functions) operating on mapped page-locked
memory are not atomic from the point of view of the host or other devices.

Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, 8-byte, and 16-byte
naturally aligned loads and stores to host memory initiated from the device are
preserved as single accesses from the point of view of the host and other devices. On
some platforms, atomics to memory may be broken by the hardware into separate
load and store operations. These component load and store operations have the same
requirements on preservation of naturally aligned accesses. The CUDA runtime does
not support a PCI Express bus topology where a PCI Express bridge splits 8-byte
naturally aligned operations and NVIDIA is not aware of any topology that splits 16-
byte naturally aligned operations.

3.2.7. Memory Synchronization Domains

3.2.7.1. Memory Fence Interference
Some CUDA applications may see degraded performance due to memory fence/flush
operations waiting on more transactions than those necessitated by the CUDA
memory consistency model.

__managed__ int x = 0;
__device__ cuda::atomic<int,
cuda::thread_scope_device> a(0);
__managed__ cuda::atomic<int,
cuda::thread_scope_system> b(0);

Thread 1 (SM) Thread 2 Thread 3

(SM) (CPU)
x = 1;
a = 1; while (a while (b
!= 1) ; != 1) ;
assert(x assert(x
== 1); == 1);
b = 1;

Consider the example above. The CUDA memory consistency model guarantees that
the asserted condition will be true, so the write to x from thread 1 must be visible to
thread 3, before the write to b from thread 2.

The memory ordering provided by the release and acquire of a is only sufficient to
make x visible to thread 2, not thread 3, as it is a device-scope operation. The
system-scope ordering provided by release and acquire of b , therefore, needs to
ensure not only writes issued from thread 2 itself are visible to thread 3, but also
writes from other threads that are visible to thread 2. This is known as cumulativity. As
the GPU cannot know at the time of execution which writes have been guaranteed at
the source level to be visible and which are visible only by chance timing, it must cast a
conservatively wide net for in-flight memory operations.

This sometimes leads to interference: because the GPU is waiting on memory

operations it is not required to at the source level, the fence/flush may take longer
than necessary.

Note that fences may occur explicitly as intrinsics or atomics in code, like in the
example, or implicitly to implement synchronizes-with relationships at task boundaries.

A common example is when a kernel is performing computation in local GPU memory,

and a parallel kernel (e.g. from NCCL) is performing communications with a peer. Upon
completion, the local kernel will implicitly flush its writes to satisfy any synchronizes-
with relationships to downstream work. This may unnecessarily wait, fully or partially,
on slower nvlink or PCIe writes from the communication kernel.
3.2.7.2. Isolating Traffic with Domains
Beginning with Hopper architecture GPUs and CUDA 12.0, the memory
synchronization domains feature provides a way to alleviate such interference. In
exchange for explicit assistance from code, the GPU can reduce the net cast by a
fence operation. Each kernel launch is given a domain ID. Writes and fences are tagged
with the ID, and a fence will only order writes matching the fence’s domain. In the
concurrent compute vs communication example, the communication kernels can be
placed in a different domain.

When using domains, code must abide by the rule that ordering or synchronization
between distinct domains on the same GPU requires system-scope fencing. Within
a domain, device-scope fencing remains sufficient. This is necessary for cumulativity
as one kernel’s writes will not be encompassed by a fence issued from a kernel in
another domain. In essence, cumulativity is satisfied by ensuring that cross-domain
traffic is flushed to the system scope ahead of time.

Note that this modifies the definition of thread_scope_device . However, because

kernels will default to domain 0 as described below, backward compatibility is
maintained.

3.2.7.3. Using Domains in CUDA

Domains are accessible via the new launch attributes cudaLaunchAttributeMemSyncDomain
and cudaLaunchAttributeMemSyncDomainMap . The former selects between logical domains
cudaLaunchMemSyncDomainDefault and cudaLaunchMemSyncDomainRemote , and the latter
provides a mapping from logical to physical domains. The remote domain is intended
for kernels performing remote memory access in order to isolate their memory traffic
from local kernels. Note, however, the selection of a particular domain does not affect
what memory access a kernel may legally perform.

The domain count can be queried via device attribute cudaDevAttrMemSyncDomainCount .

Hopper has 4 domains. To facilitate portable code, domains functionality can be used
on all devices and CUDA will report a count of 1 prior to Hopper.

Having logical domains eases application composition. An individual kernel launch at a

low level in the stack, such as from NCCL, can select a semantic logical domain
without concern for the surrounding application architecture. Higher levels can steer
logical domains using the mapping. The default value for the logical domain if it is not
set is the default domain, and the default mapping is to map the default domain to 0
and the remote domain to 1 (on GPUs with more than 1 domain). Specific libraries may
tag launches with the remote domain in CUDA 12.0 and later; for example, NCCL 2.16
will do so. Together, this provides a beneficial use pattern for common applications out
of the box, with no code changes needed in other components, frameworks, or at
application level. An alternative use pattern, for example in an application using

CUDA Matrix Multiplication Techniques
100% (1)
CUDA Matrix Multiplication Techniques
55 pages
CUDA Matrix Multiplication Quiz
No ratings yet
CUDA Matrix Multiplication Quiz
12 pages
p4 Multiply
No ratings yet
p4 Multiply
2 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Source Code
No ratings yet
Source Code
7 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
Moving To Parallel - Addition of 2 Matrices
No ratings yet
Moving To Parallel - Addition of 2 Matrices
14 pages
CUDA
No ratings yet
CUDA
3 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
5 Computation
No ratings yet
5 Computation
13 pages
HPC File
No ratings yet
HPC File
22 pages
Allocate The Device Memory Where We Will Copy M
No ratings yet
Allocate The Device Memory Where We Will Copy M
2 pages
Class 10
No ratings yet
Class 10
13 pages
Cuda 4.1
No ratings yet
Cuda 4.1
2 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
PC Cuda Assignment-2
No ratings yet
PC Cuda Assignment-2
29 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Threads
No ratings yet
Threads
54 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Understanding CUDA Memory Management
No ratings yet
Understanding CUDA Memory Management
18 pages
Understanding DRAM Bandwidth in GPUs
No ratings yet
Understanding DRAM Bandwidth in GPUs
23 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Cuda
No ratings yet
Cuda
4 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
Understanding CUDA Memory Types and Usage
No ratings yet
Understanding CUDA Memory Types and Usage
46 pages
Lab7 GPU
No ratings yet
Lab7 GPU
10 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Chap9 - CUDA Optimization
No ratings yet
Chap9 - CUDA Optimization
73 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Web GPU
0% (1)
Web GPU
40 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
CUDA Homework: Basic Exercises
No ratings yet
CUDA Homework: Basic Exercises
185 pages
Class 13
No ratings yet
Class 13
19 pages
Rishi
No ratings yet
Rishi
30 pages
Cuda4 2
No ratings yet
Cuda4 2
4 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
High Performance Computing Test II
No ratings yet
High Performance Computing Test II
10 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
MODULE-4 RM Vipul 2
No ratings yet
MODULE-4 RM Vipul 2
18 pages
Advanced Threat Protection 3.0 Study Guide-Online
100% (1)
Advanced Threat Protection 3.0 Study Guide-Online
282 pages
Guide Microplate Readers
No ratings yet
Guide Microplate Readers
12 pages
Free Photobook Campaign Report
No ratings yet
Free Photobook Campaign Report
6 pages
BDA Viva
No ratings yet
BDA Viva
26 pages
Acustica Audio - FIRE Theclip
No ratings yet
Acustica Audio - FIRE Theclip
13 pages
Understanding Relations and Functions
No ratings yet
Understanding Relations and Functions
19 pages
En WM 8.5.2 Admin MltplRls
No ratings yet
En WM 8.5.2 Admin MltplRls
5 pages
Promozione Banca Dati Uk
No ratings yet
Promozione Banca Dati Uk
3 pages
W. Grzechca Assembly Line - Theory and Practice
No ratings yet
W. Grzechca Assembly Line - Theory and Practice
262 pages
Fujifilm-FCR Profect CS Clearview-Service Manual
100% (1)
Fujifilm-FCR Profect CS Clearview-Service Manual
1,364 pages
Semester Grade Distribution Guide
No ratings yet
Semester Grade Distribution Guide
217 pages
Emerson Ha SBH023711103310F10
No ratings yet
Emerson Ha SBH023711103310F10
68 pages
UPS Compliance Statement-Hameem
No ratings yet
UPS Compliance Statement-Hameem
23 pages
Communication
No ratings yet
Communication
1,695 pages
Mr. Cua Rice - Authentic ST25 Rice - English
No ratings yet
Mr. Cua Rice - Authentic ST25 Rice - English
20 pages
A List of Run Commands For Wind - Sem Autor
No ratings yet
A List of Run Commands For Wind - Sem Autor
6 pages
DL - 3850 - Parts Catalog
No ratings yet
DL - 3850 - Parts Catalog
21 pages
NM Lab Manual
No ratings yet
NM Lab Manual
7 pages
Stock Maintenance System Overview
No ratings yet
Stock Maintenance System Overview
8 pages
Industrial Serial to WiFi Converter
No ratings yet
Industrial Serial to WiFi Converter
2 pages
OSY Practical No.5
100% (1)
OSY Practical No.5
7 pages
Hpe Comware 7 Netconf XML API Reference
No ratings yet
Hpe Comware 7 Netconf XML API Reference
6,475 pages
Ministry Roster Management and Correction System 240308 122908
No ratings yet
Ministry Roster Management and Correction System 240308 122908
35 pages
EJPTv2 Examen Cheatsheet - Pdf.es - en
100% (1)
EJPTv2 Examen Cheatsheet - Pdf.es - en
29 pages
Persona Fusion Recipes Guide
No ratings yet
Persona Fusion Recipes Guide
23 pages
172 16smbecs2 2-16smbeca2 1-16smbeit2 3 2020051912275542
No ratings yet
172 16smbecs2 2-16smbeca2 1-16smbeit2 3 2020051912275542
4 pages
Course File IoT Applications 4 1 IoT 20CS41006
No ratings yet
Course File IoT Applications 4 1 IoT 20CS41006
87 pages
Interesting Inventions: Sophia - A Social Human Robot
No ratings yet
Interesting Inventions: Sophia - A Social Human Robot
1 page
Cybersecurity in Building Automation
No ratings yet
Cybersecurity in Building Automation
28 pages

Cuuda Nvidai Guide - Part3

Uploaded by

Cuuda Nvidai Guide - Part3

Uploaded by

1.

Reset a previous persisting memory region with the access property,

3.2.3.6. Manage Utilization of L2 set-aside cache

To manage utilization of the set-aside L2 cache portion, an application must consider

 Size of L2 set-aside cache.

3.2.3.7. Query L2 cache Properties

CUDA Device Properties include:

 l2CacheSize : The amount of available L2 cache on the GPU.

3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access

3.2.4. Shared Memory

Shared memory is expected to be much faster than global memory as mentioned in

The following code sample is a straightforward implementation of matrix

// Thread block size

// Forward declaration of the matrix multiplication kernel

// Matrix multiplication - Host code

// Allocate C in device memory

// Read C from device memory

// Free device memory

// Matrix multiplication kernel called by MatMul()

Figure 8: Matrix Multiplication without Shared Memory

The following code sample is an implementation of matrix multiplication that does

3.2.5. Distributed Shared Memory

// Distributed Shared memory histogram kernel

// Cluster initialization, size and calculating local bin offsets.

for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)

// cluster synchronization ensures that shared memory is initialized to zero in

for (int i = tid; i < array_size; i += blockDim.x * gridDim.x)

//Find the right histogram bin.

//Find destination block rank and offset for computing

//Pointer to target block shared memory

//Perform atomic update of the histogram bin

// cluster synchronization is required to ensure all distributed shared

// Launch via extensible launch

// cluster_size depends on the histogram size.

//dynamic shared memory size is per block.

cudaLaunchKernelEx(&config, clusterHist_kernel, bins, nbins, nbins_per_block, input,

3.2.6. Page-Locked Host Memory

 cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory;

Using page-locked host memory has several benefits:

3.2.6.1. Portable Memory

3.2.6.2. Write-Combining Memory

3.2.6.3. Mapped Memory

cudaHostGetDevicePointer() also returns an error if the device does not support

3.2.7. Memory Synchronization Domains

Thread 1 (SM) Thread 2 Thread 3

This sometimes leads to interference: because the GPU is waiting on memory

A common example is when a kernel is performing computation in local GPU memory,

Note that this modifies the definition of thread_scope_device . However, because

3.2.7.3. Using Domains in CUDA

The domain count can be queried via device attribute cudaDevAttrMemSyncDomainCount .

Having logical domains eases application composition. An individual kernel launch at a

You might also like