0% found this document useful (0 votes)
43 views15 pages

Cuuda Nvidai Guide - Part3

The document outlines methods for managing CUDA L2 cache and shared memory, emphasizing the importance of resetting cache properties and managing utilization among concurrent kernels. It provides code examples for matrix multiplication using both global and shared memory, highlighting performance improvements when shared memory is utilized. Additionally, it introduces distributed shared memory capabilities in CUDA, allowing for efficient histogram computations across thread block clusters.

Uploaded by

faraziid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views15 pages

Cuuda Nvidai Guide - Part3

The document outlines methods for managing CUDA L2 cache and shared memory, emphasizing the importance of resetting cache properties and managing utilization among concurrent kernels. It provides code examples for matrix multiplication using both global and shared memory, highlighting performance improvements when shared memory is utilized. Additionally, it introduces distributed shared memory capabilities in CUDA, allowing for efficient histogram computations across thread block clusters.

Uploaded by

faraziid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1.

Reset a previous persisting memory region with the access property,


cudaAccessPropertyNormal .
2. Reset all persisting L2 cache lines to normal by calling
cudaCtxResetPersistingL2Cache() .
3. Eventually untouched lines are automatically reset to normal. Reliance on
automatic reset is strongly discouraged because of the undetermined length of
time required for automatic reset to occur.

3.2.3.6. Manage Utilization of L2 set-aside cache


Multiple CUDA kernels executing concurrently in different CUDA streams may have a
different access policy window assigned to their streams. However, the L2 set-aside
cache portion is shared among all these concurrent CUDA kernels. As a result, the net
utilization of this set-aside cache portion is the sum of all the concurrent kernels’
individual use. The benefits of designating memory accesses as persisting diminish as
the volume of persisting accesses exceeds the set-aside L2 cache capacity.

To manage utilization of the set-aside L2 cache portion, an application must consider


the following:

 Size of L2 set-aside cache.


 CUDA kernels that may concurrently execute.
 The access policy window for all the CUDA kernels that may concurrently execute.
 When and how L2 reset is required to allow normal or streaming accesses to utilize
the previously set-aside L2 cache with equal priority.

3.2.3.7. Query L2 cache Properties


Properties related to L2 cache are a part of cudaDeviceProp struct and can be queried
using CUDA runtime API cudaGetDeviceProperties

CUDA Device Properties include:

 l2CacheSize : The amount of available L2 cache on the GPU.


 persistingL2CacheMaxSize : The maximum amount of L2 cache that can be set-aside
for persisting memory accesses.
 accessPolicyMaxWindowSize : The maximum size of the access policy window.

3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access


The L2 set-aside cache size for persisting memory accesses is queried using CUDA
runtime API cudaDeviceGetLimit and set using CUDA runtime API cudaDeviceSetLimit as
a cudaLimit . The maximum value for setting this limit is
cudaDeviceProp::persistingL2CacheMaxSize .
enum cudaLimit {
/* other fields not shown */
cudaLimitPersistingL2CacheSize
};

3.2.4. Shared Memory


As detailed in Variable Memory Space Specifiers shared memory is allocated using the
__shared__ memory space specifier.

Shared memory is expected to be much faster than global memory as mentioned in


Thread Hierarchy and detailed in Shared Memory. It can be used as scratchpad
memory (or software managed cache) to minimize global memory accesses from a
CUDA block as illustrated by the following matrix multiplication example.

The following code sample is a straightforward implementation of matrix


multiplication that does not take advantage of shared memory. Each thread reads one
row of A and one column of B and computes the corresponding element of C as
illustrated in Figure 8. A is therefore read B.width times from global memory and B is
read A.height times.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
int width;
int height;
float* elements;
} Matrix;

// Thread block size


#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel


__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code


// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);

// Allocate C in device memory


Matrix d_C;
d_C.width = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);

// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

// Read C from device memory


cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);

// Free device memory


cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()


__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Each thread computes one element of C
// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e < A.width; ++e)
Cvalue += A.elements[row * A.width + e]
* B.elements[e * B.width + col];
C.elements[row * C.width + col] = Cvalue;
}

Figure 8: Matrix Multiplication without Shared Memory

The following code sample is an implementation of matrix multiplication that does


take advantage of shared memory. In this implementation, each thread block is
responsible for computing one square sub-matrix Csub of C and each thread within
the block is responsible for computing one element of Csub. As illustrated in Figure 9,
Csub is equal to the product of two rectangular matrices: the sub-matrix of A of
dimension (A.width, block_size) that has the same row indices as Csub, and the sub-
matrix of B of dimension (block_size, A.width )that has the same column indices as
Csub. In order to fit into the device’s resources, these two rectangular matrices are
divided into as many square matrices of dimension block_size as necessary and Csub
is computed as the sum of the products of these square matrices. Each of these
products is performed by first loading the two corresponding square matrices from
global memory to shared memory with one thread loading one element of each
matrix, and then by having each thread compute one element of the product. Each
thread accumulates the result of each of these products into a register and once done
writes the result to global memory.

By blocking the computation this way, we take advantage of fast shared memory and
save a lot of global memory bandwidth since A is only read (B.width / block_size) times
from global memory and B is read (A.height / block_size) times.

The Matrix type from the previous code sample is augmented with a stride field, so
that sub-matrices can be efficiently represented with the same type. __device__
functions are used to get and set elements and build any sub-matrix from a matrix.
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}
// Thread block size
#define BLOCK_SIZE 16
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(float);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,
cudaMemcpyHostToDevice);
// Allocate C in device memory
Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d_C.elements, size);
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size,
cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}
Figure 9: Matrix Multiplication with Shared Memory

3.2.5. Distributed Shared Memory


Thread block clusters introduced in compute capability 9.0 provide the ability for
threads in a thread block cluster to access shared memory of all the participating
thread blocks in a cluster. This partitioned shared memory is called Distributed Shared
Memory, and the corresponding address space is called Distributed shared memory
address space. Threads that belong to a thread block cluster, can read, write or
perform atomics in the distributed address space, regardless whether the address
belongs to the local thread block or a remote thread block. Whether a kernel uses
distributed shared memory or not, the shared memory size specifications, static or
dynamic is still per thread block. The size of distributed shared memory is just the
number of thread blocks per cluster multiplied by the size of shared memory per
thread block.

Accessing data in distributed shared memory requires all the thread blocks to exist. A
user can guarantee that all thread blocks have started executing using cluster.sync()
from Cluster Group API. The user also needs to ensure that all distributed shared
memory operations happen before the exit of a thread block, e.g., if a remote thread
block is trying to read a given thread block’s shared memory, user needs to ensure
that the shared memory read by remote thread block is completed before it can exit.
CUDA provides a mechanism to access to distributed shared memory, and
applications can benefit from leveraging its capabilities. Lets look at a simple
histogram computation and how to optimize it on the GPU using thread block cluster.
A standard way of computing histograms is do the computation in the shared memory
of each thread block and then perform global memory atomics. A limitation of this
approach is the shared memory capacity. Once the histogram bins no longer fit in the
shared memory, a user needs to directly compute histograms and hence the atomics
in the global memory. With distributed shared memory, CUDA provides an
intermediate step, where a depending on the histogram bins size, histogram can be
computed in shared memory, distributed shared memory or global memory directly.

The CUDA kernel example below shows how to compute histograms in shared memory
or distributed shared memory, depending on the number of histogram bins.
#include <cooperative_groups.h>

// Distributed Shared memory histogram kernel


__global__ void clusterHist_kernel(int *bins, const int nbins, const int
bins_per_block, const int *__restrict__ input,
size_t array_size)
{
extern __shared__ int smem[];
namespace cg = cooperative_groups;
int tid = cg::this_grid().thread_rank();

// Cluster initialization, size and calculating local bin offsets.


cg::cluster_group cluster = cg::this_cluster();
unsigned int clusterBlockRank = cluster.block_rank();
int cluster_size = cluster.dim_blocks().x;

for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)


{
smem[i] = 0; //Initialize shared memory histogram to zeros
}

// cluster synchronization ensures that shared memory is initialized to zero in


// all thread blocks in the cluster. It also ensures that all thread blocks
// have started executing and they exist concurrently.
cluster.sync();

for (int i = tid; i < array_size; i += blockDim.x * gridDim.x)


{
int ldata = input[i];

//Find the right histogram bin.


int binid = ldata;
if (ldata < 0)
binid = 0;
else if (ldata >= nbins)
binid = nbins - 1;

//Find destination block rank and offset for computing


//distributed shared memory histogram
int dst_block_rank = (int)(binid / bins_per_block);
int dst_offset = binid % bins_per_block;

//Pointer to target block shared memory


int *dst_smem = cluster.map_shared_rank(smem, dst_block_rank);

//Perform atomic update of the histogram bin


atomicAdd(dst_smem + dst_offset, 1);
}

// cluster synchronization is required to ensure all distributed shared


// memory operations are completed and no thread block exits while
// other thread blocks are still accessing distributed shared memory
cluster.sync();

// Perform global memory histogram, using the local distributed memory histogram
int *lbins = bins + cluster.block_rank() * bins_per_block;
for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)
{
atomicAdd(&lbins[i], smem[i]);
}
}
The above kernel can be launched at runtime with a cluster size depending on the
amount of distributed shared memory required. If histogram is small enough to fit in
shared memory of just one block, user can launch kernel with cluster size 1. The code
snippet below shows how to launch a cluster kernel dynamically based depending on
shared memory requirements.

// Launch via extensible launch


{
cudaLaunchConfig_t config = {0};
config.gridDim = array_size / threads_per_block;
config.blockDim = threads_per_block;

// cluster_size depends on the histogram size.


// ( cluster_size == 1 ) implies no distributed shared memory, just thread block local
shared memory
int cluster_size = 2; // size 2 is an example here
int nbins_per_block = nbins / cluster_size;

//dynamic shared memory size is per block.


//Distributed shared memory size = cluster_size * nbins_per_block * sizeof(int)
config.dynamicSmemBytes = nbins_per_block * sizeof(int);

CUDA_CHECK(::cudaFuncSetAttribute((void *)clusterHist_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize, config.dynamicSmemBytes));

cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = cluster_size;
attribute[0].val.clusterDim.y = 1;
attribute[0].val.clusterDim.z = 1;

config.numAttrs = 1;
config.attrs = attribute;

cudaLaunchKernelEx(&config, clusterHist_kernel, bins, nbins, nbins_per_block, input,


array_size);
}

3.2.6. Page-Locked Host Memory


The runtime provides functions to allow the use of page-locked (also known as pinned)
host memory (as opposed to regular pageable host memory allocated by malloc() ):

 cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory;


 cudaHostRegister() page-locks a range of memory allocated by malloc() (see
reference manual for limitations).

Using page-locked host memory has several benefits:

 Copies between page-locked host memory and device memory can be performed
concurrently with kernel execution for some devices as mentioned in
Asynchronous Concurrent Execution.
 On some devices, page-locked host memory can be mapped into the address
space of the device, eliminating the need to copy it to or from device memory as
detailed in Mapped Memory.
 On systems with a front-side bus, bandwidth between host memory and device
memory is higher if host memory is allocated as page-locked and even higher if in
addition it is allocated as write-combining as described in Write-Combining
Memory.

 Note

Page-locked host memory is not cached on non I/O coherent Tegra devices. Also,
cudaHostRegister() is not supported on non I/O coherent Tegra devices.

The simple zero-copy CUDA sample comes with a detailed document on the page-
locked memory APIs.

3.2.6.1. Portable Memory


A block of page-locked memory can be used in conjunction with any device in the
system (see Multi-Device System for more details on multi-device systems), but by
default, the benefits of using page-locked memory described above are only available
in conjunction with the device that was current when the block was allocated (and
with all devices sharing the same unified address space, if any, as described in Unified
Virtual Address Space). To make these advantages available to all devices, the block
needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or
page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister() .

3.2.6.2. Write-Combining Memory


By default page-locked host memory is allocated as cacheable. It can optionally be
allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to
cudaHostAlloc() . Write-combining memory frees up the host’s L1 and L2 cache
resources, making more cache available to the rest of the application. In addition,
write-combining memory is not snooped during transfers across the PCI Express bus,
which can improve transfer performance by up to 40%.

Reading from write-combining memory from the host is prohibitively slow, so write-
combining memory should in general be used for memory that the host only writes to.

Using CPU atomic instructions on WC memory should be avoided because not all CPU
implementations guarantee that functionality.

3.2.6.3. Mapped Memory


A block of page-locked host memory can also be mapped into the address space of
the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag
cudaHostRegisterMapped to cudaHostRegister() . Such a block has therefore in general
two addresses: one in host memory that is returned by cudaHostAlloc() or malloc() ,
and one in device memory that can be retrieved using cudaHostGetDevicePointer() and
then used to access the block from within a kernel. The only exception is for pointers
allocated with cudaHostAlloc() and when a unified address space is used for the host
and the device as mentioned in Unified Virtual Address Space.

Accessing host memory directly from within a kernel does not provide the same
bandwidth as device memory, but does have some advantages:

 There is no need to allocate a block in device memory and copy data between this
block and the block in host memory; data transfers are implicitly performed as
needed by the kernel;
 There is no need to use streams (see Concurrent Data Transfers) to overlap data
transfers with kernel execution; the kernel-originated data transfers automatically
overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the
application must synchronize memory accesses using streams or events (see
Asynchronous Concurrent Execution) to avoid any potential read-after-write, write-
after-read, or write-after-write hazards.

To be able to retrieve the device pointer to any mapped page-locked memory, page-
locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the
cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise,
cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support


mapped page-locked host memory. Applications may query this capability by checking
the canMapHostMemory device property (see Device Enumeration), which is equal to 1 for
devices that support mapped page-locked host memory.

Note that atomic functions (see Atomic Functions) operating on mapped page-locked
memory are not atomic from the point of view of the host or other devices.

Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, 8-byte, and 16-byte
naturally aligned loads and stores to host memory initiated from the device are
preserved as single accesses from the point of view of the host and other devices. On
some platforms, atomics to memory may be broken by the hardware into separate
load and store operations. These component load and store operations have the same
requirements on preservation of naturally aligned accesses. The CUDA runtime does
not support a PCI Express bus topology where a PCI Express bridge splits 8-byte
naturally aligned operations and NVIDIA is not aware of any topology that splits 16-
byte naturally aligned operations.

3.2.7. Memory Synchronization Domains


3.2.7.1. Memory Fence Interference
Some CUDA applications may see degraded performance due to memory fence/flush
operations waiting on more transactions than those necessitated by the CUDA
memory consistency model.

__managed__ int x = 0;
__device__ cuda::atomic<int,
cuda::thread_scope_device> a(0);
__managed__ cuda::atomic<int,
cuda::thread_scope_system> b(0);

Thread 1 (SM) Thread 2 Thread 3


(SM) (CPU)
x = 1;
a = 1; while (a while (b
!= 1) ; != 1) ;
assert(x assert(x
== 1); == 1);
b = 1;

Consider the example above. The CUDA memory consistency model guarantees that
the asserted condition will be true, so the write to x from thread 1 must be visible to
thread 3, before the write to b from thread 2.

The memory ordering provided by the release and acquire of a is only sufficient to
make x visible to thread 2, not thread 3, as it is a device-scope operation. The
system-scope ordering provided by release and acquire of b , therefore, needs to
ensure not only writes issued from thread 2 itself are visible to thread 3, but also
writes from other threads that are visible to thread 2. This is known as cumulativity. As
the GPU cannot know at the time of execution which writes have been guaranteed at
the source level to be visible and which are visible only by chance timing, it must cast a
conservatively wide net for in-flight memory operations.

This sometimes leads to interference: because the GPU is waiting on memory


operations it is not required to at the source level, the fence/flush may take longer
than necessary.

Note that fences may occur explicitly as intrinsics or atomics in code, like in the
example, or implicitly to implement synchronizes-with relationships at task boundaries.

A common example is when a kernel is performing computation in local GPU memory,


and a parallel kernel (e.g. from NCCL) is performing communications with a peer. Upon
completion, the local kernel will implicitly flush its writes to satisfy any synchronizes-
with relationships to downstream work. This may unnecessarily wait, fully or partially,
on slower nvlink or PCIe writes from the communication kernel.
3.2.7.2. Isolating Traffic with Domains
Beginning with Hopper architecture GPUs and CUDA 12.0, the memory
synchronization domains feature provides a way to alleviate such interference. In
exchange for explicit assistance from code, the GPU can reduce the net cast by a
fence operation. Each kernel launch is given a domain ID. Writes and fences are tagged
with the ID, and a fence will only order writes matching the fence’s domain. In the
concurrent compute vs communication example, the communication kernels can be
placed in a different domain.

When using domains, code must abide by the rule that ordering or synchronization
between distinct domains on the same GPU requires system-scope fencing. Within
a domain, device-scope fencing remains sufficient. This is necessary for cumulativity
as one kernel’s writes will not be encompassed by a fence issued from a kernel in
another domain. In essence, cumulativity is satisfied by ensuring that cross-domain
traffic is flushed to the system scope ahead of time.

Note that this modifies the definition of thread_scope_device . However, because


kernels will default to domain 0 as described below, backward compatibility is
maintained.

3.2.7.3. Using Domains in CUDA


Domains are accessible via the new launch attributes cudaLaunchAttributeMemSyncDomain
and cudaLaunchAttributeMemSyncDomainMap . The former selects between logical domains
cudaLaunchMemSyncDomainDefault and cudaLaunchMemSyncDomainRemote , and the latter
provides a mapping from logical to physical domains. The remote domain is intended
for kernels performing remote memory access in order to isolate their memory traffic
from local kernels. Note, however, the selection of a particular domain does not affect
what memory access a kernel may legally perform.

The domain count can be queried via device attribute cudaDevAttrMemSyncDomainCount .


Hopper has 4 domains. To facilitate portable code, domains functionality can be used
on all devices and CUDA will report a count of 1 prior to Hopper.

Having logical domains eases application composition. An individual kernel launch at a


low level in the stack, such as from NCCL, can select a semantic logical domain
without concern for the surrounding application architecture. Higher levels can steer
logical domains using the mapping. The default value for the logical domain if it is not
set is the default domain, and the default mapping is to map the default domain to 0
and the remote domain to 1 (on GPUs with more than 1 domain). Specific libraries may
tag launches with the remote domain in CUDA 12.0 and later; for example, NCCL 2.16
will do so. Together, this provides a beneficial use pattern for common applications out
of the box, with no code changes needed in other components, frameworks, or at
application level. An alternative use pattern, for example in an application using

You might also like