Inter-Block Synchronization

haojia.li · November 7, 2025, 11:39pm

Hi,

I want to build up a cross block synchronization, but I’m not fully sure about the understanding towards the causality order transitivity across different scopes in CUDA (global or block).

I have an example here. Assume all the operations are relaxed, and y and z are stored in shared memory for Block 1 to do the synchronization.

Block 0, T0           Block 1, T0              Block 1, T1
A.store(1, global)                             B.store(1, global)
fence_global()                                 fence_global()
                      y.add(1, shared)         y.add(1, shared)                 
                      while(y < 2)()           while(y < 2)()
                      fence_global()           fence_global()
G.add(1, global)      G.add(1, global)         
while(G < 2)()        while(G < 2)()
fence_global          fence_global()
B.load()==1?          A.load()==1?
                      z.add(1, shared)          z.add(1, shared)
                                                while(z < 2)()
                      fence_global()            fence_global()
                                                A ?= 1

My question is, can memory stored in shared memory also be participated in global causality order?

If this is true, Block 0, thread 0 should load correct B value. Here’s my understanding.

From Block 0, T0’s perspective

B.store() happens before y.add() in B1 T1

y.add() from B1 T1 happens before G.add() in B1 T0

G==2 (read by B0 T0 from B1 T0’s G.add() happens before B.load() in B0 T0

so B.load() in B0 T0 happens after B.store() in B1 T1, and it can read correctly.

However, if my assumption does not hold for shared memory y and z. The path wouldn’t complete as the order of y.add() is not aware by B0, so we cannot say that B.store() in B1 T1 happens before G.add() in B1 T0.

Thanks very much.

Topic		Replies	Views
Questions about cuda memory model: does causality order remain transitive across different scopes CUDA Programming and Performance	4	302	January 12, 2023
global memory consistency same address accessed by multiple blocks CUDA Programming and Performance	1	1075	March 12, 2010
Any way to guarentee writes have made it to global memory? CUDA Programming and Performance	1	739	September 23, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55875	March 8, 2007
Global memory access requests ordered..? CUDA Programming and Performance	2	621	May 8, 2014
inter-block communication via global memory why my simple implementation failed? CUDA Programming and Performance	7	14486	December 4, 2007
__syncthreads() and global memory CUDA Programming and Performance	1	2495	December 1, 2008
Avoiding global memory ordering by __syncthreads CUDA Programming and Performance	1	611	May 30, 2017
Block sheduling and L1 cache update ...about block synchronization CUDA Programming and Performance	5	989	April 22, 2011
Different cuda blocks see different values for global memory Legacy PGI Compilers	3	4389	June 22, 2011

Inter-Block Synchronization

Related topics