Hi,
I want to build up a cross block synchronization, but I’m not fully sure about the understanding towards the causality order transitivity across different scopes in CUDA (global or block).
I have an example here. Assume all the operations are relaxed, and y and z are stored in shared memory for Block 1 to do the synchronization.
Block 0, T0 Block 1, T0 Block 1, T1
A.store(1, global) B.store(1, global)
fence_global() fence_global()
y.add(1, shared) y.add(1, shared)
while(y < 2)() while(y < 2)()
fence_global() fence_global()
G.add(1, global) G.add(1, global)
while(G < 2)() while(G < 2)()
fence_global fence_global()
B.load()==1? A.load()==1?
z.add(1, shared) z.add(1, shared)
while(z < 2)()
fence_global() fence_global()
A ?= 1
My question is, can memory stored in shared memory also be participated in global causality order?
If this is true, Block 0, thread 0 should load correct B value. Here’s my understanding.
From Block 0, T0’s perspective
B.store() happens before y.add() in B1 T1
y.add() from B1 T1 happens before G.add() in B1 T0
G==2 (read by B0 T0 from B1 T0’s G.add() happens before B.load() in B0 T0
so B.load() in B0 T0 happens after B.store() in B1 T1, and it can read correctly.
However, if my assumption does not hold for shared memory y and z. The path wouldn’t complete as the order of y.add() is not aware by B0, so we cannot say that B.store() in B1 T1 happens before G.add() in B1 T0.
Thanks very much.