CUDA Thread Divergence Costs
CUDA Thread Divergence Costs
Abstract—All modern processors include a set of vec- On NVIDIA architectures threads are grouped warps
tor instructions. While this gives a tremendous boost of 32 threads that must all execute same instructions in
to the performance, it requires a vectorized code that essentially SIMD (vector) fashion. This entails that any
can take advantage of such instructions. As an ideal
vectorization is hard to achieve in practice, one has to branching instruction with condition that does not give
the same result across the whole warp leads to thread
arXiv:1504.01650v1 [[Link]] 7 Apr 2015
architecture device
1 #define EXPR_INNER 1.3333f Fermi GTX 480, GT 610
2 #define EXPR_OUTER 2.3333f Kepler GT 650M, GT 755M, GTX 770
3 Maxwell GTX 850M
4 __global__ void single_loop(int* limits, float* out,
llong* timer) { TABLE II
5 Devices used in test.
6 int tid = blockDim.x * blockIdx.x + threadIdx.x;
7 int M = limits[threadIdx.x];
8 float sum = out[tid];
9
Fermi
10 #pragma unroll 3,500
11 for (int k = 0; k < N_PREHEAT; k++) Kepler
12 for (int i = 0; i < M; i++) { Maxwell
13 sum += EXPR_INNER;
14 } 3,000
15
16 __syncthreads();
cycles
17 llong start = clock64(); 2,500
18
19 #pragma unroll
20 for (int k = 0; k < N_UNROLL; ++k) 2,000
21 for (int i = 0; i < M; i++) {
22 sum += EXPR_INNER;
23 }
24 1,500
25 llong stop = clock64();
26 __syncthreads();
27 1,000
28 out[tid] = sum; 0 4 8 12 16 20 24 28 32
29 timer[2 * tid] = start;
30 timer[2 * tid + 1] = stop;
n
31 }
Fig. 1. Single loop timings corresponding to kernel in Listing 1.
Listing 1. Single loop kernel.
tid
n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M
0 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32
1 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 31
2 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 31 30
.. ..
. .
28 32 32 32 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
29 32 32 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
30 32 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
31 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
TABLE I
Loop limits used in the single loop measurements.
·105
1 #define EXPR_INNER 1.3333f
2 #define EXPR_OUTER 2.3333f Fermi
3
4 __global__ void double_loops(int* limits, float* out, 1 Kepler
llong* timer) { Maxwell
5
6 int tid = blockDim.x * blockIdx.x + threadIdx.x;
7 int M = limits[2 * threadIdx.x];
8 int N = limits[2 * threadIdx.x + 1]; 0.8
9 float sum = out[0]; cycles
10
11 #pragma unroll
12 for (int k = 0; k < N_PREHEAT; k++)
13 for (int i = 0; i < M; i++) {
0.6
14 for (int j = 0; j < N; j++) {
15 sum += EXPR_INNER;
16 }
17 sum += EXPR_OUTER;
18 }
0.4
19
20 __syncthreads(); 0 4 8 12 16 20 24 28 32
21 llong start = clock64(); n
22
23 #pragma unroll
Fig. 2. Double loop timings corresponding to kernel in Listing 2.
24 for (int k = 0; k < N_UNROLL; ++k)
25 for (int i = 0; i < M; i++) {
26 for (int j = 0; j < N; j++) {
27 sum += EXPR_INNER; diverging instruction (branch) compiler issues one set-
28 }
29 sum += EXPR_OUTER; synchronization SSY instruction. This instruction causes
30 } new synchronization token to be pushed on the top of
31 synchronization stack. Such token consist of three parts:
32 llong stop = clock64();
33 __syncthreads(); mask, id and program counter (pc) and takes 64 bits -
34 out[tid] = sum; 32 of which are taken by the mask. In case of the SSY
35 timer[2 * tid] = start; instruction the mask is set to active mask, id to SYNC
36 timer[2 * tid + 1] = stop;
37 } and pc is set to the address of the synchronization point.
The actual divergence is caused by the predicated branch
Listing 2. Double loop kernel.
instruction which has form @P0 BRA label which translates
to the pseudocode in Algorithm 1. The P0 is a 32bit
predicate register that indicates which threads in the warp
should execute (take) the branch.
is based on reference [10]. We choose reference [10] over The instructions are executed according to the pseu-
[9] because of the SSY instruction specification. While in docode in the Algorithm 2. The instruction may have
[9] it is described as taking no arguments, cuobjdump listing a pop-bit set, also denoted as synchronization command,
shows that it expects one argument which matches the indicated by the suffix .S in the assembler output. The
description in [10]. We concentrate only on one type of instruction which is suffixed with .S will be called a carrier
control instruction which is relevant to our example: the instruction. When encountered it signals the stack unwind-
predicate branch instruction, ignoring all others. ing. The token is popped from the stack and used to set
Scalar Multiprocessor (SMX) processor maintains for the active mask and the program counter. The reference
each warp an active mask that indicates which threads [10] does not specify exactly when carrier instructions
in warp are active. When about to execute a potentially are executed, before or after popping the stack, so we
4
3: [Link] ← active_mask
0x0100 0x7FFFFFFF
4: [Link] ← SYNC
5: [Link] ← label
SYNC 0x0128 0xFFFFFFFF
6: pc ← pc+1
7: else if Instruction is a predicated branch instruction DIV 0x0120 0x80000000
then 3.
8: Execute the instruction according to Algorithm 1
9: else 0x0100 0x3FFFFFFF
10: if Is pop-bit set in instruction then
11: token ← pop() SYNC 0x0128 0xFFFFFFFF
12: active_mask ← [Link] DIV 0x0120 0x80000000
13: pc ← [Link]
14: Execute the instruction DIV 0x0120 0x40000000
15: else 4.
0x0120 0x80000000
IV. Analysis
SYNC 0x0128 0xFFFFFFFF
We will try to understand the results of benchmarks
6.
from Section II in view of the implementation described
in Section III. We begin with disassembling the kernels. 0x0128 0xFFFFFFFF
The results are presented in Listings 3 and 4. We have
7.
included only the relevant portions of the assembler code.
Looking at the listing 3 we notice all the instructions
described in the previous section. To better understand Fig. 3. Walk through the single loop (n = 2) showing the stack
what is happening here we will make a "walk-through" history.
this listing assuming n = 2.
5
We start on line 1, all bits in the active mask are set and n = 23
32
the stack is empty (Figure 3.1). The predicate register P0 n = 31
is set to the result of the comparison M < 1. In our test 28
this condition is never fulfilled so all bits of P0 are cleared.
24
As line 4 contains a potentially diverging instruction a
SSY instruction is issued on line 3 with address pointing 20
cycles
past the loop to the line 12. The active mask and the
16
address are pushed onto the stack (Figure 3.2).
On line 4 we have a predicated branch instruction, but 12
as the register P0 is all false, this does not produce any
8
divergence, and so does not change the active mask and
does not push any token on the stack. The control then 4
moves on to the next instructions.
0
Ignoring two NOPs the next instruction on line 7 incre-
ments the loop counter i in register R4, which was set 0 20 40 60 80 100 120 140
to zero by an instruction not shown on the listing. Next
#instructions
instruction on line 8 performs the addition.
On line 9 register P0 is set to the result of testing the
Fig. 4. Re-convergence stack history for the single loop kernel.
loop condition i < M . On next line this register is used
to predicate the branch instruction. As i = 1 all bits in
P0 are set and the branch is taken by all threads and
no token is pushed onto the stack. All the threads are and earlier at each iteration through the loop and the
active. The control jumps to line 7 when the loop counter stack unwinding takes place at the end. The total number
is again incremented. This continues until i = 30 without of push/pop instructions is n + 1 (the SYNC instruction
any changes to the stack. always pushes token on the stack) and the maximum stack
When i = 30 the condition on line 9 fails for the size is also n + 1. This is illustrated on the Figure 4
last thread of the warp and last bit of the P0 is cleared were we plot the stack history showing the length of the
(P0=0x7FFFFFFF). The branch on line 10 instruction is now stack as a function of the number of executed instructions.
diverging so a DIV token is pushed on the stack with mask This figure was obtained using a very simple emulator
set to !P0=0x80000000 (not taken mask) and address equal implementing the algorithms 1 and 2.
to PC+1 (line 11) (Figure 3.3). The active mask is set to We have also instrumented our code using CUPTI
P0 which disables the thread 31 and the execution jumps API from NVIDIA[11]. While this API cannot show
to line 7. Then the loop counter i is incremented to 31, the push/pop instructions, we have found out that for
but only in the active threads. n = 0 . . . 15 the number of executed instructions indeed
At line 9 the as i = 31 the comparison fails for thread increases by one with increasing n and that this is due to
30. The predicate register P0 will have its bit 30 cleared (P0 the increasing number of executions of stack unwinding
=0x3FFFFFFF) so not all active threads will take the branch instruction NOP.S on line 11. This corresponds with the
on line 10. Again a DIV token will by pushed on the stack linear raise of the number of cycles taken by the loop,
with mask set to not taken mask (0x7FFFFFFF & !0x3FFFFFFF as show in the Figure 1. By fitting to the first 16 points
= 0x40000000) and PC=0x0120. Then the active mask is set we have found out that the penalty for one diverging
to P0=0x3FFFFFFF disabling threads 30 and 31 and control branch is exactly 32 cycles on Kepler (see the continuous
jumps to line 7 (Figure 3.4). line in the Figure 1) and 26 cycles on Maxwell. Those 32
The loop counter is incremented again to 32 so all bits cycles include pushing the token on the stack, executing
of the predicate P0 are cleared. There is no divergence in a synchronization command and popping the stack. In
branch on line 10 as none of the active threads take the section IV-A we will take a more detailed look at how
branch and the control moves to line 11. exactly those cycles are divided between those operations.
The NOP instruction on line 11 has pop-bit set, so a token Looking again at the Figure 1 we notice that the first
is popped from the stack. The active mask is set to the jump occurs when length of the stack exceeds 16 i.e. when
token mask and PC to the token’s PC. The NOP instruc- n > 15. We can assume that this is the physical length of
tion is executed. And the control jumps again to line 11 the fast on chip memory allocated for stack, and growing
that was the token program counter (Figure 3.5). Stack the stack longer requires spilling the entries into some
is popped again and after executing the NOP instruction other memory (as mentioned in [10]). After this jump the
the control goes back to line 11 (Figure 3.6). number of cycles again increases by 32 at each step for the
Then the last item is popped from the stack. This is the next four entries leading us to assume that the entries are
SYNC token that restores the 0xFFFFFFFF active mask and spilled four at the time e.g. in chunks of 32 bytes. This is
sets the PC to line 12, ending the loop (Figure 3.7). exactly the length of the L2 cache line suggesting that the
This walk-through lets us see the general pattern: as n is entries are spilled to the global memory. This would also
increased we start pushing DIV tokens on the stack earlier explain why running a preheat loop before measurements
6
cycles
entries. Likewise when popping the stack would require
access to elements that were spilled, they are loaded back 16
from the memory. We have implemented this scenario in 12
our emulator. Comparing that with the timing measure-
8
ments we have found out that the spilling and loading back
takes together around 84 cycles on Kepler architecture and 4
176 cycles on Maxwell. 0
Using the CUPTI API we have looked for additional
load/store or cache hit/miss instructions that would in- 0 1,000 2,000 3,000 4,000 5,000
dicate the spill into the global memory, but found none,
#instructions
however we have noticed appearance of some additional
branch instructions. Those instruction did not show up Fig. 5. Re-convergence stack history for the double loops kernel.
in the execution trace, so could not be attributed to the
particular instructions. The number of those additional
instructions was however exactly equal to the number of 60 Branches (Kepler)
spills. Without the detailed knowledge of NVIDIA micro- Spills (Emulator)
architecture it is impossible to ascertain the meaning of
those instructions. One plausible explanation would be
that the branch instructions are reissued after the stack 40
content is spilled to memory.
branch
·104
8 Kepler
Maxwell 1 __global__ void single_loop_precise(int *limits, float *
out, llong *timer) {
Emulator 2
7 3 int tid = blockDim.x * blockIdx.x + threadIdx.x;
4 int M = limits[threadIdx.x];
5 float sum = out[tid];
cycles
6
6 7 #pragma unroll
8 for (int k = 0; k < N_PREHEAT; k++)
9 for (int i = 0; i < M; i++) {
10 sum +=1.3333f;
5 11 }
12
13 timer[tid] = clock64();
14
60
cycles
40
1 /*0068*/ S2R R9, SR_CLOCKHI;
2 /*0070*/ IADD [Link], R8, R8;
20 3 /*0078*/ [Link] R5, R5, R9, R8;
4 /*0080*/ IADD.X R5, R5, R5;
5 /*0088*/ MOV32I R9, 0x8;
6 /*0090*/ IMAD [Link], R0, R9, c[0x0][0x30];
0 7 /*0098*/ [Link].X R9, R0, R9, c[0x0][0x34];
8 /*00a0*/ ST.E.64 [R8], R4;
9 /*00a8*/ S2R R5, SR_CLOCKHI;
0 4 8 12 16 20 24 28 32 10 /*00b0*/ S2R R8, SR_CLOCKLO;
n 11 /*00b8*/ S2R R9, SR_CLOCKHI;
12 /*00c0*/ IADD [Link], R8, R8;
13 /*00c8*/ [Link] R5, R5, R9, R8;
Fig. 8. Difference between the actual and predicted number of cycles
14 /*00d0*/ IADD.X R5, R5, R5;
for the double loop kernel on Kepler (GTX 770) architecture.
15 /*00d8*/ [Link] P0, PT, R7, 0x1, PT;
16 /*00e0*/ MOV R10, RZ;
17 /*00e8*/ SSY 0x178;
additional timing instruction inserted inside the loop (line 18 /*00f0*/ @P0 BRA 0x170;
19 /*00f8*/ MOV32I R11, 0x8;
17). According to the walk-through in the previous section 20 /*0100*/ IADD R10, R10, 0x1;
those instructions should be executed before unwinding 21 /*0108*/ FADD32I R6, R6, 1.333;
the stack, while the last instruction on line 21 should be 22 /*0110*/ ISCADD R9, R10, R0, 0x5;
23 /*0118*/ IMAD [Link], R9, R11, c[0x0][0x30];
executed after. 24 /*0120*/ [Link].X R9, R9, R11, c[0x0][0x34];
At first we did not include the synchronization com- 25 /*0128*/ ST.E.64 [R8], R4;
mand on line 20, as we deemed it unnecessary within 26 /*0130*/ S2R R5, SR_CLOCKHI;
27 /*0138*/ S2R R8, SR_CLOCKLO;
one warp. The results were however not compatible with 28 /*0140*/ S2R R9, SR_CLOCKHI;
the previous ones. Inspection of the disassembly in list- 29 /*0148*/ IADD [Link], R8, R8;
ing 6 reveled that actually all the timing instructions 30 /*0150*/ [Link] R9, R5, R9, R8;
31 /*0158*/ IADD.X R5, R9, R9;
were executed before the stack unwinding triggered by the 32 /*0160*/ [Link] P0, PT, R10, R7, PT;
synchronization command on line 34. 33 /*0168*/ @P0 BRA 0x100;
34 /*0170*/ IADD.S R0, R0, 0x420;
Adding the __syncthreads() instruction on line 20 solved
Listing 6. Disassembly of the single loop kernel with detailed timings
the problem. The overall results now matched the results without synchronization.
from the previous section. Of course the total times dif-
fered substantially as the code had now extra instructions,
but the differences in number of cycles between the runs
8
with various values of n were identical for n ≤ 15 (no quite natural scenario. We would like to emphasize that
spills) and differed by few cycles for n > 15. in this case the additional costs reported by us are the
We have gathered the timelines for each n and each costs of the divergence management mechanism alone,
thread in warp but we used only the timings for thread excluding additional penalty of incomplete vectorization.
0 – the one that always iterates 32 times through the Nevertheless they implied an overhead of ∼ 20% in case of
loop. We have found out that when there were no spills 14 diverging threads for the double loop kernel and ∼ 25%
(n ≤ 15) then the total cost was last code segment: the for the single loop. In the more general case there would
unwinding of the stack. There was no cost associated with be additional costs associated with the serialization of the
predicated brach instruction i.e. the push operation. This different branches.
may be explained by the fact that the latency of push Our work was confined to NVIDIA CUDA technology
operation can be hidden as the token pushed onto the stack it would be however interesting how this corresponds to
is not needed by the next instruction. While unwinding the shader programming both in DirectX and OpenGL and
stack however the popped token is used immediately to set we are planning to extend our work in this direction in
active mask and PC. Additionally the carrier instruction the future.
has to be executed, even if this is a NOP it still has to be
fetched an decoded. References
When n was greater then 15, i.e. the stack was spilled [1] P. Białas, J. Kowal, A. Strzelecki et al., “GPU accelerated image
in the memory, the cost were divided between the branch reconstruction in a two-strip J-PET tomograph,” arXiv preprint
instruction that now took around 40 cycles longer and arXiv:1502.07478, 2015.
[2] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym,
stack unwinding that now took around 42 cycles longer. “NVIDIA Tesla: A unified graphics and computing architec-
This corresponds well with the overall 84 cycles penalty ture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
estimated in the previous section. [3] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying GPU microarchitecture through
microbenchmarking,” in Performance Analysis of Systems &
V. Summary Software (ISPASS), 2010 IEEE International Symposium on.
The Single Thread Multiple Threads (SIMT) program- IEEE, 2010, pp. 235–246.
[4] N. Wilt, The CUDA Handbook: A Comprehensive Guide to GPU
ming model introduced by NVIDIA with CUDA technol- Programming. Pearson Education, 2013.
ogy gives the programmer an illusion of writing a multi- [5] S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A
threaded application. In reality the threads are executed in parallel functional simulator for GPGPU,” in Modeling, Analy-
sis & Simulation of Computer and Telecommunication Systems
groups of 32 threads (warps) performing essentially vector (MASCOTS), 2010 IEEE International Symposium on. IEEE,
operations. The illusion of multithreading is maintained 2010, pp. 351–360.
by an elaborate mechanism for managing the divergent [6] ——, “Comparaison d’algorithmes de branchements pour le sim-
ulateur de processeur graphique Barra,” in 13‘eme Symposium
threads within a warp. This mechanism is completely hid- sur les Architectures Nouvelles de Machines, 2009, pp. 1–12.
den from the programmer and no description is provided [7] T. M. Aamodt et al., “GPGPU-Sim,” 2012. [Online]. Available:
by NVIDIA in their programming guide. In this contribu- [Link]
[8] A. Habermaier and A. Knapp, “On the correctness of the SIMT
tion we have presented a detailed study of two CUDA ker- execution model of GPUs,” in Programming Languages and
nels leading to a large thread divergence. Following other Systems. Springer, 2012, pp. 316–335.
sources, we have assumed that the management of the [9] B. W. Coon and J. E. Lindholm, “System and method
for managing divergent threads in a SIMD architecture,”
diverging threads is based on a synchronization stack and US Patent US7 353 369 B1, 04 01, 2008. [Online]. Available:
follows closely the idea described in reference [10] which [Link]
is a NVIDIA owned patent. That assumption allowed us [10] B. W. Coon, J. R. Nickolls, L. Nyland, P. C. Mills,
and J. E. Lindholm, “Indirect function call instructions
to fit the observed execution times with accuracy better in a synchronous parallel thread processor,” US Patent
then 1%. US8 312 254 B2, 11 13, 2012. [Online]. Available: https:
We have estimated the cost of a diverging branch in- //[Link]/patents/US8312254
[11] NVIDIA, “CUPTI: CUDA toolkit documentation,” 2014.
struction to be exactly 32 cycles on Kepler architecture [Online]. Available: [Link]
provided that the maximum stack length did not exceed
16. After that stack entries had to be spilled into memory
which carried an additional penalty of ∼ 84 cycles. On
Maxwell the estimated cost of branch divergence was
considerably shorter: only 24 cycles, but the cost of spilling
was much higher ∼ 176 cycles. More detailed timings are
suggesting that all of that cost, in case of no spills, can be
attributed to stack unwinding.
We also performed the measurements on Fermi archi-
tecture, there the pattern seems to be slightly different
suggesting that the physical thread synchronization stack
length is shorter in this architecture.
We have analyzed only a very specific case: loops with
different bounds across the threads, this is however a