0% found this document useful (0 votes)
29 views8 pages

CUDA Thread Divergence Costs

Uploaded by

laabdisca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views8 pages

CUDA Thread Divergence Costs

Uploaded by

laabdisca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

Benchmarking the cost of thread divergence in


CUDA
P. Bialas and A. Strzelecki

Abstract—All modern processors include a set of vec- On NVIDIA architectures threads are grouped warps
tor instructions. While this gives a tremendous boost of 32 threads that must all execute same instructions in
to the performance, it requires a vectorized code that essentially SIMD (vector) fashion. This entails that any
can take advantage of such instructions. As an ideal
vectorization is hard to achieve in practice, one has to branching instruction with condition that does not give
the same result across the whole warp leads to thread
arXiv:1504.01650v1 [[Link]] 7 Apr 2015

decide when different instructions may be applied to


different elements of the vector operand. This is espe- divergence - while some threads take one branch other
cially important in implicit vectorization as in NVIDIA do nothing and then the other branch is taken with
CUDA Single Instruction Multiple Threads (SIMT) roles reversed. This is a picture given in the NVIDIA
model, where the vectorization details are hidden from
the programmer. In order to assess the costs incurred programming guide, which warns against thread diver-
by incompletely vectorized code, we have developed gence but otherwise is missing any details. Only in the
a micro-benchmark that measures the characteristics Tesla architecture white-paper it is mentioned that this is
of the CUDA thread divergence model on different achieved using a branch synchronization stack [2].
architectures focusing on the loops performance. As the strict avoidance of thread divergence would
Index Terms—GPU, CUDA, multithreading, SIMT, severely limit the algorithms that could be implemented
SIMD using CUDA, it would be interesting to check what are the
real costs. The necessity of following two branches is one
I. Introduction obvious performance obstacle. We will be however inter-
Most of the current processors derive their performance ested in the overhead associated with the re-convergence
from some form of SIMD instructions. This holds true mechanism itself as manifested in loops.
for Intel i-Series and Xeon processor (8 lanes wide AVX Loops, which are probably the most used control state-
instruction), Intel Xeon Phi accelerator (16 lanes wide ment in programming, are an interesting example. While it
AVX instructions), AMD Graphics Cores Next (64 lanes is easy to picture the two way branching model as in a if
wide wavefronts) and NVIDIA CUDA (32 lanes wide else statement, it is much harder to follow the execution
warps). By the very nature of those instructions all the of loops, especially nested. In this contribution we will
operations on a vector are performed in parallel, at least benchmark the performance of single and double loops
conceptually. That provides a constraint on the class of with bounds that are different across the threads. While
algorithms that can be efficiently implemented on such such model is discouraged in CUDA programming it can
architectures. If we want to perform different operations nevertheless appear naturally while porting an existing
on the different components of the vector operands we are multi-core algorithm to GPU. To our best knowledge such
essentially forced to issue different instructions masking measurements were not published up to this date. We have
the unused components in each of them which of course only found some data in reference [3] but not the explicit
carries a performance penalty. The aim of this paper is timings of the synchronization stack operations. We will
to asses those cost. We will concentrate on the NVIDIA analyze our results in view of the best information on the
CUDA as that is the architecture that we use in our CUDA thread divergence model that we have found.
current work[1]. This paper is organized as follows: In the next section we
The CUDA and OpenCL programming models make present the setup we used for benchmarking the execution
thread divergence very easy to achieve as they rely on of single and double loop kernel. Then we present the
implicit vectorization. They model the computing envi- CUDA stack based re-convergence mechanism and in the
ronment as a great number of threads executing a single last section we use it to analyze our results.
program called kernel[2], and to the programmer this may
look as an multithreaded parallel execution. In fact the II. Timings
CUDA Programming Guide states: “For the purposes of
correctness, the programmer can essentially ignore the A. Single loop
SIMT behavior; however, substantial performance im- Our first test consisted of running single loop using
provements can be realized by taking care that the code the kernel from Listing 1. Each thread of a warp (we
seldom requires threads in a warp to diverge.” used only one warp) runs through the same loop but
with upper limit, denoted by M , set individually for
Authors are with Faculty of Physics, Astronomy and Computer
Science, Jagiellonian University, ul. Łojasiewicza 11, 30-348 Krakow, each thread. We measured the number of cycles taken
Poland by the loop using the CUDA clock64() function, repeating
2

architecture device
1 #define EXPR_INNER 1.3333f Fermi GTX 480, GT 610
2 #define EXPR_OUTER 2.3333f Kepler GT 650M, GT 755M, GTX 770
3 Maxwell GTX 850M
4 __global__ void single_loop(int* limits, float* out,
llong* timer) { TABLE II
5 Devices used in test.
6 int tid = blockDim.x * blockIdx.x + threadIdx.x;
7 int M = limits[threadIdx.x];
8 float sum = out[tid];
9
Fermi
10 #pragma unroll 3,500
11 for (int k = 0; k < N_PREHEAT; k++) Kepler
12 for (int i = 0; i < M; i++) { Maxwell
13 sum += EXPR_INNER;
14 } 3,000
15
16 __syncthreads();

cycles
17 llong start = clock64(); 2,500
18
19 #pragma unroll
20 for (int k = 0; k < N_UNROLL; ++k) 2,000
21 for (int i = 0; i < M; i++) {
22 sum += EXPR_INNER;
23 }
24 1,500
25 llong stop = clock64();
26 __syncthreads();
27 1,000
28 out[tid] = sum; 0 4 8 12 16 20 24 28 32
29 timer[2 * tid] = start;
30 timer[2 * tid + 1] = stop;
n
31 }
Fig. 1. Single loop timings corresponding to kernel in Listing 1.
Listing 1. Single loop kernel.

by the threads. Secondly this cost of thread divergence


is proportional to the number of divergent threads, up
our loop N_UNROLL times if needed. We have also added a
to n = 15, then it exhibits several "jumps" every four
possibility to run same loop N_PREHEAT times before making
steps, at least for the two newest architectures (Kepler,
the measurements. We have repeated each measurement
Maxwell). The behavior on the Fermi architecture is
256 times and used those samples to estimate the error.
slightly different, but we will be not concerned with this
We have found out that when using N_PREHEAT=1 we had
architecture in this contribution, and provide the plots
essentially zero errors even with N_UNROLL=1 . When no
only for comparison.
"preheating" was used the results were more erratic, with
difference of few cycles between the samples.
We start our measurements with all loops having same B. Double loop
upper limit of M = 32 (no divergence). Next we decrease On our second test we used a kernel with nested loops,
the upper limit for one of the threads and continue until all again the loops upper bounds were set individually for
threads in warp have different upper bounds (see Table I). each thread of the warp (see Listing 2). We used same
The results of the measurements as a function of n are patterns as for the single loop (Table I) setting both loops
presented in the Figure 1. We have tested three CUDA bounds to the same value (M = N ). The timings are
architectures (see Table II). We have found out that the presented in the Figure 2. We can observe a similar pattern
number of cycles depended only on the architecture or as in the single loop case. The number of cycles at first
Compute Capability of the card, not on the particular increases smoothly with the number of divergent threads,
device. We have used CUDA 7.0 environment for all and then exhibits "jumps" every four steps.
our test. We have compiled our benchmarks using the
-arch=sm_20 -Xptxas -O3 flags, as sm_20 architecture did not III. CUDA divergence model
change the results, but produced no undefined instructions To understand the observed behavior it is necessary to
in the cuobjdump disassembly listing, contrary to sm_30. find out the details of the CUDA thread divergence/re-
There are two interesting things on this plot to take convergence model. This is not explicitly stated by
notice of, apart from the fact that GPU are getting faster NVIDIA apart from brief mention of the branch syn-
with each new architecture. Firstly although the longest chronization stack in [2]. A more detailed description can
loop has always same upper bound (M = 32) the time be found in references [4], [5], [6], [7] and [8]. There it
increases linearly with each new divergent thread. This was established that the CUDA implementation follows
is a clear signal of an additional cost associated with the the approach described in US patents [9] and [10]. In
divergence, as in fact there is altogether less work done here we briefly describe this algorithm. Our description
3

tid
n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M
0 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32
1 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 31
2 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 31 30
.. ..
. .
28 32 32 32 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
29 32 32 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
30 32 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
31 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

TABLE I
Loop limits used in the single loop measurements.

·105
1 #define EXPR_INNER 1.3333f
2 #define EXPR_OUTER 2.3333f Fermi
3
4 __global__ void double_loops(int* limits, float* out, 1 Kepler
llong* timer) { Maxwell
5
6 int tid = blockDim.x * blockIdx.x + threadIdx.x;
7 int M = limits[2 * threadIdx.x];
8 int N = limits[2 * threadIdx.x + 1]; 0.8
9 float sum = out[0]; cycles
10
11 #pragma unroll
12 for (int k = 0; k < N_PREHEAT; k++)
13 for (int i = 0; i < M; i++) {
0.6
14 for (int j = 0; j < N; j++) {
15 sum += EXPR_INNER;
16 }
17 sum += EXPR_OUTER;
18 }
0.4
19
20 __syncthreads(); 0 4 8 12 16 20 24 28 32
21 llong start = clock64(); n
22
23 #pragma unroll
Fig. 2. Double loop timings corresponding to kernel in Listing 2.
24 for (int k = 0; k < N_UNROLL; ++k)
25 for (int i = 0; i < M; i++) {
26 for (int j = 0; j < N; j++) {
27 sum += EXPR_INNER; diverging instruction (branch) compiler issues one set-
28 }
29 sum += EXPR_OUTER; synchronization SSY instruction. This instruction causes
30 } new synchronization token to be pushed on the top of
31 synchronization stack. Such token consist of three parts:
32 llong stop = clock64();
33 __syncthreads(); mask, id and program counter (pc) and takes 64 bits -
34 out[tid] = sum; 32 of which are taken by the mask. In case of the SSY
35 timer[2 * tid] = start; instruction the mask is set to active mask, id to SYNC
36 timer[2 * tid + 1] = stop;
37 } and pc is set to the address of the synchronization point.
The actual divergence is caused by the predicated branch
Listing 2. Double loop kernel.
instruction which has form @P0 BRA label which translates
to the pseudocode in Algorithm 1. The P0 is a 32bit
predicate register that indicates which threads in the warp
should execute (take) the branch.
is based on reference [10]. We choose reference [10] over The instructions are executed according to the pseu-
[9] because of the SSY instruction specification. While in docode in the Algorithm 2. The instruction may have
[9] it is described as taking no arguments, cuobjdump listing a pop-bit set, also denoted as synchronization command,
shows that it expects one argument which matches the indicated by the suffix .S in the assembler output. The
description in [10]. We concentrate only on one type of instruction which is suffixed with .S will be called a carrier
control instruction which is relevant to our example: the instruction. When encountered it signals the stack unwind-
predicate branch instruction, ignoring all others. ing. The token is popped from the stack and used to set
Scalar Multiprocessor (SMX) processor maintains for the active mask and the program counter. The reference
each warp an active mask that indicates which threads [10] does not specify exactly when carrier instructions
in warp are active. When about to execute a potentially are executed, before or after popping the stack, so we
4

1: if None active threads take the branch then


2: pc ← pc+1
3: else 1 /*0060*/ [Link] P0, PT, R5, 0x1, PT;
2 ...
4: if !(All active threads take the branch) then 3 /*00e0*/ SSY 0x128;
5: [Link] ← active_mask && !P0 4 /*00e8*/ @P0 BRA 0x120;
6: [Link] ← DIV 5 /*00f0*/ NOP;
6 /*00f8*/ NOP;
7: [Link] ← pc+1 7 /*0100*/ IADD R4, R4, 0x1;
8: push(token) 8 /*0108*/ FADD32I R0, R0, 1.3332999944686889648;
9: end if 9 /*0110*/ [Link] P0, PT, R4, R5, PT;
10 /*0118*/ @P0 BRA 0x100;
10: active_mask ← active_mask && P0 11 /*0120*/ NOP.S;
11: pc ← label 12 /*0128*/ S2R R5, SR_CLOCKHI;
12: end if Listing 3. Disassembly of the single loop kernel
Algorithm 1: Algorithm for executing a predicated branch
instruction @P0 BRA label.

have assumed that this is done after unwinding the stack.


Actually in both kernels that we have studied this carrier PC mask
instruction is NOP, but this is not necessarily the case in 0x0060 0xFFFFFFFF
general.
1.
Summarizing: SSY and predicated branch instructions
(only if some active threads diverge) push token onto the 0x0e8 0xFFFFFFFF
stack and the synchronization command pops it.
SYNC 0x0128 0xFFFFFFFF
1: Fetch the instruction
2: if Instruction is a SSY label instruction then 2.

3: [Link] ← active_mask
0x0100 0x7FFFFFFF
4: [Link] ← SYNC
5: [Link] ← label
SYNC 0x0128 0xFFFFFFFF
6: pc ← pc+1
7: else if Instruction is a predicated branch instruction DIV 0x0120 0x80000000
then 3.
8: Execute the instruction according to Algorithm 1
9: else 0x0100 0x3FFFFFFF
10: if Is pop-bit set in instruction then
11: token ← pop() SYNC 0x0128 0xFFFFFFFF
12: active_mask ← [Link] DIV 0x0120 0x80000000
13: pc ← [Link]
14: Execute the instruction DIV 0x0120 0x40000000
15: else 4.

16: Execute the instruction


17: pc ← pc+1 0x0120 0x40000000
18: end if
19: end if SYNC 0x0128 0xFFFFFFFF
20: goto 1 DIV 0x0120 0x80000000
Algorithm 2: Algorithm for executing a CUDA instruction. 5.

0x0120 0x80000000
IV. Analysis
SYNC 0x0128 0xFFFFFFFF
We will try to understand the results of benchmarks
6.
from Section II in view of the implementation described
in Section III. We begin with disassembling the kernels. 0x0128 0xFFFFFFFF
The results are presented in Listings 3 and 4. We have
7.
included only the relevant portions of the assembler code.
Looking at the listing 3 we notice all the instructions
described in the previous section. To better understand Fig. 3. Walk through the single loop (n = 2) showing the stack
what is happening here we will make a "walk-through" history.
this listing assuming n = 2.
5

We start on line 1, all bits in the active mask are set and n = 23
32
the stack is empty (Figure 3.1). The predicate register P0 n = 31
is set to the result of the comparison M < 1. In our test 28
this condition is never fulfilled so all bits of P0 are cleared.
24
As line 4 contains a potentially diverging instruction a
SSY instruction is issued on line 3 with address pointing 20

cycles
past the loop to the line 12. The active mask and the
16
address are pushed onto the stack (Figure 3.2).
On line 4 we have a predicated branch instruction, but 12
as the register P0 is all false, this does not produce any
8
divergence, and so does not change the active mask and
does not push any token on the stack. The control then 4
moves on to the next instructions.
0
Ignoring two NOPs the next instruction on line 7 incre-
ments the loop counter i in register R4, which was set 0 20 40 60 80 100 120 140
to zero by an instruction not shown on the listing. Next
#instructions
instruction on line 8 performs the addition.
On line 9 register P0 is set to the result of testing the
Fig. 4. Re-convergence stack history for the single loop kernel.
loop condition i < M . On next line this register is used
to predicate the branch instruction. As i = 1 all bits in
P0 are set and the branch is taken by all threads and
no token is pushed onto the stack. All the threads are and earlier at each iteration through the loop and the
active. The control jumps to line 7 when the loop counter stack unwinding takes place at the end. The total number
is again incremented. This continues until i = 30 without of push/pop instructions is n + 1 (the SYNC instruction
any changes to the stack. always pushes token on the stack) and the maximum stack
When i = 30 the condition on line 9 fails for the size is also n + 1. This is illustrated on the Figure 4
last thread of the warp and last bit of the P0 is cleared were we plot the stack history showing the length of the
(P0=0x7FFFFFFF). The branch on line 10 instruction is now stack as a function of the number of executed instructions.
diverging so a DIV token is pushed on the stack with mask This figure was obtained using a very simple emulator
set to !P0=0x80000000 (not taken mask) and address equal implementing the algorithms 1 and 2.
to PC+1 (line 11) (Figure 3.3). The active mask is set to We have also instrumented our code using CUPTI
P0 which disables the thread 31 and the execution jumps API from NVIDIA[11]. While this API cannot show
to line 7. Then the loop counter i is incremented to 31, the push/pop instructions, we have found out that for
but only in the active threads. n = 0 . . . 15 the number of executed instructions indeed
At line 9 the as i = 31 the comparison fails for thread increases by one with increasing n and that this is due to
30. The predicate register P0 will have its bit 30 cleared (P0 the increasing number of executions of stack unwinding
=0x3FFFFFFF) so not all active threads will take the branch instruction NOP.S on line 11. This corresponds with the
on line 10. Again a DIV token will by pushed on the stack linear raise of the number of cycles taken by the loop,
with mask set to not taken mask (0x7FFFFFFF & !0x3FFFFFFF as show in the Figure 1. By fitting to the first 16 points
= 0x40000000) and PC=0x0120. Then the active mask is set we have found out that the penalty for one diverging
to P0=0x3FFFFFFF disabling threads 30 and 31 and control branch is exactly 32 cycles on Kepler (see the continuous
jumps to line 7 (Figure 3.4). line in the Figure 1) and 26 cycles on Maxwell. Those 32
The loop counter is incremented again to 32 so all bits cycles include pushing the token on the stack, executing
of the predicate P0 are cleared. There is no divergence in a synchronization command and popping the stack. In
branch on line 10 as none of the active threads take the section IV-A we will take a more detailed look at how
branch and the control moves to line 11. exactly those cycles are divided between those operations.
The NOP instruction on line 11 has pop-bit set, so a token Looking again at the Figure 1 we notice that the first
is popped from the stack. The active mask is set to the jump occurs when length of the stack exceeds 16 i.e. when
token mask and PC to the token’s PC. The NOP instruc- n > 15. We can assume that this is the physical length of
tion is executed. And the control jumps again to line 11 the fast on chip memory allocated for stack, and growing
that was the token program counter (Figure 3.5). Stack the stack longer requires spilling the entries into some
is popped again and after executing the NOP instruction other memory (as mentioned in [10]). After this jump the
the control goes back to line 11 (Figure 3.6). number of cycles again increases by 32 at each step for the
Then the last item is popped from the stack. This is the next four entries leading us to assume that the entries are
SYNC token that restores the 0xFFFFFFFF active mask and spilled four at the time e.g. in chunks of 32 bytes. This is
sets the PC to line 12, ending the loop (Figure 3.7). exactly the length of the L2 cache line suggesting that the
This walk-through lets us see the general pattern: as n is entries are spilled to the global memory. This would also
increased we start pushing DIV tokens on the stack earlier explain why running a preheat loop before measurements
6

reduces the overall number of cycles taken by the loops and n = 23


makes the measurements deterministic, as it populates the 32
n = 31
cache. 28
We have assumed that during a push that would over-
flow the stack, the four lowest entries are spilled into 24
the global memory leaving the room for new four highest 20

cycles
entries. Likewise when popping the stack would require
access to elements that were spilled, they are loaded back 16
from the memory. We have implemented this scenario in 12
our emulator. Comparing that with the timing measure-
8
ments we have found out that the spilling and loading back
takes together around 84 cycles on Kepler architecture and 4
176 cycles on Maxwell. 0
Using the CUPTI API we have looked for additional
load/store or cache hit/miss instructions that would in- 0 1,000 2,000 3,000 4,000 5,000
dicate the spill into the global memory, but found none,
#instructions
however we have noticed appearance of some additional
branch instructions. Those instruction did not show up Fig. 5. Re-convergence stack history for the double loops kernel.
in the execution trace, so could not be attributed to the
particular instructions. The number of those additional
instructions was however exactly equal to the number of 60 Branches (Kepler)
spills. Without the detailed knowledge of NVIDIA micro- Spills (Emulator)
architecture it is impossible to ascertain the meaning of
those instructions. One plausible explanation would be
that the branch instructions are reissued after the stack 40
content is spilled to memory.
branch

We did same analysis for the double loop kernel using


the disassembly presented in Listing 4.
20
1 /*0060*/ [Link] P0, PT, R8, 0x1, PT;
2 ...
3 /*0118*/ SSY 0x1a0;
4 /*0120*/ @P0 BRA 0x198;
5 /*0128*/ MOV R6, RZ;
6 /*0130*/ [Link] P0, PT, R9, 0x1, PT;
0
7 /*0138*/ MOV R7, RZ;
8 /*0140*/ SSY 0x178; 0 4 8 12 16 20 24 28 32
9 /*0148*/ @P0 BRA 0x170; n
10 /*0150*/ IADD R7, R7, 0x1;
11 /*0158*/ FADD32I R0, R0, 1.3332999944686889648;
12 /*0160*/ [Link] P0, PT, R7, R9, PT; Fig. 6. Number of additional branch instructions issued in the double
13 /*0168*/ @P0 BRA 0x150; loop kernel.
14 /*0170*/ NOP.S;
15 /*0178*/ IADD R6, R6, 0x1;
16 /*0180*/ FADD32I R0, R0, 2.3333001136779785156;
17 /*0188*/ [Link] P0, PT, R6, R8, PT;
the Figure 7. The agreement while not perfect is still
18 /*0190*/ @P0 BRA 0x130; very good (76 cycles difference in the worst case which
19 /*0198*/ NOP.S; amounts to 0.3%). Looking at the Figure 8 we see that
20 /*01a0*/ S2R R7, SR_CLOCKHI;
the difference between the actual and predicted number
Listing 4. Disassembly of the double loop kernel. of cycles follows a very regular pattern and happens only
on spill occurrence, suggesting some simple mechanism
This is a more complicated case as illustrated by the stack contributing to the exact number of cycles needed to
history presented in the Figure 5. The maximum length of perform stack spill depending on previous spills history.
the stack is now n + 2 and we have found out that the
number of push operations is x · (65 − x)/2 + 33 (see the
continuous line on Figure 2). A. Detailed timings
As in the single loop case we have found out that In the previous section we have calculated the costs asso-
additional number of branch instructions is executed and ciated with diverging predicated branch instructions. That
this number is exactly predicted by the number of spills was the total cost associated with pushing and popping the
(see Figure 6). stack as well as executing the synchronization command
Using the parameters obtained from the single loop carrier instruction. In this section we are attempting a
measurements we used our emulator to predict the timings more detailed view. To this end we will use the kernel
of the double loop kernel. The results are presented in from listing 5. This is essentially the single loop kernel with
7

·104
8 Kepler
Maxwell 1 __global__ void single_loop_precise(int *limits, float *
out, llong *timer) {
Emulator 2
7 3 int tid = blockDim.x * blockIdx.x + threadIdx.x;
4 int M = limits[threadIdx.x];
5 float sum = out[tid];
cycles

6
6 7 #pragma unroll
8 for (int k = 0; k < N_PREHEAT; k++)
9 for (int i = 0; i < M; i++) {
10 sum +=1.3333f;
5 11 }
12
13 timer[tid] = clock64();
14

4 15 for (int i = 0; i < M; i++) {


16 sum += 1.3333f;
0 4 8 12 16 20 24 28 32 17 timer[(i + 1) * 32 + tid] = clock64();
18 }
n 19
20 __syncthreads();
Fig. 7. Comparison of the double loop timings with emulator results. 21 timer[33 * warpSize + tid] = clock64();
22
23 out[tid] = sum;
80 24 }
Listing 5. Single loop kernel with detailed timings.

60
cycles

40
1 /*0068*/ S2R R9, SR_CLOCKHI;
2 /*0070*/ IADD [Link], R8, R8;
20 3 /*0078*/ [Link] R5, R5, R9, R8;
4 /*0080*/ IADD.X R5, R5, R5;
5 /*0088*/ MOV32I R9, 0x8;
6 /*0090*/ IMAD [Link], R0, R9, c[0x0][0x30];
0 7 /*0098*/ [Link].X R9, R0, R9, c[0x0][0x34];
8 /*00a0*/ ST.E.64 [R8], R4;
9 /*00a8*/ S2R R5, SR_CLOCKHI;
0 4 8 12 16 20 24 28 32 10 /*00b0*/ S2R R8, SR_CLOCKLO;
n 11 /*00b8*/ S2R R9, SR_CLOCKHI;
12 /*00c0*/ IADD [Link], R8, R8;
13 /*00c8*/ [Link] R5, R5, R9, R8;
Fig. 8. Difference between the actual and predicted number of cycles
14 /*00d0*/ IADD.X R5, R5, R5;
for the double loop kernel on Kepler (GTX 770) architecture.
15 /*00d8*/ [Link] P0, PT, R7, 0x1, PT;
16 /*00e0*/ MOV R10, RZ;
17 /*00e8*/ SSY 0x178;
additional timing instruction inserted inside the loop (line 18 /*00f0*/ @P0 BRA 0x170;
19 /*00f8*/ MOV32I R11, 0x8;
17). According to the walk-through in the previous section 20 /*0100*/ IADD R10, R10, 0x1;
those instructions should be executed before unwinding 21 /*0108*/ FADD32I R6, R6, 1.333;
the stack, while the last instruction on line 21 should be 22 /*0110*/ ISCADD R9, R10, R0, 0x5;
23 /*0118*/ IMAD [Link], R9, R11, c[0x0][0x30];
executed after. 24 /*0120*/ [Link].X R9, R9, R11, c[0x0][0x34];
At first we did not include the synchronization com- 25 /*0128*/ ST.E.64 [R8], R4;
mand on line 20, as we deemed it unnecessary within 26 /*0130*/ S2R R5, SR_CLOCKHI;
27 /*0138*/ S2R R8, SR_CLOCKLO;
one warp. The results were however not compatible with 28 /*0140*/ S2R R9, SR_CLOCKHI;
the previous ones. Inspection of the disassembly in list- 29 /*0148*/ IADD [Link], R8, R8;
ing 6 reveled that actually all the timing instructions 30 /*0150*/ [Link] R9, R5, R9, R8;
31 /*0158*/ IADD.X R5, R9, R9;
were executed before the stack unwinding triggered by the 32 /*0160*/ [Link] P0, PT, R10, R7, PT;
synchronization command on line 34. 33 /*0168*/ @P0 BRA 0x100;
34 /*0170*/ IADD.S R0, R0, 0x420;
Adding the __syncthreads() instruction on line 20 solved
Listing 6. Disassembly of the single loop kernel with detailed timings
the problem. The overall results now matched the results without synchronization.
from the previous section. Of course the total times dif-
fered substantially as the code had now extra instructions,
but the differences in number of cycles between the runs
8

with various values of n were identical for n ≤ 15 (no quite natural scenario. We would like to emphasize that
spills) and differed by few cycles for n > 15. in this case the additional costs reported by us are the
We have gathered the timelines for each n and each costs of the divergence management mechanism alone,
thread in warp but we used only the timings for thread excluding additional penalty of incomplete vectorization.
0 – the one that always iterates 32 times through the Nevertheless they implied an overhead of ∼ 20% in case of
loop. We have found out that when there were no spills 14 diverging threads for the double loop kernel and ∼ 25%
(n ≤ 15) then the total cost was last code segment: the for the single loop. In the more general case there would
unwinding of the stack. There was no cost associated with be additional costs associated with the serialization of the
predicated brach instruction i.e. the push operation. This different branches.
may be explained by the fact that the latency of push Our work was confined to NVIDIA CUDA technology
operation can be hidden as the token pushed onto the stack it would be however interesting how this corresponds to
is not needed by the next instruction. While unwinding the shader programming both in DirectX and OpenGL and
stack however the popped token is used immediately to set we are planning to extend our work in this direction in
active mask and PC. Additionally the carrier instruction the future.
has to be executed, even if this is a NOP it still has to be
fetched an decoded. References
When n was greater then 15, i.e. the stack was spilled [1] P. Białas, J. Kowal, A. Strzelecki et al., “GPU accelerated image
in the memory, the cost were divided between the branch reconstruction in a two-strip J-PET tomograph,” arXiv preprint
instruction that now took around 40 cycles longer and arXiv:1502.07478, 2015.
[2] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym,
stack unwinding that now took around 42 cycles longer. “NVIDIA Tesla: A unified graphics and computing architec-
This corresponds well with the overall 84 cycles penalty ture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
estimated in the previous section. [3] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying GPU microarchitecture through
microbenchmarking,” in Performance Analysis of Systems &
V. Summary Software (ISPASS), 2010 IEEE International Symposium on.
The Single Thread Multiple Threads (SIMT) program- IEEE, 2010, pp. 235–246.
[4] N. Wilt, The CUDA Handbook: A Comprehensive Guide to GPU
ming model introduced by NVIDIA with CUDA technol- Programming. Pearson Education, 2013.
ogy gives the programmer an illusion of writing a multi- [5] S. Collange, M. Daumas, D. Defour, and D. Parello, “Barra: A
threaded application. In reality the threads are executed in parallel functional simulator for GPGPU,” in Modeling, Analy-
sis & Simulation of Computer and Telecommunication Systems
groups of 32 threads (warps) performing essentially vector (MASCOTS), 2010 IEEE International Symposium on. IEEE,
operations. The illusion of multithreading is maintained 2010, pp. 351–360.
by an elaborate mechanism for managing the divergent [6] ——, “Comparaison d’algorithmes de branchements pour le sim-
ulateur de processeur graphique Barra,” in 13‘eme Symposium
threads within a warp. This mechanism is completely hid- sur les Architectures Nouvelles de Machines, 2009, pp. 1–12.
den from the programmer and no description is provided [7] T. M. Aamodt et al., “GPGPU-Sim,” 2012. [Online]. Available:
by NVIDIA in their programming guide. In this contribu- [Link]
[8] A. Habermaier and A. Knapp, “On the correctness of the SIMT
tion we have presented a detailed study of two CUDA ker- execution model of GPUs,” in Programming Languages and
nels leading to a large thread divergence. Following other Systems. Springer, 2012, pp. 316–335.
sources, we have assumed that the management of the [9] B. W. Coon and J. E. Lindholm, “System and method
for managing divergent threads in a SIMD architecture,”
diverging threads is based on a synchronization stack and US Patent US7 353 369 B1, 04 01, 2008. [Online]. Available:
follows closely the idea described in reference [10] which [Link]
is a NVIDIA owned patent. That assumption allowed us [10] B. W. Coon, J. R. Nickolls, L. Nyland, P. C. Mills,
and J. E. Lindholm, “Indirect function call instructions
to fit the observed execution times with accuracy better in a synchronous parallel thread processor,” US Patent
then 1%. US8 312 254 B2, 11 13, 2012. [Online]. Available: https:
We have estimated the cost of a diverging branch in- //[Link]/patents/US8312254
[11] NVIDIA, “CUPTI: CUDA toolkit documentation,” 2014.
struction to be exactly 32 cycles on Kepler architecture [Online]. Available: [Link]
provided that the maximum stack length did not exceed
16. After that stack entries had to be spilled into memory
which carried an additional penalty of ∼ 84 cycles. On
Maxwell the estimated cost of branch divergence was
considerably shorter: only 24 cycles, but the cost of spilling
was much higher ∼ 176 cycles. More detailed timings are
suggesting that all of that cost, in case of no spills, can be
attributed to stack unwinding.
We also performed the measurements on Fermi archi-
tecture, there the pattern seems to be slightly different
suggesting that the physical thread synchronization stack
length is shorter in this architecture.
We have analyzed only a very specific case: loops with
different bounds across the threads, this is however a

You might also like