2|Page
3|Page
Question 1(4.13)
Assume a GPU architecture that contains 10 SIMD processors.
Each SIMD instruction has a width of 32 and each SIMD processor contains 8
lanes for single-precision arithmetic and load/store instructions, meaning that each non-diverged SIMD
instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes
on average 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single-
precision arithmetic and 20% are load/store. Since not all memory latencies
are covered, assume an average SIMD instruction issue rate of 0.85. Assume that
the GPU has a clock speed of 1.5 GHz.
a. Compute the throughput, in GFLOP/sec, for this kernel on this GPU.
b. Assume that you have the following choices:
(1) Increasing the number of single-precision lanes to 16
(2) Increasing the number of SIMD processors to 15 (assume this change
doesn't affect any other performance metrics and that the code scales to the additional
processors)
(3) Adding a cache that will effectively reduce memory latency by 40%,
which will increase instruction issue rate to 0.95 What is speedup in throughput for each
of these improvements?
Solution:
A. Given Information:
Clock speed: 1.5Ghz
Active threads: 80%
Instruction issue rate: 0.85
Single precision instructions: 70%
Total instruction width: 32bits
Total cycle required to transfer 32 bits: 4cycles
Keep in mind:
Gigaflops is a unit of measurement used to measure the performance of a
computer's floating point unit, commonly referred to as the FPU. One gigaflops
is one billion (1,000,000,000) FLOPS, or floating point operations, per
second.
B.
Option 1: if we increase the single-precision lanes to 16 it means we just
need two cycles to transfer 32b data.
4|Page
Option 2: second choice is to increase the number of cores from 10 to 15
cores.
Option 3: increase instruction issue rate to 0.95
Option 3 is best.
Question 2(4.14)
In this exercise, we will examine several loops and analyze
their potential for parallelization.
a. Does the following loop have a loop-carried dependency?
for (i=0;i<100;i++)
{
A[i] = B[2*i+4];
B[4*i+5] = A[i];
}
b. In the following loop, find all the true dependences, output dependences, and
antidependences. Eliminate the output dependences and antidependences by renaming.
for (i=0;i<100;i++)
{
A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c; /* S2 */
A[i] = C[i] * c; /* S3 */
5|Page
C[i] = D[i] * A[i]; /* S4 */
}
c. Consider the following loop:
for (i=0;i < 100;i++)
{
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
Are there dependences between S1 and S2? Is this loop parallel? If not, show how
to make it parallel.
Solution:
A. To check whether a loop-carried dependency exist between the B[2*i+4] and
B[4*i+5], we use GCD(Greatest Common Divisor). Using the GCD test, a
dependency exists if GCD (2,4) must divide 5 – 4.
The GCD(2,4)=2 and (5-4) mod 2=1
BCD(2,4) does not divide the (5-4) so there is no loop-carried dependency
exist.
Same case for A[] because both A[same index]. So there is no loop-carried
dependency exist for A[] either.
Wrong Answer in book:
In this case, a loop-carried dependency does exist.
B. Output dependencies
S1 and S3 cause through A[i] which is also called write-after-write dependency
A[i] = A[i] * B[i]; /* S1 */
A[i] = C[i] * c; /* S3 */
Anti-dependencies
S4 and S3 cause an anti-dependency through C[i].
A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
Because C[i] have a write-after-read dependency which is also called anti-
dependency
True dependency
S1 and s2 through A[i]
A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c; /* S2 */
As we can clearly see the read after write dependency between A[i] in S1 and S2.
6|Page
S3 and s3 through A[i]
A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
As we can clearly see the read after write dependency between A[i] in S3 and S4.
Re-written code
for (i=0;i<100;i++)
{
T[i] = A[i] * B[i]; /* S1 */
//T[i] used as temporary register for renaming
B[i] = T[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
//A1[i] and C1[i] are the copies of A, C
C1[i] = D[i] * A1[i]; /* S4 */
// we can avoid the output and anti-dependencies using register
Renaming.
We add renaming to A[i] and C[i] because both were involved in the
output and anti-dependencies.
}
C. There is anti-dependence between iteration i and i+1 for array B. This can be
avoided by renaming the B array in S2.
Original code
for (i=0;i < 100;i++)
{
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
There is anti-dependency between the s1 and s2 through B[i] and
B[i+1].
Re-Written code: Let’s remove the anti-dependency between in the S1 and S2
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1)
{
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];
7|Page
To remove the anti-dependency in this kind of loops we need just to swap the
instruction position.
Question 3(4.15)
List and describe at least four factors that influence the performance of
GPU kernels. In other words, which runtime behaviors that are caused by the
kernel code cause a reduction in resource utilization during kernel execution?
Solution:
Typically, the performance of a GPU kernel is limited by one of the following factors:
memory throughput
instruction throughput
latency
a combination of the above
A. Branch divergence: causes SIMD lanes to be masked when threads follow
Different control paths
B. Covering memory latency: a sufficient number of active threads can hide
memory latency and increase instruction issue rate
C. Coalesced off-chip memory references: memory accesses should be organized
consecutively within SIMD thread groups
D. Use of on-chip memory: memory references with locality should take
advantage of on-chip memory, references to on-chip memory within a SIMD
thread group should be organized to avoid bank conflicts
Question 4(4.16)
Assume a hypothetical GPU with the following characteristics:
■ Clock rate 1.5 GHz
■ Contains 16 SIMD processors, each containing 16 single-
precision floating point units
■ Has 100GB/sec off-chip memory bandwidth
Without considering memory bandwidth, what is the peak single-precision
floating-point throughput for this GPU in GLFOP/sec, assuming that all memory
latencies can be hidden? Is this throughput sustainable given the memory
bandwidth limitation?
Solution:
Given Data:
Clock Rate: 1.5GHZ
8|Page
Processors: 16 SIMD
FPU (functional Point units):16 single- precision(FPU)
Peak throughput for single precision:
The minimum requirement for the instruction is the two operands of 8-Bytes and one
output operand of size four-Byte. So the total memory location needed for the FLOPS is
12-Byte.
So,
Total bandwidth needed for all the GFLOP speed (assuming no temporal locality).
12byte/FLOP * Throughput
12byte/FLOP* 384GFLOPs/s=4.6TB/s
Therefore, we can utilize the 4.6TB/s of memory bandwidth. As such, this
throughput is not sustainable, but can still be achieved in short bursts when using on-
chip memory.
Question 5(5.4)
For the following code sequences and the timing parameters for the two
implementations in Figure 5.36, compute the total stall cycles for the base MSI
protocol and the optimized MOSI protocol in Exercise 5.3. Assume that state
transitions that do not require bus transactions incur no additional stall cycles.
A.
P0: read 110
P3: read 110
P0: read 110
B.
P1: read 120
P3: read 120
P0: read 120
C.
P0: write 120 <-- 80
9|Page
P3: read 120
P0: read 120
D.
P0: write 108 <-- 88
P3: read 108
P0: write 108 <-- 98
Additional Information from question(5.2)
The performance of a snooping cache-coherent multiprocessor depends on many
detailed implementation issues that determine how quickly a cache responds with data in an
exclusive or M state block. In some implementations, a CPU read miss to a cache block that is
exclusive in another processor’s cache is faster than a miss to a block in memory. This is
because caches are smaller, and thus faster, than main memory. Conversely, in some
implementations, misses satisfied by memory are faster than those satisfied by caches. This is
because caches are generally optimized for “front side” or CPU references, rather than “back
side” or snooping accesses. For the multiprocessor illustrated in Figure 5.35, consider the
execution of a sequence of operations on a single CPU where
CPU read and write hits generate no stall cycles.
■ CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied
by memory and cache, respectively.
■ CPU write hits that generate an invalidate incur Ninvalidate stall cycles.
■ A write-back of a block, due to either a conflict or another processor’s
request to an exclusive block, incurs an additional Nwriteback stall cycles.
Consider two implementations with different performance characteristics summarized in Figure
5.36. Consider the following sequence of operations assuming the initial cache state in Figure
5.35. For simplicity, assume that the second operation begins after the first completes (even
though they are on different processors):
P1: read 110
P3: read 110
For Implementation 1, the first read generates 50 stall cycles because the read is satisfied by
P0’s cache. P1 stalls for 40 cycles while it waits for the block, and P0 stalls for 10 cycles while
it writes the block back to memory in response to P1’s request. Thus, the second read by P3
generates 100 stall cycles because its miss is satisfied by memory, and this sequence
generates a total of 150 stall cycles. For the following sequences of operations, how many stall
cycles are generated by each implementation?
Solution:
This question is totally depends on the question number 5.3. first understand
the question 5.3. MSI: Modified, Shared, Invalid ( WRITE BACK TO MEMORY)
MOSI: add Owned (O is dirty, write-back not required)
10 | P a g e
A. Given Information::
P1: read 110, Read miss, P0’s cache
First read always miss
P3: read 110, Read miss, MSI satisfies in memory, MOSI satisfies in
P0’s cache
P0: read 110, Read hit
MSI: 40 + 10 + 100 + 0 = 150 stall cycles
MOSI: 40 + 10 + 40 + 10 + 0 = 100 stall cycles
We did not need to write back in case of MOSI. So we saved the 50 stalls.
B.
P1: read 120, Read miss, satisfied in memory
P3: read 120, Read hit
P0: read 120, Read miss, satisfied in memory
Both protocols: 100 + 0 + 100 = 200 stall cycles
C.
P0: write 120 ←80, Write miss, invalidates P3
P3: read 120, Read miss, P0’s cache
P0: read 120, Read hit
Both protocols: 100 + 40 + 10 + 0 = 150 stall cycles
D.
P0: write 108 ← 88, Send invalidate, invalidate P3
P3: read 108, Read miss, P0’s cache
P0: write 108 ← 98, Send invalidate, invalidate P3
Both protocols: 15 + 40 + 10 + 15 = 80 stall cycles
Question 6(5.6)
Assume the cache contents of Figure 5.35 and the timing of Implementation 1 in
Figure 5.36. What are the total stall cycles for the following code sequences with
both the base protocol and the new MESI protocol in Exercise 5.5? Assume that
state transitions that do not require interconnect transactions incur no additional
stall cycles.
A.
P0: read 100
P0: write 100 <-- 40
B.
P0: read 120
P0: write 120 <-- 60
11 | P a g e
C.
P0: read 100
P0: read 120
E.
P0: read 100
P1: write 100 <-- 60
F.
P0: read 100
P0: write 100 <-- 60
P1: write 100 <-- 40
Figure 1: Fig-5.35
Solution:
A.
12 | P a g e
p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,
MESI: E is just copy not required, broadcast not required
p0: write 100 40, MSI: send invalidate, MESI: silent transition
from E to M
MSI: 100 + 15 = 115 stall cycles
MESI: 100 + 0 = 100 stall cycles
B.
p0: read 120, Read miss, satisfied in memory, sharers both to S
p0: write 120 ← 60, Both send invalidates
Both: 100 + 15 = 115 stall cycles
C.
p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,
MESI: E
p0: read 120, Read miss, memory, silently replace 120 from S or E
Both: 100 + 100 = 200 stall cycles, silent replacement from E.
D.
p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,
MESI: E
p1: write 100 ← 60, Write miss, satisfied in memory regardless of
protocol
Both: 100 + 100 = 200 stall cycles, don’t supply data in E state
(some protocols do)
E.
p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,
MESI: E
p0: write 100 ← 60, MSI: send invalidate, MESI: silent transition
from E to M
p1: write 100 ← 40, Write miss, P0’s cache, writeback data to
memory
MSI: 100 + 15 + 40 + 10 = 165 stall cycles
MESI: 100 + 0 + 40 + 10 = 150 stall cycles
Question 7(5.10)
Directory protocols are more scalable than snooping protocols because
they send explicit request and invalidate messages to those nodes that have copies of a
block, while snooping protocols broadcast all requests and invalidates to all nodes.
Consider the eight-processor system illustrated in Figure 5.37 and assume that all
caches not shown have invalid blocks. For each of the sequences below, identify which
nodes (chip/processor) receive each request and invalidate.
13 | P a g e
a. P0,0: write 100 <-- 80
b. P0,0: write 108 <-- 88
c. P0,0: write 118 <-- 90
d. P1,0: write 128 <-- 98
Figure 5.37: Multichip and Multicore Processor
Solution:
Pi,j denotes processor i in chip j. Each processor has a single directmapped L1
cache that holds two blocks, each holding two words. Each chip has a single direct-
mapped L2 cache that holds two blocks, each holding two words. To simplify the
14 | P a g e
illustration, the cache address tags contain the full address and each word shows only two
hex characters, with the least significant word on the right. The L1 cache states are
denoted M, S, and I for Modified, Shared, and Invalid. Both the L2 caches and memories
have directories. The directory states are denoted DM, DS, and DI for Directory
Modified, Directory Shared, and Directory Invalid.
P#: <op> <address> [ <-- <value> ]
where P# designates the CPU (e.g., P0,0), <op> is the CPU operation (e.g., read or write),
<address> denotes the memory address, and <value> indicates the new word to be
assigned on a write operation.
A.
P0,0: write 100 ← 80, Write hit only seen by P0,0
B.
P0,0: write 108 ← 88, Write “upgrade” received by P0,0; invalidate
received by P3,1
C.
P0,0: write 118 ← 90, Write miss received by P0,0; invalidate received by
P1,0
D.
P1,0: write 128 ← 98, Write miss received by P1,0.
Question 8(5.11)
Exercise 5.3 asked you to add the Owned state to the simple MSI snooping
protocol. Repeat the question, but with the simple directory protocol above.
Solution:
15 | P a g e
Figure 2: Directory States
Question 9(5.19)
Assume that we have a function for an application of the form F(i, p), which gives
the fraction of time that exactly i processors are usable given that a
total of p processors is available. That means that
p
∑ F(i, p) = 1
i=1
Assume that when i processors are in use, the applications run i times faster.
Rewrite Amdahl’s law so it gives the speedup as a function of p for some
application?
Solution:
The general form for Amdahl’s Law is
16 | P a g e
all that needs to be done to compute the formula for speedup in this
multiprocessor case is to derive the new execution time. The exercise states that
for the portion of the original execution time that can use i processors is given by
F(i,p). If we let Execution timeold be 1, then the relative time for the application
on p processors is given by summing the times required for each portion of the
execution time that can be sped up using i processors, where i is between 1 and p.
This yield,
Substituting this value for Execution timenew into the speedup equation makes
Amdahl’s Law a function of the available processors, p.
Question 10(5.21)
Show how the basic snooping protocol of Figure 5.7 can be
changed for a write-through cache. What is the major hardware functionality that
is not needed with a write-through cache compared with a write-back cache?
Solution:
To keep the figures from becoming cluttered, the coherence protocol is
split into two parts as was done in Figure 5.6 in the text. Figure S.34 presents the
CPU portion of the coherence protocol, and Figure S.35 presents the bus portion
of the protocol. In both of these figures, the arcs indicate transitions and the text
along each arc indicates the stimulus (in normal text) and bus action (in bold text)
that occurs during the transition between states. Finally, like the text, we assume a
write hit is handled as a write miss.
Figure S.34 presents the behavior of state
transitions caused by the CPU itself. In this case, a write to a block in either the
invalid or shared state causes us to broadcast a “write invalidate” to flush the
block from any other caches that hold the block and move to the exclusive state.
We can leave the exclusive state through either an invalidate from another
processor (which occurs on the bus side of the coherence protocol state diagram),
or a read miss generated by the CPU (which occurs when an exclusive block of
data is displaced from the cache by a second block). In the shared state only a
write by the CPU or an invalidate from another processor can move us out of this
state. In the case of transitions caused by events external to the CPU, the state
diagram is fairly simple, as shown in Figure S.35.
When another processor writes
a block that is resident in our cache, we unconditionally invalidate the
corresponding block in our cache. This ensures that the next time we read the
data, we will load the updated value of the block from memory. Also, whenever
17 | P a g e
the bus sees a read miss, it must change the state of an exclusive block to shared
as the block is no longer exclusive to a single cache. The major change introduced
in moving from a write-back to write-through cache is the elimination of the need
to access dirty blocks in another processor’s caches. As a result, in the write-
through protocol it is no longer necessary to provide the hardware to force write
back on read accesses or to abort pending memory accesses. As memory is
updated during any write on a write-through cache, a processor that generates a
read miss will always retrieve the correct information from memory. Basically, it
is not possible for valid cache blocks to be incoherent with respect to main
memory in a system with write-through caches.
Question 11(Q5.22)
Add a clean exclusive state to the basic snooping cache coherence
protocol (Figure 5.7). Show the protocol in the format of Figure 5.7?
18 | P a g e
Solution:
To augment the snooping protocol of Figure 5.7 with a Clean Exclusive state we
assume that the cache can distinguish a read miss that will allocate a block
destined to have the Clean Exclusive state from a read miss that will deliver a
Shared block. Without further discussion we assume that there is some
mechanism to do so.
The three states of Figure 5.7 and the transitions between
them are unchanged, with the possible clarifying exception of renaming the
Exclusive (read/write) state to Dirty Exclusive (read/write). The new Clean
Exclusive (read only) state should be added to the diagram along with the
following transitions.
From Clean Exclusive to Clean Exclusive in the event of a CPU read
hit on this block or a CPU read miss on a Dirty Exclusive block
From Clean Exclusive to Shared in the event of a CPU read miss on a
Shared block or on a Clean Exclusive block
From Clean Exclusive to Shared in the event of a read miss on the bus
for this block
19 | P a g e
From Clean Exclusive to Invalid in the event of a write miss on the bus
for this block
From Clean Exclusive to Dirty Exclusive in the event of a CPU write
hit on this block or a CPU write miss
From Dirty Exclusive to Clean Exclusive in the event of a CPU read
miss on a Dirty Exclusive block
From Invalid to Clean Exclusive in the event of a CPU read miss on a
Dirty Exclusive block
From Shared to Clean Exclusive in the event of a CPU read miss on a
Dirty Exclusive block Several transitions from the original protocol
must change to accommodate the existence of the Clean Exclusive
state. The following three transitions are those that change.
From Dirty Exclusive to Shared, the label changes to CPU read miss
on a Shared block
From Invalid to Shared, the label changes to CPU miss on a Shared
block
From Shared to Shared, the miss transition label changes to CPU read
miss on a Shared block
Question 12(5.23)
One proposed solution for the problem of false sharing is to add a valid bit per
word. This would allow the protocol to invalidate a word without removing the
entire block, letting a processor keep a portion of a block in its cache while
another processor writes a different portion of the block. What extra
complications are introduced into the basic snooping cache coherence protocol
(Figure 5.7) if this capability is included? Remember to consider all possible
protocol actions.
Solution:
An obvious complication introduced by providing a valid bit per word is the
need to match not only the tag of the block but also the offset within the block
when snooping the bus. This is easy, involving just looking at a few more bits. In
addition, however, the cache must be changed to support write-back of partial
cache blocks. When writing back a block, only those words that are valid should
be written to memory because the contents of invalid words are not necessarily
coherent with the system. Finally, given that the state machine of Figure 5.7 is
applied at each cache block, there must be a way to allow this diagram to apply
when state can be different from word to word within a block. The easiest way to
do this would be to provide the state information of the figure for each word in the
block. Doing so would require much more than one valid bit per word, though.
20 | P a g e
Without replication of state information the only solution is to change the
coherence protocol slightly.
Question 13(5.26)
Assume a directory-based cache coherence protocol. The directory
currently has information that indicates that processor P1 has the data in
“exclusive” mode. If the directory now gets a request for the same cache block
from processor P1, what could this mean? What should the directory controller
do? (Such cases are called race conditions and are the reason why coherence
protocols are so difficult to design and verify.)
Solution:
The problem illustrates the complexity of cache coherence protocols. In this
case, this could mean that the processor P1 evicted that cache block from its cache
and immediately requested the block in subsequent instructions. Given that the
writeback message is longer than the request message, with networks that allow
out-of-order requests, the new request can arrive before the write back arrives at
the directory.
One solution to this problem would be to have the directory wait for
the write back and then respond to the request. Alternatively, the directory can
send out a negative acknowledgment (NACK). Note that these solutions need to
be thought out very carefully since they have potential to lead to deadlocks based
on the particular implementation details of the system. Formal methods are often
used to check for races and deadlocks
Question 14(5.30)
One performance optimization commonly used is to pad
synchronization variables to not have any other useful data in the same cache line
as the synchronization variable. Construct a pathological example when not doing
this can hurt performance. Assume a snooping write invalidate protocol.
Solution:
Assume a cache line that has a synchronization variable and the data guarded
by that synchronization variable in the same cache line. Assume a two processor
system with one processor performing multiple writes on the data and the other
processor spinning on the synchronization variable. With an invalidate protocol,
false sharing will mean that every access to the cache line ends up being a miss
resulting in significant performance penalties.
Thanks for Reading