0% found this document useful (0 votes)

66 views20 pages

Assignment 04 Solutionwithsolution

This document discusses GPU performance and parallelization of loops. It includes questions about identifying dependencies in loops, factors that influence GPU kernel performance, and calculating peak floating-point throughput for a hypothetical GPU. Some key details: - Loop carried dependencies, true dependencies, output dependencies, and anti-dependencies are defined. - Factors like branch divergence, memory latency hiding, coalesced memory access, and on-chip memory usage can impact GPU kernel performance. - Peak single-precision floating-point throughput is calculated for a GPU with 16 SIMD processors at 1.5GHz clock speed with 16 single-precision units each. Memory bandwidth limitations are also considered.

Uploaded by

blendia.25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views20 pages

Assignment 04 Solutionwithsolution

Uploaded by

blendia.25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

2|Page

3|Page

Question 1(4.13)
Assume a GPU architecture that contains 10 SIMD processors.
Each SIMD instruction has a width of 32 and each SIMD processor contains 8
lanes for single-precision arithmetic and load/store instructions, meaning that each non-diverged SIMD
instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes
on average 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single-
precision arithmetic and 20% are load/store. Since not all memory latencies
are covered, assume an average SIMD instruction issue rate of 0.85. Assume that
the GPU has a clock speed of 1.5 GHz.
a. Compute the throughput, in GFLOP/sec, for this kernel on this GPU.
b. Assume that you have the following choices:
(1) Increasing the number of single-precision lanes to 16
(2) Increasing the number of SIMD processors to 15 (assume this change
doesn't affect any other performance metrics and that the code scales to the additional
processors)
(3) Adding a cache that will effectively reduce memory latency by 40%,
which will increase instruction issue rate to 0.95 What is speedup in throughput for each
of these improvements?

Solution:
A. Given Information:
Clock speed: 1.5Ghz
Active threads: 80%
Instruction issue rate: 0.85
Single precision instructions: 70%
Total instruction width: 32bits
Total cycle required to transfer 32 bits: 4cycles
Keep in mind:
Gigaflops is a unit of measurement used to measure the performance of a
computer's floating point unit, commonly referred to as the FPU. One gigaflops
is one billion (1,000,000,000) FLOPS, or floating point operations, per
second.

B.
Option 1: if we increase the single-precision lanes to 16 it means we just
need two cycles to transfer 32b data.
4|Page

Option 2: second choice is to increase the number of cores from 10 to 15

cores.

Option 3: increase instruction issue rate to 0.95

Option 3 is best.

Question 2(4.14)
In this exercise, we will examine several loops and analyze
their potential for parallelization.
a. Does the following loop have a loop-carried dependency?
for (i=0;i<100;i++)
{
A[i] = B[2*i+4];
B[4*i+5] = A[i];
}
b. In the following loop, find all the true dependences, output dependences, and
antidependences. Eliminate the output dependences and antidependences by renaming.
for (i=0;i<100;i++)
{
A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c; /* S2 */
A[i] = C[i] * c; /* S3 */
5|Page

C[i] = D[i] * A[i]; /* S4 */

}
c. Consider the following loop:
for (i=0;i < 100;i++)
{
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
Are there dependences between S1 and S2? Is this loop parallel? If not, show how
to make it parallel.

Solution:
A. To check whether a loop-carried dependency exist between the B[2*i+4] and
B[4*i+5], we use GCD(Greatest Common Divisor). Using the GCD test, a
dependency exists if GCD (2,4) must divide 5 – 4.

The GCD(2,4)=2 and (5-4) mod 2=1

BCD(2,4) does not divide the (5-4) so there is no loop-carried dependency
exist.
Same case for A[] because both A[same index]. So there is no loop-carried
dependency exist for A[] either.

Wrong Answer in book:

In this case, a loop-carried dependency does exist.

B. Output dependencies
S1 and S3 cause through A[i] which is also called write-after-write dependency

A[i] = A[i] * B[i]; /* S1 */

A[i] = C[i] * c; /* S3 */

Anti-dependencies
S4 and S3 cause an anti-dependency through C[i].

A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
Because C[i] have a write-after-read dependency which is also called anti-
dependency
True dependency
S1 and s2 through A[i]
A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c; /* S2 */
As we can clearly see the read after write dependency between A[i] in S1 and S2.
6|Page

S3 and s3 through A[i]

A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */

As we can clearly see the read after write dependency between A[i] in S3 and S4.

Re-written code

for (i=0;i<100;i++)
{
T[i] = A[i] * B[i]; /* S1 */
//T[i] used as temporary register for renaming
B[i] = T[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
//A1[i] and C1[i] are the copies of A, C
C1[i] = D[i] * A1[i]; /* S4 */
// we can avoid the output and anti-dependencies using register
Renaming.
We add renaming to A[i] and C[i] because both were involved in the
output and anti-dependencies.
}

C. There is anti-dependence between iteration i and i+1 for array B. This can be
avoided by renaming the B array in S2.
Original code

for (i=0;i < 100;i++)

{
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
There is anti-dependency between the s1 and s2 through B[i] and
B[i+1].

Re-Written code: Let’s remove the anti-dependency between in the S1 and S2

A[0] = A[0] + B[0];

for (i=0; i<99; i=i+1)
{
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];
7|Page

To remove the anti-dependency in this kind of loops we need just to swap the
instruction position.

Question 3(4.15)
List and describe at least four factors that influence the performance of
GPU kernels. In other words, which runtime behaviors that are caused by the
kernel code cause a reduction in resource utilization during kernel execution?

Solution:
Typically, the performance of a GPU kernel is limited by one of the following factors:

 memory throughput
 instruction throughput
 latency
 a combination of the above

A. Branch divergence: causes SIMD lanes to be masked when threads follow

Different control paths
B. Covering memory latency: a sufficient number of active threads can hide
memory latency and increase instruction issue rate
C. Coalesced off-chip memory references: memory accesses should be organized
consecutively within SIMD thread groups
D. Use of on-chip memory: memory references with locality should take
advantage of on-chip memory, references to on-chip memory within a SIMD
thread group should be organized to avoid bank conflicts

Question 4(4.16)
Assume a hypothetical GPU with the following characteristics:
■ Clock rate 1.5 GHz
■ Contains 16 SIMD processors, each containing 16 single-
precision floating point units
■ Has 100GB/sec off-chip memory bandwidth
Without considering memory bandwidth, what is the peak single-precision
floating-point throughput for this GPU in GLFOP/sec, assuming that all memory
latencies can be hidden? Is this throughput sustainable given the memory
bandwidth limitation?

Solution:
Given Data:

Clock Rate: 1.5GHZ

8|Page

Processors: 16 SIMD

FPU (functional Point units):16 single- precision(FPU)

Peak throughput for single precision:

The minimum requirement for the instruction is the two operands of 8-Bytes and one
output operand of size four-Byte. So the total memory location needed for the FLOPS is
12-Byte.

So,

Total bandwidth needed for all the GFLOP speed (assuming no temporal locality).

12byte/FLOP * Throughput

12byte/FLOP* 384GFLOPs/s=4.6TB/s

Therefore, we can utilize the 4.6TB/s of memory bandwidth. As such, this

throughput is not sustainable, but can still be achieved in short bursts when using on-
chip memory.

Question 5(5.4)

For the following code sequences and the timing parameters for the two
implementations in Figure 5.36, compute the total stall cycles for the base MSI
protocol and the optimized MOSI protocol in Exercise 5.3. Assume that state
transitions that do not require bus transactions incur no additional stall cycles.
A.
P0: read 110
P3: read 110
P0: read 110
B.
P1: read 120
P3: read 120
P0: read 120
C.
P0: write 120 <-- 80
9|Page

P3: read 120

P0: read 120
D.
P0: write 108 <-- 88
P3: read 108
P0: write 108 <-- 98
Additional Information from question(5.2)
The performance of a snooping cache-coherent multiprocessor depends on many
detailed implementation issues that determine how quickly a cache responds with data in an
exclusive or M state block. In some implementations, a CPU read miss to a cache block that is
exclusive in another processor’s cache is faster than a miss to a block in memory. This is
because caches are smaller, and thus faster, than main memory. Conversely, in some
implementations, misses satisfied by memory are faster than those satisfied by caches. This is
because caches are generally optimized for “front side” or CPU references, rather than “back
side” or snooping accesses. For the multiprocessor illustrated in Figure 5.35, consider the
execution of a sequence of operations on a single CPU where

CPU read and write hits generate no stall cycles.

■ CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied
by memory and cache, respectively.
■ CPU write hits that generate an invalidate incur Ninvalidate stall cycles.
■ A write-back of a block, due to either a conflict or another processor’s
request to an exclusive block, incurs an additional Nwriteback stall cycles.
Consider two implementations with different performance characteristics summarized in Figure
5.36. Consider the following sequence of operations assuming the initial cache state in Figure
5.35. For simplicity, assume that the second operation begins after the first completes (even
though they are on different processors):
P1: read 110
P3: read 110
For Implementation 1, the first read generates 50 stall cycles because the read is satisfied by
P0’s cache. P1 stalls for 40 cycles while it waits for the block, and P0 stalls for 10 cycles while
it writes the block back to memory in response to P1’s request. Thus, the second read by P3
generates 100 stall cycles because its miss is satisfied by memory, and this sequence
generates a total of 150 stall cycles. For the following sequences of operations, how many stall
cycles are generated by each implementation?

Solution:
This question is totally depends on the question number 5.3. first understand
the question 5.3. MSI: Modified, Shared, Invalid ( WRITE BACK TO MEMORY)
MOSI: add Owned (O is dirty, write-back not required)
10 | P a g e

A. Given Information::

P1: read 110, Read miss, P0’s cache

First read always miss
P3: read 110, Read miss, MSI satisfies in memory, MOSI satisfies in
P0’s cache
P0: read 110, Read hit
MSI: 40 + 10 + 100 + 0 = 150 stall cycles
MOSI: 40 + 10 + 40 + 10 + 0 = 100 stall cycles
We did not need to write back in case of MOSI. So we saved the 50 stalls.

B.
P1: read 120, Read miss, satisfied in memory
P3: read 120, Read hit
P0: read 120, Read miss, satisfied in memory
Both protocols: 100 + 0 + 100 = 200 stall cycles

C.
P0: write 120 ←80, Write miss, invalidates P3
P3: read 120, Read miss, P0’s cache
P0: read 120, Read hit
Both protocols: 100 + 40 + 10 + 0 = 150 stall cycles

D.
P0: write 108 ← 88, Send invalidate, invalidate P3
P3: read 108, Read miss, P0’s cache
P0: write 108 ← 98, Send invalidate, invalidate P3
Both protocols: 15 + 40 + 10 + 15 = 80 stall cycles

Question 6(5.6)
Assume the cache contents of Figure 5.35 and the timing of Implementation 1 in
Figure 5.36. What are the total stall cycles for the following code sequences with
both the base protocol and the new MESI protocol in Exercise 5.5? Assume that
state transitions that do not require interconnect transactions incur no additional
stall cycles.

A.
P0: read 100
P0: write 100 <-- 40
B.
P0: read 120
P0: write 120 <-- 60
11 | P a g e

C.
P0: read 100
P0: read 120
E.
P0: read 100
P1: write 100 <-- 60
F.
P0: read 100
P0: write 100 <-- 60
P1: write 100 <-- 40

Figure 1: Fig-5.35

Solution:
A.
12 | P a g e

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,

MESI: E is just copy not required, broadcast not required
p0: write 100  40, MSI: send invalidate, MESI: silent transition
from E to M
MSI: 100 + 15 = 115 stall cycles
MESI: 100 + 0 = 100 stall cycles

B.
p0: read 120, Read miss, satisfied in memory, sharers both to S
p0: write 120 ← 60, Both send invalidates
Both: 100 + 15 = 115 stall cycles

C.
p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,
MESI: E
p0: read 120, Read miss, memory, silently replace 120 from S or E
Both: 100 + 100 = 200 stall cycles, silent replacement from E.

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,

MESI: E
p1: write 100 ← 60, Write miss, satisfied in memory regardless of
protocol
Both: 100 + 100 = 200 stall cycles, don’t supply data in E state
(some protocols do)
E.

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,

MESI: E
p0: write 100 ← 60, MSI: send invalidate, MESI: silent transition
from E to M
p1: write 100 ← 40, Write miss, P0’s cache, writeback data to
memory
MSI: 100 + 15 + 40 + 10 = 165 stall cycles
MESI: 100 + 0 + 40 + 10 = 150 stall cycles

Question 7(5.10)
Directory protocols are more scalable than snooping protocols because
they send explicit request and invalidate messages to those nodes that have copies of a
block, while snooping protocols broadcast all requests and invalidates to all nodes.
Consider the eight-processor system illustrated in Figure 5.37 and assume that all
caches not shown have invalid blocks. For each of the sequences below, identify which
nodes (chip/processor) receive each request and invalidate.
13 | P a g e

a. P0,0: write 100 <-- 80

b. P0,0: write 108 <-- 88
c. P0,0: write 118 <-- 90
d. P1,0: write 128 <-- 98

Figure 5.37: Multichip and Multicore Processor

Solution:
Pi,j denotes processor i in chip j. Each processor has a single directmapped L1
cache that holds two blocks, each holding two words. Each chip has a single direct-
mapped L2 cache that holds two blocks, each holding two words. To simplify the
14 | P a g e

illustration, the cache address tags contain the full address and each word shows only two
hex characters, with the least significant word on the right. The L1 cache states are
denoted M, S, and I for Modified, Shared, and Invalid. Both the L2 caches and memories
have directories. The directory states are denoted DM, DS, and DI for Directory
Modified, Directory Shared, and Directory Invalid.
P#: <op> <address> [ <-- <value> ]
where P# designates the CPU (e.g., P0,0), <op> is the CPU operation (e.g., read or write),
<address> denotes the memory address, and <value> indicates the new word to be
assigned on a write operation.
A.
P0,0: write 100 ← 80, Write hit only seen by P0,0
B.
P0,0: write 108 ← 88, Write “upgrade” received by P0,0; invalidate
received by P3,1
C.
P0,0: write 118 ← 90, Write miss received by P0,0; invalidate received by
P1,0

P1,0: write 128 ← 98, Write miss received by P1,0.

Question 8(5.11)
Exercise 5.3 asked you to add the Owned state to the simple MSI snooping
protocol. Repeat the question, but with the simple directory protocol above.

Solution:
15 | P a g e

Figure 2: Directory States

Question 9(5.19)
Assume that we have a function for an application of the form F(i, p), which gives
the fraction of time that exactly i processors are usable given that a
total of p processors is available. That means that

p
∑ F(i, p) = 1
i=1

Assume that when i processors are in use, the applications run i times faster.
Rewrite Amdahl’s law so it gives the speedup as a function of p for some
application?

Solution:
The general form for Amdahl’s Law is
16 | P a g e

all that needs to be done to compute the formula for speedup in this
multiprocessor case is to derive the new execution time. The exercise states that
for the portion of the original execution time that can use i processors is given by
F(i,p). If we let Execution timeold be 1, then the relative time for the application
on p processors is given by summing the times required for each portion of the
execution time that can be sped up using i processors, where i is between 1 and p.
This yield,

Substituting this value for Execution timenew into the speedup equation makes
Amdahl’s Law a function of the available processors, p.

Question 10(5.21)
Show how the basic snooping protocol of Figure 5.7 can be
changed for a write-through cache. What is the major hardware functionality that
is not needed with a write-through cache compared with a write-back cache?

Solution:
To keep the figures from becoming cluttered, the coherence protocol is
split into two parts as was done in Figure 5.6 in the text. Figure S.34 presents the
CPU portion of the coherence protocol, and Figure S.35 presents the bus portion
of the protocol. In both of these figures, the arcs indicate transitions and the text
along each arc indicates the stimulus (in normal text) and bus action (in bold text)
that occurs during the transition between states. Finally, like the text, we assume a
write hit is handled as a write miss.
Figure S.34 presents the behavior of state
transitions caused by the CPU itself. In this case, a write to a block in either the
invalid or shared state causes us to broadcast a “write invalidate” to flush the
block from any other caches that hold the block and move to the exclusive state.
We can leave the exclusive state through either an invalidate from another
processor (which occurs on the bus side of the coherence protocol state diagram),
or a read miss generated by the CPU (which occurs when an exclusive block of
data is displaced from the cache by a second block). In the shared state only a
write by the CPU or an invalidate from another processor can move us out of this
state. In the case of transitions caused by events external to the CPU, the state
diagram is fairly simple, as shown in Figure S.35.
When another processor writes
a block that is resident in our cache, we unconditionally invalidate the
corresponding block in our cache. This ensures that the next time we read the
data, we will load the updated value of the block from memory. Also, whenever
17 | P a g e

the bus sees a read miss, it must change the state of an exclusive block to shared
as the block is no longer exclusive to a single cache. The major change introduced
in moving from a write-back to write-through cache is the elimination of the need
to access dirty blocks in another processor’s caches. As a result, in the write-
through protocol it is no longer necessary to provide the hardware to force write
back on read accesses or to abort pending memory accesses. As memory is
updated during any write on a write-through cache, a processor that generates a
read miss will always retrieve the correct information from memory. Basically, it
is not possible for valid cache blocks to be incoherent with respect to main
memory in a system with write-through caches.

Question 11(Q5.22)
Add a clean exclusive state to the basic snooping cache coherence
protocol (Figure 5.7). Show the protocol in the format of Figure 5.7?
18 | P a g e

Solution:
To augment the snooping protocol of Figure 5.7 with a Clean Exclusive state we
assume that the cache can distinguish a read miss that will allocate a block
destined to have the Clean Exclusive state from a read miss that will deliver a
Shared block. Without further discussion we assume that there is some
mechanism to do so.
The three states of Figure 5.7 and the transitions between
them are unchanged, with the possible clarifying exception of renaming the
Exclusive (read/write) state to Dirty Exclusive (read/write). The new Clean
Exclusive (read only) state should be added to the diagram along with the
following transitions.
 From Clean Exclusive to Clean Exclusive in the event of a CPU read
hit on this block or a CPU read miss on a Dirty Exclusive block
 From Clean Exclusive to Shared in the event of a CPU read miss on a
Shared block or on a Clean Exclusive block
 From Clean Exclusive to Shared in the event of a read miss on the bus
for this block
19 | P a g e

 From Clean Exclusive to Invalid in the event of a write miss on the bus
for this block
 From Clean Exclusive to Dirty Exclusive in the event of a CPU write
hit on this block or a CPU write miss
 From Dirty Exclusive to Clean Exclusive in the event of a CPU read
miss on a Dirty Exclusive block
 From Invalid to Clean Exclusive in the event of a CPU read miss on a
Dirty Exclusive block
 From Shared to Clean Exclusive in the event of a CPU read miss on a
Dirty Exclusive block Several transitions from the original protocol
must change to accommodate the existence of the Clean Exclusive
state. The following three transitions are those that change.
 From Dirty Exclusive to Shared, the label changes to CPU read miss
on a Shared block
 From Invalid to Shared, the label changes to CPU miss on a Shared
block
 From Shared to Shared, the miss transition label changes to CPU read
miss on a Shared block

Question 12(5.23)
One proposed solution for the problem of false sharing is to add a valid bit per
word. This would allow the protocol to invalidate a word without removing the
entire block, letting a processor keep a portion of a block in its cache while
another processor writes a different portion of the block. What extra
complications are introduced into the basic snooping cache coherence protocol
(Figure 5.7) if this capability is included? Remember to consider all possible
protocol actions.

Solution:
An obvious complication introduced by providing a valid bit per word is the
need to match not only the tag of the block but also the offset within the block
when snooping the bus. This is easy, involving just looking at a few more bits. In
addition, however, the cache must be changed to support write-back of partial
cache blocks. When writing back a block, only those words that are valid should
be written to memory because the contents of invalid words are not necessarily
coherent with the system. Finally, given that the state machine of Figure 5.7 is
applied at each cache block, there must be a way to allow this diagram to apply
when state can be different from word to word within a block. The easiest way to
do this would be to provide the state information of the figure for each word in the
block. Doing so would require much more than one valid bit per word, though.
20 | P a g e

Without replication of state information the only solution is to change the

coherence protocol slightly.

Question 13(5.26)
Assume a directory-based cache coherence protocol. The directory
currently has information that indicates that processor P1 has the data in
“exclusive” mode. If the directory now gets a request for the same cache block
from processor P1, what could this mean? What should the directory controller
do? (Such cases are called race conditions and are the reason why coherence
protocols are so difficult to design and verify.)

Solution:
The problem illustrates the complexity of cache coherence protocols. In this
case, this could mean that the processor P1 evicted that cache block from its cache
and immediately requested the block in subsequent instructions. Given that the
writeback message is longer than the request message, with networks that allow
out-of-order requests, the new request can arrive before the write back arrives at
the directory.
One solution to this problem would be to have the directory wait for
the write back and then respond to the request. Alternatively, the directory can
send out a negative acknowledgment (NACK). Note that these solutions need to
be thought out very carefully since they have potential to lead to deadlocks based
on the particular implementation details of the system. Formal methods are often
used to check for races and deadlocks

Question 14(5.30)
One performance optimization commonly used is to pad
synchronization variables to not have any other useful data in the same cache line
as the synchronization variable. Construct a pathological example when not doing
this can hurt performance. Assume a snooping write invalidate protocol.

Solution:
Assume a cache line that has a synchronization variable and the data guarded
by that synchronization variable in the same cache line. Assume a two processor
system with one processor performing multiple writes on the data and the other
processor spinning on the synchronization variable. With an invalidate protocol,
false sharing will mean that every access to the cache line ends up being a miss
resulting in significant performance penalties.

Thanks for Reading

04 30067 Electronic Engineering January 2022 Exam Paper
No ratings yet
04 30067 Electronic Engineering January 2022 Exam Paper
10 pages
Cs433 Sp12 Midterm Sol
No ratings yet
Cs433 Sp12 Midterm Sol
9 pages
Computer Architecture Exam Solutions
No ratings yet
Computer Architecture Exam Solutions
5 pages
ASSIGNMENT1 Acsa
No ratings yet
ASSIGNMENT1 Acsa
3 pages
Cse - Ece 511 - Endsem
No ratings yet
Cse - Ece 511 - Endsem
3 pages
Midterm Solution
No ratings yet
Midterm Solution
18 pages
Assignment 01
No ratings yet
Assignment 01
4 pages
ECE 341 Final Exam Solutions
No ratings yet
ECE 341 Final Exam Solutions
9 pages
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
No ratings yet
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
15 pages
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
No ratings yet
Final Exam Solution - Test Paper Final Exam Solution - Test Paper
82 pages
Coss
No ratings yet
Coss
2 pages
40 Out
No ratings yet
40 Out
80 pages
Coa 1
No ratings yet
Coa 1
4 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
COA End Term Solution - Docx-1
No ratings yet
COA End Term Solution - Docx-1
8 pages
Data Forwarding in MIPS Pipeline Analysis
No ratings yet
Data Forwarding in MIPS Pipeline Analysis
4 pages
CAO Final
No ratings yet
CAO Final
10 pages
Ca Mid1 2017
No ratings yet
Ca Mid1 2017
9 pages
Pipelining & Branch Prediction Analysis
No ratings yet
Pipelining & Branch Prediction Analysis
6 pages
QP 1
No ratings yet
QP 1
3 pages
Advanced Loop Parallelism Techniques
No ratings yet
Advanced Loop Parallelism Techniques
35 pages
cs146 Fall2017 Midterm1xx
No ratings yet
cs146 Fall2017 Midterm1xx
12 pages
En m3 Ex Sol
No ratings yet
En m3 Ex Sol
35 pages
National University of Computer and Emerging Sciences, Lahore Campus
No ratings yet
National University of Computer and Emerging Sciences, Lahore Campus
4 pages
Parallel Computing Assignment
No ratings yet
Parallel Computing Assignment
2 pages
Computer Science U1 P2 2025
No ratings yet
Computer Science U1 P2 2025
15 pages
CS211 Final Exam 2017
No ratings yet
CS211 Final Exam 2017
10 pages
Computer Architecture Essentials
No ratings yet
Computer Architecture Essentials
24 pages
COA QP PerformanceQuestions MUST READ
No ratings yet
COA QP PerformanceQuestions MUST READ
4 pages
Computer Systems Architecture Exam Solutions
100% (1)
Computer Systems Architecture Exam Solutions
8 pages
M.Tech Computer Architecture Exam Paper
No ratings yet
M.Tech Computer Architecture Exam Paper
5 pages
L12-14 Multi Cycle Implementation
No ratings yet
L12-14 Multi Cycle Implementation
24 pages
Illinois Exam2 Practice Solfa08
No ratings yet
Illinois Exam2 Practice Solfa08
4 pages
IT Problem-Solving and Algorithms Guide
100% (1)
IT Problem-Solving and Algorithms Guide
13 pages
FPGA and ASIC Design Midterm 2011
No ratings yet
FPGA and ASIC Design Midterm 2011
7 pages
CS4961: Parallel Programming Midterm Exam October 20, 2011
No ratings yet
CS4961: Parallel Programming Midterm Exam October 20, 2011
4 pages
Part A: Short Answer (50 Marks)
No ratings yet
Part A: Short Answer (50 Marks)
4 pages
Es ZG642 Ec-3r First Sem 2023-2024-2
No ratings yet
Es ZG642 Ec-3r First Sem 2023-2024-2
2 pages
Dependencies, Instruction Scheduling, Optimization, and Parallelism
No ratings yet
Dependencies, Instruction Scheduling, Optimization, and Parallelism
49 pages
ENCE361 Exam 2024
No ratings yet
ENCE361 Exam 2024
25 pages
Capp 1
No ratings yet
Capp 1
38 pages
FPG511S Examination 3, Opportunity 1 Memo
No ratings yet
FPG511S Examination 3, Opportunity 1 Memo
15 pages
CO Assignment 4 Solution
100% (1)
CO Assignment 4 Solution
10 pages
PDF 16mve332
No ratings yet
PDF 16mve332
5 pages
Texas Instruments Placement Paper May 2011
No ratings yet
Texas Instruments Placement Paper May 2011
5 pages
CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T
No ratings yet
CSE 530 Homework #1 Due September 26 Anthony Dotterer: C C C T C T C C T T
9 pages
Understanding Data Dependences and Hazards
No ratings yet
Understanding Data Dependences and Hazards
24 pages
Midterm Quiz Solutions for CS 4290/6290
No ratings yet
Midterm Quiz Solutions for CS 4290/6290
3 pages
JUNE 2023P2-Solution
No ratings yet
JUNE 2023P2-Solution
6 pages
Advanced Computer Architecture HW3
No ratings yet
Advanced Computer Architecture HW3
5 pages
Enroll. No. - : Marwadi University
No ratings yet
Enroll. No. - : Marwadi University
4 pages
04 30067-Electronic Engineering-January Exam 2020-21
No ratings yet
04 30067-Electronic Engineering-January Exam 2020-21
11 pages
15IF11 Multicore E PDF
No ratings yet
15IF11 Multicore E PDF
14 pages
Secure Coding Practices Quick Ref 5
No ratings yet
Secure Coding Practices Quick Ref 5
14 pages
Bachelor of Computer Adminstration (Bca) : Practical File On
No ratings yet
Bachelor of Computer Adminstration (Bca) : Practical File On
93 pages
Sūrah 97: Al-Qadr
No ratings yet
Sūrah 97: Al-Qadr
4 pages
Midterm Quiz 2 - Attempt Review PDF
No ratings yet
Midterm Quiz 2 - Attempt Review PDF
6 pages
9 Planet Frescos in The Khmer Temples
No ratings yet
9 Planet Frescos in The Khmer Temples
47 pages
Psychological Aspects of Personality
No ratings yet
Psychological Aspects of Personality
11 pages
Data Cleaning in Databricks
No ratings yet
Data Cleaning in Databricks
9 pages
Brill's Encyclopaedia of The Neo-Latin World: Introduction: Language and Reality
No ratings yet
Brill's Encyclopaedia of The Neo-Latin World: Introduction: Language and Reality
17 pages
Passive Voice With Modals
0% (1)
Passive Voice With Modals
2 pages
Understanding Text Structures in EAPP
No ratings yet
Understanding Text Structures in EAPP
35 pages
Govind
No ratings yet
Govind
2 pages
Sheila Walsh
No ratings yet
Sheila Walsh
6 pages
Borang Practicum Guidelines Be101
No ratings yet
Borang Practicum Guidelines Be101
15 pages
Moeller 9e Ch02
No ratings yet
Moeller 9e Ch02
19 pages
Restricting Data and Sorting Data PDF
No ratings yet
Restricting Data and Sorting Data PDF
24 pages
IBM Cognos Varicent Installation
No ratings yet
IBM Cognos Varicent Installation
9 pages
CPE108L E01 Experiment2
No ratings yet
CPE108L E01 Experiment2
14 pages
The Kite Runner Schedule 2016
No ratings yet
The Kite Runner Schedule 2016
2 pages
Algorithm Design: MST & Huffman Coding
No ratings yet
Algorithm Design: MST & Huffman Coding
3 pages
DepEd Emerging-LAS Week1 (Edited)
No ratings yet
DepEd Emerging-LAS Week1 (Edited)
17 pages
General 2 SAP Web IDE Different Deliveries/Versions 5
No ratings yet
General 2 SAP Web IDE Different Deliveries/Versions 5
13 pages
Boolean Simplification Guide
No ratings yet
Boolean Simplification Guide
33 pages
Maths Zone-1
No ratings yet
Maths Zone-1
50 pages
Atmel YYYYY Atmel Studio 7 Release Notes - Release Note PDF
No ratings yet
Atmel YYYYY Atmel Studio 7 Release Notes - Release Note PDF
82 pages
Sandhinirmocana Sutra - Profound Secrets
No ratings yet
Sandhinirmocana Sutra - Profound Secrets
12 pages
Hannah Setter's 2019 Teaching Experience Report
No ratings yet
Hannah Setter's 2019 Teaching Experience Report
6 pages
Introduction To Narrative Journalism 1630444379
No ratings yet
Introduction To Narrative Journalism 1630444379
100 pages
Mysteries of The Messiah VOL 3
No ratings yet
Mysteries of The Messiah VOL 3
407 pages
Faith and Struggle in the Midwest
No ratings yet
Faith and Struggle in the Midwest
2 pages
(FREE PDF Sample) We Need New Names Bulawayo NoViolet Ebooks
100% (3)
(FREE PDF Sample) We Need New Names Bulawayo NoViolet Ebooks
24 pages

Assignment 04 Solutionwithsolution

Uploaded by

Assignment 04 Solutionwithsolution

Uploaded by

2|Page

Option 2: second choice is to increase the number of cores from 10 to 15

Option 3: increase instruction issue rate to 0.95

C[i] = D[i] * A[i]; /* S4 */

The GCD(2,4)=2 and (5-4) mod 2=1

Wrong Answer in book:

A[i] = A[i] * B[i]; /* S1 */

S3 and s3 through A[i]

for (i=0;i < 100;i++)

Re-Written code: Let’s remove the anti-dependency between in the S1 and S2

A[0] = A[0] + B[0];

A. Branch divergence: causes SIMD lanes to be masked when threads follow

Clock Rate: 1.5GHZ

FPU (functional Point units):16 single- precision(FPU)

Peak throughput for single precision:

Therefore, we can utilize the 4.6TB/s of memory bandwidth. As such, this

P3: read 120

CPU read and write hits generate no stall cycles.

P1: read 110, Read miss, P0’s cache

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S,

a. P0,0: write 100 <-- 80

Figure 5.37: Multichip and Multicore Processor

P1,0: write 128 ← 98, Write miss received by P1,0.

Figure 2: Directory States

Without replication of state information the only solution is to change the

Thanks for Reading

You might also like