0% found this document useful (0 votes)
91 views138 pages

Module 5

This document discusses module 5 on software and parallel programming. It covers three main topics: parallel models, languages, and compilers; parallel program development and environments; and instruction level parallelism. Specifically, it examines different parallel programming models including shared variable, message passing, data parallel, object-oriented, and functional models. It also discusses parallel languages, compilers, and how compilers can help optimize data parallel programs.

Uploaded by

yashu J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views138 pages

Module 5

This document discusses module 5 on software and parallel programming. It covers three main topics: parallel models, languages, and compilers; parallel program development and environments; and instruction level parallelism. Specifically, it examines different parallel programming models including shared variable, message passing, data parallel, object-oriented, and functional models. It also discusses parallel languages, compilers, and how compilers can help optimize data parallel programs.

Uploaded by

yashu J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 138

Module 5: Software and Parallel Programming

Subject Code: 17CS72


Credits: 4-0-0
Text Book: Kai Hwang & Naresh Jotwani,
“Advanced Computer Architecture”,
Parallelism, Scalability and Performance
Module 5: Software and Parallel
Programming

In this module, we will look into the concepts of
Parallel Computers in terms of software and
programming requirements.

Following Chapters are included in this module:

Parallel Models, Languages and Compilers

Parallel Program Development and Environments

Instruction Level Parallelism
Parallel Models, Languages and Compilers


In this chapter, we will consider the aspects of
programming and compilers for parallel & vector
computers. Studying beyond architectural capabilities, we
will look into basic Parallel Programming and optimization
of Compilers; specifically to achieve Parallelism.


Following are the topics for our understanding:
 Parallel Programming Models
 Parallel Language and Compilers
 Dependence Analysis of Data Arrays
Legends

SIMD : Single Instruction Multiple Datastream

I/O: Input/Output

IPC :Inter Process Communications

PE : Processors (Processing Elements)

OOP: Object Oriented Programming

COOP: Concurrent OOP
10.1 Parallel Programming Models

Parallel Programming Models are specifically
designed for SIMD Computers. Following are
the five models charecterized:
1)Shared Variable Model
2)Message Pasing Model
3)Data Parallel Model
4)Object-Oriented Model
5)Functional & Logic Model

Most programming Systems considers Processors as active
resources and memory & I/O devices as passive resources

Basic Computational Units in a parallel program are
processes. As all of us know, program is collection of
Process.

Classification of the first two models here, is considered
based on the fact that parallelism depends on how
InterProcess Communication(IPC) is implemented.

Fundamental issue in parallel programing in this type of
models depends on Specification, Creation, Suspension,
Reactivation, Migration, Termination and Synchronization of
concurrent processes residing on the same or different
processors.
1) Shared Variable Model

Multiprocessor progg is typically based on the use shared
variables in a common memory for IPC as shown in diagram
below

The model demands the use of shared memory and mutual
exclusion among multiple processes accesing the same set of
variables

Following are the issues in using this type of model:

Protected access of Critical Sections

Memory Consistency

Atomicity of memory operaions

Fast sychronization

Shared data strucutures &

Fast data movement techniques
Contd....

Protected access of Critical Sections

RACE CONDITION, Locks, Binary &Counting Semephores etc

Memory Consistency

Multiprogramming

Multiprocessing

Multitasking

Multthreading

Atomicity of memory operaions

Partioning & Replication

Fast sychronization

Scheduling & Synchronization

Shared data strucutures & Fast data movement techniques

Cache Coherence & Protection
2) Message Passing Model

As depiceted in the diagram, 2 Processes residing at different processor nodes
communicates with each other by message passing (ex: Socket Program in
Networks)
Following are the two models in the message passing programming models

Synchronous Message Passing 
Asynchronous Message Passing

(Also termed as Blocking 
(Also termed as Non-Blocking
Communication Scheme) Communication Scheme)

Benefits:

Benefits

No Synchronization reqd

No need of mutual Exlusion

Usage of Buffers

No Buffers Required

Issues

Issues: 
Usage of finite Buffers

Synchronization 
Distribution/Duplication of Program Codes
& Data
3) Data-Parallel Model

Most SIMD Computers have lockstep operations.

Synchronization is done explicitly by hardware and flow control is
also one among the features to have Data Parallel Codes to be
written and debugged easily.

Data Parallel languages are modified directly from standard serial
progg Languages.

Data Parallel Programs requires the use of pre-distributed data sets
& Interconnected data structures for data exchange operations.

Here we will consider fine-grain data parallelism with sychronous
control for SIMD Computers.

Benefits:
 Data parallelism leads to high degree of parallelism at
data level compared to the lower degree parallelism at
the instruction level.
 Sychronization of data-parallel operations is done at
Compile time rather than Run time(as compared to the
previous models).
 Hardware Synchronization is enforced by the Control Unit
(to carry out Lackstep operations in SIMD Computers).

Need of the hour for Data Parallelism:
 In early SIMD computers, difficulty was to match the problem size with fixed
machine size. (Large Arrays[say satellite data for weather forecasting] )or
metrices had to be partioned into 64-element segments before they could be
effectiviely processed by the 64 processors(PE’s).
 Advancement in SIMD Computers, offered bit-slice fine-grain data parallelism
using 16,384 PE’s concurrently in single array configuration, thus demading a
lower degree of segmentation and offering higher flexibility in programming.
 With synchronous SIMD computers, Mutual Exclusion or synchronization
problem are avoided.
 With Inter-PE Communications, controlled directly by hardware; Locksteps
are carried out both in computing as well as data communication, making
SIMD Computers efficient in spatial parallelism for large arrays, grids, meshs
etc
 Features like Broadcasting, Masking, Data – routing also makes data-parallel
model improve the SIMD Computers

Which are the array language programs?
 Array extensions in this model is represented by
high-level data types.
 An SIMD Programming Languages should have a
global address space along with explicitly routing of
Data between PE’s.
 Example for array processing languages are CFD,
DAP Fprtran, C*, MPF for various Computer
Architectures.

How does the Compiler Help this Model?
 In order to support Data Parallel Model, the array language
expressions and their related optimizing Compilers are
embedded. By designing like this, unifying of Program execution
model, facilitating control over Massively Parallel Hardware &
enabling incremental migration to data-parallel execution.
 Compiler-Optimized Controller allows the programmer to drive PE
array transparencies.
 The Compiler Technology also allows array extensions, optimizing
data placement, minimizing data movement etc., Compiler also
generates data-parallel code for this purpose.
 Array Sectioning, Vector-Valued Subscription also Gather/Scatter
Operations are introduced in the Compiler to support and
enhance the programmers deal with the Data Parallel Models.
4) Object- Oriented Model

An object-Oriented Model for SIMD is
charecterized below:
 Objects are dynamically created and Manipulated.
 Processing is performed by sending and receieving
messages among objects.
 Concurrent Programming Models are built up from
Low-Level objects such as Processes, Queues, and
Semaphores into High-level objects like Monitors and
Program Modules.

What was the Need for Concurrent OOP?
 Objects are program entities which encapsulates data and
operations into single computational units.
 Inheriting this is natural consequnces to the Concept of Objects.
 Following were the demand to explore the need for Concurrent
OOP:

Increase in the Interaction of processes by individual Users.

Cost-Effectiveness on Resource Sharing and Distributed Problem
Solving.

Advancements in Multiprocessor technology leading to SuperComputing
Power at a traditional cost.
 The development of COOP provides an alternative model for
concurrent computing on Multiprocessors and Multi Computers.

What is the framework for COOP?
 Supporting Patterns of reuse and Classifications, ex: inheritance, An
Actor Model is used as a framework for COOP.
 Here, Actors, are self Contained, Interactive, Indepedent components
of a computing system that communicate in Asynchronous Message
passing(Semantics).
 Basic Actor primitive include:

Create

Send-to

Become
 State Changes are specified by behavior replacement mechanism,
which allows one to aggregate changes and to avoid unneccary
control-flow dependences.

How Parallelism in COOP Works?
 Three Common Patterns of parallelism
are observed in the practice of COOP,
they are:

Pipeline Concurrency

Divide & Conquer Concurrency

Cooperative Problem Solving
5) Functional & Logic Models

Two of the Language -Oriented
Programming Models to achieve Parallel
Processing are :
 Model using Functional Programming
Languages

Ex: LISP, SISAL & Strand 88.
 Model based on Logic Programming
Languages

Ex: Concurrent Prolog, Parlog

Functional Programming Model
 Here the emphasis is on Functionality of a program, which
eventually will not have side effects after execution.
 Concept of Storage, Assignment & Branching are avoided
 With lack of side effects opens up much more opportunity for
Parallelism.
 Features such as precedence restrictions, evaluation of a
function implies dynamicity functional programming model.
 Features like Single-Assignment and Data flow Languages
helps Functional Programming Models can be easily applied
to data-driven multiprocessors.
 The functional Model Emphasis fine-grain MIMD parallelism.

Logic Programming Model.
 Based on Predicate Logic(AI Topic), Logic Programming
Model, is ideal for Knoweledge Processing dealing with Large
DataBases.
 Here in this model, a implict search strategy is adapted which
in turn supports parallelism.
 Process of Unification and Matching can also be parallelized.
 Clauses in Logic Programming can be transformed into
dataflow graphs.

Both Functional and logical Programming Model have
been Used in Artificial Intelligence applications as
Parallel Processing is very much in Demand.
10.2 Parallel Language and
Compilers

The envirnoment setup required for Parallel Computers are
much more demanding in nature compared to sequential
computers.

Any Proggramming envirnoment is always a collection of
software tools & System support(Hardware).

Users should not spend a lot of time programming details;
meant programmer has to focus more parallelism using high-
level abstractions.

Thus, to understand this part we will study following topics:
 Language and Features for Parallelism.
 Parallel Language Constructs.
 Optimizing Compilers for Parallelism.
Language Features for Parallelism

Classification of Language Features are been
done to support Parallel Programming into Six
Categories according to the functionalities:
 Optimization Features
 Availibility Features
 Syncronization/Communication Features
 Control Parallelism
 Data Parallelism Features
 Process Management Features
 (OASCDP)

Optimization Features
 The following sets of features which are used for program restructuring and
compilation directives in converting sequentially coded programs into parallel
forms.
 The purpose of these features is to match the software parallelism with the
hardware parallelism in the target machine.

Automated Parallelizer

SemiAutomated Parallelizer

Interactive Restructure Support

Availability Features
 Following set of features are meant to enhance the user-friendliness (make the
language portable to a large class of parallel computers) & expand the
applicability of Software Libraries.

Scalability

Compatibility

Portability

Synchronization Features
 Following are the features which are desirable for
Synchronization/Communication Support

Single Assignement Languages

Shared Variables(Locks) for IPC

Logically shared memory

Send/Recieve for Message Passing

Remote Procedure Call

Dataflow Languages

Barriers, Mailbox, Semaphores, Monitors

Control of Parallelism
 Following set of features involved in control constructs for specifying
parallelism

Coarse, Medium or Fine Grain

Explicit vs Implicit parallelism

Global parallelism in the entire program

Loop Parallelism in Iterations

Task-Split Parallelism

Shared Task Queue

Divide & Conquer paradigm

Shared Abstract Data Types

Task Dependency Specification

Data parallelism Features
 Following are the features which are helpful in Data parallelism specifying how
data are accessed and Distributed in either SIMD or MIMD Computers:

Run-Time Automatic Decomposition

Mapping Specification

Virtual Processor Support

Direct Access to Shared Data

SPMD(Single Program Muliple Data) Support

Process Management Features
 Following set of features are meant to support the efficient creations of parallel
process, implementation of multithreading or multitasking, program partioning
and replication & dynamic load balancing at run-time:

Dynamic Process creation at run-time

Lightweight Process(Threads)

Replicated Workers

Partioned Networks

Automatic Load Balancing
Parallel Language Constructs

To exploit parallelism in programs, first we will check the
array notations used in Frotran 90 & then we will check with
parallel constructs used for parallel program:
 Fortran 90 Array Notations:

Multidimensional Data array is represented by an array name indexed by a
sequence of subscript triplets, one for each dimensions.

Triplets for different dimensions are seperated by commas:

Examples are

e1 : e2 : e3

e1: e2

e1 : * : e3

e1 : *

*
 Where each e i , is an arithmetic expression that must produce a scalar
integer value.

Contd....

Considering, the first expression e1 is a Lower Bound, the
second expression e2 an Upper Bound, and the third e3 an
increment(stride).

For Example, B(1:4:3 , 6:8:2 , 3) represents four elements
B(1,6,3),B(4,6,3),B(1,8,3) & B(4,8,3)

When the third expression in a triplet is missing, (e1: e2), a unit
stride is assumed.

The * notation in the second expression (e1 : * : e3) indicates
all elements in that dimension starting from e1, or the entire
dimension if e1 is also omitted.

When both e2 & e3 are ommited, the e1 alone represnts a
single element in that dimension, example A(5) represents the
fifth element in the array A(3:7:2).

Contd...
 Parallel Flow Controls:

Convententional Fortran Do Loop declares that all scalar instructions within
the (Do, Enddo) pair are executed sequentially.

To Declare Parallel activities, we have to use (Doall, Endall) pair; All
iterations in Doall loop are totally independent of each other.

When the successive iterations are dependent on eachother we use
(Doacross, Endacross) pair to declare parallelism with loop-carried
dependences.

Ex:

Doacross I = 2, N

Do J = 2, N
S1: A(I,J) = (A(I,(J-1))+A(I,(J+1)))/2
Enddo
Endacross
 Contd...

Another program construct is (Cobegin , Coend) pair. All
Computations specified within the block could be executed in parallel.

Here Synchronization among concurent processed within the pair are
implied.

Eg:

Cobegin

P1

P2

P3

 Pn

Coend

Similarly we have (Forall, Endall), (Pardo, Parend), (Parbegin, Parend),
Fork and Join Commands to achieve parallelism.
Optimizing Compilers for Parallelism


As most of the time now, high level languages are to
be used exclusively to write programs, even compilers
have become a neccessity in modern Computers.

The main role of a compiler is to remove the burden of
program optimization and code generation.

The parallelizing compilers consits of following
phases:
 Flow Analysis
 Optimization
 Code Generation as shown in the figure.
 Flow Analysis

This phase reveals the program flow patterns in order the determine data and
control dependence in the source code.

Depending on the machine structure, the granularities of parallelism will be
decided.

We can conclude that Flow Analysis ar concluded at different execution levels
on different parallel computers.
 Program Optimization

Its a transformation from user programs in order to explore the hardware
capabilities as much as possible.

Transformation are usually conducted at the Loop Level, Locality Level or the
prefetching level, leading to global optimization.

Most transformations are machine independent which is required to maximize
speed of code execution.

The optimization technique includes:
 Vectorization using pipelined hardware
 Parallelization using multiple processors
 Parallel Code Generation

Code generation usually invloves transformation from one
representation to another, called as Internediate Form.

A Code model will be chosen from Intermediate form.

Parallel codes are even more demanding because parallel
constructs must be included.

Code Generation is linked with the Instruction Scheduling policies
used.

Basic blocks linked by control-flow commands are often optimized to
encourage a high degree of parallelism.

Special data strucutures are needed to represent instruction blocks.
10.3 Dependence Analysis of Data
Arrays

Here we will learn about Dependence Testing for
successive Iterations in Multidimensional data
arrays.

To understand this we will first look into examples
of Hardware Parallelism and Software Parallelism.

Later we will look into Dependence Analysis in
three subsection:
 Iteration space & Dependence Analysis
 Subscript Seperability & Partitioning
 Categorized Dependence Tests.
Iteration Spaces & Dependence Analysis


Precise and efficient dependence tests are
essential to the effectiveness of parallelizing
Computers.

The process of computing all the data
dependence in a program is called Dependence
Analysis.
 Dependence Testing
 Iteration Space
 Dependence Equations
 Distance and Direction Vectors
Subscript Separability and Partioning


The Dependence Testing has TWO GOALS:
 It tries to disprove the dependence between pairs of
subscripted references to the same array variable
 If dependence exist, it tries to characterize in some
manner, i.e., usually as a minimum complete set of
distance and direction vectors.

To Understand this we will look into:
 Subscript Categories
 Subscript Separability
 Subscript Partitioning
Categorized Dependence Tests

The Goal(end result) of Dependence Testing is to
construct the complete set of Distance(D) and
Direction Vectors(di) representing potential
dependences between an arbitrary pair of subscripted
references to the same array variable.

Here we look into:
 Testing Algorithm
 Test Categories;

ZIV Test

SIV Test

Weak-Zero SIV Test

Weak-Crossing SIV Test

MIV Tests
Module 5: Software and Parallel
Programming

In this module, we will look into the concepts of
Parallel Computers in terms of software and
programming requirements.

Following Chapters are included in this module:

Parallel Models, Languages and Compilers

Parallel Program Development and Environments

Instruction Level Parallelism
Instruction Level Parallelism

Module 5 : Chapter 12
Topics
• Computer Architecture
• Basic Design Issues
Computer Architecture and Computer
Organization
• Architecture describes what the computer does.
• Organization describes how it does it.
Example:
• Say you are constructing a house, design and all
low-level details come under computer
architecture while building it brick by brick,
connecting together keeping basic architecture
in mind comes under Computer Organization.
12.1 Computer Architecture
• Computer architecture defined as the
arrangement by which the various system
building blocks – processors, functional units,
main memory, cache, data paths and so on –
are interconnected and inter-operated to
achieve desired system performance.

Processor Design
Processor
performance
• System performance is the key benchmark in
the study of computer architecture.
– Criteria: Scalability, price, usability, reliability, power
consumption, physical size
• A basic rule of system design is that
– no performance bottlenecks in the system.
• Typically, a performance bottleneck arises when
one part of the system cannot keep up with the
overall throughput requirements of the system.
This system exhibits a performance mismatch between the processors, main
memory and the processor-memory bus.
summary
• Processor design is the central element of computer
system design.
• Since system design can only be carried out with specific
target application loads in mind, it follows that processor
design should also be tailored for target application loads.
• To satisfy overall system performance criteria, various
elements of the system must be balanced in terms of
their performance.
– No element of the system should become a performance
bottleneck.
12.2 Basic Design Issues
• designs which aim to maximize the
exploitation of instruction level parallelism
need deeper pipelines;
– such designs may support higher clock rates.
• Instruction-level parallelism (ILP) is a measure
of how many of the instructions in a 
computer program can be executed
simultaneously.
• Let us examine the trade-off involved in this context in a simplified way:
total chip area = number of cores X chip area per core
or
total transistor count = number of cores X transistor count per core
• At a given time, VLSI technology limits the left hand side in the above
equations, while the designer must select the two factors on the right.
Aggressive exploitation of instruction level parallelism, with multiple
functional units and more complex control logic, increases the chip area
—and transistor count—per processor core. Alternatively, for a different
category of target applications, the designer may select simpler cores,
and thereby place a larger number of them on a single chip.
• Within a processor, a set of instructions are in
various stages of execution at a given time—
within the pipeline stages, functional units,
operation buffers, reservation stations, and so on.
• Therefore machine instructions are not in general
executed in the order in which they are stored in
memory, and all instructions under execution
must be seen as ‘work in progress’.
• To maintain the work flow of instructions within
the processor, a superscalar processor makes
use of branch prediction—i.e. the result of a
conditional branch instruction is predicted even
before the instruction executes—so that
instructions from the predicted branch can
continue to be processed, without causing
pipeline stalls. The strategy works provided fairly
good branch prediction accuracy is maintained.
• we shall assume that instructions are committed in order.
Here committing an instruction means that the instruction
is no longer ‘under execution’—the processor state and
program state reflect the completion of all operations
specified in the instruction.
• Thus we assume that, at any time, the set of committed
instructions correspond with the program order of
instructions and the conditional branches actually taken.
Any hardware exceptions generated within the processor
must reflect the processor and program state resulting
from instructions which have already committed.
Summary
One of the main processor design trade-offs faced in this
context is this:
• Should the processor be designed to squeeze the
maximum possible parallelism from a single thread, or
should processor hardware support multiple independent
threads, with less aggressive exploitation of instruction
level parallelism within each thread?
• In this chapter, we will study the various standard
techniques for exploiting instruction level parallelism, and
also discuss some of the related design issues and trade-
offs.
Topics…
• Problem Definition
Data Dependences
Control Dependences
Resource Dependences
• Model of a Typical Processor
12.3 Problem Definition
• The execution of machine instructions from a single
sequential stream.
– The instructions are stored in main memory in program
order, from where they must be fetched into the
processor, decoded, executed, and then committed in
program order.
– We assume that the processor has a load-store type of
instruction set, which means that all arithmetic and
logical operations are carried out on operands which are
present in programmable registers.
opcode operand-1 operand-2 result
• Data transfer instructions have only two
operands—source and destination registers;
load and store instructions to/from main
memory specify one operand in the form of a
memory address, using an available
addressing mode.
• Effective address for load and store is
calculated at the time of instruction execution.
• Conditional branch instructions need to be treated as a special
category, since each such branch presents two possible
continuations of the instruction stream.
• Branch decision is made only when the instruction executes; at
that time, if instructions from the branch-not-taken are in the
pipeline, they must be flushed. But pipeline flushes are costly in
terms of lost processor clock cycles.
• The payoff of branch prediction lies in the fact that correctly
predicted branches allow the detection of parallelism to stretch
across two or more basic blocks of the program, without pipeline
stalls. It is for this reason that branch prediction becomes an
essential technique in exploiting instruction level parallelism.
• Limits to detecting and exploiting instruction level parallelism
are imposed by dependences between instructions.
– If N instructions are completely independent of each other, they can
be executed in parallel on N functional units—if N functional units
are available—and they may even be executed in arbitrary order.
• But in fact dependences amongst instructions are a central
and essential part of program logic. A dependence specifies
that instruction Ik must wait for instruction Ij to complete.
– Within the instruction pipeline, such a dependence may create a
hazard or stall—i.e. lost processor clock cycles while Ik waits for Ij to
complete.
Data Dependences
• Assume that instruction Ik follows instruction
Ij in the program. Data dependence between Ij
and Ik means that both access a common
operand.
• let us assume that the common operand of Ij and
Ik is in a programmable register. Since each
instruction either reads or writes an operand
value, accesses by Ij and Ik to the common
register can occur in one of four possible ways:
1. Read by Ik after read by Ij (RAR)
2. Read by Ik after write by Ij (RAW)
3. Write by Ik after read by Ij (WAR)
4. Write by Ik after write by Ij (WAW)
• Data hazards occur when instructions that exhibit data dependence, modify data in
different stages of a pipeline. Hazard cause delays in the pipeline. There are mainly
three types of data hazards:
1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency] Register
3) WAW (Write after Write) [Output data dependency] renaming

• Let there be two instructions I and J, such that J follow I. Then,


• RAW hazard occurs when instruction J tries to read data before instruction I writes it.
Eg:
I: R2 <- R1 + R3
J: R4 <- R2 + R3
• WAR hazard occurs when instruction J tries to write data before instruction I reads it.
Eg:
I: R2 <- R1 + R3
J: R3 <- R4 + R5
• WAW hazard occurs when instruction J tries to write output before instruction I
writes it.
Eg:
I: R2 <- R1 + R3
J: R2 <- R4 + R5
• WAR and WAW hazards occur during the out-of-order execution of the instructions.
Part (a) of Fig. 12.3 shows a basic block of six instructions,
denoted I1 through I6 in program order. Entry to the basic
block may be from one of multiple points within the
program; continuation after the basic block would be at
one of several points, depending on the outcome of
conditional branch instruction at the end of the block.

Part (b) of the figure shows a possible pattern of


dependences as they may exist amongst these six
instructions. For simplicity, we have not shown the type of
each dependence, e.g. RAW(R3), etc. In the partial order, we
see that several pairs of instructions—such as (I1, I2) and (I3,
I4)—are not related by any dependence. Therefore, amongst
each of these pairs, the instructions may be executed in any
order,
or in parallel.
Control Dependences
• Assume that instruction Ij is a conditional
branch and that, whether another instruction
Ik executes or not depends on the outcome of
the conditional branch instruction Ij. In such a
case, we say that there is a control
dependence of instruction Ik on instruction Ij.
• How can the processor designer mitigate the
adverse impact of such control dependences in
a program?
• Answer: Using some form of branch and jump
prediction—i.e. predicting early and correctly
(most of the time) the results of conditional
branches, indirect jumps, and procedure
returns.
Example 12.2 Impact of successful branch prediction

• Assume that we have attained 93% accuracy in


branch prediction in a processor with eight
pipeline stages. Assume also that the mis-
prediction penalty is 4 processor clock cycles
to flush the instruction pipeline. What is the
performance gain from such a branch
prediction strategy?
Answer

• In our case, the probability of a correct branch is 0.93, the probability of a


wrong branch is 0.07
• Thus the expected cost of a conditional branch instruction is 0.07 X 4 = 0.28
clock cycle
• As a primitive form of branch prediction, the processor designer could assume
that a conditional branch is always taken, and continue processing the
instructions which follow at the target address.
• Let us assume that this simple strategy works 80% of the time; then the
expected cost of a conditional branch is 0.2 x 4 = 0.8 clock cycles.
• Considering that 15% to 20% of the instructions in a program are branches and
jumps, the difference
• in cost between 0.28 clock cycle and 4 clock cycles per branch instruction is
huge, underlining the importance of branch prediction in a superscalar
processor.
Resource (Structural ) Dependences

• This dependency arises due to the resource


conflict in the pipeline. A resource conflict is a
situation when more than one instruction tries
to access the same resource in the same cycle.
A resource can be a register, memory, or ALU.
Summary
• Design a superscalar processor to detect and
exploit the maximum degree of parallelism
available in the instruction stream
– i.e. execute the instructions in the smallest
possible number of processor clock cycles
– by handling correctly the data dependences,
control dependences and resource dependences
within the instruction stream.
12.4 Model of A Typical Processor
Design Issues and trade-offs faced on processor
design
• To support parallel access to instructions and
data at the level of the fastest cache,
– L1 cache is divided into instruction cache and data
cache, and that this split L1 cache supports single
cycle access for instructions as well as data.
– Some processors may have an instruction buffer in
place of L1 instruction cache;
Terms to remember in specific design techniques
• The first three pipeline stages on our prototype processor are fetch,
decode and issue.
– Issue, the operation is handed over to the functional unit for execution.
• Following these are the various functional units of the processor,
which include integer unit(s), floating point unit(s), load/store unit(s)
• Let us assume that our superscalar processor is designed for k
instruction issues in every processor clock cycle. Clearly then the
fetch, decode and issue pipeline stages, as well as the other
elements of the processor, must all be designed to process k
instructions in every clock cycle.
• The process of issuing instructions to functional units also involves
instruction scheduling.
• For example, if instruction Ij cannot be issued because the required
functional unit is not free, then it may still be possible to issue the next
instruction Ij+1—provided that no dependence between the two prohibits
issuing instruction Ij+1. (out-of-order execution)
• Static Scheduling
– Compiler techniques for scheduling
• Dynamic Scheduling
– Uses hardware to rearrange instructions to reduce stalls
– Done at run time
• the basic aim in both types of scheduling—static as well as dynamic—is to
maximize the instruction level parallelism which is exploited in the
executing sequence of instructions.
The significance of processor state, program state and committing an
executed instruction
• At one time multiple instructions are in various stages of
execution within the processor.
– But processor state and program state need to be maintained which
are consistent with the program order of completed instructions.
– This is important from the point of view of preserving the semantics
of the program.
– Therefore, even with multiple instructions executing in parallel, the
processor must arrange the results of completed instructions so that
their sequence reflects program order.
• One way to achieve this is by using a reorder buffer, shown in
Fig.12.4, which allows instructions to be committed in
program order, even if they execute in a different order;
A re-order buffer (ROB) is used in a Tomasulo algorithm for out-of-order instruction
execution. It allows instructions to be committed in-order.
Normally, there are three stages of instructions: "Issue", "Execute", "Write Result". In
Tomasulo's algorithm, there is an additional stage "Commit". In the "Write Result"
stage, the results are just put in the re-order buffer then the state changes to
commit. All contents in this buffer can then be used when executing other
instructions depending on these.[wiki]
• Functional units provided with reservation stations, which
accept operations issued by the issue stage of the
instruction pipeline.
• A functional unit performs an operation when the required
operands for it are available in the reservation station.
• Such designs usually also make use of operand forwarding
over a common data bus (CDB), with tags to identify the
source of data on the bus.
• Such a design also implies register renaming, which
resolves RAW and WAW dependences.
Topics…
• Compiler-detected Instruction Level
Parallelism
• Operand Forwarding
• Reorder Buffer
12.5 Compiler-detected Instruction Level
Parallelism
• In the process of translating a sequential source program into
machine language, the compiler performs extensive syntactic
and semantic analysis of the source program.
• The compiler can contribute to the exploitation of implicit
instruction level parallelism.
• One simple technique which the compiler can employ is known
as loop unrolling,
– by which independent instructions from multiple successive iterations
of a loop can be made to execute in parallel.
– Unrolling means that the body of the loop is repeated n times for n
successive values of the control variable—so that one iteration of the
transformed loop performs the work of n iterations of the original loop.
Example: Loop unrolling
Example 12.4 Loop unrolling

Before

After
• Note that loop unrolling by the compiler does
not in itself involve the detection of
instruction level parallelism. But loop unrolling
makes it possible for the compiler or the
processor hardware to exploit a greater
degree of instruction level parallelism.
Example 12.5 Dependence across loop
iterations

Consider the following loop in a source program, which has a crucial new
dependence built into it:

Now the value calculated in the ith iteration of the loop makes use of the
value c[i – 1] calculated in the previous iteration.
This does not mean that the modified loop cannot be unrolled, but only
that extra care should be taken to account for the dependence.
With static instruction scheduling by the compiler, the processor
designer need to provide for dynamic scheduling in hardware.
• Dependences amongst references to simple variables, or amongst
array elements whose index values are known at compile time (as in
the two examples seen above), can be analyzed relatively easily at
compile time.
• But when pointers are used to refer to locations in memory, or when
array index values are known only at run-time, then clearly
dependence analysis is not possible at compile time.
• Therefore processor hardware must provide support at run-time for
alias
• Cache misses, I/O interrupts, and hardware exceptions cannot be
predicted at compile time. Therefore, apart from alias analysis,
compiler detected instruction level parallelism also requires dynamic
scheduling support within the processor.
A further step in the direction of compiler detected instruction level
parallelism and static scheduling can be the following:

• Suppose each machine instruction specifies multiple operations—to


be carried out in parallel within the processor, on multiple functional
units.
– multi-operation instructions would require a larger number of bits
to encode. Therefore processors with this type of instruction
word are said to have very long instruction word (VLIW).
• The machine language program produced by the compiler then
consists of such multi-operation instructions, and their scheduling
takes into account all the dependences amongst instructions.
• explicitly parallel instruction computer (EPIC) instruction format can
be more flexible than the fixed format of multi-operation VLIW
instruction; for example, it may allow the compiler to encode
explicitly dependences between operations.
• Another possibility is that of having predicated instructions in
the instruction set, whereby an instruction is executed only if
the hardware condition (predicate) specified with it holds
true.
– Such instructions would result in reduced number of conditional
branch instructions in the program, and could thereby lower the
number of pipeline flushes.
• The aim behind VLIW and EPIC processor architecture is to
assign to the compiler primary responsibility for the parallel
exploitation of plentiful hardware resources of the processor
– provide a third alternative to the RISC and CISC styles of processor
architecture.
12.6 Operand Forwarding

• Which helps in reducing the impact of


true data dependences in the instruction
stream.
• Consider the following simple sequence
of two instructions in a running program:
ADD R1, R2, R3
SHIFTR #4, R3, R4
• RAW dependence between the two
instructions—the output of the first is
required as input operand of the second.
• In clock cycle Tk, ALU output is transferred to
R3 over an internal data path.
• In the next clock cycle Tk + 1, the content of
R3 is transferred to ALU input for the right
shift. When carried out in this order, clearly
the two data transfer operations take two
clock cycles.
• But the required two transfers of data can be
achieved in only one clock cycle if ALU output is
sent to both R3 and ALU input in the same clock
cycle
• This type of an operation within a processor is known as operand
forwarding.
• Basically this means that, instead of performing two or more data
transfers from a common source one after the other, we perform them in
parallel.
• This can be seen as parallelism at the level of elementary data transfer
operations within the processor.
• To achieve this aim, the processor hardware must be designed to detect
and exploit on the fly all such opportunities for saving clock cycles.
• Benefits: wait within a functional unit for its operand becomes shorter
because, as soon as it is available, the operand is sent in one clock cycle,
over the common data bus, to every destination where it is needed.
Example
ADD R1, R2, R3
SUB R5, R6, R7
SHIFTR #4, R3, R4

If SUB is executed in program order, then even without operand forwarding between
ADD and SHIFTR, no processor clock cycle is lost, since SHIFTR does not directly
follow ADD. But now suppose SUB is executed either before ADD, or after SHIFTR. In
both these cases, SHIFTR directly follows ADD, and therefore operand forwarding
proves useful in saving a processor cycle, as we have seen above.
• But why should SUB be executed in any order other than
program order?
• The answer can only be this: to achieve better utilization of
processor clock cycles.
• Therefore the processor must make on the fly decisions
such as
(i) transferring ALU output in parallel to both R3 and ALU input,
and/or
(ii) out of order execution of the instruction SUB.
• This is the needed to implement dynamic scheduling of
machine instructions within the processor.
12.7 Reorder Buffer
• The Reorder Buffer (ROB) tracks the state of all
inflight instructions in the pipeline.
• The role of the ROB is to provide the illusion to the
programmer that his program executes in-order.
• Since instructions execute in parallel on multiple
functional units, the reorder buffer serves the
function of bringing completed instructions back
into an order which is consistent with program
order.
Note that instructions may complete in an order which is not related to
program order, but must be committed in program order.
ROB Data Structure
1. instruction identifier
2. value computed
3. program-specified destination of the value
computed
4. flag indicating whether the instruction has
completed (i.e. the computed value is
available).

• The head of queue of instructions(instr[i]) which would be committed next—if it has


completed execution.
• When this instruction commits, its result value is copied to its destination, and the
instruction is then removed from the reorder buffer.
• The next instruction to be issued in the issue stage of the instruction pipeline then joins
the reorder buffer at its tail.
How the use of reorder buffer addresses the various
types of dependences in the program.

1. Data Dependences: A RAW dependence—i.e. true


data dependence—will hold up the execution of the
dependent instruction if the result value required as
its input operand is not available.
– operand forwarding can be added to this scheme to speed
up the supply of the needed input operand as soon as its
value has been computed.
– WAR and WAW dependences—i.e. anti-dependence and
output dependence, respectively— also hold up the
execution of the dependent instruction and create a
possible pipeline stall.
2. Control Dependences: Suppose the instruction(s) in
the reorder buffer belong to a branch in the program
which should not have been taken—i.e. there has
been a mis-predicted branch.
– Clearly then the reorder buffer should be flushed along with
other elements of the pipeline.
– Therefore the performance impact of control dependences
in the running program is determined by the accuracy of
branch prediction technique employed.
– The reorder buffer plays no direct role in the handling of
control dependences.
3. Resource Dependences: If an instruction
needs a functional unit to execute, but the
unit is not free, then the instruction must
wait for the unit to become free
– If a subsequent instruction needs to use another
functional unit which is free, then the subsequent
instruction can be executed out of order.
Topics…
• Register Renaming
• Tomasulo’s Algorithm
12.8 Register Renaming

register renaming (in hardware)


• change register names to eliminate
WAR/WAW hazards
• one of the most elegant concepts in
computer
DIVD R0,R2,R4 architecture DIVD R0,R2,R4
ADDD R6,R0,R8 ADDD R6,R0,R8
SD R6,0(R1) SD S,0(R1)
SUBD R8,R10,R14 SUBD R8,R10,R14
MULD R6,R10,R8 MULD R6,R10,T
• For example
FADD R1, R2, R5
FSUB R3, R4, R5

• there would be occurrences of RAW, WAR and WAW


dependences on programmable registers.
• RAW is true data dependence—since a value written by
one instruction is used as an input operand by another.
But a WAR or WAW dependence can be avoided if we
have more registers to work with.
SOLUTION
• Both these instructions are writing to register R5, creating thereby
a WAW dependence—i.e. output dependence—on register R5.
• we have mapped—or renamed—R5 to X, for the purpose of
storing the result of FSUB, and thereby removed the WAW
dependence from the instruction stream.
• The processor is to make the additional registers invisible (X) to
machine language instructions
NOTE:
• But we must also assume that the instruction set architecture (ISA)
of the processor is fixed—i.e. we cannot change it to allow access
to a larger number of programmable registers.
Example: Register renaming and WAR
dependence

FADD R6, R7, R2


FADD R2, R3, R5
FSUB R1, R3, R2
Solution:
FADD R6, R7, R2
FADD R2, R3, R5
FSUB R1, R3, Xn
The WAR dependence between the second FADD and
FSUB is removed. Xn Invisible register
12.9 Tomasulo’s Algorithm
• Tomasulo’s algorithm is a computer architecture hardware 
algorithm for dynamic scheduling of instructions that
allows out-of-order execution and enables more efficient
use of multiple execution units. It was developed by 
Robert Tomasulo at IBM in 1967 and was first implemented
in the IBM System/360 Model 91’s floating point unit.
• The major innovations of Tomasulo’s algorithm include 
register renaming in hardware, reservation stations for all
execution units, and a common data bus (CDB) on which
computed values broadcast to all reservation stations that
may need them.
• For register renaming, we need a set of program invisible registers to
which programmable registers are re-mapped.
• Tomasulo’s algorithm requires these program invisible registers to be
provided with reservation stations of functional units.
working
• At the time of instruction issue, the reservation station is filled out with the
operation code (op).
• If an operand value is available, for example in a programmable register, it
is transferred to the corresponding source operand field in the reservation
station.
• If the operand value is not available at the time of issue, the corresponding
source tag (t1 and/or t2) is copied into the reservation station.
• The source tag identifies the source of the required operand.
• As soon as the required operand value is available at its source the data
value is forwarded over the common data bus, along with the source tag.
• This value is copied into all the reservation station operand slots which
have the matching tag. (operand forwarding)
Example: Tomasulo's algorithm and RAW
dependence
• Assume that instruction I1 is to write its result
into R4, and that two subsequent instructions
I2 and I3 are to read result value. Thus
instructions I2 and I3 are truly data dependent
(RAW dependent) on instruction I1.
solution

• Assume that I1 has not even started executing when I2 and I3 are issued.
• When I2 and I3 are issued, they are parked in the reservation stations of the
appropriate functional units.
• Since the required result value from I1 is not available, these reservation
station entries of I2 and I3 get source tag corresponding to the output of I1.
• When the result of I1 becomes available at its functional unit, it is sent over
the common data bus along with the tag value of its source.
• At this point, programmable register R4 as well as the reservation stations
assigned to I2 and I3 have the matching source tag—since they are waiting
for the same result value, which is being computed by I1.
• Thus, through the use of source tags and the common data bus, in one clock
cycle, three destination registers receive the value produced by I1—
programmable register R4, and the operand registers in the reservation
stations assigned to I2 and I3.
Example: Combination of RAW and WAR
dependence
• Assume that instruction I1 is to write its result into R4, a subsequent
instructions I2 is to read that result value, and a latter subsequent
instruction I3 is then to write its result into R4.
• Thus instruction I2 is truly data dependent (RAW dependent) on
instruction I1, but I3 is anti-dependent (WAR dependent) on I2.
• When I2 is issued, it is parked in the reservation station
of the appropriate functional unit.
• Since the required result value from I1 is not available,
the reservation station entry of I2 also gets the source
tag corresponding to the output of I1
• Can I3 be issued even before I1 completes and I2 starts
execution?
• The answer is that, with register renaming I3 can be
issued even before I2 starts execution.
• I1 and I2 dependence resolved as previous example.
• But suppose I3 is issued even before the output of
I1 is available, register R4, the output of I1 is
programmed to be overwritten by the output of I3
• When the output of I1 (finally) becomes available,
it goes to the input of I2, but not to register R4,
since this register’s source tag now refers to I3.
When the output of I3 becomes available, it goes
correctly to R4 because of the matching source tag.
Example: Calculation of processor clock
cycles
1 LOAD mem-a, R4
2 FSUB R7, R4, R4
3 STORE mem-a, R4
4 FADD R4, R3, R7
5 STORE mem-b, R7
• We shall assume that (a) one instruction is issued per
clock cycle, (b) floating point operations take two clock
cycles each to execute, and (c) memory operations take
one clock cycle each when there is L1 cache hit.
• We get the total as 1+2+1+2+1 = 7.
solution
• The RAW dependences on registers R4 and R7
will cost three additional clock cycles
– i.e. between (i) instructions 1 and 2, (ii) instructions 2
and 3, and (ii) instructions 4 and 5.
• We get the total as 1+2+1+2+1+3 = 10.
• With operand forwarding—which is built into
Tomasulo’s algorithm—one clock cycle is saved
on account of each RAW dependence
• We get the total as 1+2+1+2+1 = 7.
Topics…
• Branch Prediction
• Limitations in Exploiting Instruction Level
Parallelism
• Thread Level Parallelism
12.10 Branch Prediction
Example: Predicting the outcome of a tossed
coin
• Can one predict the result of a single coin
toss? • Prior knowledge: Assume that a coin comes up
head in its first two tosses. Then simple conditional
probability theory predicts that the third toss of the
coin has a higher probability of coming up head
than of coming up tail.
• Like tossed coins, outcomes of conditional
branches in computer programs also have yes and
no answers— i.e. a branch is either taken or not
taken.
two-bit predictor
• Two-bit counter is maintained for every conditional
branch instruction in the program.
• The two-bit counter has four possible states;
– When the counter state is 0 or 1, the respective branch is
predicted as taken;
– when the counter state is 2 or 3, the branch is predicted
as not taken.
• When two successive predictions come out wrong,
the prediction is changed from branch taken to
branch not taken, and vice versa.
correlated predictors
• different branches may be correlated
– outcome of branch depends on outcome of other branches
– makes intuitive sense (programs are written this way)
• Example
If(d==0)
d=1;
If(d==1)
{……}
B2 correlated with B1; if B1 is not take, B2 will be not
taken.
• Can branch prediction be carried out even before the
instruction is decoded
– —i.e. at the instruction fetch stage?
– Yes, if a so-called branch target buffer is provided which
has a history of recently executed conditional branches.
• Branch prediction based on the earlier history of the
same branch is known as local prediction,
• while prediction based on the history of other
branches in the program is known as global
prediction.
tournament predictor (Hybrid)
• A tournament predictor uses (i) a global
predictor, (ii) a local predictor, and (iii) a selector
which selects one of the two predictors for
prediction at a given branch instruction.
– The selector uses a two bit counter per conditional
branch to choose between the global and local
predictors for the branch.
– Two successive mis-predictions cause a switch from
the local predictor to the global predictor, and vice
versa
jump prediction
• which is applicable in indirect jumps
– computed go to statements (used in FORTRAN)
– switch statements (used in C and C++)
• The address associated with a procedure
return is obtained from the runtime procedure
stack in main memory;
Speculative Execution
• Instructions executed on the basis of a predicted branch,
before the actual branch result is known, are said to
involve speculative execution.
• If a branch prediction turns out to be correct, the
corresponding speculatively executed instructions must
be committed.
• If the prediction turns out to be wrong, the effects of
corresponding speculative operations carried out within
the processor must be cleaned up, and instructions from
another branch of the program must instead be executed.
Multiple Issue
– the processor tries to issue more than one instruction per cycle
(superscalar architectures)
• Instruction window or window is the special memory
provided upstream of the fetch unit in the processor
Conclusion
• For the targeted processor performance, the processor
designers must integrate and balance the hardware
techniques of branch prediction, dynamic scheduling,
speculative execution, internal data paths, functional units,
and an instruction window of appropriate size.
12.11 Limitations in Exploiting Instruction Level
Parallelism
• We have learned many CPU design techniques to
optimize performance
– Pipelining, superscalar, speculative execution, out-of-
order execution
• The common goal is to execution instructions in
parallel (pipelining), as many instructions as possible
• Instruction-level parallelism (ILP) is a measurement
of how many of the instructions in a computer
program can be executed simultaneously
The Limitations of ILP
• Applications (algorithms)
 Different applications (algorithms) have different numbers of instructions
that can run simultaneously.
• Compiler sophistication
 Good compilers can generate and/or schedule instructions that run in
parallel
– E.g., VLIW compilers (in some degree)
• Hardware sophistication
 Complex hardware usually can find more instructions to run
 E.g., superscalar, speculation
 E.g., SIMD instructions
• In this lecture, we focus on the hardware limitations and the
application limitations.
Hardware Limitations of ILP
• The number of registers for renaming
 Essentially the size of the ROB
 The more ROB entries the more instructions can be examined for parallel
execution
• Branch (outcome and branch target) prediction accuracy
 Better branch prediction => fewer stalls and pipeline flushes
• Memory address alias analysis (disambiguation)
 Better memory aliasing analysis => more accurate dependencies
detection => fewer pipeline flushes from incorrect speculation,
fewer instructions stalled due to falsely/conservatively assumed
dependencies
• Memory/cache latency
 Faster memory/cache => fewer stalls due to slow memory accesses
The range of techniques which was explored in the study to detect and exploit
instruction level parallelism is summarized briefly below:
• Register renaming—with (a) infinite number of registers as renaming targets, (b)
finite number of registers, and (c) no register renaming.
• Alias analysis—with (a) perfect alias analysis, (b) two intermediate levels of alias
analysis, and (c) no alias analysis.
• Branch prediction—with (a) perfect branch prediction, (b) three hardware-based
branch prediction schemes, (c) three profile-based branch prediction schemes, and
(d) no branch prediction.
• Indirect jump prediction—with (a) perfect prediction, (b) intermediate level of
prediction, and (c) no indirect jump prediction.
• Window size—for some of the processor models, different window sizes from an
upper limit of 2048 instructions down to 4 instructions were used.
• Cycle width, i.e. the maximum number of instructions
which can be issued in one cycle—(a) 64,(b) 128, and (c)
bounded only by window size. Note that, from a practical
point of view, cycle widths of both 64 and 128 are on the
high side.
• Latencies of processor operations—five different latency
models were used, specifying latencies (in number of clock
cycles) for various processor operations.
• Loop unrolling—was carried out in some of the programs.
• Misprediction penalty—values of 0 to 10 clock cycles were
used
12.12 Thread Level Parallelism
• There can be much higher natural parallelism in some applications (e.g.,
Database or Scientific codes)
• Explicit Thread Level Parallelism or Data Level Parallelism
• Thread Level Parallelism (TLP): Execute the instructions from multiple
threads at the same time.
– Threads may be from one process, or they may be from
independent processes.
– Each thread has all the state (instructions, data, PC,
register state, and so on) necessary to allow it to execute
• Data Level Parallelism (DLP): Perform identical operations on data, and lots
of data
– i.e., SIMD.
• ILP exploits implicit parallel operations within
a loop or straight-line code segment
• TLP explicitly represented by the use of
multiple threads of execution that are
inherently parallel
• Goal: Use multiple instruction streams to
improve
– Throughput of computers that run many programs
– Execution time of multi-threaded programs
• TLP could be more cost-effective to exploit
Multi-Threaded Execution in One CPU
• Multithreading (MT): multiple threads to share the
functional units of one processor via overlapping
– Processor must duplicate resources to track the states
of each thread, e.g., separate copies of register file,
separate PCs, and for running independent programs,
separate page tables
– Memory shared through the virtual memory
mechanisms, which already support multiple processes
– HW for fast thread switch; much faster than full process
switch ≈100s to 1000s of clocks
• Depending on the specific strategy adopted
for switching between threads, hardware
support for multi-threading may be classified
as one of the following:
1. Coarse-grain multi-threading:
2. Fine-grain multi-threading:
3. Simultaneous multi-threading:
Fine-Grained Multi-threading
• Switches between threads on each instruction, causing the
execution of multiples threads to be interleaved
– programs are broken into large number of small tasks.
• Usually done in a round-robin fashion, skipping any stalled threads
• CPU must be able to switch threads every clock
• Advantage is it can hide both short and long stalls, since
instructions from other threads executed when one thread stalls
• Disadvantage is it slows down execution of individual threads,
since a thread ready to execute without stalls will be delayed by
instructions from other threads
• Used on Sun’s Niagara.
Coarse-Grained Multi-threading
• Switches threads only on costly stalls, such as L2 cache misses
– programs are broken into small number of large task.
• Used in IBM AS/400
• Advantages
 Relieves the need to have very fast thread-switching
 Does not slow down one thread, since instructions from other threads
issued only when the thread encounters a costly stall
• Disadvantage is hard to overcome throughput losses from shorter
stalls, due to pipeline start-up costs
 Since CPU issues instructions from 1 thread, when a stall occurs, the
pipeline must be emptied or frozen
 New thread must fill pipeline before instructions can complete
Simultaneous Multi-Threading
• Simultaneous Multi-Threading (SMT) is a variant of fine-grained MT.
• SMT is based on the observation that dynamic scheduling already
has the HW mechanisms to support
 Large set of ROB registers that can be used to hold the register sets of
independent threads
 Register renaming provides unique register identifiers, so instructions from
multiple threads can be mixed in datapath without confusing sources and
destinations across threads
 Out-of-order completion allows the threads to execute out of order, and get
better utilization of the HW
• Just add a per thread renaming table and keep separate PCs
 Independent commits can be supported by logically keeping a separate
reorder buffer for each thread
Module 5: Software and Parallel
Programming

In this module, we will look into the concepts of Parallel
Computers in terms of software and programming
requirements.

Following Chapters are included in this module:

Parallel Models, Languages and Compilers

Parallel Program Development and Environments

Instruction Level Parallelism

Refer Text Book


Thank you

You might also like