Module 5
Module 5
In this chapter, we will consider the aspects of
programming and compilers for parallel & vector
computers. Studying beyond architectural capabilities, we
will look into basic Parallel Programming and optimization
of Compilers; specifically to achieve Parallelism.
Following are the topics for our understanding:
Parallel Programming Models
Parallel Language and Compilers
Dependence Analysis of Data Arrays
Legends
SIMD : Single Instruction Multiple Datastream
I/O: Input/Output
IPC :Inter Process Communications
PE : Processors (Processing Elements)
OOP: Object Oriented Programming
COOP: Concurrent OOP
10.1 Parallel Programming Models
Parallel Programming Models are specifically
designed for SIMD Computers. Following are
the five models charecterized:
1)Shared Variable Model
2)Message Pasing Model
3)Data Parallel Model
4)Object-Oriented Model
5)Functional & Logic Model
Most programming Systems considers Processors as active
resources and memory & I/O devices as passive resources
Basic Computational Units in a parallel program are
processes. As all of us know, program is collection of
Process.
Classification of the first two models here, is considered
based on the fact that parallelism depends on how
InterProcess Communication(IPC) is implemented.
Fundamental issue in parallel programing in this type of
models depends on Specification, Creation, Suspension,
Reactivation, Migration, Termination and Synchronization of
concurrent processes residing on the same or different
processors.
1) Shared Variable Model
Multiprocessor progg is typically based on the use shared
variables in a common memory for IPC as shown in diagram
below
The model demands the use of shared memory and mutual
exclusion among multiple processes accesing the same set of
variables
Following are the issues in using this type of model:
Protected access of Critical Sections
Memory Consistency
Atomicity of memory operaions
Fast sychronization
Shared data strucutures &
Fast data movement techniques
Contd....
Protected access of Critical Sections
RACE CONDITION, Locks, Binary &Counting Semephores etc
Memory Consistency
Multiprogramming
Multiprocessing
Multitasking
Multthreading
Atomicity of memory operaions
Partioning & Replication
Fast sychronization
Scheduling & Synchronization
Shared data strucutures & Fast data movement techniques
Cache Coherence & Protection
2) Message Passing Model
As depiceted in the diagram, 2 Processes residing at different processor nodes
communicates with each other by message passing (ex: Socket Program in
Networks)
Following are the two models in the message passing programming models
Synchronous Message Passing
Asynchronous Message Passing
(Also termed as Blocking
(Also termed as Non-Blocking
Communication Scheme) Communication Scheme)
Benefits:
Benefits
No Synchronization reqd
No need of mutual Exlusion
Usage of Buffers
No Buffers Required
Issues
Issues:
Usage of finite Buffers
Synchronization
Distribution/Duplication of Program Codes
& Data
3) Data-Parallel Model
Most SIMD Computers have lockstep operations.
Synchronization is done explicitly by hardware and flow control is
also one among the features to have Data Parallel Codes to be
written and debugged easily.
Data Parallel languages are modified directly from standard serial
progg Languages.
Data Parallel Programs requires the use of pre-distributed data sets
& Interconnected data structures for data exchange operations.
Here we will consider fine-grain data parallelism with sychronous
control for SIMD Computers.
Benefits:
Data parallelism leads to high degree of parallelism at
data level compared to the lower degree parallelism at
the instruction level.
Sychronization of data-parallel operations is done at
Compile time rather than Run time(as compared to the
previous models).
Hardware Synchronization is enforced by the Control Unit
(to carry out Lackstep operations in SIMD Computers).
Need of the hour for Data Parallelism:
In early SIMD computers, difficulty was to match the problem size with fixed
machine size. (Large Arrays[say satellite data for weather forecasting] )or
metrices had to be partioned into 64-element segments before they could be
effectiviely processed by the 64 processors(PE’s).
Advancement in SIMD Computers, offered bit-slice fine-grain data parallelism
using 16,384 PE’s concurrently in single array configuration, thus demading a
lower degree of segmentation and offering higher flexibility in programming.
With synchronous SIMD computers, Mutual Exclusion or synchronization
problem are avoided.
With Inter-PE Communications, controlled directly by hardware; Locksteps
are carried out both in computing as well as data communication, making
SIMD Computers efficient in spatial parallelism for large arrays, grids, meshs
etc
Features like Broadcasting, Masking, Data – routing also makes data-parallel
model improve the SIMD Computers
Which are the array language programs?
Array extensions in this model is represented by
high-level data types.
An SIMD Programming Languages should have a
global address space along with explicitly routing of
Data between PE’s.
Example for array processing languages are CFD,
DAP Fprtran, C*, MPF for various Computer
Architectures.
How does the Compiler Help this Model?
In order to support Data Parallel Model, the array language
expressions and their related optimizing Compilers are
embedded. By designing like this, unifying of Program execution
model, facilitating control over Massively Parallel Hardware &
enabling incremental migration to data-parallel execution.
Compiler-Optimized Controller allows the programmer to drive PE
array transparencies.
The Compiler Technology also allows array extensions, optimizing
data placement, minimizing data movement etc., Compiler also
generates data-parallel code for this purpose.
Array Sectioning, Vector-Valued Subscription also Gather/Scatter
Operations are introduced in the Compiler to support and
enhance the programmers deal with the Data Parallel Models.
4) Object- Oriented Model
An object-Oriented Model for SIMD is
charecterized below:
Objects are dynamically created and Manipulated.
Processing is performed by sending and receieving
messages among objects.
Concurrent Programming Models are built up from
Low-Level objects such as Processes, Queues, and
Semaphores into High-level objects like Monitors and
Program Modules.
What was the Need for Concurrent OOP?
Objects are program entities which encapsulates data and
operations into single computational units.
Inheriting this is natural consequnces to the Concept of Objects.
Following were the demand to explore the need for Concurrent
OOP:
Increase in the Interaction of processes by individual Users.
Cost-Effectiveness on Resource Sharing and Distributed Problem
Solving.
Advancements in Multiprocessor technology leading to SuperComputing
Power at a traditional cost.
The development of COOP provides an alternative model for
concurrent computing on Multiprocessors and Multi Computers.
What is the framework for COOP?
Supporting Patterns of reuse and Classifications, ex: inheritance, An
Actor Model is used as a framework for COOP.
Here, Actors, are self Contained, Interactive, Indepedent components
of a computing system that communicate in Asynchronous Message
passing(Semantics).
Basic Actor primitive include:
Create
Send-to
Become
State Changes are specified by behavior replacement mechanism,
which allows one to aggregate changes and to avoid unneccary
control-flow dependences.
How Parallelism in COOP Works?
Three Common Patterns of parallelism
are observed in the practice of COOP,
they are:
Pipeline Concurrency
Divide & Conquer Concurrency
Cooperative Problem Solving
5) Functional & Logic Models
Two of the Language -Oriented
Programming Models to achieve Parallel
Processing are :
Model using Functional Programming
Languages
Ex: LISP, SISAL & Strand 88.
Model based on Logic Programming
Languages
Ex: Concurrent Prolog, Parlog
Functional Programming Model
Here the emphasis is on Functionality of a program, which
eventually will not have side effects after execution.
Concept of Storage, Assignment & Branching are avoided
With lack of side effects opens up much more opportunity for
Parallelism.
Features such as precedence restrictions, evaluation of a
function implies dynamicity functional programming model.
Features like Single-Assignment and Data flow Languages
helps Functional Programming Models can be easily applied
to data-driven multiprocessors.
The functional Model Emphasis fine-grain MIMD parallelism.
Logic Programming Model.
Based on Predicate Logic(AI Topic), Logic Programming
Model, is ideal for Knoweledge Processing dealing with Large
DataBases.
Here in this model, a implict search strategy is adapted which
in turn supports parallelism.
Process of Unification and Matching can also be parallelized.
Clauses in Logic Programming can be transformed into
dataflow graphs.
Both Functional and logical Programming Model have
been Used in Artificial Intelligence applications as
Parallel Processing is very much in Demand.
10.2 Parallel Language and
Compilers
The envirnoment setup required for Parallel Computers are
much more demanding in nature compared to sequential
computers.
Any Proggramming envirnoment is always a collection of
software tools & System support(Hardware).
Users should not spend a lot of time programming details;
meant programmer has to focus more parallelism using high-
level abstractions.
Thus, to understand this part we will study following topics:
Language and Features for Parallelism.
Parallel Language Constructs.
Optimizing Compilers for Parallelism.
Language Features for Parallelism
Classification of Language Features are been
done to support Parallel Programming into Six
Categories according to the functionalities:
Optimization Features
Availibility Features
Syncronization/Communication Features
Control Parallelism
Data Parallelism Features
Process Management Features
(OASCDP)
Optimization Features
The following sets of features which are used for program restructuring and
compilation directives in converting sequentially coded programs into parallel
forms.
The purpose of these features is to match the software parallelism with the
hardware parallelism in the target machine.
Automated Parallelizer
SemiAutomated Parallelizer
Interactive Restructure Support
Availability Features
Following set of features are meant to enhance the user-friendliness (make the
language portable to a large class of parallel computers) & expand the
applicability of Software Libraries.
Scalability
Compatibility
Portability
Synchronization Features
Following are the features which are desirable for
Synchronization/Communication Support
Single Assignement Languages
Shared Variables(Locks) for IPC
Logically shared memory
Send/Recieve for Message Passing
Remote Procedure Call
Dataflow Languages
Barriers, Mailbox, Semaphores, Monitors
Control of Parallelism
Following set of features involved in control constructs for specifying
parallelism
Coarse, Medium or Fine Grain
Explicit vs Implicit parallelism
Global parallelism in the entire program
Loop Parallelism in Iterations
Task-Split Parallelism
Shared Task Queue
Divide & Conquer paradigm
Shared Abstract Data Types
Task Dependency Specification
Data parallelism Features
Following are the features which are helpful in Data parallelism specifying how
data are accessed and Distributed in either SIMD or MIMD Computers:
Run-Time Automatic Decomposition
Mapping Specification
Virtual Processor Support
Direct Access to Shared Data
SPMD(Single Program Muliple Data) Support
Process Management Features
Following set of features are meant to support the efficient creations of parallel
process, implementation of multithreading or multitasking, program partioning
and replication & dynamic load balancing at run-time:
Dynamic Process creation at run-time
Lightweight Process(Threads)
Replicated Workers
Partioned Networks
Automatic Load Balancing
Parallel Language Constructs
To exploit parallelism in programs, first we will check the
array notations used in Frotran 90 & then we will check with
parallel constructs used for parallel program:
Fortran 90 Array Notations:
Multidimensional Data array is represented by an array name indexed by a
sequence of subscript triplets, one for each dimensions.
Triplets for different dimensions are seperated by commas:
Examples are
e1 : e2 : e3
e1: e2
e1 : * : e3
e1 : *
*
Where each e i , is an arithmetic expression that must produce a scalar
integer value.
Contd....
Considering, the first expression e1 is a Lower Bound, the
second expression e2 an Upper Bound, and the third e3 an
increment(stride).
For Example, B(1:4:3 , 6:8:2 , 3) represents four elements
B(1,6,3),B(4,6,3),B(1,8,3) & B(4,8,3)
When the third expression in a triplet is missing, (e1: e2), a unit
stride is assumed.
The * notation in the second expression (e1 : * : e3) indicates
all elements in that dimension starting from e1, or the entire
dimension if e1 is also omitted.
When both e2 & e3 are ommited, the e1 alone represnts a
single element in that dimension, example A(5) represents the
fifth element in the array A(3:7:2).
Contd...
Parallel Flow Controls:
Convententional Fortran Do Loop declares that all scalar instructions within
the (Do, Enddo) pair are executed sequentially.
To Declare Parallel activities, we have to use (Doall, Endall) pair; All
iterations in Doall loop are totally independent of each other.
When the successive iterations are dependent on eachother we use
(Doacross, Endacross) pair to declare parallelism with loop-carried
dependences.
Ex:
Doacross I = 2, N
Do J = 2, N
S1: A(I,J) = (A(I,(J-1))+A(I,(J+1)))/2
Enddo
Endacross
Contd...
Another program construct is (Cobegin , Coend) pair. All
Computations specified within the block could be executed in parallel.
Here Synchronization among concurent processed within the pair are
implied.
Eg:
Cobegin
P1
P2
P3
Pn
Coend
Similarly we have (Forall, Endall), (Pardo, Parend), (Parbegin, Parend),
Fork and Join Commands to achieve parallelism.
Optimizing Compilers for Parallelism
As most of the time now, high level languages are to
be used exclusively to write programs, even compilers
have become a neccessity in modern Computers.
The main role of a compiler is to remove the burden of
program optimization and code generation.
The parallelizing compilers consits of following
phases:
Flow Analysis
Optimization
Code Generation as shown in the figure.
Flow Analysis
This phase reveals the program flow patterns in order the determine data and
control dependence in the source code.
Depending on the machine structure, the granularities of parallelism will be
decided.
We can conclude that Flow Analysis ar concluded at different execution levels
on different parallel computers.
Program Optimization
Its a transformation from user programs in order to explore the hardware
capabilities as much as possible.
Transformation are usually conducted at the Loop Level, Locality Level or the
prefetching level, leading to global optimization.
Most transformations are machine independent which is required to maximize
speed of code execution.
The optimization technique includes:
Vectorization using pipelined hardware
Parallelization using multiple processors
Parallel Code Generation
Code generation usually invloves transformation from one
representation to another, called as Internediate Form.
A Code model will be chosen from Intermediate form.
Parallel codes are even more demanding because parallel
constructs must be included.
Code Generation is linked with the Instruction Scheduling policies
used.
Basic blocks linked by control-flow commands are often optimized to
encourage a high degree of parallelism.
Special data strucutures are needed to represent instruction blocks.
10.3 Dependence Analysis of Data
Arrays
Here we will learn about Dependence Testing for
successive Iterations in Multidimensional data
arrays.
To understand this we will first look into examples
of Hardware Parallelism and Software Parallelism.
Later we will look into Dependence Analysis in
three subsection:
Iteration space & Dependence Analysis
Subscript Seperability & Partitioning
Categorized Dependence Tests.
Iteration Spaces & Dependence Analysis
Precise and efficient dependence tests are
essential to the effectiveness of parallelizing
Computers.
The process of computing all the data
dependence in a program is called Dependence
Analysis.
Dependence Testing
Iteration Space
Dependence Equations
Distance and Direction Vectors
Subscript Separability and Partioning
The Dependence Testing has TWO GOALS:
It tries to disprove the dependence between pairs of
subscripted references to the same array variable
If dependence exist, it tries to characterize in some
manner, i.e., usually as a minimum complete set of
distance and direction vectors.
To Understand this we will look into:
Subscript Categories
Subscript Separability
Subscript Partitioning
Categorized Dependence Tests
The Goal(end result) of Dependence Testing is to
construct the complete set of Distance(D) and
Direction Vectors(di) representing potential
dependences between an arbitrary pair of subscripted
references to the same array variable.
Here we look into:
Testing Algorithm
Test Categories;
ZIV Test
SIV Test
Weak-Zero SIV Test
Weak-Crossing SIV Test
MIV Tests
Module 5: Software and Parallel
Programming
In this module, we will look into the concepts of
Parallel Computers in terms of software and
programming requirements.
Following Chapters are included in this module:
Parallel Models, Languages and Compilers
Parallel Program Development and Environments
Instruction Level Parallelism
Instruction Level Parallelism
Module 5 : Chapter 12
Topics
• Computer Architecture
• Basic Design Issues
Computer Architecture and Computer
Organization
• Architecture describes what the computer does.
• Organization describes how it does it.
Example:
• Say you are constructing a house, design and all
low-level details come under computer
architecture while building it brick by brick,
connecting together keeping basic architecture
in mind comes under Computer Organization.
12.1 Computer Architecture
• Computer architecture defined as the
arrangement by which the various system
building blocks – processors, functional units,
main memory, cache, data paths and so on –
are interconnected and inter-operated to
achieve desired system performance.
Processor Design
Processor
performance
• System performance is the key benchmark in
the study of computer architecture.
– Criteria: Scalability, price, usability, reliability, power
consumption, physical size
• A basic rule of system design is that
– no performance bottlenecks in the system.
• Typically, a performance bottleneck arises when
one part of the system cannot keep up with the
overall throughput requirements of the system.
This system exhibits a performance mismatch between the processors, main
memory and the processor-memory bus.
summary
• Processor design is the central element of computer
system design.
• Since system design can only be carried out with specific
target application loads in mind, it follows that processor
design should also be tailored for target application loads.
• To satisfy overall system performance criteria, various
elements of the system must be balanced in terms of
their performance.
– No element of the system should become a performance
bottleneck.
12.2 Basic Design Issues
• designs which aim to maximize the
exploitation of instruction level parallelism
need deeper pipelines;
– such designs may support higher clock rates.
• Instruction-level parallelism (ILP) is a measure
of how many of the instructions in a
computer program can be executed
simultaneously.
• Let us examine the trade-off involved in this context in a simplified way:
total chip area = number of cores X chip area per core
or
total transistor count = number of cores X transistor count per core
• At a given time, VLSI technology limits the left hand side in the above
equations, while the designer must select the two factors on the right.
Aggressive exploitation of instruction level parallelism, with multiple
functional units and more complex control logic, increases the chip area
—and transistor count—per processor core. Alternatively, for a different
category of target applications, the designer may select simpler cores,
and thereby place a larger number of them on a single chip.
• Within a processor, a set of instructions are in
various stages of execution at a given time—
within the pipeline stages, functional units,
operation buffers, reservation stations, and so on.
• Therefore machine instructions are not in general
executed in the order in which they are stored in
memory, and all instructions under execution
must be seen as ‘work in progress’.
• To maintain the work flow of instructions within
the processor, a superscalar processor makes
use of branch prediction—i.e. the result of a
conditional branch instruction is predicted even
before the instruction executes—so that
instructions from the predicted branch can
continue to be processed, without causing
pipeline stalls. The strategy works provided fairly
good branch prediction accuracy is maintained.
• we shall assume that instructions are committed in order.
Here committing an instruction means that the instruction
is no longer ‘under execution’—the processor state and
program state reflect the completion of all operations
specified in the instruction.
• Thus we assume that, at any time, the set of committed
instructions correspond with the program order of
instructions and the conditional branches actually taken.
Any hardware exceptions generated within the processor
must reflect the processor and program state resulting
from instructions which have already committed.
Summary
One of the main processor design trade-offs faced in this
context is this:
• Should the processor be designed to squeeze the
maximum possible parallelism from a single thread, or
should processor hardware support multiple independent
threads, with less aggressive exploitation of instruction
level parallelism within each thread?
• In this chapter, we will study the various standard
techniques for exploiting instruction level parallelism, and
also discuss some of the related design issues and trade-
offs.
Topics…
• Problem Definition
Data Dependences
Control Dependences
Resource Dependences
• Model of a Typical Processor
12.3 Problem Definition
• The execution of machine instructions from a single
sequential stream.
– The instructions are stored in main memory in program
order, from where they must be fetched into the
processor, decoded, executed, and then committed in
program order.
– We assume that the processor has a load-store type of
instruction set, which means that all arithmetic and
logical operations are carried out on operands which are
present in programmable registers.
opcode operand-1 operand-2 result
• Data transfer instructions have only two
operands—source and destination registers;
load and store instructions to/from main
memory specify one operand in the form of a
memory address, using an available
addressing mode.
• Effective address for load and store is
calculated at the time of instruction execution.
• Conditional branch instructions need to be treated as a special
category, since each such branch presents two possible
continuations of the instruction stream.
• Branch decision is made only when the instruction executes; at
that time, if instructions from the branch-not-taken are in the
pipeline, they must be flushed. But pipeline flushes are costly in
terms of lost processor clock cycles.
• The payoff of branch prediction lies in the fact that correctly
predicted branches allow the detection of parallelism to stretch
across two or more basic blocks of the program, without pipeline
stalls. It is for this reason that branch prediction becomes an
essential technique in exploiting instruction level parallelism.
• Limits to detecting and exploiting instruction level parallelism
are imposed by dependences between instructions.
– If N instructions are completely independent of each other, they can
be executed in parallel on N functional units—if N functional units
are available—and they may even be executed in arbitrary order.
• But in fact dependences amongst instructions are a central
and essential part of program logic. A dependence specifies
that instruction Ik must wait for instruction Ij to complete.
– Within the instruction pipeline, such a dependence may create a
hazard or stall—i.e. lost processor clock cycles while Ik waits for Ij to
complete.
Data Dependences
• Assume that instruction Ik follows instruction
Ij in the program. Data dependence between Ij
and Ik means that both access a common
operand.
• let us assume that the common operand of Ij and
Ik is in a programmable register. Since each
instruction either reads or writes an operand
value, accesses by Ij and Ik to the common
register can occur in one of four possible ways:
1. Read by Ik after read by Ij (RAR)
2. Read by Ik after write by Ij (RAW)
3. Write by Ik after read by Ij (WAR)
4. Write by Ik after write by Ij (WAW)
• Data hazards occur when instructions that exhibit data dependence, modify data in
different stages of a pipeline. Hazard cause delays in the pipeline. There are mainly
three types of data hazards:
1) RAW (Read after Write) [Flow/True data dependency]
2) WAR (Write after Read) [Anti-Data dependency] Register
3) WAW (Write after Write) [Output data dependency] renaming
Before
After
• Note that loop unrolling by the compiler does
not in itself involve the detection of
instruction level parallelism. But loop unrolling
makes it possible for the compiler or the
processor hardware to exploit a greater
degree of instruction level parallelism.
Example 12.5 Dependence across loop
iterations
Consider the following loop in a source program, which has a crucial new
dependence built into it:
Now the value calculated in the ith iteration of the loop makes use of the
value c[i – 1] calculated in the previous iteration.
This does not mean that the modified loop cannot be unrolled, but only
that extra care should be taken to account for the dependence.
With static instruction scheduling by the compiler, the processor
designer need to provide for dynamic scheduling in hardware.
• Dependences amongst references to simple variables, or amongst
array elements whose index values are known at compile time (as in
the two examples seen above), can be analyzed relatively easily at
compile time.
• But when pointers are used to refer to locations in memory, or when
array index values are known only at run-time, then clearly
dependence analysis is not possible at compile time.
• Therefore processor hardware must provide support at run-time for
alias
• Cache misses, I/O interrupts, and hardware exceptions cannot be
predicted at compile time. Therefore, apart from alias analysis,
compiler detected instruction level parallelism also requires dynamic
scheduling support within the processor.
A further step in the direction of compiler detected instruction level
parallelism and static scheduling can be the following:
If SUB is executed in program order, then even without operand forwarding between
ADD and SHIFTR, no processor clock cycle is lost, since SHIFTR does not directly
follow ADD. But now suppose SUB is executed either before ADD, or after SHIFTR. In
both these cases, SHIFTR directly follows ADD, and therefore operand forwarding
proves useful in saving a processor cycle, as we have seen above.
• But why should SUB be executed in any order other than
program order?
• The answer can only be this: to achieve better utilization of
processor clock cycles.
• Therefore the processor must make on the fly decisions
such as
(i) transferring ALU output in parallel to both R3 and ALU input,
and/or
(ii) out of order execution of the instruction SUB.
• This is the needed to implement dynamic scheduling of
machine instructions within the processor.
12.7 Reorder Buffer
• The Reorder Buffer (ROB) tracks the state of all
inflight instructions in the pipeline.
• The role of the ROB is to provide the illusion to the
programmer that his program executes in-order.
• Since instructions execute in parallel on multiple
functional units, the reorder buffer serves the
function of bringing completed instructions back
into an order which is consistent with program
order.
Note that instructions may complete in an order which is not related to
program order, but must be committed in program order.
ROB Data Structure
1. instruction identifier
2. value computed
3. program-specified destination of the value
computed
4. flag indicating whether the instruction has
completed (i.e. the computed value is
available).
• Assume that I1 has not even started executing when I2 and I3 are issued.
• When I2 and I3 are issued, they are parked in the reservation stations of the
appropriate functional units.
• Since the required result value from I1 is not available, these reservation
station entries of I2 and I3 get source tag corresponding to the output of I1.
• When the result of I1 becomes available at its functional unit, it is sent over
the common data bus along with the tag value of its source.
• At this point, programmable register R4 as well as the reservation stations
assigned to I2 and I3 have the matching source tag—since they are waiting
for the same result value, which is being computed by I1.
• Thus, through the use of source tags and the common data bus, in one clock
cycle, three destination registers receive the value produced by I1—
programmable register R4, and the operand registers in the reservation
stations assigned to I2 and I3.
Example: Combination of RAW and WAR
dependence
• Assume that instruction I1 is to write its result into R4, a subsequent
instructions I2 is to read that result value, and a latter subsequent
instruction I3 is then to write its result into R4.
• Thus instruction I2 is truly data dependent (RAW dependent) on
instruction I1, but I3 is anti-dependent (WAR dependent) on I2.
• When I2 is issued, it is parked in the reservation station
of the appropriate functional unit.
• Since the required result value from I1 is not available,
the reservation station entry of I2 also gets the source
tag corresponding to the output of I1
• Can I3 be issued even before I1 completes and I2 starts
execution?
• The answer is that, with register renaming I3 can be
issued even before I2 starts execution.
• I1 and I2 dependence resolved as previous example.
• But suppose I3 is issued even before the output of
I1 is available, register R4, the output of I1 is
programmed to be overwritten by the output of I3
• When the output of I1 (finally) becomes available,
it goes to the input of I2, but not to register R4,
since this register’s source tag now refers to I3.
When the output of I3 becomes available, it goes
correctly to R4 because of the matching source tag.
Example: Calculation of processor clock
cycles
1 LOAD mem-a, R4
2 FSUB R7, R4, R4
3 STORE mem-a, R4
4 FADD R4, R3, R7
5 STORE mem-b, R7
• We shall assume that (a) one instruction is issued per
clock cycle, (b) floating point operations take two clock
cycles each to execute, and (c) memory operations take
one clock cycle each when there is L1 cache hit.
• We get the total as 1+2+1+2+1 = 7.
solution
• The RAW dependences on registers R4 and R7
will cost three additional clock cycles
– i.e. between (i) instructions 1 and 2, (ii) instructions 2
and 3, and (ii) instructions 4 and 5.
• We get the total as 1+2+1+2+1+3 = 10.
• With operand forwarding—which is built into
Tomasulo’s algorithm—one clock cycle is saved
on account of each RAW dependence
• We get the total as 1+2+1+2+1 = 7.
Topics…
• Branch Prediction
• Limitations in Exploiting Instruction Level
Parallelism
• Thread Level Parallelism
12.10 Branch Prediction
Example: Predicting the outcome of a tossed
coin
• Can one predict the result of a single coin
toss? • Prior knowledge: Assume that a coin comes up
head in its first two tosses. Then simple conditional
probability theory predicts that the third toss of the
coin has a higher probability of coming up head
than of coming up tail.
• Like tossed coins, outcomes of conditional
branches in computer programs also have yes and
no answers— i.e. a branch is either taken or not
taken.
two-bit predictor
• Two-bit counter is maintained for every conditional
branch instruction in the program.
• The two-bit counter has four possible states;
– When the counter state is 0 or 1, the respective branch is
predicted as taken;
– when the counter state is 2 or 3, the branch is predicted
as not taken.
• When two successive predictions come out wrong,
the prediction is changed from branch taken to
branch not taken, and vice versa.
correlated predictors
• different branches may be correlated
– outcome of branch depends on outcome of other branches
– makes intuitive sense (programs are written this way)
• Example
If(d==0)
d=1;
If(d==1)
{……}
B2 correlated with B1; if B1 is not take, B2 will be not
taken.
• Can branch prediction be carried out even before the
instruction is decoded
– —i.e. at the instruction fetch stage?
– Yes, if a so-called branch target buffer is provided which
has a history of recently executed conditional branches.
• Branch prediction based on the earlier history of the
same branch is known as local prediction,
• while prediction based on the history of other
branches in the program is known as global
prediction.
tournament predictor (Hybrid)
• A tournament predictor uses (i) a global
predictor, (ii) a local predictor, and (iii) a selector
which selects one of the two predictors for
prediction at a given branch instruction.
– The selector uses a two bit counter per conditional
branch to choose between the global and local
predictors for the branch.
– Two successive mis-predictions cause a switch from
the local predictor to the global predictor, and vice
versa
jump prediction
• which is applicable in indirect jumps
– computed go to statements (used in FORTRAN)
– switch statements (used in C and C++)
• The address associated with a procedure
return is obtained from the runtime procedure
stack in main memory;
Speculative Execution
• Instructions executed on the basis of a predicted branch,
before the actual branch result is known, are said to
involve speculative execution.
• If a branch prediction turns out to be correct, the
corresponding speculatively executed instructions must
be committed.
• If the prediction turns out to be wrong, the effects of
corresponding speculative operations carried out within
the processor must be cleaned up, and instructions from
another branch of the program must instead be executed.
Multiple Issue
– the processor tries to issue more than one instruction per cycle
(superscalar architectures)
• Instruction window or window is the special memory
provided upstream of the fetch unit in the processor
Conclusion
• For the targeted processor performance, the processor
designers must integrate and balance the hardware
techniques of branch prediction, dynamic scheduling,
speculative execution, internal data paths, functional units,
and an instruction window of appropriate size.
12.11 Limitations in Exploiting Instruction Level
Parallelism
• We have learned many CPU design techniques to
optimize performance
– Pipelining, superscalar, speculative execution, out-of-
order execution
• The common goal is to execution instructions in
parallel (pipelining), as many instructions as possible
• Instruction-level parallelism (ILP) is a measurement
of how many of the instructions in a computer
program can be executed simultaneously
The Limitations of ILP
• Applications (algorithms)
Different applications (algorithms) have different numbers of instructions
that can run simultaneously.
• Compiler sophistication
Good compilers can generate and/or schedule instructions that run in
parallel
– E.g., VLIW compilers (in some degree)
• Hardware sophistication
Complex hardware usually can find more instructions to run
E.g., superscalar, speculation
E.g., SIMD instructions
• In this lecture, we focus on the hardware limitations and the
application limitations.
Hardware Limitations of ILP
• The number of registers for renaming
Essentially the size of the ROB
The more ROB entries the more instructions can be examined for parallel
execution
• Branch (outcome and branch target) prediction accuracy
Better branch prediction => fewer stalls and pipeline flushes
• Memory address alias analysis (disambiguation)
Better memory aliasing analysis => more accurate dependencies
detection => fewer pipeline flushes from incorrect speculation,
fewer instructions stalled due to falsely/conservatively assumed
dependencies
• Memory/cache latency
Faster memory/cache => fewer stalls due to slow memory accesses
The range of techniques which was explored in the study to detect and exploit
instruction level parallelism is summarized briefly below:
• Register renaming—with (a) infinite number of registers as renaming targets, (b)
finite number of registers, and (c) no register renaming.
• Alias analysis—with (a) perfect alias analysis, (b) two intermediate levels of alias
analysis, and (c) no alias analysis.
• Branch prediction—with (a) perfect branch prediction, (b) three hardware-based
branch prediction schemes, (c) three profile-based branch prediction schemes, and
(d) no branch prediction.
• Indirect jump prediction—with (a) perfect prediction, (b) intermediate level of
prediction, and (c) no indirect jump prediction.
• Window size—for some of the processor models, different window sizes from an
upper limit of 2048 instructions down to 4 instructions were used.
• Cycle width, i.e. the maximum number of instructions
which can be issued in one cycle—(a) 64,(b) 128, and (c)
bounded only by window size. Note that, from a practical
point of view, cycle widths of both 64 and 128 are on the
high side.
• Latencies of processor operations—five different latency
models were used, specifying latencies (in number of clock
cycles) for various processor operations.
• Loop unrolling—was carried out in some of the programs.
• Misprediction penalty—values of 0 to 10 clock cycles were
used
12.12 Thread Level Parallelism
• There can be much higher natural parallelism in some applications (e.g.,
Database or Scientific codes)
• Explicit Thread Level Parallelism or Data Level Parallelism
• Thread Level Parallelism (TLP): Execute the instructions from multiple
threads at the same time.
– Threads may be from one process, or they may be from
independent processes.
– Each thread has all the state (instructions, data, PC,
register state, and so on) necessary to allow it to execute
• Data Level Parallelism (DLP): Perform identical operations on data, and lots
of data
– i.e., SIMD.
• ILP exploits implicit parallel operations within
a loop or straight-line code segment
• TLP explicitly represented by the use of
multiple threads of execution that are
inherently parallel
• Goal: Use multiple instruction streams to
improve
– Throughput of computers that run many programs
– Execution time of multi-threaded programs
• TLP could be more cost-effective to exploit
Multi-Threaded Execution in One CPU
• Multithreading (MT): multiple threads to share the
functional units of one processor via overlapping
– Processor must duplicate resources to track the states
of each thread, e.g., separate copies of register file,
separate PCs, and for running independent programs,
separate page tables
– Memory shared through the virtual memory
mechanisms, which already support multiple processes
– HW for fast thread switch; much faster than full process
switch ≈100s to 1000s of clocks
• Depending on the specific strategy adopted
for switching between threads, hardware
support for multi-threading may be classified
as one of the following:
1. Coarse-grain multi-threading:
2. Fine-grain multi-threading:
3. Simultaneous multi-threading:
Fine-Grained Multi-threading
• Switches between threads on each instruction, causing the
execution of multiples threads to be interleaved
– programs are broken into large number of small tasks.
• Usually done in a round-robin fashion, skipping any stalled threads
• CPU must be able to switch threads every clock
• Advantage is it can hide both short and long stalls, since
instructions from other threads executed when one thread stalls
• Disadvantage is it slows down execution of individual threads,
since a thread ready to execute without stalls will be delayed by
instructions from other threads
• Used on Sun’s Niagara.
Coarse-Grained Multi-threading
• Switches threads only on costly stalls, such as L2 cache misses
– programs are broken into small number of large task.
• Used in IBM AS/400
• Advantages
Relieves the need to have very fast thread-switching
Does not slow down one thread, since instructions from other threads
issued only when the thread encounters a costly stall
• Disadvantage is hard to overcome throughput losses from shorter
stalls, due to pipeline start-up costs
Since CPU issues instructions from 1 thread, when a stall occurs, the
pipeline must be emptied or frozen
New thread must fill pipeline before instructions can complete
Simultaneous Multi-Threading
• Simultaneous Multi-Threading (SMT) is a variant of fine-grained MT.
• SMT is based on the observation that dynamic scheduling already
has the HW mechanisms to support
Large set of ROB registers that can be used to hold the register sets of
independent threads
Register renaming provides unique register identifiers, so instructions from
multiple threads can be mixed in datapath without confusing sources and
destinations across threads
Out-of-order completion allows the threads to execute out of order, and get
better utilization of the HW
• Just add a per thread renaming table and keep separate PCs
Independent commits can be supported by logically keeping a separate
reorder buffer for each thread
Module 5: Software and Parallel
Programming
In this module, we will look into the concepts of Parallel
Computers in terms of software and programming
requirements.
Following Chapters are included in this module:
Parallel Models, Languages and Compilers
Parallel Program Development and Environments
Instruction Level Parallelism