0% found this document useful (0 votes)

202 views21 pages

ILP Techniques: Trace Scheduling & Limits

1) The document summarizes techniques for increasing instruction level parallelism (ILP) such as superscalar, VLIW, loop unrolling, software pipelining, trace scheduling, conditional execution, speculation, and their limits. 2) It discusses how realistic hardware models with factors like branch prediction, register renaming, memory aliasing impact the maximum achievable ILP. 3) It compares the performance of an 8-way superscalar Power-2 processor to a 2-way superscalar Alpha processor, showing tradeoffs between ILP and clock frequency.

Uploaded by

Balasubramanian Jayaraman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views21 pages

ILP Techniques: Trace Scheduling & Limits

Uploaded by

Balasubramanian Jayaraman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP

Professor Randy H. Katz Computer Science 252 Spring 1996

RHK.S96 1

Review: Getting CPI < 1 Multiple Instructions/Cycle

Two variations:
Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100 Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler Joint HP/Intel agreement in 1998?

RHK.S96 2

Loop Unrolling in SuperScalar

Loop: Integer instruction LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) LD F18,-32(R1) SD 0(R1),F4 SD -8(R1),F8 SD -16(R1),F12 SD -24(R1),F16 SUBI R1,R1,#40 BNEZ R1,LOOP SD -32(R1),F20 FP instruction Clock cycle 1 2 3 4 5 6 7 8 9 10 11 12

ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 ADDD F20,F18,F2

Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration

RHK.S96 3

Loop Unrolling in VLIW

Memory reference 1 Memory reference 2 FP operation 1 FP op. 2 Int. op/ branch Clock
1 2 3 4 5 6 7 8 9 LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 ADDD F20,F18,F2 ADDD F24,F22,F2 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 SD -16(R1),F12 SD -24(R1),F16 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 SD -0(R1),F28 BNEZ R1,LOOP

Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW

RHK.S96 4

Limits to Multi-Issue Machines

Inherent limitations of ILP
1 branch in 5: How to keep a 5-way VLIW busy? Latencies of units: many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy

Difficulties in building HW
Duplicate FUs to get parallel execution Increase ports to Register File VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Increase ports to memory Decoding SS and impact on clock rate, pipeline depth

RHK.S96 5

Limits to Multi-Issue Machines

Limitations specific to either SS or VLIW implementation
Decode issue in SS VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step => 1 hazard & all instructions stall VLIW & binary compatibility is practical weakness

RHK.S96 6

Software Pipelining Example

Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP

After: Software Pipelined

1 2 3 4 5 SD ADDD LD SUBI BNEZ 0(R1),F4 ; Stores M[i] F4,F0,F2 ; Adds to M[i-1] F0,-16(R1); Loads M[i-2] R1,R1,#8 R1,LOOP

Symbolic Loop Unrolling

Less code space Fill & drain pipe only once vs. each iteration in loop unrolling
RHK.S96 7

Review: Summary
Branch Prediction
Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction

SuperScalar and VLIW

CPI < 1 Dynamic issue vs. Static issue More instructions issue at same time, larger the penalty of hazards

SW Pipelining
Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead
RHK.S96 8

Trace Scheduling
Parallelism across IF branches vs. LOOP branches Two steps:
Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong

RHK.S96 9

HW support for More ILP

Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP
If false, then neither store result or cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.

Drawbacks to conditional instructions

Still takes a clock even if annulled Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline

RHK.S96 10

HW support for More ILP

Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (HW undo) Often try to combine with dynamic scheduling Tomasulo: separate speculative bypassing of results from real bypassing of results
When instruction no longer speculative, write results (instruction commit) execute out-of-order but commit in order

RHK.S96 11

HW support for More ILP

Need HW buffer for results of uncommitted instructions: reorder buffer
Reorder Buffer

Reorder buffer can be operand source FP Regs Once operand commits, result is found in register 3 fields: instr. type, destination, value Use reorder buffer number instead Res Stations Res Stations of reservation station FP Adder FP Adder Instructionscommit in order As a result, its easy to undo speculated instructions on Figure 4.34, page 311 mispredicted branches or on exceptions
RHK.S96 12

FP Op Queue

Four Steps of Speculative Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station or reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination.

2. Executionoperate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute

3. Write resultfinish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commitupdate register with reorder result

When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer.
RHK.S96 13

Limits to ILP
Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about:
Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication

RHK.S96 14

Limits to ILP
Initial HW Model here; MIPS compilers 1. Register renaminginfinite virtual registers and all WAW & WAR hazards are avoided 2. Branch predictionperfect; no mispredictions 3. Jump predictionall jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysisaddresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions
RHK.S96 15

Upper Limit to ILP

(Figure 4.38, page 319)
160 150.1 140 120 100 80 60 40 20 0 gcc espresso li fpppp doducd tomcatv
RHK.S96 16

Instruction Issues per cycle

118.7

75.2 54.8 62.6

17.9

Programs

More Realistic HW: Branch Impact

Figure 4.40, Page 323
60

50 Instruction issues per cycle

Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle
41 35

61 58

48 46 45 46 45 45

20
12

19 16 10 6 6 2 7 6 2 6 7 4 2 15 13 14

0 gcc espresso li Program Perfect Selective predictor Standard 2-bit Static None fpppp doducd tomcatv

RHK.S96 17

Perfect

Pick Cor. or BHT BHT (512)

Profile

More Realistic HW: Register Impact

Figure 4.42, Page 325
60
59

50 Instruction issues per cycle

Change 2000 instr window, 64 instr issue, 8K 2 level Prediction

54 49 45 44

20
15 15 13 11 10 10 9 5 4 10 5 4 12 12 12 11 6 5 5 4

15 11 5 7 5 5

0 gcc espresso li Program Infinite 256 128 64 32 None fpppp doducd tomcatv

Infinite

256

128

None

RHK.S96 18

More Realistic HW: Alias Impact

Figure 4.44, Page 328
50 45 40 Instruction issues per cycle 35 30 25 20 15
10 15 12 9 7 4 3 7 5 5 4 3 4 3 6 4 5 4 49 49

Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers
16 16

10 5 0

gcc

espresso

li Program

fpppp

doducd

tomcatv

Perfect

Global/stack Perfect

Inspection

None

Perfect

Global/Stack perf; Inspec. heap conflicts Assem.

None

RHK.S96 19

Realistic HW for 9X: Window Impact

(Figure 4.48, Page 332)
60
52 56

50 Instruction issues per cycle

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
15 15 10 10 10 9 13 10 8 6 4 3 8 6 4 2 12 12 11 11 9 6 4 3

47 45

22 17 16 14 8 5 3

15 12 9 7 4 3

14 9 6 3

0 gcc expresso li Program Infinite 256 128 64 32 16 8 4 fpppp doducd tomcatv

Infinite 256 128

RHK.S96 20

Braniac vs. Speed Demon

8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
900 800 700 600

SPECMarks

500 400 300 200 100 0

espresso

eqntott

doduc

wave5

compress

tomcatv

hydro2d

nasa

gcc

swm256

mdljdp2

mdljsp2

su2cor

alvinn

Benchmark

RHK.S96 21

fpppp

spice

ora

ear

Expliting More ILP
No ratings yet
Expliting More ILP
20 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
Software Pipelining Patterson 1996
No ratings yet
Software Pipelining Patterson 1996
60 pages
03ILP Speculation and Advanced Topics
No ratings yet
03ILP Speculation and Advanced Topics
48 pages
Computer Architecture 09-Superscalar
No ratings yet
Computer Architecture 09-Superscalar
83 pages
Lecture #2
No ratings yet
Lecture #2
11 pages
VLIW Processors Explained
No ratings yet
VLIW Processors Explained
17 pages
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
No ratings yet
Αρχιτεκτονική Υπολογιστών: Παράλληλος Έλεγχος
34 pages
Superpipelining in Computer Architecture
No ratings yet
Superpipelining in Computer Architecture
7 pages
Speculative Execution in Dynamic ILP
No ratings yet
Speculative Execution in Dynamic ILP
21 pages
Vliw/Epic:: Statically Scheduled ILP
No ratings yet
Vliw/Epic:: Statically Scheduled ILP
34 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Lec 15
No ratings yet
Lec 15
15 pages
Advanced ILP Techniques for Developers
No ratings yet
Advanced ILP Techniques for Developers
104 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
ILP vs. TLP: Performance Limits Explained
No ratings yet
ILP vs. TLP: Performance Limits Explained
51 pages
Instruction-Level Parallelism Techniques
0% (1)
Instruction-Level Parallelism Techniques
40 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
VLIW vs. Superscalar Processors
No ratings yet
VLIW vs. Superscalar Processors
35 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
05 Wideissue
No ratings yet
05 Wideissue
77 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
CMP3010L05-Hazard Continue ILP
No ratings yet
CMP3010L05-Hazard Continue ILP
54 pages
VLIW Processors for CS Students
No ratings yet
VLIW Processors for CS Students
19 pages
Understanding Parallel Computing Architectures
No ratings yet
Understanding Parallel Computing Architectures
24 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
49 pages
Advantages of Tomasulo's Scheme in CPUs
No ratings yet
Advantages of Tomasulo's Scheme in CPUs
30 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
MIPS Pipelining and Hazards Explained
No ratings yet
MIPS Pipelining and Hazards Explained
48 pages
Pipelining - Lec 3-Modified
No ratings yet
Pipelining - Lec 3-Modified
36 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
No ratings yet
Compiling For Vliws and Ilp: Profiling Region Formation Acyclic Scheduling Cyclic Scheduling
46 pages
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors
No ratings yet
CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors
18 pages
Unit II
No ratings yet
Unit II
84 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
MIPS Pipeline & Dynamic Scheduling
No ratings yet
MIPS Pipeline & Dynamic Scheduling
5 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Pipe 4
No ratings yet
Pipe 4
50 pages
ILP Limitations
No ratings yet
ILP Limitations
31 pages
MIPS R4000 Pipelining Case Study
No ratings yet
MIPS R4000 Pipelining Case Study
23 pages
Module3
No ratings yet
Module3
49 pages
Unit I Instruction Level Parallelism Two Mark Questions: Dept of Cse G.SURESH. M.Tech, Asst Prof / CSE
No ratings yet
Unit I Instruction Level Parallelism Two Mark Questions: Dept of Cse G.SURESH. M.Tech, Asst Prof / CSE
12 pages
Instruction Level Parallelism Explained
No ratings yet
Instruction Level Parallelism Explained
45 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Speculation in Computer Architecture
No ratings yet
Speculation in Computer Architecture
22 pages
ILP-Architectures Part III
No ratings yet
ILP-Architectures Part III
49 pages
VLIW Processor Architecture Guide
No ratings yet
VLIW Processor Architecture Guide
23 pages
Instruction-Level Parallelism Overview
No ratings yet
Instruction-Level Parallelism Overview
170 pages
Error Detection / Correction: Computer Organization & Architecture
No ratings yet
Error Detection / Correction: Computer Organization & Architecture
18 pages
Clifford Gertz Thick Description Toward An Interpretive Theory of Culture
100% (1)
Clifford Gertz Thick Description Toward An Interpretive Theory of Culture
3 pages
Organizational Development Interventions
No ratings yet
Organizational Development Interventions
16 pages
LeNS Proceedings
No ratings yet
LeNS Proceedings
899 pages
Rural Communication Strategy
No ratings yet
Rural Communication Strategy
41 pages
RP-HPLC Method for Celecoxib Quantification
No ratings yet
RP-HPLC Method for Celecoxib Quantification
8 pages
Use of Coconut Shell As A Partial Replacement For Aggregate
No ratings yet
Use of Coconut Shell As A Partial Replacement For Aggregate
21 pages
Grade 5 - Math Standards Examples
No ratings yet
Grade 5 - Math Standards Examples
32 pages
LLB Semester 1 Timetable 2017-2018 Full Time
No ratings yet
LLB Semester 1 Timetable 2017-2018 Full Time
1 page
Spelling Strategies
No ratings yet
Spelling Strategies
8 pages
Table Maintenance With Create
No ratings yet
Table Maintenance With Create
4 pages
Recruitment and Selection Process Guide
No ratings yet
Recruitment and Selection Process Guide
15 pages
6 Kreiranje Baze Podataka PDF 11494
No ratings yet
6 Kreiranje Baze Podataka PDF 11494
8 pages
Ecen 248 Lab 10 Report
100% (3)
Ecen 248 Lab 10 Report
4 pages
Yamanouchi Quantum Mechanics Molecular Structures
50% (2)
Yamanouchi Quantum Mechanics Molecular Structures
276 pages
FK Show Led - 20141219test
No ratings yet
FK Show Led - 20141219test
215 pages
Field Work Practice Report: Muara Bungo University
No ratings yet
Field Work Practice Report: Muara Bungo University
10 pages
Dashboard CAPEX 2021
No ratings yet
Dashboard CAPEX 2021
62 pages
The Girl Effect - Laura Lora
No ratings yet
The Girl Effect - Laura Lora
6 pages
L'OCCITANE CSR Report FY2014 Summary
No ratings yet
L'OCCITANE CSR Report FY2014 Summary
44 pages
EAPP Notes Lesson 1 8
No ratings yet
EAPP Notes Lesson 1 8
13 pages
Understanding Global Governance Agents
No ratings yet
Understanding Global Governance Agents
25 pages
Piling Suite Tutorial Manual 2015 PDF
No ratings yet
Piling Suite Tutorial Manual 2015 PDF
74 pages
Linear Programming Course Syllabus
No ratings yet
Linear Programming Course Syllabus
3 pages
Region VI-Western Visayas Division of Negros Occidental Rizal ST., Brgy. III, Hinigaran, Negros Occidental
No ratings yet
Region VI-Western Visayas Division of Negros Occidental Rizal ST., Brgy. III, Hinigaran, Negros Occidental
1 page
Stories of Jedi and Sith Lucasfilm Press PDF Version
100% (2)
Stories of Jedi and Sith Lucasfilm Press PDF Version
137 pages
Software Project Management: Dr. R. Mall
No ratings yet
Software Project Management: Dr. R. Mall
87 pages
Coagulation & Flocculation Lecture
No ratings yet
Coagulation & Flocculation Lecture
53 pages
CSAB NEUT 2019 Admission Guidelines
No ratings yet
CSAB NEUT 2019 Admission Guidelines
118 pages
Rafiqul Islam - Resume
No ratings yet
Rafiqul Islam - Resume
2 pages

ILP Techniques: Trace Scheduling & Limits

Uploaded by

ILP Techniques: Trace Scheduling & Limits

Uploaded by

Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP

Professor Randy H. Katz Computer Science 252 Spring 1996

Review: Getting CPI < 1 Multiple Instructions/Cycle

Loop Unrolling in SuperScalar

Loop Unrolling in VLIW

Limits to Multi-Issue Machines

Limits to Multi-Issue Machines

Software Pipelining Example

After: Software Pipelined

Symbolic Loop Unrolling

SuperScalar and VLIW

HW support for More ILP

Drawbacks to conditional instructions

HW support for More ILP

HW support for More ILP

Four Steps of Speculative Tomasulo Algorithm

2. Executionoperate on operands (EX)

3. Write resultfinish execution (WB)

4. Commitupdate register with reorder result

Upper Limit to ILP

Instruction Issues per cycle

75.2 54.8 62.6

More Realistic HW: Branch Impact

50 Instruction issues per cycle

Pick Cor. or BHT BHT (512)

More Realistic HW: Register Impact

50 Instruction issues per cycle

Change 2000 instr window, 64 instr issue, 8K 2 level Prediction

More Realistic HW: Alias Impact

Global/Stack perf; Inspec. heap conflicts Assem.

Realistic HW for 9X: Window Impact

50 Instruction issues per cycle

0 gcc expresso li Program Infinite 256 128 64 32 16 8 4 fpppp doducd tomcatv

Infinite 256 128

Braniac vs. Speed Demon

500 400 300 200 100 0

You might also like