Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
Professor Randy H. Katz Computer Science 252 Spring 1996
RHK.S96 1
Review: Getting CPI < 1 Multiple Instructions/Cycle
Two variations:
Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100 Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler Joint HP/Intel agreement in 1998?
RHK.S96 2
Loop Unrolling in SuperScalar
Loop: Integer instruction LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) LD F18,-32(R1) SD 0(R1),F4 SD -8(R1),F8 SD -16(R1),F12 SD -24(R1),F16 SUBI R1,R1,#40 BNEZ R1,LOOP SD -32(R1),F20 FP instruction Clock cycle 1 2 3 4 5 6 7 8 9 10 11 12
ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 ADDD F20,F18,F2
Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration
RHK.S96 3
Loop Unrolling in VLIW
Memory reference 1 Memory reference 2 FP operation 1 FP op. 2 Int. op/ branch Clock
1 2 3 4 5 6 7 8 9 LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 ADDD F20,F18,F2 ADDD F24,F22,F2 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 SD -16(R1),F12 SD -24(R1),F16 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 SD -0(R1),F28 BNEZ R1,LOOP
Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW
RHK.S96 4
Limits to Multi-Issue Machines
Inherent limitations of ILP
1 branch in 5: How to keep a 5-way VLIW busy? Latencies of units: many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent operations to keep machines busy
Difficulties in building HW
Duplicate FUs to get parallel execution Increase ports to Register File VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Increase ports to memory Decoding SS and impact on clock rate, pipeline depth
RHK.S96 5
Limits to Multi-Issue Machines
Limitations specific to either SS or VLIW implementation
Decode issue in SS VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step => 1 hazard & all instructions stall VLIW & binary compatibility is practical weakness
RHK.S96 6
Software Pipelining Example
Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP
After: Software Pipelined
1 2 3 4 5 SD ADDD LD SUBI BNEZ 0(R1),F4 ; Stores M[i] F4,F0,F2 ; Adds to M[i-1] F0,-16(R1); Loads M[i-2] R1,R1,#8 R1,LOOP
Symbolic Loop Unrolling
Less code space Fill & drain pipe only once vs. each iteration in loop unrolling
RHK.S96 7
Review: Summary
Branch Prediction
Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction
SuperScalar and VLIW
CPI < 1 Dynamic issue vs. Static issue More instructions issue at same time, larger the penalty of hazards
SW Pipelining
Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead
RHK.S96 8
Trace Scheduling
Parallelism across IF branches vs. LOOP branches Two steps:
Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong
RHK.S96 9
HW support for More ILP
Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP
If false, then neither store result or cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.
Drawbacks to conditional instructions
Still takes a clock even if annulled Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline
RHK.S96 10
HW support for More ILP
Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (HW undo) Often try to combine with dynamic scheduling Tomasulo: separate speculative bypassing of results from real bypassing of results
When instruction no longer speculative, write results (instruction commit) execute out-of-order but commit in order
RHK.S96 11
HW support for More ILP
Need HW buffer for results of uncommitted instructions: reorder buffer
Reorder Buffer
Reorder buffer can be operand source FP Regs Once operand commits, result is found in register 3 fields: instr. type, destination, value Use reorder buffer number instead Res Stations Res Stations of reservation station FP Adder FP Adder Instructionscommit in order As a result, its easy to undo speculated instructions on Figure 4.34, page 311 mispredicted branches or on exceptions
RHK.S96 12
FP Op Queue
Four Steps of Speculative Tomasulo Algorithm
1. Issueget instruction from FP Op Queue
If reservation station or reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination.
2. Executionoperate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.
4. Commitupdate register with reorder result
When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer.
RHK.S96 13
Limits to ILP
Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about:
Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication
RHK.S96 14
Limits to ILP
Initial HW Model here; MIPS compilers 1. Register renaminginfinite virtual registers and all WAW & WAR hazards are avoided 2. Branch predictionperfect; no mispredictions 3. Jump predictionall jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysisaddresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions
RHK.S96 15
Upper Limit to ILP
(Figure 4.38, page 319)
160 150.1 140 120 100 80 60 40 20 0 gcc espresso li fpppp doducd tomcatv
RHK.S96 16
Instruction Issues per cycle
118.7
75.2 54.8 62.6
17.9
Programs
More Realistic HW: Branch Impact
Figure 4.40, Page 323
60
50 Instruction issues per cycle
Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle
41 35
61 58
60
48 46 45 46 45 45
40
30
29
20
12
19 16 10 6 6 2 7 6 2 6 7 4 2 15 13 14
10
0 gcc espresso li Program Perfect Selective predictor Standard 2-bit Static None fpppp doducd tomcatv
RHK.S96 17
Perfect
Pick Cor. or BHT BHT (512)
Profile
More Realistic HW: Register Impact
Figure 4.42, Page 325
60
59
50 Instruction issues per cycle
40
Change 2000 instr window, 64 instr issue, 8K 2 level Prediction
54 49 45 44
35
30
29
28
20
20
15 15 13 11 10 10 9 5 4 10 5 4 12 12 12 11 6 5 5 4
16
15 11 5 7 5 5
10
0 gcc espresso li Program Infinite 256 128 64 32 None fpppp doducd tomcatv
Infinite
256
128
64
32
None
RHK.S96 18
More Realistic HW: Alias Impact
Figure 4.44, Page 328
50 45 40 Instruction issues per cycle 35 30 25 20 15
10 15 12 9 7 4 3 7 5 5 4 3 4 3 6 4 5 4 49 49
Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers
16 16
45
45
10 5 0
gcc
espresso
li Program
fpppp
doducd
tomcatv
Perfect
Global/stack Perfect
Inspection
None
Perfect
Global/Stack perf; Inspec. heap conflicts Assem.
None
RHK.S96 19
Realistic HW for 9X: Window Impact
(Figure 4.48, Page 332)
60
52 56
50 Instruction issues per cycle
40
30
20
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
15 15 10 10 10 9 13 10 8 6 4 3 8 6 4 2 12 12 11 11 9 6 4 3
47 45
35
34
22 17 16 14 8 5 3
22
15 12 9 7 4 3
14 9 6 3
10
0 gcc expresso li Program Infinite 256 128 64 32 16 8 4 fpppp doducd tomcatv
Infinite 256 128
64
32
16
RHK.S96 20
Braniac vs. Speed Demon
8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
900 800 700 600
SPECMarks
500 400 300 200 100 0
espresso
eqntott
doduc
wave5
compress
tomcatv
hydro2d
nasa
gcc
sc
swm256
mdljdp2
mdljsp2
su2cor
alvinn
Benchmark
RHK.S96 21
fpppp
spice
ora
ear
li