EEF011 Computer Architecture
Sections 3.2 and 3.3
Dynamic Scheduling
Tomasulos Algorithm
October 2004
A Dynamic Algorithm:
Tomasulos Algorithm
For IBM 360/91 (before caches!) 3 years after CDC
Goal: High Performance without special compilers
Small number of floating point registers (4 in 360)
prevented interesting compiler scheduling of operations
This led Tomasulo to try to figure out how to get more effective
registers renaming in hardware!
Why Study 1966 Computer?
The descendants of this have flourished!
Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,
Example to eleminate WAR and WAW
by register renaming
Original
DIV.D
F0, F2, F4
ADD.D
F6, F0, F8
S.D
F6, 0(R1)
SUB.D
F8, F10, F14
MUL.D
F6, F10, F8
WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D
(Due to that DIV.D needs to take much longer cycles to get F0)
Register renaming
DIV.D
ADD.D
S.D
SUB.D
MUL.D
F0, F2, F4
S, F0, F8
S, 0(R1)
T, F10, F14
F6, F10, T
Tomasulo Algorithm
Register renaming provided
by reservation stations, which buffer the operands of
instructions waiting to issue
by the issue logic
Basic idea:
a reservation station fetches and buffers an operand as
soon as it is available, eliminating the need to get the
operand from a register (WAR)
pending instructions designate the reservation station that
will provide their input (RAW)
when successive writes to a register overlap in execution,
only the last one is actually used to update the register
(WAW)
As instructions are issued, the register specifiers for pending
operands are renamed to the names of the reservation
station, which provides register renaming
more reservation stations than real registers
Properties of Tomasulo Algorithm
1. Control & buffers distributed with Function Units (FU)
Hazard detection and execution control are distributed
FU buffers called reservation stations; have pending
operands
Registers in instructions replaced by values or pointers to
reservation stations(RS)
form of register renaming to avoids WAR, WAW hazards
2. Bypassing: Results passed directly to FU from RS, not
through registers, over Common Data Bus
that broadcasts results to all FUs, so allows all units waiting
for an operand to be loaded simultaneously
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
Figure 3.2 Basic structure of a MIPS
floating-point unit using Tomasulos algorithm
Load buffers:
1. hold components of the effected addr
2. track outstanding loads that are waiting
on the memory
3. hold the results of completed loads that
are waiting for the CDB
Store buffers:
1. hold components of the
effected addr
2. hold the destination
memory addresses of
outstanding stores that are
waiting for the data value
to store
3. hold the addr and value to
store until the memory unit
is available
Three Stages of Tomasulo Algorithm
1. Issueget instruction from the head of the instruction queue
If reservation station free (no structural hazard),
control issues instr with the operand values (renames registers).
No free RS => there is a structural hazard
If the operands are not in the registers, keep track of FU
This step renames registers, eliminating WAR and WAW hazards
2. Executeoperate on operands (EX)
When both operands ready (placed into RS), then execute;
if not ready, monitor Common Data Bus for result
By delaying EX until the operands are available, RAW hazards are avoided
3. Write resultfinish execution (WB)
Write on Common Data Bus to the registers and the RS of all awaiting units;
mark reservation station available
Normal data bus: data + destination (go to bus)
Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast
7 Components of Reservation Station
Op: Operation to perform in the unit (e.g., + or )
Qj, Qk: Reservation stations producing the
corresponding source operand
Note: Qj,Qk=0 => ready or unnessary
Store buffers only have Qi for RS producing result
Vj, Vk: Value of Source operands
Only one of V field or the Q field is valid
Store buffers has V field, result to be stored
A: used to hold information for the memory address
calculation for a load or a store
Busy: Indicates reservation station or FU is busy
Register result status QiIndicates which functional
unit will write each register, if one exists. Blank when
no pending instructions that will write that register.
Tomasulo Example
Instruction stream
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
Load1
Load2
Load3
Register result status:
Clock
0
No
No
No
3 Load/Buffers
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No
Busy Address
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
3 FP Adder R.S.
2 FP Mult R.S.
F0
F2
F4
F6
F8
FU
Clock cycle
counter
Example speed: 3 clocks for FP +,-; 10 for * ; 40 clks for /
F10
F12
...
F30
Tomasulo Example Cycle 1
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
1
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load1
Yes
No
No
34+R2
F10
F12
...
F30
Tomasulo Example Cycle 2
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
2
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load2
Yes
Yes
No
34+R2
45+R3
F10
F12
Load1
Note: Can have multiple loads outstanding
...
F30
Tomasulo Example Cycle 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No
Register result status:
Clock
3
FU
F0
Busy Address
S1
Vj
Load1
Load2
Load3
S2
Vk
RS
Qj
Yes
Yes
No
34+R2
45+R3
F10
F12
RS
Qk
R(F4) Load2
F2
Mult1 Load2
F4
F6
F8
...
Load1
Note: registers names are removed (renamed) in
Reservation Stations; MULT issued
Load1 completing; what is waiting for Load1?
F30
Tomasulo Example Cycle 4
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
Reservation Stations:
Busy Address
3
4
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
No
Yes
No
45+R3
F10
F12
Time Name Busy Op
Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
R(F4) Load2
Mult1 Yes MULTD
Mult2 No
Register result status:
Clock
4
FU
F0
Mult1 Load2
M(A1) Add1
Load2 completing; what is waiting for Load2?
...
F30
Tomasulo Example Cycle 5
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
5
FU
F0
Mult1 M(A2)
No
No
No
F10
M(A1) Add1 Mult2
Timer starts down for Add1, Mult1
F12
...
F30
Tomasulo Example Cycle 6
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
6
FU
F0
Mult1 M(A2)
Add2
No
No
No
F10
F12
...
Add1 Mult2
Issue ADDD here despite name dependency on F6?
F30
Tomasulo Example Cycle 7
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
Busy Address
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
7
FU
F0
No
No
No
Mult1 M(A2)
Add2
F10
F12
Add1 Mult2
Add1 (SUBD) completing; what is waiting for it?
...
F30
Tomasulo Example Cycle 8
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
8
FU
F0
Mult1 M(A2)
No
No
No
F10
Add2 (M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 9
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
9
FU
F0
Mult1 M(A2)
No
No
No
F10
Add2 (M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 10
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
4
5
Busy Address
Load1
Load2
Load3
10
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
10
FU
F0
No
No
No
Mult1 M(A2)
F10
F12
Add2 (M-M) Mult2
Add2 (ADDD) completing; what is waiting for it?
...
F30
Tomasulo Example Cycle 11
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
11
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M(M-M) Mult2
Write result of ADDD here?
All quick instructions complete in this cycle!
F12
...
F30
Tomasulo Example Cycle 12
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
12
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 13
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
13
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 14
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
14
FU
F0
Mult1 M(A2)
No
No
No
F10
(M-M+M(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 15
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
15
7
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
15
FU
F0
Mult1 M(A2)
No
No
No
F10
F12
...
(M-M+M(M-M) Mult2
Mult1 (MULTD) completing; what is waiting for it?
F30
Tomasulo Example Cycle 16
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
16
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
(M-M+M(M-M) Mult2
Just waiting for Mult2 (DIVD) to complete
F12
...
F30
Tomasulo Example Cycle 55
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
55
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
(M-M+M(M-M) Mult2
F12
...
F30
Tomasulo Example Cycle 56
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
Load1
Load2
Load3
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
56
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
(M-M+M(M-M) Mult2
Mult2 (DIVD) is completing; what is waiting for it?
F30
Tomasulo Example Cycle 57
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
57
11
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
56
FU
F0
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
(M-M+M(M-M) Result
Once again: In-order issue, out-of-order execution
and out-of-order completion.
F30
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
Each CDB must go to multiple functional units
high capacitance, high wiring density
Number of functional units that can complete per cycle
limited to one!
Multiple CDBs more FU logic for parallel assoc stores
Non-precise interrupts!
We will address this later
Tomasulo Loop Example
Loop:LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
This time assume Multiply takes 4 clocks
Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
To be clear, will show clocks for SUBI, BNEZ
Reality: integer instructions ahead of Fl. Pt. Instructions
Show 2 iterations
Loop Example
Instruction status:
ITER Instruction
1
1
1
Iter2
ation 2
Count 2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Op
Vj
Exec Write
Issue CompResult
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
S1
Vk
S2
Qj
RS
Qk
Code:
LD
MULTD
SD
SUBI
BNEZ
No
No
No
No
No
No
Added Store Buffers
F0
F4
F4
R1
R1
Register result status
Clock
0
F0
R1
80
F2
F4
F6
F8
Fu
F10 F12
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Instruction Loop
Fu
Value of Register used for address, iteration control
Loop Example Cycle 1
Instruction status:
ITER Instruction
1
LD
F0
R1
Vj
S1
Vk
Reservation Stations:
Time
Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Exec Write
Issue CompResult
Op
S2
Qj
RS
Qk
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
1
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Loop Example Cycle 2
Instruction status:
ITER Instruction
1
1
LD
MULTD
F0
F4
0
F0
R1
F2
1
2
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
No
No
No
80
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
2
R1
80
F0
Fu Load1
F2
F4
Mult1
F6
F8
F10 F12
Loop Example Cycle 3
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
3
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
Implicit renaming sets up data flow graph
Loop Example Cycle 4
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Fu
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
4
R1
80
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
Dispatching SUBI Instruction (not in FP queue)
Loop Example Cycle 5
Instruction status:
ITER Instruction
1
1
1
LD
MULTD
SD
F0
F4
F4
0
F0
0
R1
F2
R1
Reservation Stations:
Time
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
Vj
Exec Write
Issue CompResult
1
2
3
S1
Vk
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
No
Yes
No
No
80
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
5
R1
72
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
And, BNEZ instruction (not in FP queue)
Loop Example Cycle 6
Instruction status:
ITER Instruction
1
1
1
2
LD
MULTD
SD
LD
F0
F4
F4
F0
0
F0
0
0
R1
F2
R1
R1
1
2
3
6
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
S2
Qj
RS
Qk
R(F2) Load1
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
6
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult1
Notice that F0 never sees Load from location 80
Loop Example Cycle 7
Instruction status:
ITER Instruction
1
1
1
2
2
LD
MULTD
SD
LD
MULTD
F0
F4
F4
F0
F4
0
F0
0
0
F0
R1
F2
R1
R1
F2
1
2
3
6
7
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
No
No
80
72
80
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
7
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
Register file completely detached from computation
First and Second iteration completely overlapped
Loop Example Cycle 8
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
Vj
S1
Vk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
S2
Qj
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
8
R1
72
F0
Fu Load2
F2
F4
Mult2
F6
F8
F10 F12
Loop Example Cycle 9
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
Vj
S1
Vk
S2
Qj
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
9
R1
72
F0
Fu Load2
F2
F4
F6
Mult2
Load1 completing: who is waiting?
Note: Dispatching SUBI
F8
F10 F12
Loop Example Cycle 10
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
10
10
S2
Qj
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd
R(F2) Load2
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
Yes
No
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
Fu
72
80
72
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
10
R1
64
F0
Fu Load2
F2
F4
F6
Mult2
Load2 completing: who is waiting?
Note: Dispatching BNEZ
F8
F10 F12
Loop Example Cycle 11
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
3
4
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
11
R1
64
F0
Fu Load3
F2
F4
Mult2
Next load in sequence
F6
F8
F10 F12
Loop Example Cycle 12
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
2
3
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
12
R1
64
F0
Fu Load3
F2
F4
F6
F8
Mult2
Why not issue third multiply?
F10 F12
Loop Example Cycle 13
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
1
2
Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
10
11
S2
Qj
RS
Qk
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
13
R1
64
F0
Fu Load3
F2
F4
F6
Mult2
Why not issue third store?
F8
F10 F12
Loop Example Cycle 14
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
0
1
Exec Write
Issue CompResult
1
2
3
6
7
8
9
14
10
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)
10
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
Mult1
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
14
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
Mult1 completing. Who is waiting?
Loop Example Cycle 15
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
Reservation Stations:
Time
Exec Write
Issue CompResult
1
2
3
6
7
8
9
14
10
15
11
S1
Vk
S2
Qj
RS
Qk
Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes Multd M[72] R(F2)
10
15
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
Mult2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
15
R1
64
F0
Fu Load3
F2
F4
F6
F8
F10 F12
Mult2
Mult2 completing. Who is waiting?
Loop Example Cycle 16
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
10
15
R(F2) Load3
Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
No
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
64
80
72
Fu
[80]*R2
[72]*R2
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
16
R1
64
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
Loop Example Cycle 17
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
10
15
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
10
15
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
17
R1
64
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
Loop Example Cycle 18
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
10
15
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
11
16
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
Yes
Yes
Yes
64
80
72
64
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
[80]*R2
[72]*R2
Mult1
Register result status
Clock
18
R1
64
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
Loop Example Cycle 19
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
No
No
Yes
No
Yes
Yes
72
64
[72]*R2
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
64
Register result status
Clock
19
R1
56
F0
Fu Load3
F2
F4
Mult1
F6
F8
F10 F12
Loop Example Cycle 20
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
14
18
10
15
19
10
15
19
11
16
20
Vj
S1
Vk
S2
Qj
RS
Qk
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No
R(F2) Load3
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
No
Yes
No
No
Yes
56
64
Mult1
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
64
Register result status
Clock
20
R1
56
F0
Fu Load1
F2
F4
F6
F8
F10 F12
Mult1
Once again: In-order issue, out-of-order execution
and out-of-order completion.
Why can Tomasulo overlap iterations
of loops?
Register renaming
Multiple iterations use different physical destinations for
registers (dynamic loop unrolling).
Reservation stations
Permit instruction issue to advance past integer control flow
operations
Also buffer old values of registers - totally avoiding the WAR
stall that we saw in the scoreboard.
Other perspective: Tomasulo building data flow
dependency graph on the fly.
Tomasulos scheme offers 2 major
advantages
(1) the distribution of the hazard detection logic
distributed reservation stations and the CDB
If multiple instructions waiting on single result, & each
instruction has other operand, then instructions can be
released simultaneously by broadcast on CDB
If a centralized register file were used, the units
would have to read their results from the registers
when register buses are available.
(2) the elimination of stalls for WAW and WAR
hazards of scoreboard