EE476
VLSI
Lecture 6: Designing for Speed
CSE477 L10 Inverter, Dynamic.1 Irwin&Vijay, PSU, 2002
Cray was a legend in computers … said that he
liked to hire inexperienced engineers right out of
school, because they do not usually know what’s
supposed to be impossible.
The Soul of a New Machine, Kidder, pg. 77
CSE477 L10 Inverter, Dynamic.2 Irwin&Vijay, PSU, 2002
Review: CMOS Inverter: Dynamic
VDD
tpHL = f(Rn, CL)
Vout
tpHL = 0.69 Reqn CL
CL tpHL = 0.69 (3/4 (CL VDD)/ IDSATn )
Rn
= 0.52 CL / (W/Ln k’n VDSATn )
Vin = V DD
CSE477 L10 Inverter, Dynamic.3 Irwin&Vijay, PSU, 2002
Review: Designing Inverters for Performance
Reduce CL
internal diffusion capacitance of the gate itself
interconnect capacitance
fanout
Increase W/L ratio of the transistor
the most powerful and effective performance optimization
tool in the hands of the designer
watch out for self-loading!
Increase VDD
only minimal improvement in performance at the cost of
increased energy dissipation
Slope engineering - keeping signal rise and fall times
smaller than or equal to the gate propagation delays
and of approximately equal values
good for performance
good for power consumption
CSE477 L10 Inverter, Dynamic.4 Irwin&Vijay, PSU, 2002
Switch Delay Model
A Req
A
Rp
Rp Rp
B
A B Rp
A Rp Cint
Rn CL A
A Rn CL
A Rn Rn CL
Rn
Cint
A B
B INVERTER
NOR
NAND
CSE477 L10 Inverter, Dynamic.5 Irwin&Vijay, PSU, 2002
Input Pattern Effects on Delay
Delay is dependent on the pattern of
inputs
Rp Rp Low to high transition
A B both inputs go low
- delay is 0.69 Rp/2 CL since two p-resistors
are on in parallel
Rn CL one input goes low
A - delay is 0.69 Rp CL
Rn
Cint High to low transition
B both inputs go high
- delay is 0.69 2Rn CL
Adding transistors in series (without
sizing) slows down the circuit
CSE477 L10 Inverter, Dynamic.6 Irwin&Vijay, PSU, 2002
Delay Dependence on Input Patterns
2-input NAND with
NMOS = 0.5µm/0.25 µm
PMOS = 0.75µm/0.25 µm
3 CL = 10 fF
2.5 A=B=1→0
2 Input Data Delay
A=1 →0, B=1 Pattern (psec)
1.5
Voltage, V
A=B=0→1 69
1 A=1, B=1→0
A=1, B=0→1 62
0.5 A= 0→1, B=1 50
0 A=B=1→0 35
0 100 200 300 400
-0.5 A=1, B=1→0 76
time, psec
A= 1→0, B=1 57
CSE477 L10 Inverter, Dynamic.7 Irwin&Vijay, PSU, 2002
Fan-In Considerations
A B C D
A CL
B C3 Distributed RC model
C C2 (Elmore delay)
D C1 tpHL = 0.69 Reqn(C1+2C2+3C3+4CL)
Propagation delay deteriorates
rapidly as a function of fan-in –
quadratically in the worst case.
CSE477 L10 Inverter, Dynamic.8 Irwin&Vijay, PSU, 2002
tp as a Function of Fan-In
1250
quadratic
1000 function of
fan-in
750
tp (psec)
tpH tp
500
L
250 tpL
H linear
0 function of
2 4 6 8 10 12 14 16 fan-in
fan-in
Gates with a fan-in greater than 4 should be avoided.
CSE477 L10 Inverter, Dynamic.9 Irwin&Vijay, PSU, 2002
Fast Complex Gates: Design Technique 1
Transistor sizing
as long as fan-out capacitance dominates
Progressive sizing
Distributed RC line
InN MN CL M1 > M2 > M3 > … > MN
(the fet closest to the output
In3 M3 C3 should be the smallest)
In2 M2 C2
Can reduce delay by more
In1 M1 C1 than 20%; decreasing gains
as technology shrinks
While progressive sizing is easy in a schematic, in a real layout it may not pay off due to design-rule
considerations that force the designer to push the transistors apart, increasing internal
CSE477 L10 Inverter, Dynamic.10
capacitance.
Irwin&Vijay, PSU, 2002
Fast Complex Gates: Design Technique 2
Input re-ordering
when not all inputs arrive at the same time
critical path critical path
charged 0→1
In3 1 M3 CL In1 M3 CLcharged
In2 1 M2 In2 1 M2 C2 discharged
C2 charged
In1 In3 1 M1 C1 discharged
M1 C1 charged
0→1
delay determined by time to delay determined by time to
discharge CL, C1 and C2 discharge CL
CSE477 L10 Inverter, Dynamic.12 Irwin&Vijay, PSU, 2002
Sizing and Ordering Effects
A 3 B 3 C 3 D 3
A 44 CL= 100 fF
B 45 C3
C 46 Progressive sizing in pull-down
C2
chain gives up to a 23%
D 47 C1 improvement.
Input ordering saves 5%
critical path A – 23%
critical path D – 17%
CSE477 L10 Inverter, Dynamic.13 Irwin&Vijay, PSU, 2002
Fast Complex Gates: Design Technique 3
Alternative logic structures
F = ABCDEFGH
Reduced fan-in -> deeper logic depth
Reduction in fan-in offsets, by far, the extra delay incurred by the
NOR gate (second configuration).
Only simulation will tell which of the last two configurations is
faster, lower power
CSE477 L10 Inverter, Dynamic.14 Irwin&Vijay, PSU, 2002
Fast Complex Gates: Design Technique 4
Isolating fan-in from fan-out using buffer insertion
CL CL
Real lesson is that optimizing the propagation delay of a
gate in isolation is misguided.
Reduce CL on large fan-in gates, especially for large CL, and size the inverters
progressively to handle the CL more effectively
CSE477 L10 Inverter, Dynamic.15 Irwin&Vijay, PSU, 2002
Fast Networks: Design Technique 5 - Logical Effort
The optimum fan-out for a chain of N inverters driving a
load CL is N
f = √(CL/Cin)
so, if we can, keep the fan-out per stage around 4.
Can the same approach (logical effort) be used for any
combinational circuit?
For a complex gate, we expand the inverter equation
tp = tp0 (1 + Cext/ γCg) = tp0 (1 + f/γ)
to
tp = tp0 (p + g f/γ)
- tp0 is the intrinsic delay of an inverter
- f is the effective fan-out (Cext/Cg) – also called the electrical effort
- p is the ratio of the instrinsic (unloaded) delay of the complex gate and
a simple inverter (a function of the gate topology and layout style)
- g is the logical effort
CSE477 L10 Inverter, Dynamic.16 Irwin&Vijay, PSU, 2002
Intrinsic Delay Term, p
The more involved the structure of the complex gate, the
higher the intrinsic delay compared to an inverter
Gate Type p
Inverter 1
n-input NAND n
n-input NOR n
n-way mux 2n
XOR, XNOR n 2n-1
Ignoring second order
effects such as internal
node capacitances
CSE477 L10 Inverter, Dynamic.17 Irwin&Vijay, PSU, 2002
Logical Effort Term, g
g represents the fact that, for a given load, complex gates
have to work harder than an inverter to produce a similar
(speed) response
the logical effort of a gate tells how much worse it is at producing
an output current than an inverter (how much more input
capacitance a gate presents to deliver it same output current)
Gate Type g (for 1 to 4 input gates)
1 2 3 4
Inverter 1
NAND 4/3 5/3 (n+2)/3
NOR 5/3 7/3 (2n+1)/3
mux 2 2 2
XOR 4 12
CSE477 L10 Inverter, Dynamic.18 Irwin&Vijay, PSU, 2002
Example of Logical Effort
Assuming a pmos/nmos ratio of 2, the input capacitance
of a minimum-sized inverter is three times the gate
capacitance of a minimum-sized nmos (Cunit)
B 4
A 2 B 2
A 2 A 4
A A•B
1 A+B
A A 2
B 2 A 1 B 1
Cunit = 3
Cunit = 4 Cunit = 5
CSE477 L10 Inverter, Dynamic.20 Irwin&Vijay, PSU, 2002
Delay as a Function of Fan-Out
The slope of the line is
7
the logical effort of the
6 gate
normalized delay
5
The y-axis intercept is
4 the intrinsic delay
3
effort delay
2 Can adjust the delay by
1 adjusting the effective
intrinsic delay fan-out (by sizing) or by
0
choosing a gate with a
0 1 2 3 4 5
different logical effort
fan-out f
Gate effort: h = fg
CSE477 L10 Inverter, Dynamic.21 Irwin&Vijay, PSU, 2002
Path Delay of Complex Logic Gate Network
Total path delay through a combinational logic block
tp = ∑ tp,j = tp0 ∑(pj + (fj gj)/γ )
So, the minimum delay through the path determines that
each stage should bear the same gate effort
f1g1 = f2g2 = . . . = fNgN
Consider optimizing the delay through the logic network
1 c
a b
CL 5
how do we determine a, b, and c sizes?
CSE477 L10 Inverter, Dynamic.22 Irwin&Vijay, PSU, 2002
Path Delay Equation Derivation
The path logical effort, G = ∏ gi
And the path effective fan-out (path electrical effort) is
F = CL/g1
The branching effort accounts for fan-out to other gates
in the network
b = (Con-path + Coff-path)/Con-path
The path branching effort is then B = ∏ bi
And the total path effort is then H = GFB
So, the minimum delay through the path is
N
D = tp0 ( ∑pj + (N √H)/ γ)
CSE477 L10 Inverter, Dynamic.23 Irwin&Vijay, PSU, 2002
Path Delay of Complex Logic Gates, con’t
For gate i in the chain, its size is determined by
i-1
si = (g1 s1)/gi ∏ (fj/bj)
j=1
1 c
a b
CL 5
For this network
F = CL/Cg1 = 5
G = 1 x 5/3 x 5/3 x 1 = 25/9
B = 1 (no branching)
4
H = GFB = 125/9, so the optimal stage effort is √H = 1.93
- Fan-out factors are f1=1.93, f2=1.93 x 3/5 = 1.16, f3 = 1.16, f4 = 1.93
So the gate sizes are a = f1g1/g2 = 1.16, b = f1f2g1/g3 = 1.34 and
c = f1f2f3g1/g4 = 2.60
CSE477 L10 Inverter, Dynamic.24 Irwin&Vijay, PSU, 2002
Fast Complex Gates: Design Technique 6
Reducing the voltage swing
tpHL = 0.69 (3/4 (CL VDD)/ IDSATn )
= 0.69 (3/4 (CL Vswing)/ IDSATn )
linear reduction in delay
also reduces power consumption
requires use of “sense amplifiers” on the receiving end to
restore the signal level (will look at their design when covering
memory design)
CSE477 L10 Inverter, Dynamic.25 Irwin&Vijay, PSU, 2002
TG Logic Performance
Effective resistance of the TG is modeled as a parallel
connection of Rp (= (VDD – Vout)/(-IDp)) and
Rn (=VDD – Vout)/IDn)
W/Lp=0.50/0.25
30
0V
25 Rn Rp
20 2.5V Vout
Rp
Resistance, kΩ
15 Rn
2.5V
10
Req = Rn || Rp W/Ln=0.50/0.25
5
0
0 1 2
So, the assumption that the TG switch has a constant
resistive value, Req, is acceptable
CSE477 L10 Inverter, Dynamic.26 Irwin&Vijay, PSU, 2002
Delay of a TG Chain
0 0 0 0
Vin V Vi Vi+1
VN
1
5 C 5 C 5 C 5 C
Vin Req Req Req Req
V Vi Vi+1
VN
1
C C C C
Delay of the RC chain (N TG’s in series) is
N
tp(Vn) = 0.69 ∑kCReq = 0.69 CReq (N(N+1))/2 ≈ 0.35 CReqN2
k=1
CSE477 L10 Inverter, Dynamic.27 Irwin&Vijay, PSU, 2002
TG Delay Optimization
Can speed it up by inserting buffers every M switches
0 0 0 0 0 0
Vin VN
5 C 5 C 5 5 C 5 C 5 C
Delay of buffered chain (M TG’s between buffer)
tp = 0.69 N/M CReq (M(M+1))/2 + (N/M - 1) tpbuf
Mopt = 1.7 √ (tpbuf/CReq ) ≈ 3 or 4
CSE477 L10 Inverter, Dynamic.28 Irwin&Vijay, PSU, 2002