Lecture 5
Lecture 5
Kaushik Roy
Electrical & Computer Engineering
Purdue University
Exponential Increase in Leakage
1970 1980 2000 2010 2020
5 µm 1 µm 100 nm 10 nm
Fully-depleted body
VS
VG Nano devices
Bulk-CMOS Gate VD
DGMOS Carbon nanotube
Source Drain
VS
Gate
VG
VD
Buried Oxide (BOX) III-V devices
Source Floating Body Drain
Substrate
nano-wires
Vback
Spintronics
Buried Oxide (BOX)
FD/SOI
Substrate
FinFET Trigate
PD/SOI
Single gate device Multi-gate devices
Normalized Frequency
1 .3 30%
1 .2
130nm
1 .1
Source: Intel
1 .0
5X
0 .9
1 2 3 4 5
Channel length N o rm a liz e d L e a k a g e ( Is b )
Delay and Leakage Spread
10000
# dopant atoms
Source: Intel
1000
100
10
1000 500 250 130 65 32
Inter and Intra-die Technology Node (nm)
Variations Random dopant fluctuation
Device parameters are no longer deterministic
Reliability
Temporal degradation of performance -- NBTI
Failure probability
Tech. generation
Time
100
worst-case corner
50
0
0 1 2 3 4 5 7
Normalized IOFF
Substantial variation in leakage across dies
4X variation between nominal and worst-case leakage
Performance determined at nominal leakage
Robustness determined at worst-case leakage
7
δ Vt − LOCAL
σ LOCAL
intra-die
∆Vt −GLOBAL
σ GLOBAL
inter-die
δ Vt = ∆Vt −GLOBAL + δ Vt − LOCAL
8
Voltage
VL
VTRIPRD WL
∆
+∆ VR=‘0’
VREAD
VR=VREAD
PL PR
AXL AXR
VL=‘1’ ∆
-∆
NL NR Time ->
∆
-∆ ∆
+∆
VR
Voltage
BL BR
WL
Time ->
Read failure => Flipping of Cell Data while Reading
10
PL PR
‘1’ ‘0’
AXL AXR
NL NR
BL BR
High-Vt Low-Vt Test & Repair using
Redundancy
Parametric failures
– Read Failures
– Write Failures
– Access Failures
– Hold Failures
Faulty chips Working chips
Parametric failures can degrade SRAM yield
11
250
50
0
157
262
367
419
524
577
682
0
210
315
472
629
734
52
105
839
944
996
786
890
1049
Number of faulty cells (NFaulty-Cells)
Reduce Reduce
RF & HF AF & WF
σ σ
LOCAL LOCAL
intra-dieintra-die
σ GLOBAL
inter-die
RBB FBB
ZBB
VDD
BL BR
GND
N σY 1 σX
Y = ∑ X i => =
i =1 µY N µX
SRAM Body
Array bias
Body-Bias
selection
VCO
16 KB
Isolated cell
block 64 KB
LVT
Array
Sensor + Ref. gen. BB gen
clk
LBL0
N0
RS0 RS1 RS7
... LBL1
D0 D1 D7
W s 2W s 4W s
clk
LBL0
N0
RS0 RS1 RS7
... LBL1
D0 D1 D7
200
150
100
saved
50 dies
0
0.7 0.8 0.9 1.0 1.1 1.2
Normalized DC robustness
50
0
0.8 0.9 1.0 1.1 1.2
Normalized delay
10% opportunistic speedup
On-Die Leakage Sensor For Measuring
Process Variation
current
reference
VBIAS
mirrors
current
μm
gen.
compa
73μ
rators NMOS
device
test interface
μm
83μ
Faulty
Fault statistics
300
Chips saved by the proposed + redundancy (R=8, r=3)
250
Chips saved by ECC + redundancy ( R=16)
200
More number of saved chips
150 as compare to ECC ECC fails to save
100 any chips
50
0
0 105 210 315 419 524 629 734 839 944 1049
NFaulty-Cells
Proposed architecture can handle more number of
faulty cells than ECC, as high as 890 faulty cells
Saves more number of chips than ECC for a given
NFaulty-Cells
CPU Performance Loss
2.5
1.0
0.5
0.0
0 105 210 315 419 524 629 734 839
NFaulty-Cells
D
Q
CLK
Shadow Latch
E
Delay
RAZOR: Dan Ernst et. al., MICRO 2003.
• Robustness:
– Increase supply voltage
• Power dissipation increases
– Upsize the gates
• Switching capacitance increases
CLK
VDD=1V
VDD<1V critical
non-critical
path path
activation
activation
• Important points:
– Scale down the supply while making delay failures
predictable
– Avoid the failures by adaptive clock stretching
– Ensure that critical paths are activated rarely
41
Tc
VDD=1V
Design A
S1 VDD=1V
Design B
S2+S
S31
S2 S3
VDD<1V
path delay
Design B
• Interesting features:
– Single critical path (activated by P0P1P2P3=1 & Ci,0=1)
– Low activation probability of critical path
VDD = 1V, TCLK = 260ps VDD = 0.8V, TCLK = 260ps
A5 A4 A3 A2 A2 A5 A4 A3 A2 A2
Co0
Ci0=0
Cout
M10 M9 M8 M7 M6 M5 M4 M3 M2 M1
Ci1=1
Co1
Cin= 0
A5 A4 A3 A2 A2 A5 A4 A3 A2 A2
FA FA FA
LATENCY FA FA FA
PREDICTOR
BLOCK
FA FA HA
Stage 2
Stage 3
% Area overhead
% Power savings
% Power savings
%Area overhead
40 10 20 9
9
35 15
8 8
30 10
7
25 5 7
6
20 5 0 6
12 bits 16 bits 32 bits 12 bits 16 bits 32 bits
% Throughput penalty
14 4.4
29 7.5 % Throughput penalty
ISO Yield = 96%
% Area Overhead
%Area overhead
12 4.3
%Power savings
% Area Overhead
6.5 10 4.2
28
5.5 8 4.1
6 4
4.5
27 4 3.9
3.5
2 3.8
26 2.5 0 3.7
8 bits 12 bits 16 bits 6 bits 8 bits 10 bits 12 bits 20 bits
47
Random Logic: Shannon’s Expansion
f ( x1,..., xi ,..., xn ) = xi • f ( x1,..., xi = 1,..., xn ) + xi' • f ( x1,..., xi = 0,..., xn )
= xi • CF1 + xi' • CF2
CF1 = f ( x1,..., xi = 1,..., xn ); CF2 = f ( x1,..., xi = 0,..., xn )
f1
CF1(xi=1)
CF11
Prob =50% f
f1
MUX
Prob =25%
MUX
CF2(xi=0) xi CF12
f2
xj
Prob =50% Prob =25%
inputs
Activation probability of cofactors can be reduced
How to choose Control Variable ?
48
CF32
CF11 CF53
MUX Network
CF42
Original PO
Circuit CF63
CF21
Inputs Inputs
5.0
Area (x103)um^2
60 4.0
3.0
40
2.0
20
1.0
0 0
cht sct pcle mux decod cm150a x2 alu2 count cht sct pcle mux decod cm150a x2 alu2 count
Power Area
• Average power saving = ~50%
• Average area overhead = 18%
• Avg performance penalty=5.9% (with 4 control variables) for signal
prob=0.5
50
Two-Stage Pipeline with Test Logic
Low Power Robust Pipeline
Stalling
CLK Logic gclk
freeze
TM1
TM2
Pre- Pre- GDS Layout
LFSR decoder decoder Test logic
Regular
pipeline
● Proposed
pipeline
fixed Clock
Test logic
vectors●
ahead Adder
Comparator
generator
Carry-Look-
4:1 mux
● SFFs
SFFs
SFFs
VDDm ● Outputs
Comparator
Carry-Look-
pipeline
2:1 Mux
SFFs
SFFs
SFFs
~40%
~40% power
power saving
saving with
with ~13%
~13% performance
performance penalty
penalty
VDD Scaling, Process Variation, and Quality
Trade-off: DCT
×512 image
512× Z = T• X • T '
X W Transpose Y Z
1D DCT 1D DCT
Memory
3 5 8 14 17 27 30 43
4 9 13 18 26 31 42 44
10 12 19 25 32 41 45 54
11 20 24 33 40 46 53 55
21 23 34 39 47 52 56 61
22 35 38 48 51 57 60 62
36 37 49 50 58 59 63 64
z0
Slower Computation
y0
Faster Computation
Slower Computation
Faster
z1 y1
z2 y2
z3 y3
z4 1D-DCT
y4
y5
y6
y7
(x2- x5) • a
<< ─ (4 adders
delay)
( x3 - x4 ) • e
<< (4 adders
+ ( x3- x4 ) • a
( x3- x4) • a ─ + delay)
( x1- x6 ) • a ─
( x2- x5) • f ( x3 - x4 ) • f
<< ─ << ─
(x0- x7 ) • f + ( x2 - x5 ) • f +
>> >> 56
Proposed DCT under Vdd scaling
Proposed Design with high/low delay paths Scaled Vdd: Longer paths under Vdd scaling
w0 w0
w1 Important w1
Computations w2 D1
w2
w3 w3 @Vdd2
w4 Delay=D1 w4
w5 w5 D2 >D1
Paths Not
@ Vdd1
w6 Longer w6 @Vdd2 Computed
Delays
w7 w7
Only DC w0 D1 @Vdd3
Vdd3 < Vdd2 < Vdd1(nominal)
component w1
w2 D3 > D1
w3 @Vdd3
Paths Not
w4
Computed
w5 D4 >D1
w6 @Vdd3
w7
57
1D-DCT Path Delay Comparisons
4
Conventional DCT Proposed DCT
3.5
3
Delay(ns)
2.5
2
1.5
1
0.5
0
Path1(w0)
Path2(w1)
Path3(w2)
Path4(w3)
Path5(w4)
Path6(w5)
Path7(w6)
Path8(w7)
Computation Paths
58
Effect of Vdd Scaling
Different Architectures at Nominal Voltage
Proposed DCT
Proposed DCT
FAILS FAILS Vdd=0.8V
0.8 V Vdd=0.9V
Power (mW) 17.53(41.2%) 11.09(62.8%)
PSNR (dB) 29 23.41
-2
10
Th
V
-3
10 0 1 2 3 4
10 10 10 10 10
Stress time [s]
kF
DH ⋅ t ⋅ [N0 − NIT ] =NIT ⋅ NH(0)
kR
DH ⋅t
y
0 NIT ( t ) = ∫ NH( 0) ( y,t ) ⋅ dy
Conservation of
hydrogen:
Distance into oxide 0
1
NIT = NH( 0) ⋅ DHt
2
q ⋅ ∆N IT kFN0
∆VT = NIT ( t ) = ( DHt ) 4
1
COX 2kR
Eox
⋅ t 0.25
E
0
qχ Eox e
∆VT = (1 + m )
COX
Impact of NBTI on circuit
performance
Circuit Performance Degradation
2
10
Vt
Inverter Delay
% change 1
10 ROSC delay
0
10
-1
10
-2
10 0 5 9
10 10 10
time (s)
0
10
-1
10
-2
10 0 2 4 6 8
10 10 10 10 10
Time (s)
• Delay degradation in ISCAS c432
• Activity factor (switching activity) does not affect much on the
delay degradation
In reality, activity factor’s are balanced in the normal operations
Design method considering the
NBTI degradation
NBTI-aware design method
NBTI-aware over-design
Required lifetime of the design
max ckt. delay
Delay Constraint
time
NBTI-aware Design
i1 6 3 o1
1 BLB BL
i2 7 VL
4 2
o2 VR
i3 8 5
101 1
PTM 60nm, 125°°C t=0
CR = 1.33 8
0.8 t=10
PR = 0.67 7
t=10
6
t=10
CDF
1/6 trend line 0.6
0.4
WM improves
0.2 with time
100 0
104 106 108 0.43 0.44 0.45 0.46 0.47 0.48 0.49
Time (s) Write Margin (V)
Area Saving
700 From TR-based Synthesized in PTM 65nm
1/6 VTh degradation model
650 Cell-based 125°°C Stress temperature
50% Signal Probability at PI’s
600
INIT
TR-Based
550
Cell-Based
580 600 620 640 660 680 700 720
Delay (ps)
Gate Sizing applied to guarantee lifetime functionality of design
11.7% overhead for Cell-based sizing
6.13% overhead for TR-based sizing
45% improvement in area overhead
Runtime complexity for TR-based sizing is identical to that of Cell-based sizing
IDDQ based NBTI Characterization
Layout
Microphotograph
Inverter Chain
VDD
Vin
1000 stages
IDDQ Measurement
Technology CMOS 130nm
Die Size 20 (mm2)
I/O Pin 209 • Test Circuit Fabricated
• 1000 stage INV chain
Tox 1.6 (nm)
• DC Stress signal @Vin
VDD 1.2 (V)
• IDDQ measurement @GND
Correlation between IDDQ & fMAX
10 102
IDDQ (Vin=0.0) Vstress = 1.7V @150°°C
c2670
IDDQ (Vin=1.7) 8 c5315 Linear relationship
Design phase
• Circuit-level NBTI Reliability
Initial Characterization Characterization
• Compute Rleak , RfMAX
• Compute KfMAX = Rleak / RfMAX • IDDQ test is used
• Expensive fMAX testing is avoided
(or minimized)
Reliability Extrapolation
• For each IDDQ measurement • Accurate circuit level performance
sample, Rleak is computed degradation can be predicted
•RfMAX = KfMAX X Rleak
Post-silicon phase
• IC specific burn-in to qualify the
•Estimate fMAX degradation target produce
• Efficient way of field monitoring:
Lifetime Projection dynamic local signature of produce
•Project IDDQ using KfMAX usage
• Possible usage in other reliability
sources; HCI
Temp. Dependency
82