Unit 8: Special Net Routing & Performance Optimization
Course contents:
Clock net routing
Power/ground routing
Performance optimization
Readings
Unit 8
W&C&C: Chapter 13
S&Y: Chapter 7
Y.-W. Chang
The Clock Routing Problem
Digital systems
Synchronous systems: Highly precise clock achieves
communication and timing.
Asynchronous systems: Handshake protocol achieves the
timing requirements of the system.
Clock skew: the difference in the minimum and the maximum
arrival times of the clock.
Clock routing: Routing clock nets such that
1.
2.
Unit 8
clock signals arrive simultaneously
clock delay is minimized
Other issues: total wirelength, power consumption
Y.-W. Chang
Clock Routing
Given the routing plane and a set of points
P = {p1, p2, , pn} within the plane and clock entry
point p0 on the boundary of the plane, the Clock
Routing Problem is to interconnect each pi P such
that maxi, j P|t(0, i) - t(0, j)| and maxi P t(0, i) are both
minimized.
p4
p1
p0
p0
p2
p6
p5
p3
Clock-tree synthesis (CTS): make the clock nets a tree
Unit 8
Y.-W. Chang
Clock Routing Algorithms
Pathlength-based Clock-Tree Synthesis (CTS)
1. H-tree: Dhar, Franklin, Wang, ICCD-84; Fisher & Kung, 1982.
2. Methods of means & medians (MMM): Jackson, Srinivasan,
Kuh, DAC-90.
3. Geometric matching: Cong, Kahng, Robins, DAC-91.
RC-delay based CTS
1. Exact zero skew: Tsay, ICCAD-91.
2. Deferred-merge embedding (DME) algorithm: Boese & Kahng,
ASICON-92; Chao & Hsu & Ho, DAC-92; Edahiro, NEC R&D, 1991.
3. Lagrangian relaxation: Chen, Chang, Wong, DAC-96.
Simulation-based CTS
ISPD-09 CTS contest (ASP-DAC-10, DATE-10)
Timing-model independent CTS
Shih & Chang, DAC-10; Shih et al., ICCAD-10.
Mesh-based & tree-link-based clock routing
Unit 8
Y.-W. Chang
H-Tree Based Algorithm
H-tree: Dhar, Franklin, Wang, Reduction of clock
delays in VLSI structure, ICCD-1984.
Similar topology: X-tree
Unit 8
Y.-W. Chang
The MMM Algorithm
Jackson, Sirinivasan, Kuh, Clock routing for high-performance
ICs, DAC-1990.
Each block pin is represented as a point in the region, S.
The region is partitioned into two subregions, SL and SR.
The center of mass is computed for each subregion.
The center of mass of the region S is connected to each of the
centers of mass of subregion SL and SR.
The subregions SL and SR are then recursively split in Y-direction.
Steps 2--5 are repeated with alternate splitting in X- and Ydirection.
Time complexity: O(n log n).
Unit 8
Y.-W. Chang
The Geometric Matching Algorithm
Cong, Kahng, Robins, Matching based models for highperformance clock routing, IEEE TCAD, 1993.
Clock pins are represented as n nodes in the clock tree (n = 2k).
Each node is a tree itself with clock entry point being node itself.
The minimum cost matching on n points yields n/2 segments.
The clock entry point in each subtree of two nodes is the point on
the segment such that length of both sides is same.
Above steps are repeated for each segment.
Apply H-flipping to further reduce clock skew (and to handle edges
intersection).
Time complexity: O(n2 log n).
Unit 8
Y.-W. Chang
Elmore Delay: Nonlinear Delay Model
Parasitic resistance and capacitance dominate delay in deep
submicron wires.
Resistor ri must charge all downstream capacitors.
Elmore delay: Delay can be approximated as sum of sections:
resistance downstream capacitance.
Delay grows as square of wire length.
Cannot apply to the delay with inductance consideration, which is
important in high-performance design.
Unit 8
Y.-W. Chang
Wire Models
Lumped circuit approximations for distributed RC lines: -model
(most popular), T-model, L-model.
-model: If no capacitive loads for C and D,
A to B: AB = r1 (c1/2 + c2 + c3);
B to C: BC = r2 (c2/2);
B to D: BD = r3 (c3/2).
Unit 8
Y.-W. Chang
Example Elmore Delay Computation
0.18 m technology: unit resistance = 0.075 / m; unit
capacitance = 0.118 fF/m.
Assume CC = 2 fF, CD = 4 fF.
BC = rBC (cBC / 2 + CC) = 0.075 150 (17.7/2 + 2) = 120 fs
BD = rBD (cBD / 2 + CD) = 0.075 200 (23.6/2 + 4) = 240 fs
AB = rAB (cAB/2 + CB) = 0.075 100 (11.8/2 + 17.7 + 2 + 23.6
+ 4) = 400 fs
Critical path delay: AB + BD = 640 fs.
Unit 8
Y.-W. Chang
10
Exact Zero Skew Algorithm
Tsay, Exact zero skew algorithm, ICCAD-91.
To ensure the delay from the tapping point to leaf nodes of subtrees T1
and T2 being equal, it requires that
r1 (c1/2 + C1) + t1 = r2 (c2/2 + C2) + t2.
Solving the above equation, we have
where and are the per unit values of resistance and capacitance,
l the length of the interconnecting wire, r1 = xl, c1 = xl, r2 = (1 - x)l,
c2 = (1 - x)l.
Unit 8
Y.-W. Chang
11
Zero-Skew Computation
Balance delays: r1(c1/2 + C1) + t1 = r2 (c2/2 + C2) + t2.
(): per
Compute tapping points:
unit values of resistance (capacitance); l: length of the wire;
r1 = xl, c1 = x l; r2 = (1 - x) l, c2 = (1 - x) l.
If x [0, 1], we need snaking to find the tapping point.
Exp: = 0.1 /unit, = 0.2 F/unit (tapping points: E, F, G)
merging segment
Unit 8
merging segment
Y.-W. Chang
12
Deferred Merge Embedding (DME)
Boese & Kahng, ASICON-92; Chao & Hsu & Ho, DAC92; Edahiro, NEC R&D, 1991
Consists of two stages: bottom-up + top-down
Bottom-up: Build the potential embedding locations of
clock sinks (i.e., a segment for potential tapping points)
Top-down: Determine exact locations for the embedding
Unit 8
Y.-W. Chang
13
Delay Computation for Buffered Wires
Wire: = 0.068 /m, = 0.118 fF/ m2; buffer: ' = 180 / unit
size, ' = 23.4 fF/unit size; driver resistance Rd = 180 ; unit-sized
wire, buffer.
Unit 8
Y.-W. Chang
14
Buffering and Wire Sizing for Skew Minimization
Discrete wire/buffer sizes: dynamic programming
Chung & Cheng, Skew sensitivity minimization of buffered
clock tree, ICCAD-94.
Continuous wire/buffer sizes: mathematical
programming (e.g., Lagrangian relaxation)
Unit 8
Chen, Chang, Wong, Fast performance-driven optimization for
buffered clock trees based on Lagrangian relaxation, DAC-96.
Considers clock skew, area, delay, power, clock-skew
sensitivity simultaneously.
Y.-W. Chang
15
Clock Meshes
More alternative paths to clock sinks
Good for high-performance circuits with stringent skew and
variation constraints
Drive mesh from the boundary or from grid points
H-tree is a good candidate to drive mesh
Alpha 21264 processor [Bailey et al. 1998]
Unit 8
IBM Power4 processor [Anderson et al. 2001]
Y.-W. Chang
16
Power Integrity: IR (Voltage) Drop
Power consumption and rail parasitics cause actual supply
voltage to be lower than ideal
Metal width tends to decrease with length increasing in
nanometer design
Effects of IR drop
Reducing voltage supply reduces circuit speed (5% IR drop =>
15% delay increase)
Reduced noise margin may cause functional failures
2.23V
1.46V
1.8V
violation
HM1
HM1
SM1
HM2
HM3
SM2
SM2
Unit 8
HM1
SM1
HM2
2.89V
SM2 HM3
3V
HM3
SM2
HM2
SM1
2.95V
Y.-W. Chang
17
Power/Ground (P/G) Routing
Are usually laid out entirely on metal layers for
smaller parasitics.
Two steps:
1. Construction of interconnection topology: non-crossing
power, ground trees.
2. Determination of wire widths: prevent metal migration,
keep voltage (IR) drop small, widen wires for more powerconsuming modules and higher density current (1 mA / m2
at 25 oC for 0.18 m technology). (So area metric?)
Unit 8
Y.-W. Chang
18
Power/Ground Network Optimization
Use the minimum amount of chip area for wiring P/G networks
while avoiding potential reliability failures due to electromigration
and excessive IR drops.
Tan and Shi, Fast power/ground network optimization based on
equivalent circuit modeling, DAC-2001.
Build the equivalent models for series resistors and apply a
sequence of the linear programming (SLP) method to solve the
problem.
Size wire segments assuming the topologies of P/G networks
to be fixed.
Wu and Chang, Efficient power/ground network analysis for power
integrity driven design methodology, DAC-2004.
Liu and Chang, Floorplan and power/ground co-synthesis for fast
design convergence, ISPD-06 (TCAD-07).
Unit 8
Y.-W. Chang
19
Problem Formulation
Let G = {N, B} be a P/G network with n nodes N = {1, , n} and b
branches B = {1, , b}; branch i connects two nodes: i1 and i2 with
current flowing from i1 to i2.
Let li and wi be the length and width of branch i, respectively. Let
be the sheet resistivity. Then the resistance ri of branch i
is ri
Vi1 vi 2
li
.
Ii
wi
Total P/G routing area is as follows:
i2
i1
wi
li
P/G network optimization is to minimize f(V, I) subject to the
constraints listed in the next slide.
Relax the nonlinear objective function and then translate the
constrained nonlinear programming problem into a SLP problem.
Unit 8
Y.-W. Chang
20
Constraints
The voltage IR drop constraints.
Vi Vh, min for power networks.
Vi Vl , max for ground networks.
The minimum width constraints: wi
liIi
wi , min
Vi1 Vi 2
The electro-migration constraints: Ii/wi => Vi Vi li
1
is a constant for a particular routing layer with a fixed
thickness.
Equal width constraints: wi wj or vi1 vi 2 vj1 vj 2
liIi
ljIj
Kirchoff s current law (KCL):
I 0
i
iB ( j )
For each node j = {1, , n}, B(j) is the set of indices of
branches connecting to node j.
Unit 8
Y.-W. Chang
21
Reducing the Problem Size with Equivalent Circuits
Consider a series resistor chain commonly seen in the P/G
network below.
Equivalent circuit
Series resistor chain
The equivalent
resistor Rs is just the sum of all the resistors in
n 1
series, Rs
R.
i
i 1
By superposition, the equivalent currents Ie1, and Ien can be
computed as follows:
Unit 8
Y.-W. Chang
22
Equivalent Circuit (contd)
The voltages at the intermediate nodes are calculated
based on superposition as follows:
Ri
Vi 1 Vi Vs RiIei
Rs
Iei 1 Iei Ii
Equivalent circuit
Series resistor chain
Unit 8
Y.-W. Chang
23
Equivalent Circuit Example
Unit 8
Y.-W. Chang
24
Design Methodology Evolution
IR-drop aware design methodology for faster design
convergence
Floorplanning
Floorplanning
IR-drop Analysis
P&R
no
iterative loop
RC Extraction
Simulation
SI Analysis
Traditional flow
Unit 8
IR-drop Analysis
OK
yes
P&R
P&R
RC Extraction
RC Extraction
Simulation
Simulation
SI Analysis
no
OK
iterative loop
yes
Floorplanning
SI Analysis
DAC-04 flow
(Wu & Chang)
Y.-W. Chang
ISPD-06 (TCAD-07) flow
(Liu & Chang)
25
Ideal Scaling of MOS Transistors
Feature size scales down by S times:
Unit 8
Y.-W. Chang
26
Ideal Scaling of Interconnections
Feature size scales down by S times:
Unit 8
Y.-W. Chang
27
Techniques for Higher Performance
In very deep submicron technology, interconnect delay
dominates circuit performance.
Techniques for higher performance
Unit 8
SOI: lower gate delay.
Copper interconnect: lower resistance.
Dielectric with lower permittivity: lower capacitance.
Buffering: Insert (and size) buffers to break a long
interconnection into shorter ones.
Wire sizing: Widen wires to reduce resistance (careful for
capacitance increase).
Shielding: Add/order wires to reduce capacitive and inductive
coupling.
Spacing: Widen wire spacing to reduce coupling.
Others: padding, track permutation, net ordering, etc.
Y.-W. Chang
28
Interconnect Dominates Circuit Performance!!
70
Worst-case
interconnect
delay due
to crosstalk
60
Delay (ps)
50
40
30
Interconnect
delay
20
10
Gate delay
650
500
350
250
180
150
100
70 (nm)
Technology Node
In 0.18m wire-to-wire
capacitance dominates (CW>>CS)
Unit 8
Y.-W. Chang
CS
CW
29
Optimal Buffer Sizing w/o Considering Interconnects
Delay through each stage is tmin, where tmin is the average delay
through any inverter driving an identically sized inverter.
n = CL/Cg n = ln (CL/Cg)/ln , where CL is the capacitive load
and Cg the capacitance of the minimum size inverter.
Total delay
Optimal stage ratio:
Optimal delay:
Buffer sizes are exponentially tapered ( = e).
Unit 8
Y.-W. Chang
30
Wire Sizing
Wire length is determined by layout architecture, but we can
choose wire width to minimize delay.
Wire width can vary with distance from driver to adjust the
resistance which drives downstream capacitance.
Wire with minimum delay has an exponential taper.
Can approximate optimal tapering with segments of a few
widths.
Recent research claims that buffering is more effective than
wire sizing for optimizing delay, and two wire widths are
sufficient for area/delay trade-off.
Unit 8
Y.-W. Chang
31
Optimal Wire-Sizing Function
Suppose a wire of length L is partitioned into n equal-length wire segments,
each of length x = L/n; unit resistance and capacitance: ,
.
The respective resistance and capacitance of i-th wire segment can be
approximated by x / f(xi) and x f(xi), where f(xi) is the width at
position xi.
Elmore delay:
As n , Dn D:
Optimal wire sizing function f(x) = ae-bx, where
Unit 8
x
Y.-W. Chang
32
Simultaneous Wire & Buffer Sizing
Input: Wire length L, driver resistance Rd, load capacitance CL,
unit wire area capacitance c0, unit wire fringing capacitance cf, unitsized wire resistance r0, unit-size capacitance of a buffer cb, unitsize buffer resistance rb, intrinsic buffer delay Tin, and the number
of buffers N.
Objective: Determine the stage ratio for buffer sizes and the
stage ratio for wire widths such that the wire delay is minimized.
Unit 8
Y.-W. Chang
33
Wire/Buffer Size Ratios for Delay Optimization
Chang, Chang, Jiang, ISQED-2002.
In practice, the delay of a wire DN(, ) is a convex function of the
stage ratio for practical buffer sizes and the stage ratio for
practical wire widths.
Can apply efficient search techniques (e.g., binary search) to find
the optimum ratios.
Unit 8
Y.-W. Chang
34
Performance Optimization: A Sizing Problem
Minimize the maximum delay Dmax by changing w1,,wn
Minimize Dmax
subject to Di (
w) D
max
, i 1..m
L wi U , i 1..n
w9
w7
w4
Unit 8
w11
D1<Dmax
w5
w10
b
w1
w8
w6
w3
Y.-W. Chang
w2
D2<Dmax
35
Popular Sizing Works
Algorithmic approaches: faster, non-optimal for general problems
TILOS (Fishburn, Dunlop, ICCAD-85)
Weighted Delay Optimization (Cong et al., ICCAD-95)
Traditional mathematical programming: often slower, optimal
Geometric Programming (TILOS)
Augmented Lagrangian (Marple et al., 86)
Sequential Linear Programming (Sapatnekar et al.)
Interior Point Method (Sapatnekar et al., TCAD-93)
Sequential Quadratic Programming (Menezes et al., DAC-95)
Augmented Lagrangian + Adjoin Sensitivity (Visweswariah, et
al., ICCAD-96, ICCAD-97)
Lagrangian relaxation based mathematical
programming: (Chen, Chang, Wong, DAC-96; Jiang,
Chang, Jou, DAC-99 [TCAD, Sept. 2000]; and many
more)
Unit 8
Fast and optimal
Y.-W. Chang
36
TILOS: Heuristic Approach
Finds sensitivities associated with each gate
Up-sizes the gate with the maximum sensitivity
Minimizes the objective function
Minimize Dmax
w9
w7
w4
Unit 8
w11
D1<Dmax
w5
w10
b
w1
w8
w6
w3
Y.-W. Chang
w2
D2<Dmax
37
Weighted Delay Optimization
Cong, et. al., ICCAD-95
Sizes one wire at a time in the DFS order
Minimize the weighted delay
Best weights?
Minimize 1D1 2D2
Driver
w1
w2
Loads
w3
1 D1
2 D2
w4
Unit 8
Y.-W. Chang
w5
38
From Mathematical Prog. to Lagrangian Relaxation
min
st
cx
Axb
xX
Mathematical
formulation
Unit 8
Posynomial
forms
Positive coefficient
polynomials
Y.-W. Chang
min
st
L()=cx + (Ax-b)
xX
Lagrange multipliers
39
Mathematical Programming
Formulation:
Minimize f ( x)
subject to g i ( x) 0, i 1..m
m
Lagrangian: L( ) f ( x) g ( x), where 0
i i
i
i 1
Optimality (Necessary) Condition (Kuhn-Tucker theorem):
m
L( )
0 f ( x ) g ( x ) 0
i i
xi
i 1
g ( x) 0 (Complemen tary Condition)
i i
g ( x) 0, 0 (Feasibility Condition)
i
i
Unit 8
Y.-W. Chang
40
Lagrangian Relaxation
Minimize f ( x)
LRS
Minimize f ( x) i gi ( x)
i 1
subject to g i ( x) 0, i 1..n
subject to gi ( x) 0, i n 1..m
g i ( x) 0, i n 1..m
LRS (Lagrangian Relaxation Subproblem)
There exist Lagrangian multipliersthat lead LRS to
the optimal solution for convex programming
When f(x), gi(x)s are all positive polynomials
(posynomials)
The optimal solution for any LRS is a lower bound
of the original problem
Unit 8
Y.-W. Chang
41
Lagrangian Relaxation
Minimize Dmax
subject to Di (
w) D
max
, i 1..m
L wi U , i 1..n
Lagrangian Relaxation
m
Minimize Dmax i ( Di (
w) D
i 1
max
subject to L wi U , i 1..n
m
L
By
0 , we have i 1
Dmax
i 1
Minimize
D (w )
i 1
subject to L wi U , i 1..n
Unit 8
Y.-W. Chang
42
Lagrangian Relaxation
Lagrangian
Relaxation
Augmented
Lagrangian
Weighted
Delay
SQP
TILOS
SLP
Algorithmic
approaches
Unit 8
Mathematical
Programming
Y.-W. Chang
43
Lagrangian Relaxation Framework
Update Multipliers
Weighted Delay
Optimization
Converge?
No
Yes
done
Unit 8
Y.-W. Chang
44
Lagrangian Relaxation Framework
More Critical -> More Resource -> Larger Weight
D1
D2
1 2
1 2
Dmax
Unit 8
D1 D2
Dmax
Y.-W. Chang
D1 D2
45
Weighted Minimization
Traverse the circuit in the topological order
Resize each component to minimize Lagrangian during
visit
Minimize 1D1 2D2
w1
D1
D2
b
w2
Unit 8
w3
Y.-W. Chang
46
Multiplier Adjustment: A Subgradient Approach
Step 1 :
new
i
old
i
k ( Di Dmax ),
where lim k 0, k
k
k 1
Step 2 : Project to the nearest feasible solution
Subgradient: An extension definition of gradient for
non-smooth functions.
Experience: Simple heuristic implementation can
achieve a very good convergence rate.
Unit 8
Y.-W. Chang
47
Convergence Sequence
Minimize
Max Delay
i Di ( w )
i 1
subject to L wi U , i 1..n
Any Feasible Maximum Delay =
Upper Bound
Optimal Solution
Lagrangian = Lower Bound
Weighted Delay <= Maximum Delay
# Iterations
Unit 8
Y.-W. Chang
48
Path Delay Formulation
d1
d2
Aa
Ab
D1
d3
D2
Ac
Aa d1 d 2 D1
Ab d1 d 2 D1
Ab d1 d 3 D2
Ac d 3 D2
Unit 8
Exponential growth
More accurate
Can exclude false paths
Y.-W. Chang
49
Stage Delay Formulation
d1
Aa
Ab
Ae
d2
D1
d3
D2
Ac
Aa d1 Ae
Ab d1 Ae
Ae d 2 D1
Ae d 3 D2
Polynomial size
Less accurate
Contains false paths
Ac d 3 D2
Unit 8
Y.-W. Chang
50
Both Multipliers Satisfy KCL (Flow Conservation)
Stage Based
43
Path Based
1
31
53
32
jinput ( i )
Unit 8
ji
ik
koutput ( i )
3
5
jinput ( i )
Y.-W. Chang
ji
ik
koutput ( i )
51
42
52
3,in 3,out
43 5331 32
41
51
Appendix A:
Shih and Chang
Fast timing-model independent clock-tree synthesis
DAC-10, TCAD-12
Unit 8
Y.-W. Chang
52
Introduction
Skew-minimized buffered clock-tree synthesis plays an
important role in VLSI designs for synchronous circuits
Due to the insufficient accuracy of timing models,
embedding simulation into synthesis becomes
inevitable
Runtime becomes prohibitively huge as design
complexity grows
merging
timing
timing
vdd
vss
insufficient accuracy
time-consuming
solution?
Unit 8
Y.-W. Chang
53
Symmetrical Structure
Skew is minimized by structural optimization
Buffering and wiring of all paths are almost the same
Is timing-model independent
Do not need simulation information
n4 (1,57)
n2
n1 (12,26)
(3,29)
n6 (33,29)
n3 (33,19)
n5 (5,1)
n7 (33,
9)
0ps skew (Elmore delay)
0.123ps skew (simulation)
Unit 8
n4
n1
n2
n5
snaking
n6
n'3 (21,23)
n7
0ps skew (Elmore delay)
0ps skew (simulation)
Y.-W. Chang
54
Problem Formulation
Problem: Buffered Clock-Tree Synthesis (BCTS)
Instance
Given a set of clock sinks, a slew-rate constraint,
and a library of buffers
Question
Unit 8
Construct a buffered clock tree to minimize its skew,
subject to no slew-rate violation
Y.-W. Chang
55
Symmetrical Clock Tree Synthesis
Specification
Number of branches, wirelength and inserted buffers are the
same at each level
Flow
Input
Branch-Number Planning
Tree Construction
Buffer Insertion
Assign specific branch numbers to
each tree level
Cluster sub-trees level by level bottomup
Lengthen shorter connection by
snaking
Insert identical buffers along trees
Output
Unit 8
Y.-W. Chang
56
Branch-Number Planning
Observation
Total branch number of some level equals the number of
preceding level times its branch number
The multiplication sequence forms a factorization
prime
total number of primes
Planning
Branch-Number Plan (BNP) is arranged in non-increasing order
level-1 branch number
Unit 8
Y.-W. Chang
57
Branch-Number Planning
Factorization may result in a big branch number,
implying a large fan-out size that could not be driven
Pseudo sinks are added to increase the total sink
number until all branch numbers are feasible
BNP = B(216) = < 3, 3, 3, 2, 2, 2 >
branches
3
Unit 8
BNP = B(212+3)
B(212+1)
B(212+2)
B(212)
B(212+4)
= <=53,
< 3,
71,
107,
43,
2,3,2
33,
5
2>>2, 2, 2 >
branches
3
3
3
2
2
3
3
2
pseudo sinks
sinks
Y.-W. Chang
2
2
58
Tree Construction
Achieve identical wirelength in this stage
Cluster sub-trees level by level bottom-up
Lengthen shorter connection by snaking
Flow
Partitioning
Embedding-Region
Construction
Root?
Y
Node Embedding
Unit 8
Divide sub-trees into desired clusters
Apply a common connection length to
each cluster, and locate potential
embedding positions to which snaked
wires can reach
Repeat the two stages till the
embedding region of the root is built
Find exact physical locations for nodes
and route wires top-down
Y.-W. Chang
59
Tilted Rectangular Region (TRR)
Represents potential embedding positions (embedding
region)
Is a 45- or 135-degree rectangular region
core: a 45- or 135-degree line segment
radius: the Manhattan distances from the core to the region
boundaries
Operation
Definition
Configuration
TRRi
extended TRR
core
Manhattan
distance
radius
Unit 8
extended
radius
TRRj
Y.-W. Chang
60
Partitioning
The objective is to minimize cluster diameter
Cluster diameter: the maximum distance among sub-trees
within the same cluster
Maximum cluster diameter is the upper bound of the common
connection length
Sub-trees are divided recursively along the BNP in a
top-down manner
Non-binary tree can also be handled by this technique
maximum
cluster diameter
Unit 8
Y.-W. Chang
61
Dividing: Cake Cutting
Borrow the idea of cake cutting, i.e., slicing a cake into
pieces from the center of the cake
Sort the polar angles of sub-trees relative to the
geometric center of the cluster
Apply dynamic programming to find the minimum cluster
diameter by restricting the dividing on this sorted order
Input Sinks
Polar-Angles Sorting
center point
Unit 8
Y.-W. Chang
Divided Result
cluster diameter
62
Recursive Dividing
For i-th level partitioning along the given BNP
<b1, b2, , bq>, dividing is performed recursively until
b1 x b2 x x bi-1 clusters are derived
Desired cluster diameter could be obtained since global
sub-tree distribution is considered throughout the whole
process
Recursive Dividing
Unit 8
Final Divided Result
Y.-W. Chang
Corresponding Clusters
63
Embedding-Region Construction
Assign the common connection length (CCL) as the half
length of the maximum cluster diameter
Extend the TRRs of children nodes and make
intersection to construct the embedding region of their
parents
Given Divided Result
Region Extension/Intersection
Resulting Regions
CCL
CCL
embedding
region
Unit 8
Y.-W. Chang
64
Node Embedding
Set the tree root as the closest position of the embedding
region w.r.t. the clock source
Propagate embedding information level by level top-down
Perform snaking to meet the uniform length, if necessary
Root Embedding
Level-1 Embedding
Level-2 Embedding
clock source
tree root
level -1
CCL
the closest
position
Unit 8
level -2
CCL
snaking
Y.-W. Chang
65
Pseudo-Sink Handling
For partitioning
Relax the sizes of clusters in a partition which can
differ by at most one for the first recursion
For embedding-region construction
Construct no embedding regions for pseudo sinks to
reserve the flexibility of snaking
For node embedding
Let the embedding regions of pseudo sinks cover
entire chip
Dangling wires can be identified and attached to proper
sub-trees successfully
Unit 8
Y.-W. Chang
66
Buffer Insertion
Align buffer distribution on the symmetrical tree
topology
Insert identical buffers level by level top-down
First-Time Insertion
Unit 8
Second-Time Insertion
Y.-W. Chang
Third-Time Insertion
67
Experimental Results on IBM Benchmarks
Our approach can obtain much smaller skews in much
shorter runtime than the state of the art, with marginal
overheads of snaking for symmetry
Shih et al.
Shih et al.
[ASPDAC10]
[ASPDAC10]
w/o simulation
w/ simulation
Circuit # sinks
skew usage runtime skew usage runtime
(ps)
(fF)
(s)
(ps)
(fF)
(s)
Ours
skew
(ps)
usage runtime
(fF)
(s)
r1
267
14.005 14001
5.012 15229
5126
1.510 13829
0.070
r2
598
16.012 28011
11
6.421 29234
7374
1.770 31056
0.280
r3
862
16.532 39123
26
5.611 41431
12739
2.310 44188
1.050
r4
1903 17.792 89312
165
5.418 91015
17871
2.540 98450
3.350
r5
3101 21.557 149875 498 7.028 156854 26045 3.010 171228 5.560
avg.
7.93 0.92 46.29 2.77 0.96 24343.13 1.00 1.00 1.00
comparison
Unit 8
More than 80000X faster than the ISPD-09 contest winners
(simulation-based methods)
Y.-W. Chang
68
Resulting Clock Tree: ispd09f22
Unit 8
Y.-W. Chang
69
Appendix B:
Liu and Chang
Floorplan and power/ground network co-synthesis
for fast design convergence
ISPD-06 (TCAD-07)
B
A
Unit 8
Y.-W. Chang
70
Floorplan & P/G Network Co-Synthesis
Liu and Chang, Floorplan and power/ground network
co-synthesis for fast design convergence, ISPD-06
(TCAD-07).
Apply the B*-tree floorplan representation and
simulated annealing (SA)
Analyze the P/G network (typical flow)
Circuit modeling
Global P/G network construction
P/G network modeling/reduction
P/G network evaluation (IR-drop computation)
Reduce floorplan solution space
Unit 8
Y.-W. Chang
71
Implementation of the Design Flow
Data preparation
Power profile
Power consumption data of the
modules generated by
PrimePower
Hierarchical circuit partition
Organize the design into hard
modules and soft modules
according to the hierarchy
Post-layout verification
AstroRail
Unit 8
Static cell-level P/G analysis
Y.-W. Chang
72
Simulated Annealing Process
Non-zero probability for up-hill
climbing:
Initialize B*-tree
and temperature T
T
p min 1, e
Perturb B*-tree
Perturbations (neighboring solutions) Pack B*-tree
Op1: Rotate a block
Construct
Update T
Op2: Move a node/block to
P/G network
another place
Update P/G
Evaluate cost
pitch Dpitch
Op3: Swap two nodes/blocks
Op4: Resize a soft block
N
Better ?
The cost function is based
Accept?
Y
N
on the floorplan cost and P/G
Y
Keep solution
network cost
Recover last
T is decreased every n cycles,
solution
Cool/Good
where n is proportional to the
enough?
N
number of blocks
Y
Unit 8
Y.-W. Chang
73
Cost Function
Cost function:
Wirelength
Area P/G cost
P/G Density
W A
A
2
pitch
D
W: Wirelength
A : Area
: P/G network cost (penalty of power integrity violation)
Dpitch: pitch of P/G network
Unit 8
Increasing power mesh density (reducing Dpitch) reduces
/
Update Dpitch by multiplying
avg
avg : Average P/G network cost at a temperature
: 0
1, a factor for adjusting the density of P/G networks
for higher P/G density and larger one for lower P/G
Smaller
density
Y.-W. Chang
74
Pitch Updating: An Example
0.02
At the beginning of SA, Dpitch = 2 and
During SA process, D pitch / avg D pitch
1000
100
/ avg 10
avg
1
0.1
Dpitch
Dpitch
1
0
SA process
-1
0.01
1
0.1
0.01
0.001
0.0001
0.00001
Temperature
Unit 8
/ avg
converges to 1 while temperature cools down
Y.-W. Chang
75
P/G Network Cost
: P/G network cost
EM cost
Bem
B
IR-drop cost
(1 )
pvi Pv
PiP
v pvi
Vlim, pi
, 0 1
Bem: set of branches violating electromigration constraints
B : total branches of the P/G mesh
vpvi: amount of the violation at the pin pvi
P : set of all P/G pins
Pv : set of violating P/G pins
Vlim,pi : IR-drop constraint of the P/G pin pi
Unit 8
Y.-W. Chang
76
P/G Network Construction
For each floorplan, we construct a uniform global P/G
network according to Dpitch
The number of trunks is defined by
round[width/Dpitch]+1 & round[height/Dpitch]+1
2X4 uniform P/G network is constructed
1+1 =2
Height
Floorplan
3+1 =4
Width
Calculate the P/G network dimension
Unit 8
Y.-W. Chang
77
P/G Network Modeling
Apply static analysis for fast P/G network evaluation
Use resistive P/G Model
Model P/G pins by current sources
Current value: maximum current drawn from P/G pins
Reduce circuit size
Connect current sources to nearest global trunk nodes
Power pad
Module
Global trunk
node
Power
trunk
Power strap
Unit 8
Power pin
Y.-W. Chang
78
P/G Network Modeling
Apply static analysis for fast P/G network evaluation
Use resistive P/G Model
Model P/G pins by current sources
Current value: maximum current drawn from P/G pins
Reduce circuit size
Connect current sources to nearest global trunk nodes
Power pad
Module
Reduced circuit
Power
trunk
Power strap
Unit 8
Power pin
Y.-W. Chang
79
Macro Current Modeling
Divide the floorplan into regions
For hard macros
Connect P/G pins to the nearest global trunk nodes
For soft macros (worst-case scenario)
Collect the largest current drawn by standard cells in the
overlapping area of the region and the soft macro
Hard module
The border line of the region is
defined by the center of the global
trunk nodes
Soft module
d/2 d d/2
Unit 8
Y.-W. Chang
80
Macro Current Modeling
Divide the floorplan into regions
For hard macros
Connect P/G pins to the nearest global trunk nodes
For soft macros (worst-case scenario)
Collect the largest current drawn by standard cells in the
overlapping area of the region and the soft macro
Assign current to the global
trunk nodes of the regions
Overlapping Area
d/2 d d/2
Unit 8
Y.-W. Chang
81
Soft Macro Modeling
Derive the largest current drawn by standard cells of the
overlapping area
Maximize the current of the overlapping area
Constraint: total standard cell area < the overlapping area
The problem is known as 0-1 Knapsack Problem (NP-complete)
Approximate it by Fractional Knapsack Algorithm
Assume standard cells can be broken into arbitrary smaller pieces
Rank cells by current to area ratio
Apply a greedy algorithm (complexity O(n lg n))
Standard Cells of the soft module
1mA 1mA
3mA
1mA 1mA
5mA
Overlapping
Area
Unit 8
4mA
Y.-W. Chang
82
Evaluation of P/G Network
The static analysis of a P/G network is formulated as
the following modified nodal analysis (MNA) formula:
Gx = i
G: conductance matrix (sparse positive definite matrix)
x: vector of node voltages
i: vector of current loads and voltage sources
Dimensions of G, i and x are equal to the number of nodes in
the P/G network
Solve the linear equation
Unit 8
Apply Preconditioned Conjugated Gradient (PCG) method
The time complexity is linear
Y.-W. Chang
83
Idea of Solution Space Reduction
The IR-drop of a P/G pin is proportional to the effective
resistance between the P/G pin and the power pad
The closer the P/G pin is placed to the power pad, the smaller
the IR-drop
A technique to reduce solution space
Place the modules consuming larger current (power-hungry
modules) near the boundary of the floorplan
Place power pads close to them
10 3.4V
1.9V
0.6V
3.4V
2.8V
2.5V
2.4V
100mA
30mA
20mA
10mA
5V
5V
10mA
Unit 8
10
-0.4V
20mA
30mA 100mA
Y.-W. Chang
84
B*tree Boundary Properties
Bottom boundary modules: the leftmost branch
Left-boundary condition
Left boundary modules: the rightmost branch
Right-boundary condition
Right boundary modules: the bottom-left branch
Top-boundary condition
Unit 8
Top boundary modules: bottom-right branch
Y.-W. Chang
85
Power-Hungry Modules Handling
Power-Hungry Modules
Are clustered and restricted to satisfy the boundary
property during B*-tree perturbation
P/G pads are placed near these modules
Clustered modules
Unit 8
Y.-W. Chang
86
Results on OpenRISC1200
Improve on runtime and max IR-drop with little overheads
on delay & wirelength (UMC 0.18 um technology)
OpenRISC1200
*Astro
Flow
*Astro w/
IR-drop
Driven
Placement
Our
Improv.
vs.
Astro w/
IR-drop
Our
Flow
Die Area (mm2)
3.86
3.86
3.33
15.9%
Utilization (%)
62
62
72
13.9%
Wirelength (m)
1655463 1539125
1540172 -0.1%
Avg. Delay (ns)
8.62
8.54
8.55
-0.1%
Max IR-drop (mv) 80.18
78.20
55.14
41.8%
CPU Runtime (s) 505
346
135
2.56X
Iterations
*Need iterative and manual P/G network fix
Unit 8
Y.-W. Chang
87
Resulting Voltage Map
Astro design flow
Our design flow
Power-hungry blocks (register files A&B)
are placed far away from the power pad
Power-hungry blocks are
placed beside the power pad
B
A
Unit 8
Y.-W. Chang
88