0% found this document useful (0 votes)

757 views88 pages

Physical Design Essentials

The document discusses special net routing and performance optimization in integrated circuits. It covers topics like clock net routing, power/ground routing, and performance optimization. For clock net routing, it describes algorithms like H-tree, MMM, geometric matching, exact zero skew, and deferred merge embedding that aim to minimize clock skew. It also discusses Elmore delay model and techniques for buffering and sizing wires to optimize skew. For power routing, the document outlines constructing power/ground trees and sizing wires to prevent IR drops and electromigration failures. It formulates the power routing problem and describes using equivalent circuit models and linear programming to optimize the design.

Uploaded by

entercode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

757 views88 pages

Physical Design Essentials

Uploaded by

entercode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

Unit 8: Special Net Routing & Performance Optimization

Course contents:

Clock net routing

Power/ground routing
Performance optimization

Readings

Unit 8

W&C&C: Chapter 13
S&Y: Chapter 7

Y.-W. Chang

The Clock Routing Problem

Digital systems
Synchronous systems: Highly precise clock achieves
communication and timing.
Asynchronous systems: Handshake protocol achieves the
timing requirements of the system.
Clock skew: the difference in the minimum and the maximum
arrival times of the clock.

Clock routing: Routing clock nets such that

1.
2.

Unit 8

clock signals arrive simultaneously

clock delay is minimized
Other issues: total wirelength, power consumption
Y.-W. Chang

Clock Routing

Given the routing plane and a set of points

P = {p1, p2, , pn} within the plane and clock entry
point p0 on the boundary of the plane, the Clock
Routing Problem is to interconnect each pi P such
that maxi, j P|t(0, i) - t(0, j)| and maxi P t(0, i) are both
minimized.
p4
p1
p0

p6
p5

p3
Clock-tree synthesis (CTS): make the clock nets a tree
Unit 8

Y.-W. Chang

Clock Routing Algorithms

Pathlength-based Clock-Tree Synthesis (CTS)

1. H-tree: Dhar, Franklin, Wang, ICCD-84; Fisher & Kung, 1982.
2. Methods of means & medians (MMM): Jackson, Srinivasan,
Kuh, DAC-90.
3. Geometric matching: Cong, Kahng, Robins, DAC-91.

RC-delay based CTS

1. Exact zero skew: Tsay, ICCAD-91.
2. Deferred-merge embedding (DME) algorithm: Boese & Kahng,

ASICON-92; Chao & Hsu & Ho, DAC-92; Edahiro, NEC R&D, 1991.

3. Lagrangian relaxation: Chen, Chang, Wong, DAC-96.

Simulation-based CTS

ISPD-09 CTS contest (ASP-DAC-10, DATE-10)

Timing-model independent CTS

Shih & Chang, DAC-10; Shih et al., ICCAD-10.

Mesh-based & tree-link-based clock routing

Unit 8

Y.-W. Chang

H-Tree Based Algorithm

H-tree: Dhar, Franklin, Wang, Reduction of clock

delays in VLSI structure, ICCD-1984.

Similar topology: X-tree

Unit 8

Y.-W. Chang

The MMM Algorithm

Jackson, Sirinivasan, Kuh, Clock routing for high-performance
ICs, DAC-1990.
Each block pin is represented as a point in the region, S.
The region is partitioned into two subregions, SL and SR.
The center of mass is computed for each subregion.
The center of mass of the region S is connected to each of the
centers of mass of subregion SL and SR.
The subregions SL and SR are then recursively split in Y-direction.
Steps 2--5 are repeated with alternate splitting in X- and Ydirection.
Time complexity: O(n log n).

Unit 8

Y.-W. Chang

The Geometric Matching Algorithm

Cong, Kahng, Robins, Matching based models for highperformance clock routing, IEEE TCAD, 1993.
Clock pins are represented as n nodes in the clock tree (n = 2k).
Each node is a tree itself with clock entry point being node itself.
The minimum cost matching on n points yields n/2 segments.
The clock entry point in each subtree of two nodes is the point on
the segment such that length of both sides is same.
Above steps are repeated for each segment.
Apply H-flipping to further reduce clock skew (and to handle edges
intersection).
Time complexity: O(n2 log n).

Unit 8

Y.-W. Chang

Elmore Delay: Nonlinear Delay Model

Parasitic resistance and capacitance dominate delay in deep

submicron wires.
Resistor ri must charge all downstream capacitors.
Elmore delay: Delay can be approximated as sum of sections:
resistance downstream capacitance.

Delay grows as square of wire length.

Cannot apply to the delay with inductance consideration, which is
important in high-performance design.

Unit 8

Y.-W. Chang

Wire Models
Lumped circuit approximations for distributed RC lines: -model
(most popular), T-model, L-model.

-model: If no capacitive loads for C and D,

A to B: AB = r1 (c1/2 + c2 + c3);
B to C: BC = r2 (c2/2);
B to D: BD = r3 (c3/2).

Unit 8

Y.-W. Chang

Example Elmore Delay Computation

0.18 m technology: unit resistance = 0.075 / m; unit

capacitance = 0.118 fF/m.

Assume CC = 2 fF, CD = 4 fF.
BC = rBC (cBC / 2 + CC) = 0.075 150 (17.7/2 + 2) = 120 fs
BD = rBD (cBD / 2 + CD) = 0.075 200 (23.6/2 + 4) = 240 fs
AB = rAB (cAB/2 + CB) = 0.075 100 (11.8/2 + 17.7 + 2 + 23.6
+ 4) = 400 fs
Critical path delay: AB + BD = 640 fs.

Unit 8

Y.-W. Chang

Exact Zero Skew Algorithm

Tsay, Exact zero skew algorithm, ICCAD-91.
To ensure the delay from the tapping point to leaf nodes of subtrees T1
and T2 being equal, it requires that
r1 (c1/2 + C1) + t1 = r2 (c2/2 + C2) + t2.
Solving the above equation, we have

where and are the per unit values of resistance and capacitance,
l the length of the interconnecting wire, r1 = xl, c1 = xl, r2 = (1 - x)l,
c2 = (1 - x)l.

Unit 8

Y.-W. Chang

Zero-Skew Computation
Balance delays: r1(c1/2 + C1) + t1 = r2 (c2/2 + C2) + t2.
(): per

Compute tapping points:

unit values of resistance (capacitance); l: length of the wire;

r1 = xl, c1 = x l; r2 = (1 - x) l, c2 = (1 - x) l.

If x [0, 1], we need snaking to find the tapping point.

Exp: = 0.1 /unit, = 0.2 F/unit (tapping points: E, F, G)
merging segment

Unit 8

merging segment
Y.-W. Chang

Deferred Merge Embedding (DME)

Boese & Kahng, ASICON-92; Chao & Hsu & Ho, DAC92; Edahiro, NEC R&D, 1991
Consists of two stages: bottom-up + top-down
Bottom-up: Build the potential embedding locations of
clock sinks (i.e., a segment for potential tapping points)
Top-down: Determine exact locations for the embedding

Unit 8

Y.-W. Chang

Delay Computation for Buffered Wires

Wire: = 0.068 /m, = 0.118 fF/ m2; buffer: ' = 180 / unit
size, ' = 23.4 fF/unit size; driver resistance Rd = 180 ; unit-sized
wire, buffer.

Unit 8

Y.-W. Chang

Buffering and Wire Sizing for Skew Minimization

Discrete wire/buffer sizes: dynamic programming

Chung & Cheng, Skew sensitivity minimization of buffered

clock tree, ICCAD-94.

Continuous wire/buffer sizes: mathematical

programming (e.g., Lagrangian relaxation)

Unit 8

Chen, Chang, Wong, Fast performance-driven optimization for

buffered clock trees based on Lagrangian relaxation, DAC-96.
Considers clock skew, area, delay, power, clock-skew
sensitivity simultaneously.

Y.-W. Chang

Clock Meshes

More alternative paths to clock sinks

Good for high-performance circuits with stringent skew and

variation constraints

Drive mesh from the boundary or from grid points

H-tree is a good candidate to drive mesh

Alpha 21264 processor [Bailey et al. 1998]

Unit 8

IBM Power4 processor [Anderson et al. 2001]

Y.-W. Chang

Power Integrity: IR (Voltage) Drop

Power consumption and rail parasitics cause actual supply

voltage to be lower than ideal
Metal width tends to decrease with length increasing in
nanometer design

Effects of IR drop

Reducing voltage supply reduces circuit speed (5% IR drop =>

15% delay increase)
Reduced noise margin may cause functional failures
2.23V

1.46V

1.8V

violation
HM1

HM1
SM1

HM2

HM3

SM2

Unit 8

HM1

SM1
HM2

2.89V

SM2 HM3

HM3

SM2

HM2

SM1

2.95V
Y.-W. Chang

Power/Ground (P/G) Routing

Are usually laid out entirely on metal layers for

smaller parasitics.
Two steps:
1. Construction of interconnection topology: non-crossing
power, ground trees.
2. Determination of wire widths: prevent metal migration,
keep voltage (IR) drop small, widen wires for more powerconsuming modules and higher density current (1 mA / m2
at 25 oC for 0.18 m technology). (So area metric?)

Unit 8

Y.-W. Chang

Power/Ground Network Optimization

Use the minimum amount of chip area for wiring P/G networks
while avoiding potential reliability failures due to electromigration
and excessive IR drops.
Tan and Shi, Fast power/ground network optimization based on
equivalent circuit modeling, DAC-2001.
Build the equivalent models for series resistors and apply a
sequence of the linear programming (SLP) method to solve the
problem.
Size wire segments assuming the topologies of P/G networks
to be fixed.
Wu and Chang, Efficient power/ground network analysis for power
integrity driven design methodology, DAC-2004.
Liu and Chang, Floorplan and power/ground co-synthesis for fast
design convergence, ISPD-06 (TCAD-07).

Unit 8

Y.-W. Chang

Problem Formulation
Let G = {N, B} be a P/G network with n nodes N = {1, , n} and b
branches B = {1, , b}; branch i connects two nodes: i1 and i2 with
current flowing from i1 to i2.

Let li and wi be the length and width of branch i, respectively. Let

be the sheet resistivity. Then the resistance ri of branch i
is ri

Vi1 vi 2
li
.
Ii
wi

Total P/G routing area is as follows:

i1
wi
li

P/G network optimization is to minimize f(V, I) subject to the

constraints listed in the next slide.

Relax the nonlinear objective function and then translate the

constrained nonlinear programming problem into a SLP problem.
Unit 8

Y.-W. Chang

Constraints
The voltage IR drop constraints.
Vi Vh, min for power networks.
Vi Vl , max for ground networks.
The minimum width constraints: wi

liIi
wi , min
Vi1 Vi 2

The electro-migration constraints: Ii/wi => Vi Vi li

is a constant for a particular routing layer with a fixed

thickness.

Equal width constraints: wi wj or vi1 vi 2 vj1 vj 2

liIi
ljIj
Kirchoff s current law (KCL):

I 0
i

iB ( j )

For each node j = {1, , n}, B(j) is the set of indices of

branches connecting to node j.

Unit 8

Y.-W. Chang

Reducing the Problem Size with Equivalent Circuits

Consider a series resistor chain commonly seen in the P/G
network below.

Equivalent circuit

Series resistor chain

The equivalent
resistor Rs is just the sum of all the resistors in
n 1
series, Rs

R.
i

i 1

By superposition, the equivalent currents Ie1, and Ien can be

computed as follows:

Unit 8

Y.-W. Chang

Equivalent Circuit (contd)

The voltages at the intermediate nodes are calculated

based on superposition as follows:

Ri
Vi 1 Vi Vs RiIei
Rs
Iei 1 Iei Ii

Equivalent circuit

Series resistor chain

Unit 8

Y.-W. Chang

Equivalent Circuit Example

Unit 8

Y.-W. Chang

Design Methodology Evolution

IR-drop aware design methodology for faster design
convergence
Floorplanning
Floorplanning
IR-drop Analysis
P&R

no
iterative loop

RC Extraction
Simulation
SI Analysis

Traditional flow
Unit 8

IR-drop Analysis

OK
yes

P&R

P&R
RC Extraction
RC Extraction
Simulation
Simulation
SI Analysis

OK
iterative loop
yes

Floorplanning

SI Analysis

DAC-04 flow
(Wu & Chang)
Y.-W. Chang

ISPD-06 (TCAD-07) flow

(Liu & Chang)
25

Ideal Scaling of MOS Transistors

Feature size scales down by S times:

Unit 8

Y.-W. Chang

Ideal Scaling of Interconnections

Feature size scales down by S times:

Unit 8

Y.-W. Chang

Techniques for Higher Performance

In very deep submicron technology, interconnect delay

dominates circuit performance.
Techniques for higher performance

Unit 8

SOI: lower gate delay.

Copper interconnect: lower resistance.
Dielectric with lower permittivity: lower capacitance.
Buffering: Insert (and size) buffers to break a long
interconnection into shorter ones.
Wire sizing: Widen wires to reduce resistance (careful for
capacitance increase).
Shielding: Add/order wires to reduce capacitive and inductive
coupling.
Spacing: Widen wire spacing to reduce coupling.
Others: padding, track permutation, net ordering, etc.

Y.-W. Chang

Interconnect Dominates Circuit Performance!!

Worst-case
interconnect
delay due
to crosstalk

Delay (ps)

50
40
30

Interconnect
delay

20
10

Gate delay
650

500

350

250

180

150

100

70 (nm)

Technology Node

In 0.18m wire-to-wire
capacitance dominates (CW>>CS)
Unit 8

Y.-W. Chang

CW
29

Optimal Buffer Sizing w/o Considering Interconnects

Delay through each stage is tmin, where tmin is the average delay
through any inverter driving an identically sized inverter.

n = CL/Cg n = ln (CL/Cg)/ln , where CL is the capacitive load

and Cg the capacitance of the minimum size inverter.

Total delay

Optimal stage ratio:

Optimal delay:
Buffer sizes are exponentially tapered ( = e).

Unit 8

Y.-W. Chang

Wire Sizing
Wire length is determined by layout architecture, but we can
choose wire width to minimize delay.
Wire width can vary with distance from driver to adjust the
resistance which drives downstream capacitance.
Wire with minimum delay has an exponential taper.
Can approximate optimal tapering with segments of a few
widths.
Recent research claims that buffering is more effective than
wire sizing for optimizing delay, and two wire widths are
sufficient for area/delay trade-off.

Unit 8

Y.-W. Chang

Optimal Wire-Sizing Function

Suppose a wire of length L is partitioned into n equal-length wire segments,
each of length x = L/n; unit resistance and capacitance: ,
.
The respective resistance and capacitance of i-th wire segment can be
approximated by x / f(xi) and x f(xi), where f(xi) is the width at
position xi.
Elmore delay:

As n , Dn D:
Optimal wire sizing function f(x) = ae-bx, where

Unit 8

x
Y.-W. Chang

Simultaneous Wire & Buffer Sizing

Input: Wire length L, driver resistance Rd, load capacitance CL,

unit wire area capacitance c0, unit wire fringing capacitance cf, unitsized wire resistance r0, unit-size capacitance of a buffer cb, unitsize buffer resistance rb, intrinsic buffer delay Tin, and the number
of buffers N.
Objective: Determine the stage ratio for buffer sizes and the
stage ratio for wire widths such that the wire delay is minimized.

Unit 8

Y.-W. Chang

Wire/Buffer Size Ratios for Delay Optimization

Chang, Chang, Jiang, ISQED-2002.

In practice, the delay of a wire DN(, ) is a convex function of the

stage ratio for practical buffer sizes and the stage ratio for
practical wire widths.
Can apply efficient search techniques (e.g., binary search) to find
the optimum ratios.

Unit 8

Y.-W. Chang

Performance Optimization: A Sizing Problem

Minimize the maximum delay Dmax by changing w1,,wn

Minimize Dmax
subject to Di (

w) D

max

, i 1..m

L wi U , i 1..n

Unit 8

w11

D1<Dmax

w10
b

w3
Y.-W. Chang

D2<Dmax

Popular Sizing Works

Algorithmic approaches: faster, non-optimal for general problems
TILOS (Fishburn, Dunlop, ICCAD-85)

Weighted Delay Optimization (Cong et al., ICCAD-95)

Traditional mathematical programming: often slower, optimal

Geometric Programming (TILOS)

Augmented Lagrangian (Marple et al., 86)

Sequential Linear Programming (Sapatnekar et al.)

Interior Point Method (Sapatnekar et al., TCAD-93)

Sequential Quadratic Programming (Menezes et al., DAC-95)

Augmented Lagrangian + Adjoin Sensitivity (Visweswariah, et

al., ICCAD-96, ICCAD-97)

Lagrangian relaxation based mathematical

programming: (Chen, Chang, Wong, DAC-96; Jiang,

Chang, Jou, DAC-99 [TCAD, Sept. 2000]; and many
more)

Unit 8

Fast and optimal

Y.-W. Chang

TILOS: Heuristic Approach

Finds sensitivities associated with each gate

Up-sizes the gate with the maximum sensitivity
Minimizes the objective function
Minimize Dmax
w9

Unit 8

w11

D1<Dmax

w10
b

w3
Y.-W. Chang

D2<Dmax

Weighted Delay Optimization

Cong, et. al., ICCAD-95

Sizes one wire at a time in the DFS order
Minimize the weighted delay
Best weights?
Minimize 1D1 2D2
Driver

Loads

1 D1

2 D2

w4
Unit 8

Y.-W. Chang

w5
38

From Mathematical Prog. to Lagrangian Relaxation

min
st

cx
Axb
xX

Mathematical

formulation

Unit 8

Posynomial
forms

Positive coefficient
polynomials

Y.-W. Chang

min
st

L()=cx + (Ax-b)
xX

Lagrange multipliers

Mathematical Programming
Formulation:

Minimize f ( x)
subject to g i ( x) 0, i 1..m

m
Lagrangian: L( ) f ( x) g ( x), where 0
i i
i
i 1

Optimality (Necessary) Condition (Kuhn-Tucker theorem):

m
L( )
0 f ( x ) g ( x ) 0
i i
xi
i 1

g ( x) 0 (Complemen tary Condition)

i i
g ( x) 0, 0 (Feasibility Condition)
i
i

Unit 8

Y.-W. Chang

Lagrangian Relaxation
Minimize f ( x)

LRS

Minimize f ( x) i gi ( x)
i 1

subject to g i ( x) 0, i 1..n

subject to gi ( x) 0, i n 1..m

g i ( x) 0, i n 1..m

LRS (Lagrangian Relaxation Subproblem)

There exist Lagrangian multipliersthat lead LRS to
the optimal solution for convex programming

When f(x), gi(x)s are all positive polynomials

(posynomials)

The optimal solution for any LRS is a lower bound

of the original problem

Unit 8

Y.-W. Chang

Lagrangian Relaxation
Minimize Dmax
subject to Di (

w) D

max

, i 1..m

L wi U , i 1..n
Lagrangian Relaxation
m

Minimize Dmax i ( Di (

w) D

i 1

max

subject to L wi U , i 1..n
m
L
By
0 , we have i 1
Dmax
i 1

Minimize

D (w )
i 1

subject to L wi U , i 1..n
Unit 8

Y.-W. Chang

Lagrangian Relaxation
Lagrangian
Relaxation
Augmented
Lagrangian

Weighted
Delay

SQP
TILOS
SLP
Algorithmic
approaches
Unit 8

Mathematical
Programming
Y.-W. Chang

Lagrangian Relaxation Framework

Update Multipliers
Weighted Delay
Optimization

Converge?

Yes

done
Unit 8

Y.-W. Chang

Lagrangian Relaxation Framework

More Critical -> More Resource -> Larger Weight

D1
D2

1 2

1 2
Dmax
Unit 8

D1 D2

Dmax
Y.-W. Chang

D1 D2

Weighted Minimization

Traverse the circuit in the topological order

Resize each component to minimize Lagrangian during
visit

Minimize 1D1 2D2

b
w2
Unit 8

w3
Y.-W. Chang

Multiplier Adjustment: A Subgradient Approach

Step 1 :

new
i

old
i

k ( Di Dmax ),

where lim k 0, k
k

k 1

Step 2 : Project to the nearest feasible solution

Subgradient: An extension definition of gradient for
non-smooth functions.
Experience: Simple heuristic implementation can
achieve a very good convergence rate.

Unit 8

Y.-W. Chang

Convergence Sequence
Minimize
Max Delay

i Di ( w )

i 1

subject to L wi U , i 1..n
Any Feasible Maximum Delay =
Upper Bound
Optimal Solution
Lagrangian = Lower Bound
Weighted Delay <= Maximum Delay
# Iterations

Unit 8

Y.-W. Chang

Path Delay Formulation

Aa
Ab

D1
d3
D2

Ac
Aa d1 d 2 D1
Ab d1 d 2 D1
Ab d1 d 3 D2
Ac d 3 D2

Unit 8

Exponential growth
More accurate
Can exclude false paths

Y.-W. Chang

Stage Delay Formulation

d1
Aa
Ab

d2
D1
d3
D2

Ac
Aa d1 Ae
Ab d1 Ae
Ae d 2 D1
Ae d 3 D2

Polynomial size
Less accurate
Contains false paths

Ac d 3 D2
Unit 8

Y.-W. Chang

Both Multipliers Satisfy KCL (Flow Conservation)

Stage Based
43

Path Based
1

jinput ( i )

Unit 8

ik
koutput ( i )

3
5

jinput ( i )

Y.-W. Chang

ik
koutput ( i )

51
42
52

3,in 3,out

43 5331 32

Appendix A:
Shih and Chang
Fast timing-model independent clock-tree synthesis
DAC-10, TCAD-12

Unit 8

Y.-W. Chang

Introduction

Skew-minimized buffered clock-tree synthesis plays an

important role in VLSI designs for synchronous circuits
Due to the insufficient accuracy of timing models,
embedding simulation into synthesis becomes
inevitable
Runtime becomes prohibitively huge as design
complexity grows
merging
timing

timing

vdd

vss
insufficient accuracy

time-consuming

solution?
Unit 8

Y.-W. Chang

Symmetrical Structure

Skew is minimized by structural optimization

Buffering and wiring of all paths are almost the same

Is timing-model independent
Do not need simulation information
n4 (1,57)

n1 (12,26)

(3,29)

n6 (33,29)
n3 (33,19)

n5 (5,1)

n7 (33,

0ps skew (Elmore delay)

0.123ps skew (simulation)
Unit 8

n4
n1

snaking
n6
n'3 (21,23)
n7

0ps skew (Elmore delay)

0ps skew (simulation)

Y.-W. Chang

Problem Formulation

Problem: Buffered Clock-Tree Synthesis (BCTS)

Instance

Given a set of clock sinks, a slew-rate constraint,

and a library of buffers

Question

Unit 8

Construct a buffered clock tree to minimize its skew,

subject to no slew-rate violation

Y.-W. Chang

Symmetrical Clock Tree Synthesis

Specification

Number of branches, wirelength and inserted buffers are the

same at each level

Flow
Input

Branch-Number Planning

Tree Construction

Buffer Insertion

Assign specific branch numbers to

each tree level
Cluster sub-trees level by level bottomup
Lengthen shorter connection by
snaking
Insert identical buffers along trees

Output

Unit 8

Y.-W. Chang

Branch-Number Planning

Observation

Total branch number of some level equals the number of

preceding level times its branch number
The multiplication sequence forms a factorization

prime

total number of primes

Planning

Branch-Number Plan (BNP) is arranged in non-increasing order

level-1 branch number

Unit 8

Y.-W. Chang

Branch-Number Planning

Factorization may result in a big branch number,

implying a large fan-out size that could not be driven
Pseudo sinks are added to increase the total sink
number until all branch numbers are feasible
BNP = B(216) = < 3, 3, 3, 2, 2, 2 >
branches
3

Unit 8

BNP = B(212+3)
B(212+1)
B(212+2)
B(212)
B(212+4)
= <=53,
< 3,
71,
107,
43,
2,3,2
33,
5
2>>2, 2, 2 >
branches
3

3
3
2
2

3
3
2

pseudo sinks

sinks
Y.-W. Chang

2
2

Tree Construction

Achieve identical wirelength in this stage

Cluster sub-trees level by level bottom-up

Lengthen shorter connection by snaking

Flow
Partitioning

Embedding-Region
Construction
Root?
Y
Node Embedding

Unit 8

Divide sub-trees into desired clusters

Apply a common connection length to
each cluster, and locate potential
embedding positions to which snaked
wires can reach
Repeat the two stages till the
embedding region of the root is built
Find exact physical locations for nodes
and route wires top-down

Y.-W. Chang

Tilted Rectangular Region (TRR)

Represents potential embedding positions (embedding

region)
Is a 45- or 135-degree rectangular region
core: a 45- or 135-degree line segment
radius: the Manhattan distances from the core to the region
boundaries
Operation
Definition
Configuration

TRRi

extended TRR
core

Manhattan
distance
radius

Unit 8

extended
radius

TRRj
Y.-W. Chang

Partitioning

The objective is to minimize cluster diameter

Cluster diameter: the maximum distance among sub-trees

within the same cluster
Maximum cluster diameter is the upper bound of the common
connection length

Sub-trees are divided recursively along the BNP in a

top-down manner
Non-binary tree can also be handled by this technique
maximum
cluster diameter

Unit 8

Y.-W. Chang

Dividing: Cake Cutting

Borrow the idea of cake cutting, i.e., slicing a cake into

pieces from the center of the cake
Sort the polar angles of sub-trees relative to the
geometric center of the cluster
Apply dynamic programming to find the minimum cluster
diameter by restricting the dividing on this sorted order
Input Sinks

Polar-Angles Sorting

center point
Unit 8

Y.-W. Chang

Divided Result

cluster diameter
62

Recursive Dividing

For i-th level partitioning along the given BNP

<b1, b2, , bq>, dividing is performed recursively until
b1 x b2 x x bi-1 clusters are derived
Desired cluster diameter could be obtained since global
sub-tree distribution is considered throughout the whole
process
Recursive Dividing

Unit 8

Final Divided Result

Y.-W. Chang

Corresponding Clusters

Embedding-Region Construction

Assign the common connection length (CCL) as the half

length of the maximum cluster diameter
Extend the TRRs of children nodes and make
intersection to construct the embedding region of their
parents
Given Divided Result

Region Extension/Intersection

Resulting Regions

CCL

embedding
region
Unit 8

Y.-W. Chang

Node Embedding

Set the tree root as the closest position of the embedding

region w.r.t. the clock source
Propagate embedding information level by level top-down
Perform snaking to meet the uniform length, if necessary

Root Embedding

Level-1 Embedding

Level-2 Embedding

clock source
tree root

level -1
CCL

the closest
position
Unit 8

level -2
CCL
snaking

Y.-W. Chang

Pseudo-Sink Handling

For partitioning
Relax the sizes of clusters in a partition which can
differ by at most one for the first recursion
For embedding-region construction
Construct no embedding regions for pseudo sinks to
reserve the flexibility of snaking
For node embedding
Let the embedding regions of pseudo sinks cover
entire chip
Dangling wires can be identified and attached to proper
sub-trees successfully

Unit 8

Y.-W. Chang

Buffer Insertion

Align buffer distribution on the symmetrical tree

topology
Insert identical buffers level by level top-down
First-Time Insertion

Unit 8

Second-Time Insertion

Y.-W. Chang

Third-Time Insertion

Experimental Results on IBM Benchmarks

Our approach can obtain much smaller skews in much

shorter runtime than the state of the art, with marginal
overheads of snaking for symmetry
Shih et al.
Shih et al.
[ASPDAC10]
[ASPDAC10]
w/o simulation
w/ simulation
Circuit # sinks
skew usage runtime skew usage runtime
(ps)

(fF)

(s)

(ps)

(fF)

(s)

Ours
skew
(ps)

usage runtime
(fF)

(s)

267

14.005 14001

5.012 15229

5126

1.510 13829

0.070

598

16.012 28011

6.421 29234

7374

1.770 31056

0.280

862

16.532 39123

5.611 41431

12739

2.310 44188

1.050

1903 17.792 89312

165

5.418 91015

17871

2.540 98450

3.350

3101 21.557 149875 498 7.028 156854 26045 3.010 171228 5.560
avg.
7.93 0.92 46.29 2.77 0.96 24343.13 1.00 1.00 1.00
comparison

Unit 8

More than 80000X faster than the ISPD-09 contest winners

(simulation-based methods)
Y.-W. Chang

Resulting Clock Tree: ispd09f22

Unit 8

Y.-W. Chang

Appendix B:
Liu and Chang
Floorplan and power/ground network co-synthesis
for fast design convergence
ISPD-06 (TCAD-07)

B
A

Unit 8

Y.-W. Chang

Floorplan & P/G Network Co-Synthesis

Liu and Chang, Floorplan and power/ground network

co-synthesis for fast design convergence, ISPD-06
(TCAD-07).
Apply the B*-tree floorplan representation and
simulated annealing (SA)
Analyze the P/G network (typical flow)
Circuit modeling
Global P/G network construction
P/G network modeling/reduction
P/G network evaluation (IR-drop computation)
Reduce floorplan solution space

Unit 8

Y.-W. Chang

Implementation of the Design Flow

Data preparation
Power profile

Power consumption data of the

modules generated by
PrimePower

Hierarchical circuit partition

Organize the design into hard

modules and soft modules
according to the hierarchy

Post-layout verification
AstroRail

Unit 8

Static cell-level P/G analysis

Y.-W. Chang

Simulated Annealing Process

Non-zero probability for up-hill
climbing:

Initialize B*-tree

and temperature T
T

p min 1, e

Perturb B*-tree
Perturbations (neighboring solutions) Pack B*-tree
Op1: Rotate a block
Construct
Update T
Op2: Move a node/block to
P/G network
another place
Update P/G
Evaluate cost
pitch Dpitch
Op3: Swap two nodes/blocks
Op4: Resize a soft block
N
Better ?
The cost function is based
Accept?
Y
N
on the floorplan cost and P/G
Y
Keep solution
network cost
Recover last
T is decreased every n cycles,
solution
Cool/Good
where n is proportional to the
enough?
N
number of blocks

Y
Unit 8

Y.-W. Chang

Cost Function
Cost function:
Wirelength

Area P/G cost

P/G Density

W A

A
2
pitch

D
W: Wirelength
A : Area
: P/G network cost (penalty of power integrity violation)
Dpitch: pitch of P/G network

Unit 8

Increasing power mesh density (reducing Dpitch) reduces

/
Update Dpitch by multiplying
avg
avg : Average P/G network cost at a temperature
: 0
1, a factor for adjusting the density of P/G networks

for higher P/G density and larger one for lower P/G
Smaller
density

Y.-W. Chang

Pitch Updating: An Example

0.02
At the beginning of SA, Dpitch = 2 and
During SA process, D pitch / avg D pitch
1000

100

/ avg 10

avg

1
0.1

Dpitch

1
0

SA process

-1

0.01
1

0.1

0.01

0.001

0.0001

0.00001

Temperature
Unit 8

/ avg

converges to 1 while temperature cools down

Y.-W. Chang

P/G Network Cost

: P/G network cost

EM cost

Bem
B

IR-drop cost

(1 )

pvi Pv

PiP

v pvi

Vlim, pi

, 0 1

Bem: set of branches violating electromigration constraints

B : total branches of the P/G mesh
vpvi: amount of the violation at the pin pvi
P : set of all P/G pins
Pv : set of violating P/G pins
Vlim,pi : IR-drop constraint of the P/G pin pi
Unit 8

Y.-W. Chang

P/G Network Construction

For each floorplan, we construct a uniform global P/G

network according to Dpitch

The number of trunks is defined by

round[width/Dpitch]+1 & round[height/Dpitch]+1
2X4 uniform P/G network is constructed
1+1 =2

Height

Floorplan

3+1 =4

Width

Calculate the P/G network dimension

Unit 8

Y.-W. Chang

P/G Network Modeling

Apply static analysis for fast P/G network evaluation
Use resistive P/G Model
Model P/G pins by current sources

Current value: maximum current drawn from P/G pins

Reduce circuit size

Connect current sources to nearest global trunk nodes

Power pad

Module

Global trunk
node

Power
trunk
Power strap
Unit 8

Power pin
Y.-W. Chang

P/G Network Modeling

Apply static analysis for fast P/G network evaluation
Use resistive P/G Model
Model P/G pins by current sources

Current value: maximum current drawn from P/G pins

Reduce circuit size

Connect current sources to nearest global trunk nodes

Power pad

Module

Reduced circuit

Power
trunk
Power strap
Unit 8

Power pin

Y.-W. Chang

Macro Current Modeling

Divide the floorplan into regions

For hard macros

Connect P/G pins to the nearest global trunk nodes

For soft macros (worst-case scenario)

Collect the largest current drawn by standard cells in the

overlapping area of the region and the soft macro
Hard module
The border line of the region is
defined by the center of the global
trunk nodes
Soft module

d/2 d d/2
Unit 8

Y.-W. Chang

Macro Current Modeling

Divide the floorplan into regions

For hard macros

Connect P/G pins to the nearest global trunk nodes

For soft macros (worst-case scenario)

Collect the largest current drawn by standard cells in the

overlapping area of the region and the soft macro
Assign current to the global
trunk nodes of the regions

Overlapping Area
d/2 d d/2
Unit 8

Y.-W. Chang

Soft Macro Modeling

Derive the largest current drawn by standard cells of the

overlapping area

Maximize the current of the overlapping area

Constraint: total standard cell area < the overlapping area
The problem is known as 0-1 Knapsack Problem (NP-complete)

Approximate it by Fractional Knapsack Algorithm

Assume standard cells can be broken into arbitrary smaller pieces

Rank cells by current to area ratio
Apply a greedy algorithm (complexity O(n lg n))
Standard Cells of the soft module
1mA 1mA

3mA

1mA 1mA
5mA

Overlapping
Area
Unit 8

4mA
Y.-W. Chang

Evaluation of P/G Network

The static analysis of a P/G network is formulated as

the following modified nodal analysis (MNA) formula:

Gx = i

G: conductance matrix (sparse positive definite matrix)

x: vector of node voltages
i: vector of current loads and voltage sources
Dimensions of G, i and x are equal to the number of nodes in
the P/G network

Solve the linear equation

Unit 8

Apply Preconditioned Conjugated Gradient (PCG) method

The time complexity is linear

Y.-W. Chang

Idea of Solution Space Reduction

The IR-drop of a P/G pin is proportional to the effective

resistance between the P/G pin and the power pad

The closer the P/G pin is placed to the power pad, the smaller
the IR-drop

A technique to reduce solution space

Place the modules consuming larger current (power-hungry

modules) near the boundary of the floorplan
Place power pads close to them
10 3.4V

1.9V

0.6V

3.4V

2.8V

2.5V

2.4V

100mA

30mA

20mA

10mA

5V
10mA

Unit 8

-0.4V

20mA

30mA 100mA

Y.-W. Chang

B*tree Boundary Properties

Bottom boundary modules: the leftmost branch
Left-boundary condition

Left boundary modules: the rightmost branch

Right-boundary condition

Right boundary modules: the bottom-left branch

Top-boundary condition

Unit 8

Top boundary modules: bottom-right branch

Y.-W. Chang

Power-Hungry Modules Handling

Power-Hungry Modules

Are clustered and restricted to satisfy the boundary

property during B*-tree perturbation
P/G pads are placed near these modules
Clustered modules

Unit 8

Y.-W. Chang

Results on OpenRISC1200

Improve on runtime and max IR-drop with little overheads

on delay & wirelength (UMC 0.18 um technology)

OpenRISC1200

*Astro
Flow

*Astro w/
IR-drop
Driven
Placement

Our
Improv.
vs.
Astro w/
IR-drop

Our
Flow

Die Area (mm2)

3.86

3.33

15.9%

Utilization (%)

13.9%

Wirelength (m)

1655463 1539125

1540172 -0.1%

Avg. Delay (ns)

8.62

8.54

8.55

-0.1%

Max IR-drop (mv) 80.18

78.20

55.14

41.8%

CPU Runtime (s) 505

346

135

2.56X

Iterations

*Need iterative and manual P/G network fix

Unit 8

Y.-W. Chang

Resulting Voltage Map

Astro design flow

Our design flow

Power-hungry blocks (register files A&B)

are placed far away from the power pad

Power-hungry blocks are

placed beside the power pad

B
A

Unit 8

Y.-W. Chang

Clock Tree Synthesis Overview and Techniques
No ratings yet
Clock Tree Synthesis Overview and Techniques
16 pages
Standard Cell Library Overview
No ratings yet
Standard Cell Library Overview
13 pages
Physical Design
No ratings yet
Physical Design
96 pages
Low Power Design in VLSI
100% (1)
Low Power Design in VLSI
67 pages
Physical Design Concepts in IC Layout
No ratings yet
Physical Design Concepts in IC Layout
5 pages
VLSI Routing and Timing Analysis
No ratings yet
VLSI Routing and Timing Analysis
8 pages
Post-CTS Optimization Guide
No ratings yet
Post-CTS Optimization Guide
5 pages
VLSI Design Process Overview
No ratings yet
VLSI Design Process Overview
86 pages
Physical Design Questions
No ratings yet
Physical Design Questions
13 pages
Floor Plan
No ratings yet
Floor Plan
8 pages
VLSI Physical Design Course Details
No ratings yet
VLSI Physical Design Course Details
3 pages
Timing Analysis in Physical Design
100% (2)
Timing Analysis in Physical Design
32 pages
Eco
No ratings yet
Eco
3 pages
VLSI Power Planning Essentials
No ratings yet
VLSI Power Planning Essentials
20 pages
PD Interview Questions
No ratings yet
PD Interview Questions
16 pages
PNR Flow
No ratings yet
PNR Flow
23 pages
Common Clock Path Pessimism Removal CPPR
No ratings yet
Common Clock Path Pessimism Removal CPPR
21 pages
PD Flow I - Floorplan
No ratings yet
PD Flow I - Floorplan
17 pages
Timing Analysis: Setup & Hold Requirements
No ratings yet
Timing Analysis: Setup & Hold Requirements
75 pages
Clock Tree Synthesis for VLSI Designs
No ratings yet
Clock Tree Synthesis for VLSI Designs
5 pages
Age, Lda: Lntroduction and Lab
No ratings yet
Age, Lda: Lntroduction and Lab
36 pages
Scripting TCL/Perl
No ratings yet
Scripting TCL/Perl
5 pages
Legalization in Standard Cell Placement
No ratings yet
Legalization in Standard Cell Placement
13 pages
ASIC Physical Design Guide
No ratings yet
ASIC Physical Design Guide
272 pages
L16 CTS 4
No ratings yet
L16 CTS 4
13 pages
PD Interview Questions
No ratings yet
PD Interview Questions
54 pages
VLSI Logic Synthesis Part 3
No ratings yet
VLSI Logic Synthesis Part 3
19 pages
Floor Plan
No ratings yet
Floor Plan
6 pages
Multiple IO Ring
No ratings yet
Multiple IO Ring
27 pages
Clock Gating
No ratings yet
Clock Gating
7 pages
STA Temp
No ratings yet
STA Temp
34 pages
Project Report
100% (1)
Project Report
15 pages
ASIC Design Types
No ratings yet
ASIC Design Types
28 pages
Physical Design and Synthesis Guide
No ratings yet
Physical Design and Synthesis Guide
3 pages
Physical Design
No ratings yet
Physical Design
45 pages
Shanghai Training Placement
No ratings yet
Shanghai Training Placement
25 pages
Power Gating - Power Management Technique: VLSI Basics and Interview Questions
No ratings yet
Power Gating - Power Management Technique: VLSI Basics and Interview Questions
12 pages
Floorplan
No ratings yet
Floorplan
29 pages
Digital Design Constraints and Power Planning
No ratings yet
Digital Design Constraints and Power Planning
8 pages
Post Layout Synthesis Guide
100% (1)
Post Layout Synthesis Guide
4 pages
Team VLSI
No ratings yet
Team VLSI
3 pages
Physical Only Cells
No ratings yet
Physical Only Cells
2 pages
ASIC Interview Questions
No ratings yet
ASIC Interview Questions
6 pages
TCL Tutorial
No ratings yet
TCL Tutorial
123 pages
Static Timing Analysis
100% (1)
Static Timing Analysis
3 pages
Design Planning For Large SoC Implementation at 40nm - Part 1
No ratings yet
Design Planning For Large SoC Implementation at 40nm - Part 1
11 pages
Setup and Hold Time Analysis Guide
No ratings yet
Setup and Hold Time Analysis Guide
14 pages
VLSI Physical Design - STA Interview Questions Part 1
No ratings yet
VLSI Physical Design - STA Interview Questions Part 1
2 pages
Interview Questions
No ratings yet
Interview Questions
10 pages
Fusion Compiler Place and Route Flow
No ratings yet
Fusion Compiler Place and Route Flow
2 pages
Clock and Power Routing Optimization
No ratings yet
Clock and Power Routing Optimization
23 pages
Clock Routing For High-Performance Ics: CP 1 DL +
No ratings yet
Clock Routing For High-Performance Ics: CP 1 DL +
7 pages
Lecture24 Clockpower Routing
No ratings yet
Lecture24 Clockpower Routing
30 pages
VLSI Clock & Power Routing Guide
100% (1)
VLSI Clock & Power Routing Guide
30 pages
Clocking
No ratings yet
Clocking
22 pages
VLSI Clock & Power Routing Guide
No ratings yet
VLSI Clock & Power Routing Guide
30 pages
CTS Algorithms
No ratings yet
CTS Algorithms
2 pages
10 CTS
No ratings yet
10 CTS
21 pages
Clock Network Synthesis: Prof. Shiyan Hu Shiyan@mtu - Edu Office: EREC 731
No ratings yet
Clock Network Synthesis: Prof. Shiyan Hu Shiyan@mtu - Edu Office: EREC 731
22 pages
Chip Interconnect Delay Optimization
No ratings yet
Chip Interconnect Delay Optimization
47 pages
Startasl Asl1 Workbook
No ratings yet
Startasl Asl1 Workbook
20 pages
UPSC ESE Electrical Engineering Syllabus
No ratings yet
UPSC ESE Electrical Engineering Syllabus
4 pages
EL-12 (March 24, 2025) Ode To The West Wind
No ratings yet
EL-12 (March 24, 2025) Ode To The West Wind
11 pages
How To Make Big Decisions Wisely A Biblical and Scientific Guide To Healthier Habits, Less Stress, A Better Career, and Much More
100% (19)
How To Make Big Decisions Wisely A Biblical and Scientific Guide To Healthier Habits, Less Stress, A Better Career, and Much More
14 pages
Gender Communication Gaps
No ratings yet
Gender Communication Gaps
3 pages
The Psychodynamics of Shame
100% (6)
The Psychodynamics of Shame
21 pages
Bells University of Technology, Ota: Bus 311-Entrepreneurship Studies
No ratings yet
Bells University of Technology, Ota: Bus 311-Entrepreneurship Studies
8 pages
Workshop There Is - There Are
No ratings yet
Workshop There Is - There Are
5 pages
Confessional Poetry: Truth and Body Politics
No ratings yet
Confessional Poetry: Truth and Body Politics
145 pages
Johann Lukas von Hildebrandt: Architect Biography
No ratings yet
Johann Lukas von Hildebrandt: Architect Biography
3 pages
XII A Physics Project 2023-24
No ratings yet
XII A Physics Project 2023-24
11 pages
The Quran and Modern Science (By DR Zakir Naik)
50% (2)
The Quran and Modern Science (By DR Zakir Naik)
2 pages
(1984) Doctrine Commission Report - 21-82 The Remarriage of Divorced Persons
No ratings yet
(1984) Doctrine Commission Report - 21-82 The Remarriage of Divorced Persons
9 pages
Types of Response
No ratings yet
Types of Response
15 pages
Chloe Visits The Fertile Valley
No ratings yet
Chloe Visits The Fertile Valley
37 pages
Full Download Transforming Adults Through Coaching New Directions For Adult and Continuing Education Number 148 1st Edition James P. Pappas PDF
100% (12)
Full Download Transforming Adults Through Coaching New Directions For Adult and Continuing Education Number 148 1st Edition James P. Pappas PDF
51 pages
Position Paper
No ratings yet
Position Paper
12 pages
Atma and Atmakaraka
100% (2)
Atma and Atmakaraka
4 pages
Encyclopedia of Forensic & Legal Medicine - Vol 3
100% (1)
Encyclopedia of Forensic & Legal Medicine - Vol 3
576 pages
Marcos V Republic
No ratings yet
Marcos V Republic
4 pages
Elements of Contemporary Philippine Art
No ratings yet
Elements of Contemporary Philippine Art
11 pages
BDPR3103 Final Exam Answer
No ratings yet
BDPR3103 Final Exam Answer
5 pages
Advanced Nursing Practice Overview
100% (1)
Advanced Nursing Practice Overview
12 pages
Nonviolent Communication - A Literature Review (2023)
No ratings yet
Nonviolent Communication - A Literature Review (2023)
13 pages
Ordinarium Bukas Palad-Palan Reyes
No ratings yet
Ordinarium Bukas Palad-Palan Reyes
12 pages
The King of The Sea by Dina Zaman
No ratings yet
The King of The Sea by Dina Zaman
3 pages
Direct Memory Access (DMA)
50% (4)
Direct Memory Access (DMA)
15 pages
PFM's Dracula: Iconology in Rock Opera
No ratings yet
PFM's Dracula: Iconology in Rock Opera
10 pages
ISO 281 Waelzlager en
No ratings yet
ISO 281 Waelzlager en
16 pages
Robbery-Homicide Case Verdict 1964
67% (3)
Robbery-Homicide Case Verdict 1964
2 pages