Gate Sizing
Gate Sizing
CONTENTS
10.1 Introduction 246
10.1.1 Static Timing 247
10.1.2 Delay Models 247
10.1.2.1 The RC Delay Model 248
10.1.2.2 Model-Independent Delay Characteristics 249
10.1.3 Impact on Power and Area 249
10.1.3.1 Area 250
10.1.4 Problem Formulations 250
10.2 Theoretical Background 251
10.2.1 Discrete Sizing 251
10.2.2 Continuous Sizing 251
10.2.3 Lagrangian Relaxation 252
10.2.3.1 Solving the Lagrangian Subproblem 253
10.2.3.2 Solving the Lagrangian Dual Problem 253
10.3 Practical Algorithms 254
10.3.1 Sensitivity-Based Heuristics 254
10.3.2 Dedicated Worst-Slack Maximization 255
10.3.3 Continuous Optimization with Self-Snapping 255
10.3.4 Practical Gate Sizing by Lagrangian Relaxation 255
245
10.1 INTRODUCTION
Gate sizing is a central step for achieving timing closure and minimizing the power consumption
of integrated circuits. It refers to determining the widths of the transistors inside a logic gate,
where wider transistors imply a faster charging and discharging of the output capacitance. As a
drawback, larger transistors increase the input capacitances and thus the load, delays, and transition
times at the upstream gates. An increased transition time slows down sibling gates driven by
common predecessors.
The delay characteristics of a gate are also affected by its threshold voltage Vt. By varying the
oxide thickness or materials in the gate, the threshold voltage can be lowered to speed it up. As a
drawback, the leakage current Ileak grows exponentially with a falling threshold voltage:
I leak = ae - bVt
for constants a, b Î + (see Section 3.1.3). As every alternative threshold voltage on a design
requires a separate production step with additional masks, usually no more than four threshold
voltages are available.
Further techniques for optimizing gate delays have a smaller impact on the end result compared
to sizing and Vt-assignment and are, therefore, seldom used. First, the ratio of the p-type and
n-type transistor sizes (β-ratio) is adjusted, trading off the rising and falling signal delay. Second,
the sizes of serially connected transistors inside a gate are modified to balance the delays from
different inputs to the output pin (tapering).
Most integrated circuits are designed using a circuit library B that contains a variety of
predesigned physical layouts implementing different logic functions and are of varying sizes and
threshold voltages. Instead of implementing each gate in the netlist individually, each gate is
mapped to one of the library circuits. Let G denote the set of all gates in the netlist. The implementation
of all gates is described by a vector x ÎB G , where x g ÎB specifies the physical implementation
of g ÎG . Of course, the logical function of a gate restricts the set of implementations to which
it can be assigned. By X Í B G we denote the set of assignments, where each gate is assigned to a
logically equivalent library circuit. In discrete gate sizing, B and thus X are finite sets. In continuous
gate sizing, logic gates can take any size between a lower- and upper-size bound, extending the
solution space compared to discrete sizing.
A key reason for using discrete gate libraries is that preprocessed timing models for each
library gate significantly speed up timing analysis compared to the numerical simulation
of continuously designed custom gates. Thus, they allow the processing of multimillion gate
instances by physical design tools. More recently, FinFET transistors introduced additional
discreteness in standard cell design. FinFETs are sized by adding multiple fins of unit size.
In microprocessor design, chips used to be partitioned into small custom circuits of a few
thousand gates, where the layout was essentially defined manually. Here, numerical timing
simulation is possible, and transistors can be sized continuously for optimum sizes, β-ratios, and
taperings. For algorithms designed for transistor sizing of custom circuits, we refer the reader
to the optimization methods in [7,42] that combine timing simulation with gradient descent
or interior point methods. The benefit of the flexibility of transistor sizing is hampered by the
restrictions imposed by partitioning the design into small blocks. Thus, transistor sizing on small
blocks has mostly been replaced by discrete gate sizing on larger units, even in microprocessor
design.
In this chapter, we first present theoretical results on discrete and continuous gate sizing
in Section 10.2. Then in Section 10.3, we focus on practical algorithms for discrete gate sizing.
Readers that are mostly interested in practical algorithms can skip the theoretical part in
Section 10.2. We begin with the basic concepts of gate sizing.
av ³0 "v Î PI È RO,
(10.1) av + delay v ,w ( x) £ aw "(v , w) Î E(G ),
av £T "v Î PO È RI ,
where
E(G) is the edge set of the graph
PI, PO are the sets of primary inputs and outputs
RI, RO are register inputs and outputs
T is the design cycle time
The solution vector x ∈ X determines the gate sizes and threshold voltages. The edge delays
delayv,w(x) are functions of x, as we will describe in Section 10.1.2.
By introducing a super-vertex v⋆ representing the time zero (av = 0) and edges with constant
delay zero or −T, the three constraint types can be written in an uniform way:
Traditionally, the shape of a signal transition is approximated by (1) its arrival time, which is the
point where the voltage is half the supply voltage Vdd, and (2) its slew, which is the time of the
transition between 10% and 90% Vdd, as indicated in Figure 10.2. The 10/90 interval is chosen
arbitrarily. It should reflect the region where the voltage changes almost linearly, and sometimes,
the slew is defined by the 20/80 or 30/70 intervals.
The delay between an input and an output signal is the difference of their arrival times. For
higher accuracy, the signal transition can also be approximated by piecewise linear functions
with multiple pieces. The most accurate delay analysis is based on numerical circuit simulation
using the SPICE software [26]. Due to the high running times of simulation, industry standard
static timing sign-off tools, such as Synopsys PrimeTime or Cadence Tempus, use discrete gate
V V
Vdd Vdd
0.9Vdd 0.9Vdd
0.5Vdd 0.5Vdd
0.1Vdd 0.1Vdd
V0 V0
at t at t
FIGURE 10.2 A (rising) signal transition (a) and its approximation (b).
libraries with preprocessed delay characteristics to analyze multi-million gate designs, providing
SPICE simulation as an option for the accurate analysis of the most critical paths.
Traditionally, the gate delays were approximated by two nondecreasing functions: one for the
delay and the other for the output slew. Both of these were taking the load capacitance and input
slew as arguments and were computed by interpolated table lookup. Wire delays were computed
as Elmore delays [10] or using asymptotic waveform emulation to account for resistive shield-
ing [32]. Due to the growing importance of capacitive coupling between wires, the current source
model has become the most popular delay model [8], usually analyzing stages from the input pins
of a gate to the inputs of the downstream gates.
The delay rules are characterized only within some upper bounds for the load capacitance
and input slews. Meaningful delays are only given if the capacitance and slew limits are met.
Thus, capacitance and slew constraints have a higher priority than the delay constraints in
Equation 10.2.
A delay model that is often used in the literature is the RC delay model [4,11]. Here, the Elmore
delay formula is also used for modeling gate delays. The delay through a gate is proportional
to the product of its resistance and load capacitance. While the resistance of a gate is inversely
proportional to the gate size, the input pin capacitance of a gate is proportional to the size. Let
us assume that xg reflects the size of a gate g ÎG, that is, x g Î . Now, the RC delay through g
coincides for all timing arcs to the output pin. It is defined as
rg æ ö
delay g ( x) = ç C wire +
xg ç
è
å
g ¢Î fanout ( g )
c g ¢ x g ¢ + c intrinsic
g ÷,
÷
ø
EDA for . . .
160
100
120
50
100
80
0 10 20 0 200 400
(a) Load capacitance (fF) (b) Input slew (ps)
FIGURE 10.3 A typical gate delay as a (concave) function of load capacitance (a) and input slew (b),
given a fixed input slew (left) and load capacitance (right).
function in the load capacitance, while in reality it is rather a concave function, as shown in
Figure 10.3. The nonlinearity is further amplified by the slew dependency of the delay. Therefore,
these two models are used at most to guess initial sizes before placement but not during
subsequent optimization.
Delay and slew are increasing functions of load capacitance and input slew as in Figure 10.3.
Changing the size of a gate influences the delays and slews at the gate itself as well as all its
neighboring gates and all gates in their downstream cone. Assuming a constant input slew, an
increasing size reduces the delay through the gate and the output slews and thereby decreases
delays and slews at the downstream gates. On the other hand, it increases the load capacitance of
the upstream gates, their delays and output slews, and thereby the delays of the other sink gates
in the input nets as well as their successors. So, in fact, the input slew increases with the gate size.
If the upstream gate is very load sensitive, a strongly increasing input slew may degrade the delay
through the upsized gate instead of reducing it.
For details on power analysis, see Chapter 3 or [44]. Here, we emphasize the factors relevant for
gate sizing. The power consumption P(g) of a logic gate g ÎG is composed of the dynamic power
Pdynamic(g) that occurs when a gate changes its logic state and static power Pstatic(g) that occurs
while the gate is in a steady state:
P( g ) = Pdynamic ( g ) + Pstatic ( g ).
The dynamic power is, in turn, composed of the switching power and the short-circuit power:
where the Pswitching measures the power for charging and discharging its input pin capacitances.
Wire capacitances are usually considered constant during gate sizing as interconnects can
stay in place after minor adjustments to ensure pin access. For the purpose of gate sizing, we
assume that the dynamic power of a gate is more or less proportional to its size. Other factors
that influence the dynamic power are the supply voltage that is not sensitive to gate sizing
and the switching behavior, which is hard to predict and usually estimated by a switching
frequency per net.
Modeling becomes more complicated for the short-circuit power, which is observed under
simulation. It is dissipated at low-Vt gates during an input signal transition. If the threshold volt-
age is below 50% of the supply voltage, there is a period where both p-type and n-type transistors
in the gate are turned on, forming a short circuit. The short-circuit current increases with a larger
input slew, a lower threshold voltage, and a larger gate size. If a gate has steep input slews or a high
threshold voltage, the short-circuit current is usually negligible. As low-Vt gates are used mostly
on critical parts, small delays and slews are targeted anyway. But a designer or a tool should be
aware that downsizing a gate to save power will increase the short-circuit power at downstream
gates with a low threshold voltage. Often, the short-circuit power is not considered directly but
mitigated by enforcing low slew limits for low-Vt gates.
While dynamic and short-circuit power dissipation occurs while a gate is switching, the leak-
age current is also present at a steady state. There is a permanent small current leaking through
the transistor gates. This current is higher for lower threshold voltages and wider gates, with an
exponential dependency on the threshold voltage. In particular, for mobile devices, it is critical
to keep the static power small. Among many physical factors, the leakage power depends on the
current state of the input signals. These state patterns are not affected by gate sizing. We assume
that the leakage power of a gate is defined by its physical realization and not by the realization of
other gates.
For simplicity, gate sizing algorithms estimate power by separable functions, where the
contribution of one gate does not affect the contributions of other gates:
This is justifiable for the switching power and the leakage power. The short-circuit power can and
usually is added as a small slew-independent constant to the dynamic power after imposing low
upper bounds on the input slews of low-Vt gates.
10.1.3.1 AREA
The form factor of a gate is also an important aspect of gate sizing. If the chip gets locally over-
filled, it can become difficult or even infeasible to legalize the gates. This can, of course, also
be adjusted by changing the floorplan, but at some point in the design cycle, gate sizing should
match local placement density targets to avoid disturbance by legalization. At the very end, it
should avoid gate overlaps. In the literature, placement density is a rarely considered constraint
for gate sizing. One approach is proposed in [6]. Here, a single global density violation is punished by
a weighted quadratic penalty in the objective function during gate sizing. Gate sizing is repeated
with an increasing penalty weight until the density constraint is met. Production tools, such as
those used in [41], partition the chip into small placement bins and meet the placement density
target inside each bin.
Timing, power, and area can be considered as objectives or constraints in various combinations.
A typical objective in gate sizing is to minimize the power consumption such that the timing
constraints in Equation 10.2 are met:
min P total ( x)
(10.4) s.t . av + delay v ,w ( x) £ aw "(v , w) Î E(G )
xÎX
In practice, area constraints must be fulfilled as well. Early in the design cycle, when global gate
sizing is performed, timing constraints are usually unsatisfiable. Therefore, minimizing the cycle
time T or the absolute total negative slack (TNS) is usually the prevalent goal during most parts
of the design cycle, where the TNS is the sum of negative slacks at the endpoints of signal paths.
This can be done in linear combination with power and area or imposing area constraints. But
one could also ask for the minimum area under timing and power constraints.
The discrete and nonlinear nature of the problem makes gate sizing practically and mathemati-
cally difficult. Assuming well-behaved delay functions and continuous sizability, gate sizing
becomes tractable to some extent. In this section, we give an overview on the most important
theoretical results for discrete and continuous gate sizing.
Even under mild assumptions on the delay model, discrete gate sizing is a hard problem, as
the connection to the discrete time-cost trade-off problem shows. Here, we are given an acyclic
directed graph G = (V, E), and for each edge e ∈ E, there is a finite set T e = {te1 ,¼, tek } Ì R of edge
delays and a nonincreasing cost function ce : T e ® R + . In the deadline problem, we wish to choose
a delay for each edge such that the length of each path stays within a given deadline D minimiz-
ing the total cost. This problem cannot be approximated within a factor of 1.36 unless P = NP,
because it is as hard as the vertex cover problem [9,14]. Assuming the unique games conjecture,
which is a stronger assumption than P ≠ NP, there is even no constant-factor approximation
algorithm [39]. Theoretically, polynomial time algorithms or approximation algorithms with
better guarantees may exist for special delay functions. But in gate sizing, the delay functions
appear rather more involved due to their nonlinearity and nonseparability. The nonseparability
means that the delay of an edge in the timing graph depends on the sizes of multiple gates.
For netlists with a tree structure and certain delay models, pseudo-polynomial dynamic
programming (DP) algorithms have been used successfully. One of the first was presented in [3].
They can be turned into a fully polynomial time approximation scheme as a consequence of the
approximation scheme for buffering by [17]. For netlists with bounded width, that is, with a path-
like structure, an approximation scheme for delay minimization obeying an upper bound on the
power was developed in [22].
When nonlinear delay models with slew propagation come into play, the only known algo-
rithms for general discrete gate sizing that provide a size guarantee while meeting the timing
constraints are based on complete enumeration, potentially sped up by branch and bound [45].
Such techniques can be used to postoptimize small groups of gates [20].
Mathematical tractability improves the plain gate sizing problem (without discrete Vt changes)
if one assumes continuously sizable gates and posynomial delay functions, which are defined as
follows. For each gate g ÎG, the realization xg can be chosen continuously within a lower and an
upper bound lg ≤ xg ≤ ug, where l g , u g Î . Now the set X of feasible assignments consists of gate
sizing solutions x Î X = { x¢ Î RG : l g £ x g ¢ £ u g for all g Î G }.
A posynomial function f : n ® is of the form
K
(10.5) f ( x) = åc x
k =1
b1 k
k 1 x2b2 k xnbnk ,
where ck > 0 and bik Î . Posynomial delay functions occur, for instance, if gates are modeled as
RC circuits with bij ∈ {−1, 0, 1} [4,11].
The posynomial formulation for the circuit/transistor sizing problem was introduced by
Fishburn and Dunlop [11]. A globally optimum solution to the transistor sizing problem based on
geometric programming was first given in [35]. The nice property of geometric programs is that
they can be turned into convex programs by the variable transformation yi = logxi. Therefore, a
local optimum is always a global optimum.
General-purpose geometric programming solvers mostly implement an interior point method
for convex programs with polynomial running time. The drawback of interior point methods is
that their iterations require a significant memory and running time. Solvable instance sizes are
limited to approximately 100,000 gates according to [1,2], which also give a detailed tutorial on
geometric programming.
A prominent practical approach to gate sizing is based on Lagrangian relaxation (LR), which was
analyzed in detail in [4] and later in [43]. As this is also the motivation for many discrete gate
sizing heuristics, we will describe this approach in detail.
In the LR approach, the timing constraints are moved to the objective weighted by Lagrange
multipliers l Î E³(0G ) that penalize constraint violations. Applied to Equation 10.4, this leads to
the following Lagrange function (called Lagrangian):
(10.7) {
L (l) : = inf L(l, x, a) : x Î X , a Î RV (G ) }
is a lower bound for the objective value in Equation 10.4. To get the best possible lower bound,
we have to solve
If the gate sizing problem in Equation 10.4 has a strictly feasible solution, Equations 10.4 and 10.8
have the same value and the Lagrangian duality gap is zero.
If no strictly feasible solutions exists, Equation 10.8 may not have a finite solution. Note that
for a minimum cycle time T in Equation 10.1 the problem is not strictly feasible, because the
constraints on the critical path can only be satisfied with equality. However, Wang et al. [43]
specify regularity conditions for the objective function and the delay functions so that the duality
gap between the gate sizing problems (Equation 10.4) and the Lagrangian dual problem (Equation
10.8) is zero. But the supremum in Equation 10.8 is not attained by a finite value, and instead, the
multipliers of the tight constraints tend toward infinity.
The crucial observation in [4] is that the Lagrangian can be separated into two parts, one
depending only on the size vector x and the other only on the arrival time vector a:
with
and
L2 (l, a) = å l v , w (a v - a w )
( v ,w )ÎE (G )
æ ö
å å å
1
(10.11) = av ç l v ,w - l u ,v ÷ .
2 vÎV (G ) ç (v ,w )ÎE (G ) ÷
è ( u ,v )ÎE (G ) ø
Now Equation 10.11 shows that the Lagrange multipliers must obey network flow conservation
constraints:
å
( v ,w )ÎE (G )
l v ,w = å
( u ,v )ÎE (G )
l u ,v .
Thus, to solve the Lagrangian dual problem in Equation 10.8, the Lagrange multipliers must be
restricted to the space N of nonnegative network flows, as can also be derived from the Karush–
Kuhn–Tucker optimality conditions [4]. In this case, L2(l, a) = 0 for all a and L(l, x, a) = L1(l, x).
The Lagrangian dual problem (Equation 10.8) is usually solved by a subgradient method and
involves solving the Lagrangian subproblem (Equation 10.7) in every iteration.
The Lagrangian subproblem (Equation 10.7) is again a geometric program with the box con-
straints x ∈ X. It can be solved by a simple descent algorithm. In [4], convergence was proven
for a simple greedy algorithm if gates are modeled as RC circuits. In fact, it converges with a
linear convergence rate [5,19,34]. These theoretical statements rely on a specific gate delay model.
Empirically, linear running times were also observed for general delay models. They give reasons
to apply LR in gate sizing [4,12,21,23,24,28,40].
By the Lagrangian duality theory, L is a concave function in the domain N of nonnegative net-
work flows. It is continuous but not everywhere differentiable in N . A prominent method for
solving the Lagrangian dual problem (Equation 10.8) is the projected subgradient method [4].
Starting with some initial vector of multipliers λ 0 ≥ 0, it determines in iteration k ≥ 0, a
subgradient sk at L (l k ) in ascent direction and updates the multipliers by
(10.12) l k +1 = p N (l k + a k sk ),
where
αk > 0 is the step length
p N is the projection to the set N of nonnegative network flows
For the plain subgradient method without projection and Lipschitz continuous functions,
convergence rates have been established. However, L⋆ is not Lipschitz continuous at the transi-
E (G )
tions between N and R ³0 N . Polyak [29] showed that the solution sequence (l k )kÎ in the
projected subgradient method contains a subsequence that converges to the optimum, that is,
k ®¥
the limit superior lim k ®¥L(l k ) converges. To prove this, the series αk must fulfill a k ® 0 and
å ( )
¥
a k = ¥. The projected subgradient steps are iterated until the duality gap P total ( xk ) - L(l k )
k =0
and the maximum or total delay constraint violation are within some positive error bound. Note
that the optimum solution x⋆ will likely be located on the boundary of the feasible set and most
iteration values xk will not satisfy all delay constraints.
Computing the exact projection p N is a quadratic minimum-cost flow problem that can be
solved in polynomial time [25]. However, this would still be too time consuming in practice.
Therefore, essentially, all papers employing the subgradient method for gate sizing resort to a
heuristic [40] that maps the multipliers into N , node by node in a backward topological order.
Convergence results similar to that in [29] are lacking for most of the employed heuristics. In [33],
convergence was shown when projecting with the method of alternating projections. This implies
that one can proceed as in [40] and replace their heuristic local mapping into N by a projection.
The gate sizing problem is so complicated and involves so many details that theoretically elegant
approaches are inadequate. Many practical techniques have been developed and shown to be
effective.
If one iteratively sizes a single gate at a time, it is natural to choose a gate whose change achieves
the largest benefit at the lowest cost. The TILOS algorithm in [11] was the first to systematically
use this idea. Since then, it became the rationale behind many sensitivity-based gate sizing
heuristics. The idea is intuitive yet can be very effective if implemented well. One recent example
is [15], which will be described in this section. An earlier work [38] will also be introduced to
illustrate different flavors of this approach.
A sensitivity-based gate sizing heuristic usually consists of two phases—one for timing
improvement and the other for power reduction. It typically starts with the timing phase
followed by the power phase. Sometimes, the two phases can be interleaved and applied
iteratively.
The sensitivity in the timing phase points to changes with the largest timing improvement and
the least power overhead. In [15], it is defined by
DTNS
(10.13) Sensitivity timing =
Dpower power _ exponent
where
TNS is the total negative slack
Δ indicates change
Sizing based on this sensitivity aims to minimize the absolute TNS. Since TNS is expensive to
compute precisely, it is approximated by
where mik means changing gate g i ÎG to implementation k. The set Ni is the neighbors of gi and
consists of gates that share fanin nodes with gi. The delay change Ddelay kj of g j ∈ Ni is due to the
size or Vt (threshold voltage) change at gi. The notation NPathsj tells the total number of fanin and
fanout gates of gate g j that is affected by the delay change at g j. The power exponent is between
0 and 3. Extensions to this approach including an integration with a complex yet relatively slow
sign-off timer can be found in [20].
In [38], the timing phase sensitivity is defined as
å slack
1 -Ddelay
Sensitivity timing =
(10.15) Dpower arcs arc - slack min + K
where
Δpower is the power increase
Δdelay is the delay change when upsizing a gate
K is a constant
The arcs are (falling and rising) input arcs to a gate. In the timing recovery phase, one or several
gates with the maximum sensitivity are selected to be sized up or assigned to a lower Vt level.
This is repeated until the timing constraints are satisfied or no more improvement can be found.
Upon each gate size or Vt change, an incremental timing analysis is performed to obtain timing
updates.
A power reduction phase starts when the overall timing slack is nonnegative. In [15], the
power-phase sensitivity is defined as
-Dpower × slack
Sensitivity power =
(10.16) Ddelay × # paths
where
slack is at the output pin of the gate
#paths is the total fanin and fanout of the gate
å Ddelay
slack arc
Sensitivity power = -Dpower
(10.17) arcs
In the power reduction phase, gates with large sensitivity are selected for downsizing or for
increasing the Vt level. This is repeated until the timing slack reaches minimal acceptable values.
Specialized heuristics for power minimization maintaining timing feasibility are given in [31].
Here, not only one gate is powered down at a time, but an independent set of gates is chosen based
on its total sensitivity.
To improve the worst slack near the end of gate sizing, the focus of local search can be adjusted
by changing the objectives in Equations 10.13 and 10.15: Instead of taking the sum of sensitivities
among all neighbors or input arcs, the worst slack over all upstream driver gates is maximized.
Under this objective, a gate on the critical path is sized for minimum delay. Furthermore, the
size of a gate with a more critical upstream gate cannot increase because this would increase the
capacitance and worsen the slack at the upstream gate. Instead, its size rather decreases to reduce
the load capacitance and delay of the upstream gate.
For the second reason, it is important to consider not only the gates on the critical path but
also their less critical downstream gates in the local search [18,20] regardless of the local objec-
tive. In this setting, it can be beneficial to proceed in a forward topological order because an
increased gate size on the critical path improves the input slews in the fanout and might allow
a more aggressive downsizing. With such an extended gate selection and the local worst-slack
objective explained earlier, close to optimum global worst slacks are achievable [18].
With guidance from the function gradient, continuous optimization is usually more capable of
handling large problems than discrete optimization. However, Vt levels are highly discrete, and
its assignment is very difficult to be cast into a continuous optimization problem. The work by
Shah et al. [36] overcomes this difficulty by implementing each gate with two parallel parts of dif-
ferent Vt levels (Figure 10.4) and by sizing the two parts continuously. Since gate sizes are available
at a finer granularity than Vt levels, the discretization of a continuous sizing solutions is relatively
manageable [16]. Interestingly, it is observed in [36] that solving this new sizing problem through
convex programming usually results in integer Vt assignment solutions. More specifically, usually
one part (either low Vt or high Vt) of a parallel gate has size 0. It is further mathematically proved
that Vt assignment always self-snaps to integer solutions if there is no constraint on the gate sizes.
LR is a popular approach in solving gate sizing problems [4,12,21,23,24,28,40] mainly due to its
theoretical advantage on the trade-off between conflicting objectives, such as performance and
power. The foundation framework of LR-based gate sizing is introduced in the pioneer work [4]
and summarized in Section 10.2.3.
A B B
A
Out
A Low Vt A High Vt
B B
FIGURE 10.4 Implementing a NAND2 gate by a parallel connection of two logic gates with differ-
ent Vt levels.
Later on, many practical techniques have been proposed to improve or extend this framework.
The Lagrangian dual problem is usually solved by subgradient methods [4]. In [40], it is observed
that the practical convergence of subgradient methods is very sensitive to the initial solution and
step size. The convergence is faster if the initial solution satisfies timing constraints. The work
by Tennakoon and Sechen [40] suggests invoking the iterative greedy algorithm [4] at the begin-
ning so that a timing-feasible initial solution is obtained. It also shows how to estimate Lagrange
multiplier values corresponding to the initial solution. With such an initial solution, the LR can
focus more on area minimization subject to timing constraints rather than delay reduction.
Tennakoon and Sechen [40] also propose the following Lagrange multiplier update method that
does not rely on a constant step size:
ì ai
ïl ji A if i Î PO, j Î fanin(i)
ï 0
ï aj
(10.18) l ji ,new = íl ji if j Î fanin(i)
ï ai - Di
ïl Di if i Î PI , j Î fanin(i)
ïî ji ai
where
λji is the multiplier for timing arc (j, i)
ai is the arrival time at node i
A0 is the required arrival time at the primary output (PO)
Di is the delay of gate i
Please note that the primary input (PI) nodes of a combinational logic circuit are driven by
ip-flops and, therefore, they have fanin nodes.
fl
Note that the multipliers are updated multiplicatively in Equation 10.18 and not in subgradient
direction. In practice, multiplicative updates often lead to faster convergence than the subgradi-
ent update shown in Equation 10.12, as demonstrated by the results discussed in [12] on ISPD
2012 and 2013 gate sizing benchmarks.
LR is also combined with DP-like solution search [23] so that a discrete solution is obtained
directly without the error-prone rounding procedure. DP has been quite successful for problems
with tree topologies, such as buffer insertion. Since the topology of combinational logic circuits is
a directed acyclic graph, the solution propagation of DP faces the challenge of history consistency
constraint. When two candidate solutions are merged at a node, the choices on their common
ancestor nodes should be identical. This constraint entails either large memory space or long
computation time as a large volume of history information needs to be stored or traced. Moreover,
it complicates pruning, which is a key component of the DP. The work of Liu and Hu [23] suggests
an interesting idea to solve this challenge. It first runs DP with the history consistency constraint
relaxed. Although the solution may be illegal, it provides a timing bound for each node. Next, it
applies a greedy-like heuristic to obtain a history consistent solution. Since the heuristic utilizes
the timing bound obtained from the DP, it is not myopic as a stand-alone greedy algorithm. Then,
the algorithm continues solution search in a manner that is between DP and a greedy heuristic.
Overall, it strives to gain the greatest possible benefit from DP, while avoiding excessive enumera-
tion for resolving inconsistencies.
The algorithm in [28] is also based on LR and DP. Unlike most previous works that constrain
only the longest path slack to be nonnegative, the problem formulation of [28] attempts to
minimize the absolute TNS and includes it as a part of the objective function. For cases without
timing-feasible solutions, the result from [28] is more usable. The algorithm of this work is built
upon a graph model as illustrated in Figure 10.5. In this model, gate delay and power are captured
in node and edge weight. Then, the Lagrangian subproblem, which optimizes a weighted sum
of gate delay and power, is solved by DP. In order to avoid the troublesome history consistency
problem, the entire circuit is partitioned into a set of disjoint trees where DP is conducted on each
individual tree.
In [24], the Lagrangian subproblem is solved by a discrete greedy heuristic. The algorithm
proceeds in circuit traversals. When a gate is encountered during traversal, its implementation
options are enumerated and the one that minimizes the overall cost function is selected.
This heuristic is very simple yet effective.
Another approach to handle the trade-off between circuit timing and power is through timing
budgeting [27]. First, a minimum delay sizing solution is obtained using a certain simple heuristic.
Next, the timing slack is allocated to each gate. Last, each gate is sized down or changed to higher
A B
(a) (b)
FIGURE 10.5 (a) Circuit and its (b) graph model assuming each gate has only two implementation
options.
Vt level according to its slack budget. The key component of this approach is a systematic alloca-
tion of timing slack to each node, which can be formulated as follows:
(10.19) Max å vi ÎV
di × xi
(10.21) a j £ T "v j Î PO
(10.22) 0 £ xi £ U i "vi ÎV
where
ai, di, and xi are the arrival time, delay, and slack budget for node vi
T is the required arrival time at the PO
Ui is an upper bound for the budget at node vi
δi is the power reduction per delay increase
The budgeting problem can be solved by linear programming [27] or network flow algorithms [13].
Noticing the correlation between gate delay and signal slew rate, the work of [18] proposes
to size gates through slew budgeting instead of delay budgeting. The slew target for each gate is
initialized with the design rules. Then, in a reverse topological order traversal of the circuit, a size
is found for each gate to meet its slew target. When sizing a gate, its fanout gate sizes have already
been updated, and therefore, the load capacitance is known. Its input slew can be estimated
according to the slew target of its fanin gates. After one circuit traversal, slew targets are adjusted
according to timing criticality. Timing critical gates are assigned with tighter slew constraints,
and the slew-driven sizing is repeated. Since slew estimation is much faster than timing analysis,
this approach can easily deal with very large circuits using a relatively slow sign-off timer.
Some commercial tools use global gate sizing algorithms based on LR (Section 10.3.4) or timing
budgeting (Section 10.3.5) early in the design flow to get good starting solutions. Such global
solutions are improved by local search, where sensitivity-based heuristics (Section 10.3.1) are
used predominantly.
Gate sizing is usually not performed as an isolated task but incrementally alternated or integrated
with interconnect optimization, timing-driven detailed placement, or logic restructuring.
Again sensitivity-based heuristics are mostly used in the interplay with other operations [30].
As LR does not guarantee fast convergence, there is always the danger of a timing degradation,
when applied late in the flow. Again sensitivity-based heuristics that do not introduce new path delay
violations are prevalent for timing final power reduction [31].
Gate sizing is an important and challenging problem in very-large-scale integration design. The
continuous gate sizing problem without Vt assignment can be solved in polynomial time for
posynomial delay functions.
In practice, discrete gate libraries are used, and no polynomial time algorithm with a perfor-
mance guarantee is known, except for trees or path-like netlists. On the other hand, there has
been substantial practical improvement, in particular, after the ISPD gate sizing contests 2012
and 2013 [12,15,20,21,23,24]. It can be expected that several of these new heuristics will find their
way into electronic design automation tools. Thereby, the major challenge is the interaction with
sign-off timing engines that, due to the support of many features, tend to be much slower than the
slim static timing implementations used in research papers. It is not clear how far the currently
best solutions are away from the optimum. Even if LR is used, the Lagrangian subproblems are
not solved to optimality. A fast computation of good lower bounds for the power consumption is
an important open problem.
On the theoretical side, discrete gate sizing algorithms with good performance guarantees are
lacking. As explained in Section 10.2.1, it aims high to achieve a constant-factor approximation
algorithm. However, providing a polynomial-time algorithm with a logarithmic performance
guarantee would be a significant result.
REFERENCES
1. S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming.
Optimization and Engineering, 8(1), 67–127, 2007.
2. S. P. Boyd, S.-J. Kim, D. D. Patil, and M. A. Horowitz. Digital circuit optimization via geometric
programming. Operations Research, 53(6), 899–932, 2005.
3. P. K. Chan. Algorithms for library-specific sizing of combinational logic. Proceedings of the 27th
ACM/IEEE Design Automation Conference, Orlando, FL, 1990, pp. 353–356.
4. C. P. Chen, C. C.-N. Chu, and D. F. Wong. Fast and exact simultaneous gate and wire sizing by
Lagrangian relaxation. IEEE Transactions on Computer-Aided Design, 18(7), 1014–1025, 1999.
5. C. Chu and D. F. Wong. VLSI circuit performance optimization by geometric programming. Annals
of Operations Research, 105, 37–60, 2001.
6. J. Cong, J. Lee, and G. Luo. A unified optimization framework for simultaneous gate sizing and
placement under density constraints. Proceedings of 2011 IEEE International Symposium on
Circuits and Systems (ISCAS), Rio De Janeiro, Brazil, 2011, pp. 1207–1210.
7. A. R. Conn, I. M. Elfadel, W. W. Molzen, Jr., P. R. O’Brien, P. N. Strenski, C. Visweswariah, and
C. B. Whan. Gradient-based optimization of custom circuits using a static-timing formulation.
Design Automation Conference (DAC), New Orleans, LA, 1999, pp. 452–459.
8. J. F. Croix and D. F. Wong. Blade and razor: Cell and interconnect delay analysis using current-
based model. Proceedings of 40th ACM/IEEE Design Automation Conference, Anaheim, CA, 2003,
pp. 386–389.
9. I. Dinur and S. Safra. On the hardness of approximating minimum vertex cover. Annals of
Mathematics, 162, 439–485, 2005.
10. W. C. Elmore. The transient response of damped linear networks with particular regard to wide-band
amplifiers. Journal of Applied Physics, 19(1), 55–64, 1948.
11. J. P. Fishburn and A. E. Dunlop. TILOS: A posynomial programming approach to transis-
tor sizing. Proceedings of the IEEE/ACM International Conference on Computer-Aided Design,
Santa Clara, CA, 1985, pp. 326–328.
12. G. Flach, T. Reimann, G. Posser, M. de O. Johann, R. Reis. Effective method for simultaneous gate
sizing and Vth assignment using Lagrangian relaxation. IEEE Transactions on Computer Aided
Design of Integrated Circuits and Systems, 33(4), 546–557, 2014.
13. S. Ghiasi, E. Bozorgzadeh, P.-K. Huang, R. Jafari, and M. Sarrafzadeh. A unified theory of timing
budget management. IEEE Transactions on Computer-Aided Design, 25(11), 2364–2375, 2006.
14. A. Grigoriev and G. J. Woeginger. Project scheduling with irregular costs: Complexity, approximability,
and algorithms. Acta Informatica, 41, 83–97, 2004.
15. J. Hu, A. B. Kahng, S. Kang, M.-C. Kim, and I. L. Markov. Sensitivity-guided metaheuristics for accu-
rate discrete gate sizing. Proceedings of the IEEE/ACM International Conference on Computer-Aided
Design, San Jose, CA, 2012, pp. 233–239.
16. S. Hu, M. Ketkar, and J. Hu. Gate sizing for cell-library-based designs. IEEE Transactions on Computer-
Aided Design, 28(6), 818–825, 2009.
17. S. Hu, Z. Li, and C. J. Alpert. A fully polynomial time approximation scheme for timing driven
minimum cost buffer insertion. Proceedings of the 46th Annual Conference on Design Automation,
San Francisco, CA, 2009, pp. 424–429.
18. S. Held. Gate sizing for large cell-based designs. Proceedings of Design, Automation and Test in Europe
Conference, Nice, France, 2009, pp. 827–832.
19. S. Joshi and S. Boyd. An efficient method for large-scale gate sizing. IEEE Transactions on Circuits and
Systems, 55(9), 2760–2773, 2008.
20. A. B. Kahng, S. Kang, H. Lee, I. L. Markov, and P. Thapar. High-performance gate sizing with a signoff
timer. Proceedings of the International Conference on Computer-Aided Design, San Jose, CA, 2013,
pp. 450–457.
21. L. Li, P. Kang, Y. Lu, and H. Zhou. An efficient algorithm for library-based cell-type selection in
high-performance low-power designs. Proceedings of the IEEE/ACM International Conference on
Computer-Aided Design, San Jose, CA, 2012, pp. 226–232.
22. C. Liao and S. Hu. Approximation scheme for restricted discrete gate sizing targeting delay minimi-
zation. Journal of Combinatorial Optimization, 21(4), 497–510, 2011.
23. Y. Liu and J. Hu. A new algorithm for simultaneous gate sizing and threshold voltage assignment.
IEEE Transactions on Computer-Aided Design, 29(2), 223–234, 2010.
24. V. S. Livramento, C. Guth, J. L. Güntzel, and M. de O. Johann. Fast and efficient Lagrangian
relaxation-based discrete gate sizing. ACM Transactions on Design Automation of Electronic
Systems, 19(4), Article No. 40, 2014.
25. M. Minoux. A polynomial algorithm for minimum quadratic cost flow problems. European Journal
of Operational Research, 18(3), 377–387, 1984.
26. L. W. Nagel and D. O. Pederson. SPICE (Simulation Program with Integrated Circuit Emphasis).
Technical Report M382, EECS Department, University of California, Berkeley, CA, 1973.
27. D. Nguyen, A. Davare, M. Orshansky, D. Chinnery, B. Thompson, and K. Keutzer. Minimization of
dynamic and static power through joint assignment of threshold voltages and sizing optimization.
Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Seoul,
Korea, 2003, pp. 158–163.
28. M. Ozdal, S. Burns, and J. Hu. Algorithms for gate sizing and device parameter selection for high-
performance designs. IEEE Transactions on Computer-Aided Design, 31(10), 1558–1571, 2012.
29. B. T. Polyak. A general method for solving extemum problems. Doklady Akademii Nauk SSSR, 174(1),
33–36, 1967. Translation in Soviet Mathematics Doklady, 8(3), 593–597, 1967.
30. R. Puri, M. Choudhury, H. Qian, and M. Ziegler. Bridging high performance and low power in processor
design. Proceedings of the 2014 International Symposium on Low Power Electronics and Design, La Jolla,
CA, 2014, pp. 183–188.
31. H. Qian and E. Acar. Timing-aware power minimization via extended timing graph methods. ASP
Journal of Low Power Electronics, 3(3), 318–326, 2007.
32. C. Ratzlaff and L. T. Pillage. RICE: Rapid interconnect circuit evaluation using asymptotic waveform
evaluation. IEEE Transactions on Computer-Aided Design, 13(6), 763–776, 1994.
33. D. Rautenbach and C. Szegedy. A subgradient method using alternating projections. Technical Report
04940, Research Institute for Discrete Mathematics, University of Bonn, Bonn, Germany, 2004.
34. D. Rautenbach and C. Szegedy. A class of problems for which cyclic relaxation converges linearly.
Computational Optimization and Applications, 41, 52–60, 2008.
35. S. S. Sapatnekar, V. B. Rao, P. M. Vaidya, and S.-M. Kang. An exact solution to the transis-
tor sizing problem for CMOS circuits using convex programming. IEEE Transactions on Computer-
Aided Design, 12(11), 1621–1634, 1993.
36. S. Shah, A. Srivastava, D. Sharma, D. Sylvester, D. Blaauw, and V. Zolotov. Discrete vt assignment and
gate sizing using a self-snapping continuous formulation. Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design, 2005, pp. 705–712.
37. R. F. Sproull and I. E. Sutherland. Logical effort: Designing for speed on the back of an envelope. IEEE
Advanced Research in VLSI C, Sequin, ed., MIT Press, Cambridge, MA, 1991, pp. 1–16.
38. A. Srivastava, D. Sylvester, and D. Blaauw. Power minimization using simultaneous gate sizing, mini-
mization using simultaneous GatDual-Vdd and Dual-Vth assignment. ACM/IEEE Design Automation
Conference, San Diego, CA, 2004, pp. 783–787.
39. O. Svensson. Hardness of vertex deletion and project scheduling. Theory of Computing, 9(24),
759–781, 2013.
40. H. Tennakoon and C. Sechen. Gate sizing using Lagrangian relaxation combined with a fast
gradient-based pre-processing step. Proceedings of the IEEE/ACM International Conference on
Computer-Aided Design, San Jose, CA, 2002, pp. 395–402.
41. L. Trevillyan, D. S. Kung, R. Puri, L. N. Reddy, and M. A. Kazda. An integrated environment for
technology closure of deep-submicron IC designs. IEEE Design & Test of Computers, 21(1), 14–22,
2004.
42. A. Wächter, C. Visweswariah, and A. R. Conn. Large-scale nonlinear optimization in circuit tuning.
Future Generation Computer Systems, 21(8), 1251–1262, 2005.
43. J. Wang, D. Das, and H. Zhou. Gate sizing by Lagrangian relaxation revisited. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 28(7), 1071–1084, 2009.
44. N. H. E. Weste and D. Harris. CMOS VLSI Design: A Circuits and Systems Perspective, 4th edn.,
Addison-Wesley Publishing Company, Boston, MA, 2010.
45. T.-H. Wu and A. Davoodi PaRS: Parallel and near-optimal grid-based cell sizing for library-based
design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(11),
1666–1678, 2009.