0% found this document useful (0 votes)
12 views41 pages

Module 2

This document discusses the delay model based on logical effort, incorporating factors such as parasitic capacitance and nonideal delays in CMOS technology. It presents equations for calculating delays in logic cells, emphasizing the importance of logical and electrical efforts in determining performance. The document also explores logical area and efficiency, comparing different logic cells based on their transistor sizes and logical efforts.

Uploaded by

H M BRUNDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views41 pages

Module 2

This document discusses the delay model based on logical effort, incorporating factors such as parasitic capacitance and nonideal delays in CMOS technology. It presents equations for calculating delays in logic cells, emphasizing the importance of logical and electrical efforts in determining performance. The document also explores logical area and efficiency, comparing different logic cells based on their transistor sizes and logical efforts.

Uploaded by

H M BRUNDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

3.

3 Logical Effort
In this section we explore a delay model based on logical effort, a term coined by
Ivan Sutherland and Robert Sproull [1991], that has as its basis the time-constant
analysis of Carver Mead, Chuck Seitz, and others.
We add a catch all nonideal component of delay, t q , to Eq. 3.2 that includes:
(1) delay due to internal parasitic capacitance; (2) the time for the input to reach
the switching threshold of the cell; and (3) the dependence of the delay on the
slew rate of the input waveform. With these assumptions we can express the
delay as follows:
t PD = R ( C out + C p ) + t q . (3.10)

(The input capacitance of the logic cell is C , but we do not need it yet.)
We will use a standard-cell library for a 3.3 V, 0.5 m m (0.6 m m drawn)
technology (from Compass) to illustrate our model. We call this technology C5 ;
it is almost identical to the G5 process from Section 2.1 (the Compass library
uses a more accurate and more complicated SPICE model than the generic
process). The equation for the delay of a 1X drive, two-input NAND cell is in the
form of Eq. 3.10 ( C out is in pF):
t PD = (0.07 + 1.46 C out + 0.15) ns . (3.11)

The delay due to the intrinsic output capacitance (0.07 ns, equal to RC p ) and the
nonideal delay ( t q = 0.15 ns) are specified separately. The nonideal delay is a
considerable fraction of the total delay, so we may hardly ignore it. If data books
do not specify these components of delay separately, we have to estimate the
fractions of the constant part of a delay equation to assign to RC p and t q (here
the ratio RC p / t q is approximately 2).

The data book tells us the input trip point is 0.5 and the output trip points are 0.35
and 0.65. We can use Eq. 3.11 to estimate the pull resistance for this cell as R ª
1.46 nspF 1 or about 1.5 k W . Equation 3.11 is for the falling delay; the data
book equation for the rising delay gives slightly different values (but within 10
percent of the falling delay values).
We can scale any logic cell by a scaling factor s (transistor gates become s times
wider, but the gate lengths stay the same), and as a result the pull resistance R
will decrease to R / s and the parasitic capacitance C p will increase to sC p .
Since t q is nonideal, by definition it is hard to predict how it will scale. We shall
assume that t q scales linearly with s for all cells. The total cell delay then scales
as follows:
t PD = ( R / s )·( C out + sC p ) + st q . (3.12)

For example, the delay equation for a 2X drive ( s = 2), two-input NAND cell is
t PD = (0.03 + 0.75 C out + 0.51) ns . (3.13)

Compared to the 1X version (Eq. 3.11 ), the output parasitic delay has decreased
to 0.03 ns (from 0.07 ns), whereas we predicted it would remain constant (the
difference is because of the layout); the pull resistance has decreased by a factor
of 2 from 1.5 k W to 0.75 k W , as we would expect; and the nonideal delay has
increased to 0.51 ns (from 0.15 ns). The differences between our predictions and
the actual values give us a measure of the model accuracy.
We rewrite Eq. 3.12 using the input capacitance of the scaled logic cell, C in = s
C,
C out
t PD = RC + RC p + st q . (3.14)
C in

Finally we normalize the delay using the time constant formed from the pull
resistance R inv and the input capacitance C inv of a minimum-size inverter:
( RC ) ( C out / C in ) + RC p + st q
d= = f + p + q . (3.15)
t

The time constant tau ,


t = R inv C inv , (3.16)

is a basic property of any CMOS technology. We shall measure delays in terms


of t .
The delay equation for a 1X (minimum-size) inverter in the C5 library is
t PDf = R pd ( C out + C p ) ln (1/0.35) ª R pd ( C out + C p ) . (3.17)

Thus tq inv = 0.1 ns and R inv = 1.60 k W . The input capacitance of the 1X
inverter (the standard load for this library) is specified in the data book as C inv =
0.036 pF; thus t = (0.036 pF)(1.60 k W ) = 0.06 ns for the C5 technology.
The use of logical effort consists of rearranging and understanding the meaning
of the various terms in Eq. 3.15 . The delay equation is the sum of three terms,
d = f + p + q . (3.18)

We give these terms special names as follows:


delay = effort delay + parasitic delay + nonideal delay . (3.19)

The effort delay f we write as a product of logical effort, g , and electrical effort,
h:
f = gh . (3.20)

So we can further partition delay into the following terms:


delay = logical effort ¥ electrical effort + parasitic delay + nonideal delay . (3.21)

The logical effort g is a function of the type of logic cell,


g = RC/ t . (3.22)

What size of logic cell do the R and C refer to? It does not matter because the R
and C will change as we scale a logic cell, but the RC product stays the samethe
logical effort is independent of the size of a logic cell. We can find the logical
effort by scaling down the logic cell so that it has the same drive capability as the
1X minimum-size inverter. Then the logical effort, g , is the ratio of the input
capacitance, C in , of the 1X version of the logic cell to C inv (see Figure 3.8 ).

FIGURE 3.8 Logical effort. (a) The input capacitance, C inv , looking into the
input of a minimum-size inverter in terms of the gate capacitance of a
minimum-size device. (b) Sizing a logic cell to have the same drive strength as a
minimum-size inverter (assuming a logic ratio of 2). The input capacitance
looking into one of the logic-cell terminals is then C in . (c) The logical effort of
a cell is C in / C inv . For a two-input NAND cell, the logical effort, g = 4/3.

The electrical effort h depends only on the load capacitance C out connected to
the output of the logic cell and the input capacitance of the logic cell, C in ; thus
h = C out / C in . (3.23)

The parasitic delay p depends on the intrinsic parasitic capacitance C p of the


logic cell, so that
p = RC p / t . (3.24)

Table 3.2 shows the logical efforts for single-stage logic cells. Suppose the
minimum-size inverter has an n -channel transistor with W/L = 1 and a p
-channel transistor with W/L = 2 (logic ratio, r , of 2). Then each two-input
NAND logic cell input is connected to an n -channel transistor with W/L = 2 and
a p -channel transistor with W/L = 2. The input capacitance of the two-input
NAND logic cell divided by that of the inverter is thus 4/3. This is the logical
effort of a two-input NAND when r = 2. Logical effort depends on the ratio of the
logic. For an n -input NAND cell with ratio r , the p -channel transistors are W/L
= r /1, and the n -channel transistors are W/L = n /1. For a NOR cell the n
-channel transistors are 1/1 and the p -channel transistors are nr /1.
TABLE 3.2 Cell effort, parasitic delay, and nonideal delay (in units of t ) for
single-stage CMOS cells.
Cell effort Cell effort Nonideal delay/
Cell Parasitic delay/ t
(logic ratio = 2) (logic ratio = r) t
p inv (by q inv (by
inverter 1 (by definition) 1 (by definition)
definition) 1 definition) 1
n -input
( n + 2)/3 ( n + r )/( r + 1) n p inv n q inv
NAND
n -input
(2 n + 1)/3 ( nr + 1)/( r + 1) n p inv n q inv
NOR

The parasitic delay arises from parasitic capacitance at the output node of a
single-stage logic cell and most (but not all) of this is due to the source and drain
capacitance. The parasitic delay of a minimum-size inverter is
p inv = C p / C inv . (3.25)

The parasitic delay is a constant, for any technology. For our C5 technology we
know RC p = 0.06 ns and, using Eq. 3.17 for a minimum-size inverter, we can
calculate p inv = RC p / t = 0.06/0.06 = 1 (this is purely a coincidence). Thus C p
is about equal to C inv and is approximately 0.036 pF. There is a large error in
calculating p inv from extracted delay values that are so small. Often we can
calculate p inv more accurately from estimating the parasitic capacitance from
layout.
Because RC p is constant, the parasitic delay is equal to the ratio of parasitic
capacitance of a logic cell to the parasitic capacitance of a minimum-size
inverter. In practice this ratio is very difficult to calculateit depends on the
layout. We can approximate the parasitic delay by assuming it is proportional to
the sum of the widths of the n -channel and p -channel transistors connected to
the output. Table 3.2 shows the parasitic delay for different cells in terms of p inv
.
The nonideal delay q is hard to predict and depends mainly on the physical size
of the logic cell (proportional to the cell area in general, or width in the case of a
standard cell or a gate-array macro),
q = st q / t . (3.26)

We define q inv in the same way we defined p inv . An n -input cell is


approximately n times larger than an inverter, giving the values for nonideal
delay shown in Table 3.2 . For our C5 technology, from Eq. 3.17 , q inv = t q inv /
t = 0.1 ns/0.06 ns = 1.7.

3.3.1 Predicting Delay


As an example, let us predict the delay of a three-input NOR logic cell with 2X
drive, driving a net with a fanout of four, with a total load capacitance
(comprising the input capacitance of the four cells we are driving plus the
interconnect) of 0.3 pF.
From Table 3.2 we see p = 3 p inv and q = 3 q inv for this cell. We can calculate
C in from the fact that the input gate capacitance of a 1X drive, three-input NOR
logic cell is equal to gC inv , and for a 2X logic cell, C in = 2 gC inv . Thus,
C out g ·(0.3 pF) (0.3 pF)
gh = g = = . (3.27)
C in 2 g C inv (2)·(0.036 pF)

(Notice that g cancels out in this equation, we shall discuss this in the next
section.)
The delay of the NOR logic cell, in units of t , is thus
0.3 ¥ 10 12
d = gh + p + q = + (3)·(1) + (3)·(1.7)
(2)·(0.036 ¥ 10 12 )

= 4.1666667 + 3 + 5.1
= 12.266667 t . (3.28)

equivalent to an absolute delay, t PD ª 12.3 ¥ 0.06 ns = 0.74 ns.

The delay for a 2X drive, three-input NOR logic cell in the C5 library is
t PD = (0.03 + 0.72 C out + 0.60) ns . (3.29)

With C out = 0.3 pF,


t PD = 0.03 + (0.72)·(0.3) + 0.60 = 0.846 ns . (3.30)

compared to our prediction of 0.74 ns. Almost all of the error here comes from
the inaccuracy in predicting the nonideal delay. Logical effort gives us a method
to examine relative delays and not accurately calculate absolute delays. More
important is that logical effort gives us an insight into why logic has the delay it
does.

3.3.2 Logical Area and Logical Efficiency


Figure 3.9 shows a single-stage OR-AND-INVERT cell that has different logical
efforts at each input. The logical effort for the OAI221 is the logical-effort vector
g = (7/3, 7/3, 5/3). For example, the first element of this vector, 7/3, is the logical
effort of inputs A and B in Figure 3.9 .

FIGURE 3.9 An OAI221 logic


cell with different logical
efforts at each input. In this
case g = (7/3, 7/3, 5/3). The
logical effort for inputs A and
B is 7/3, the logical effort for
inputs C and D is also 7/3, and
for input E the logical effort is
5/3. The logical area is the sum
of the transistor areas, 33
logical squares.

We can calculate the area of the transistors in a logic cell (ignoring the routing
area, drain area, and source area) in units of a minimum-size n -channel transistor
we call these units logical squares . We call the transistor area the logical area .
For example, the logical area of a 1X drive cell, OAI221X1, is calculated as
follows:
● n -channel transistor sizes: 3/1 + 4 ¥ (3/1)

● p -channel transistor sizes: 2/1 + 4 ¥ (4/1)

● total logical area = 2 + (4 ¥ 4) + (5 ¥ 3) = 33 logical squares

Figure 3.10 shows a single-stage AOI221 cell, with g = (8/3, 8/3, 6/3). The
calculation of the logical area (for a AOI221X1) is as follows:
● n -channel transistor sizes: 1/1 + 4 ¥ (2/1)
● p -channel transistor sizes: 6/1 + 4 ¥ (6/1)
● logical area = 1 + (4 ¥ 2) + (5 ¥ 6) = 39 logical squares

FIGURE 3.10 An
AND-OR-INVERT cell,
an AOI221, with
logical-effort vector, g =
(8/3, 8/3, 7/3). The
logical area is 39 logical
squares.

These calculations show us that the single-stage AOI221, with an area of 33


logical squares and logical effort of (7/3, 7/3, 5/3), is more logically efficient than
the single-stage OAI221 logic cell with a larger area of 39 logical squares and
larger logical effort of (8/3, 8/3, 6/3).

3.3.3 Logical Paths


When we calculated the delay of the NOR logic cell in Section 3.3.1, the answer
did not depend on the logical effort of the cell, g (it cancelled out in Eqs. 3.27
and 3.28 ). This is because g is a measure of the input capacitance of a 1X drive
logic cell. Since we were not driving the NOR logic cell with another logic cell,
the input capacitance of the NOR logic cell had no effect on the delay. This is
what we do in a data bookwe measure logic-cell delay using an ideal input
waveform that is the same no matter what the input capacitance of the cell.
Instead let us calculate the delay of a logic cell when it is driven by a
minimum-size inverter. To do this we need to extend the notion of logical effort.
So far we have only considered a single-stage logic cell, but we can extend the
idea of logical effort to a chain of logic cells or logical path . Consider the logic
path when we use a minimum-size inverter ( g 0 = 1, p 0 = 1, q 0 = 1.7) to drive
one input of a 2X drive, three-input NOR logic cell with g 1 = ( nr + 1)/( r + 1), p
1 = 3, q 1 =3, and a load equal to four standard loads. If the logic ratio is r = 1.5,
then g 1 = 5.5/2.5 = 2.2.

The delay of the inverter is


d = g 0 h 0 + p 0 + q 0 = (1) · (2g 1 ) · (C inv /C inv ) +1 + 1.7 (3.31)
= (1)(2)(2.2) + 1 + 1.7
= 7.1 .
Of this 7.1 t delay we can attribute 4.4 t to the loading of the NOR logic cell input
capacitance, which is 2 g 1 C inv . The delay of the NOR logic cell is, as before, d
1 = g 1 h 1 + p 1 + q 1 = 12.3, making the total delay 7.1 + 12.3 = 19.4, so the
absolute delay is (19.4)(0.06 ns) = 1.164 ns, or about 1.2 ns.
We can see that the path delay D is the sum of the logical effort, parasitic delay,
and nonideal delay at each stage. In general, we can write the path delay as

D= gihi+ ( p i + q i ) . (3.32)
i path i path

3.3.4 Multistage Cells


Consider the following function (a multistage AOI221 logic cell):
ZN(A1, A2, B1, B2, C)
= NOT(NAND(NAND(A1, A2), AOI21(B1, B2, C)))
= (((A1·A2)' · (B1·B2 + C)')')'
= (A1·A2 + B1·B2 + C)'
= AOI221(A1, A2, B1, B2, C) . (3.33)

Figure 3.11 (a) shows this implementation with each input driven by a
minimum-size inverter so we can measure the effect of the cell input capacitance.

FIGURE 3.11 Logical paths. (a) An AOI221 logic cell constructed as a


multistage cell from smaller cells. (b) A single-stage AOI221 logic cell.
The logical efforts of each of the logic cells in Figure 3.11 (a) are as follows:
g 0 = g 4 = g (NOT) = 1 ,
g 1 = g (AOI21) = (2, (2 r + 1)/( r + 1)) = (2, 4/2.5) = (2, 1.6) ,
g 2 = g 3 = g (NAND2) = ( r + 2)/( r + 1) = (3.5)/(2.5) = 1.4 . (3.34)

Each of the logic cells in Figure 3.11 has a 1X drive strength. This means that
the input capacitance of each logic cell is given, as shown in the figure, by gC inv
.
Using Eq. 3.32 we can calculate the delay from the input of the inverter driving
A1 to the output ZN as
d 1 = (1)·(1.4) + 1 + 1.7 + (1.4)·(1) + 2 + 3.4
+ (1.4)·(0.7) + 2 + 3.4 + (1)· C L + 1 + 1.7
= (20 + C L ) . (3.35)

In Eq. 3.35 we have normalized the output load, C L , by dividing it by a


standard load (equal to C inv ). We can calculate the delays of the other paths
similarly.
More interesting is to compare the multistage implementation with the
single-stage version. In our C5 technology, with a logic ratio, r = 1.5, we can
calculate the logical effort for a single-stage AOI221 logic cell as
g (AOI221) = ((3 r + 2)/( r + 1), (3 r + 2)/( r + 1), (3 r + 1)/( r + 1))
= (6.5/2.5, 6.5/2.5, 5.5/2.5)
= (2.6, 2.6, 2.2) . (3.36)

This gives the delay from an inverter driving the A input to the output ZN of the
single-stage logic cell as
d1 = ((1)·(2.6) + 1 + 1.7 + (1)· C L + 5 + 8.5 )
= 18.8 + C L . (3.37)

The single-stage delay is very close to the delay for the multistage version of this
logic cell. In some ASIC libraries the AOI221 is implemented as a multistage
logic cell instead of using a single stage. It raises the question: Can we make the
multistage logic cell any faster by adjusting the scale of the intermediate logic
cells?

3.3.5 Optimum Delay


Before we can attack the question of how to optimize delay in a logic path, we
shall need some more definitions. The path logical effort G is the product of
logical efforts on a path:
G= g i . (3.38)
i path

The path electrical effort H is the product of the electrical efforts on the path,
C out
H= hi , (3.39)
i path C in

where C out is the last output capacitance on the path (the load) and C in is the
first input capacitance on the path.
The path effort F is the product of the path electrical effort and logical efforts,
F = GH . (3.40)

The optimum effort delay for each stage is found by minimizing the path delay D
by varying the electrical efforts of each stage h i , while keeping H , the path
electrical effort fixed. The optimum effort delay is achieved when each stage
operates with equal effort,
f^ i = g i h i = F 1/ N . (3.41)

This a useful result. The optimum path delay is then


D^ = NF 1/ N = N ( GH ) 1/ N + P + Q , (3.42)

where P + Q is the sum of path parasitic delay and nonideal delay,


P+Q= p i + h i . (3.43)
i path

We can use these results to improve the AOI221 multistage implementation of


Figure 3.11 (a). Assume that we need a 1X cell, so the output inverter (cell 4)
must have 1X drive strength. This fixes the capacitance we must drive as C out =
C inv (the capacitance at the input of this inverter). The input inverters are
included to measure the effect of the cell input capacitance, so we cannot cheat
by altering these. This fixes the input capacitance as C in = C inv . In this case H =
1.
The logic cells that we can scale on the path from the A input to the output are
NAND logic cells labeled as 2 and 3. In this case
G = g 0 ¥ g 2 ¥ g 3 = 1 ¥ 1.4 ¥ 1.4 = 1.95 . (3.44)

Thus F = GH = 1.95 and the optimum stage effort is 1.95 (1/3) = 1.25, so that the
optimum delay NF 1/ N = 3.75. From Figure 3.11 (a) we see that
g 0 h 0 + g 2 h 2 + g 3 h 3 = 1.4 + 1.3 + 1 = 3.8 . (3.45)

This means that even if we scale the sizes of the cells to their optimum values, we
only save a fraction of a t (3.8 3.75 = 0.05). This is a useful result (and one that
is true in general)the delay is not very sensitive to the scale of the cells. In this
case it means that we can reduce the size of the two NAND cells in the multicell
implementation of an AOI221 without sacrificing speed. We can use logical
effort to predict what the change in delay will be for any given cell sizes.
We can use logical effort in the design of logic cells and in the design of logic
that uses logic cells. If we do have the flexibility to continuously size each logic
cell (which in ASIC design we normally do not, we usually have to choose from
1X, 2X, 4X drive strengths), each logic stage can be sized using the equation for
the individual stage electrical efforts,
F 1/ N
h^ i = . (3.46)
gi

For example, even though we know that it will not improve the delay by much,
let us size the cells in Figure 3.11 (a). We shall work backward starting at the
fixed load capacitance at the input of the last inverter.
For NAND cell 3, gh = 1.25; thus (since g = 1.4), h = C out / C in = 0.893. The
output capacitance, C out , for this NAND cell is the input capacitance of the
inverterfixed as 1 standard load, C inv . This fixes the input capacitance, C in , of
NAND cell 3 at 1/0.893 = 1.12 standard loads. Thus, the scale of NAND cell 3 is
1.12/1.4 or 0.8X.
Now for NAND cell 2, gh = 1.25; C out for NAND cell 2 is the C in of NAND
cell 3. Thus C in for NAND cell 2 is 1.12/0.893 = 1.254 standard loads. This
means the scale of NAND cell 2 is 1.254/1.4 or 0.9X.
The optimum sizes of the NAND cells are not very different from 1X in this case
because H = 1 and we are only driving a load no bigger than the input
capacitance. This raises the question: What is the optimum stage effort if we have
to drive a large load, H >> 1? Notice that, so far, we have only calculated the
optimum stage effort when we have a fixed number of stages, N . We have said
nothing about the situation in which we are free to choose, N , the number of
stages.

3.3.6 Optimum Number of Stages


Suppose we have a chain of N inverters each with equal stage effort, f = gh .
Neglecting parasitic and nonideal delay, the total path delay is Nf = Ngh = Nh ,
since g = 1 for an inverter. Suppose we need to drive a path electrical effort H ;
then h N = H , or N ln h = ln H . Thus the delay, Nh = h ln H /ln h . Since ln H is
fixed, we can only vary h /ln ( h ). Figure 3.12 shows that this is a very shallow
function with a minimum at h = e ª 2.718. At this point ln h = 1 and the total
delay is N e = e ln H . This result is particularly useful in driving large loads
either on-chip (the clock, for example) or off-chip (I/O pad drivers, for example).

FIGURE 3.12 Stage effort.

h h/(ln h)
1.5 3.7
2 2.9
2.7 2.7
3 2.7
4 2.9
5 3.1
10 4.3

Figure 3.12 shows us how to minimize delay regardless of area or power and
neglecting parasitic and nonideal delays. More complicated equations can be
derived, including nonideal effects, when we wish to trade off delay for smaller
area or reduced power.

1. For the Compass 0.5 m m technology (C5): p inv = 1.0, q inv = 1.7, R inv = 1.5
k W , C inv = 0.036 pF.
3.4 Library-Cell Design
The optimum cell layout for each process generation changes because the design
rules for each ASIC vendors process are always slightly differenteven for the
same generation of technology. For example, two companies may have very
similar 0.35 m m CMOS process technologies, but the third-level metal spacing
might be slightly different. If a cell library is to be used with both processes, we
could construct the library by adopting the most stringent rules from each
process. A library constructed in this fashion may not be competitive with one
that is constructed specifically for each process. Even though ASIC vendors prize
their design rules as secret, it turns out that they are similarexcept for a few
details. Unfortunately, it is the details that stop us moving designs from one
process to another. Unless we are a very large customer it is difficult to have an
ASIC vendor change or waive design rules for us. We would like all vendors to
agree on a common set of design rules. This is, in fact, easier than it sounds. The
reason that most vendors have similar rules is because most vendors use the same
manufacturing equipment and a similar process. It is possible to construct a
highest common denominator library that extracts the most from the current
manufacturing capability. Some library companies and the large Japanese ASIC
vendors are adopting this approach.
Layout of library cells is either hand-crafted or uses some form of symbolic
layout . Symbolic layout is usually performed in one of two ways: using either
interactive graphics or a text layout language. Shapes are represented by simple
lines or rectangles, known as sticks or logs , in symbolic layout. The actual
dimensions of the sticks or logs are determined after layout is completed in a
postprocessing step. An alternative to graphical symbolic layout uses a text
layout language, similar to a programming language such as C, that directs a
program to assemble layout. The spacing and dimensions of the layout shapes are
defined in terms of variables rather than constants. These variables can be
changed after symbolic layout is complete to adjust the layout spacing to a
specific process.
Mapping symbolic layout to a specific process technology uses 1020 percent
more area than hand-crafted layout (though this can then be further reduced to 5
10 percent with compaction). Most symbolic layout systems do not allow 45°
layout and this introduces a further area penalty (my experience shows this is
about 515 percent). As libraries get larger, and the capability to quickly move
libraries and ASIC designs between different generations of process technologies
becomes more important, the advantages of symbolic layout may outweigh the
disadvantages.
L ast E d ited by S P 1411 2 0 0 4

PROGRAMMABLE
ASIC LOGIC
CELLS
All programmable ASICs or FPGAs contain a basic logic cell replicated in a
regular array across the chip (analogous to a base cell in an MGA). There are the
following three different types of basic logic cells: (1) multiplexer based, (2)
look-up table based, and (3) programmable array logic. The choice among these
depends on the programming technology. We shall see examples of each in this
chapter.
5.1 Actel ACT
The basic logic cells in the Actel ACT family of FPGAs are called Logic
Modules . The ACT 1 family uses just one type of Logic Module and the ACT 2
and ACT 3 FPGA families both use two different types of Logic Module.

5.1.1 ACT 1 Logic Module


The functional behavior of the Actel ACT 1 Logic Module is shown in Figure 5.1
(a). Figure 5.1 (b) represents a possible circuit-level implementation. We can
build a logic function using an Actel Logic Module by connecting logic signals to
some or all of the Logic Module inputs, and by connecting any remaining Logic
Module inputs to VDD or GND. As an example, Figure 5.1 (c) shows the
connections to implement the function F = A · B + B' · C + D. How did we know
what connections to make? To understand how the Actel Logic Module works,
we take a detour via multiplexer logic and some theory.

FIGURE 5.1 The Actel ACT architecture. (a) Organization of the basic logic
cells. (b) The ACT 1 Logic Module. (c) An implementation using pass
transistors (without any buffering). (d) An example logic macro. (Source: Actel.)
5.1.2 Shannons Expansion Theorem
In logic design we often have to deal with functions of many variables. We need
a method to break down these large functions into smaller pieces. Using the
Shannon expansion theorem, we can expand a Boolean logic function F in terms
of (or with respect to) a Boolean variable A,
F = A · F (A = '1') + A' · F (A = '0'),(5.1)
where F (A = 1) represents the function F evaluated with A set equal to '1'.
For example, we can expand the following function F with respect to (I shall use
the abbreviation wrt ) A,
F = A' · B + A · B · C' + A' · B' · C
= A · (B · C') + A' · (B + B' · C).(5.2)
We have split F into two smaller functions. We call F (A = '1') = B · C' the
cofactor of F wrt A in Eq. 5.2 . I shall sometimes write the cofactor of F wrt A as
F A (the cofactor of F wrt A' is F A' ). We may expand a function wrt any of its
variables. For example, if we expand F wrt B instead of A,
F = A' · B + A · B · C' + A' · B' · C
= B · (A' + A · C') + B' · (A' · C).(5.3)
We can continue to expand a function as many times as it has variables until we
reach the canonical form (a unique representation for any Boolean function that
uses only minterms. A minterm is a product term that contains all the variables of
Fsuch as A · B' · C). Expanding Eq. 5.3 again, this time wrt C, gives

F = C · (A' · B + A' · B') + C' · (A · B + A' · B).(5.4)


As another example, we will use the Shannon expansion theorem to implement
the following function using the ACT 1 Logic Module:
F = (A · B) + (B' · C) + D.(5.5)
First we expand F wrt B:
F = B · (A + D) + B' · (C + D)
= B · F2 + B' · F1.(5.6)
Equation 5.6 describes a 2:1 MUX, with B selecting between two inputs: F (A =
'1') and F (A = '0'). In fact Eq. 5.6 also describes the output of the ACT 1 Logic
Module in Figure 5.1 ! Now we need to split up F1 and F2 in Eq. 5.6 . Suppose
we expand F2 = F B wrt A, and F1 = F B' wrt C:

F2 = A + D = (A · 1) + (A' · D),(5.7)
F1 = C + D = (C · 1) + (C' · D).(5.8)
From Eqs. 5.6 5.8 we see that we may implement F by arranging for A, B, C to
appear on the select lines and '1' and D to be the data inputs of the MUXes in the
ACT 1 Logic Module. This is the implementation shown in Figure 5.1 (d), with
connections: A0 = D, A1 = '1', B0 = D, B1 = '1', SA = C, SB = A, S0 = '0', and S1
= B.
Now that we know that we can implement Boolean functions using MUXes, how
do we know which functions we can implement and how to implement them?

5.1.3 Multiplexer Logic as Function


Generators
Figure 5.2 illustrates the 16 different ways to arrange 1s on a Karnaugh map
corresponding to the 16 logic functions, F (A, B), of two variables. Two of these
functions are not very interesting (F = '0', and F = '1'). Of the 16 functions,
Table 5.1 shows the 10 that we can implement using just one 2:1 MUX. Of these
10 functions, the following six are useful:
● INV. The MUX acts as an inverter for one input only.

● BUF. The MUX just passes one of the MUX inputs directly to the output.

● AND. A two-input AND.

● OR. A two-input OR.

● AND1-1. A two-input AND gate with inverted input, equivalent to an


NOR-11.
● NOR1-1. A two-input NOR gate with inverted input, equivalent to an
AND-11.

FIGURE 5.2 The


logic functions of two
variables.

TABLE 5.1 Boolean functions using a 2:1 MUX.


Canonical Minterms Minterm Function M1 4
Function, F F=
form 1 code 2 number 3 A0 A1 SA
1 '0' '0' '0' none 0000 0 0 0 0
(A
NOR1-1(A,
2 + A' · B 1 0010 2 B 0 A
B)
B')'
A' · B' +
3 NOT(A) A' 0, 1 0011 3 0 1 A
A' · B
AND1-1(A, A ·
4 A · B' 2 0100 4 A 0 B
B) B'
A' · B' + A
5 NOT(B) B' 0, 2 0101 5 0 1 B
· B'
A' · B + A
6 BUF(B) B 1, 3 1010 6 0 B 1
·B

7 AND(A, B) A·B 3 1000 8 0 B A
B
A · B' + A
8 BUF(A) A 2, 3 1100 9 0 A 1
·B
A' · B + A
A
9 OR(A, B) · B' + A · 1, 2, 3 1110 13 B 1 A
+B
B
A' · B' +
A' · B + A
10 '1' '1' 0, 1, 2, 3 1111 15 1 1 1
· B' + A ·
B

Figure 5.3 (a) shows how we might view a 2:1 MUX as a function wheel , a
three-input black box that can generate any one of the six functions of two-input
variables: BUF, INV, AND-11, AND1-1, OR, AND. We can write the output of
a function wheel as
F1 = WHEEL1 (A, B).(5.9)
where I define the wheel function as follows:
WHEEL1 (A, B) = MUX (A0, A1, SA).(5.10)
The MUX function is not unique; we shall define it as
MUX (A0, A1, SA) = A0 · SA' + A1 · SA.(5.11)
The inputs (A0, A1, SA) are described using the notation
A0, A1, SA = {A, B, '0', '1'}(5.12)
to mean that each of the inputs (A0, A1, and SA) may be any of the values: A, B,
'0', or '1'. I chose the name of the wheel function because it is rather like a dial
that you set to your choice of function. Figure 5.3 (b) shows that the ACT 1
Logic Module is a function generator built from two function wheels, a 2:1
MUX, and a two-input OR gate.
FIGURE 5.3 The ACT 1 Logic Module as a Boolean function generator. (a) A
2:1 MUX viewed as a function wheel. (b) The ACT 1 Logic Module viewed as
two function wheels, an OR gate, and a 2:1 MUX.

We can describe the ACT 1 Logic Module in terms of two WHEEL functions:
F = MUX [ WHEEL1, WHEEL2, OR (S0, S1) ](5.13)
Now, for example, to implement a two-input NAND gate, F = NAND (A, B) =
(A · B)', using an ACT 1 Logic Module we first express F as the output of a 2:1
MUX. To split up F we expand it wrt A (or wrt B; since F is symmetric in A and
B):
F = A · (B') + A' · ('1')(5.14)
Thus to make a two-input NAND gate we assign WHEEL1 to implement INV
(B), and WHEEL2 to implement '1'. We must also set the select input to the
MUX connecting WHEEL1 and WHEEL2, S0 + S1 = Awe can do this with S0 =
A, S1 = '1'.
Before we get too carried away, we need to realize that we do not have to worry
about how to use Logic Modules to construct combinational logic functionsthis
has already been done for us. For example, if we need a two-input NAND gate,
we just use a NAND gate symbol and software takes care of connecting the
inputs in the right way to the Logic Module.
How did Actel design its Logic Modules? One of Actels engineers wrote a
program that calculates how many functions of two, three, and four variables a
given circuit would provide. The engineers tested many different circuits and
chose the best one: a small, logically efficient circuit that implemented many
functions. For example, the ACT 1 Logic Module can implement all two-input
functions, most functions with three inputs, and many with four inputs.
Apart from being able to implement a wide variety of combinational logic
functions, the ACT 1 module can implement sequential logic cells in a flexible
and efficient manner. For example, you can use one ACT 1 Logic Module for a
transparent latch or two Logic Modules for a flip-flop. The use of latches rather
than flip-flops does require a shift to a two-phase clocking scheme using two
nonoverlapping clocks and two clock trees. Two-phase synchronous design using
latches is efficient and fast but, to handle the timing complexities of two clocks
requires changes to synthesis and simulation software that have not occurred.
This means that most people still use flip-flops in their designs, and these require
two Logic Modules.

5.1.4 ACT 2 and ACT 3 Logic Modules


Using two ACT 1 Logic Modules for a flip-flop also requires added interconnect
and associated parasitic capacitance to connect the two Logic Modules. To
produce an efficient two-module flip-flop macro we could use extra antifuses in
the Logic Module to cut down on the parasitic connections. However, the extra
antifuses would have an adverse impact on the performance of the Logic Module
in other macros. The alternative is to use a separate flip-flop module, reducing
flexibility and increasing layout complexity. In the ACT 1 family Actel chose to
use just one type of Logic Module. The ACT 2 and ACT 3 architectures use two
different types of Logic Modules, and one of them does include the equivalent of
a D flip-flop.
Figure 5.4 shows the ACT 2 and ACT 3 Logic Modules. The ACT 2 C-Module is
similar to the ACT 1 Logic Module but is capable of implementing five-input
logic functions. Actel calls its C-module a combinatorial module even though the
module implements combinational logic. John Wakerly blames MMI for the
introduction of the term combinatorial [Wakerly, 1994, p. 404].
The use of MUXes in the Actel Logic Modules (and in other places) can cause
confusion in using and creating logic macros. For the Actel library, setting S = '0'
selects input A of a two-input MUX. For other libraries setting S = '1' selects
input A. This can lead to some very hard to find errors when moving schematics
between libraries. Similar problems arise in flip-flops and latches with MUX
inputs. A safer way to label the inputs of a two-input MUX is with '0' and '1',
corresponding to the input selected when the select input is '1' or '0'. This notation
can be extended to bigger MUXes, but in Figure 5.4 , does the input combination
S0 = '1' and S1 = '0' select input D10 or input D01? These problems are not
caused by Actel, but by failure to use the IEEE standard symbols in this area.
The S-Module ( sequential module ) contains the same combinational function
capability as the C-Module together with a sequential element that can be
configured as a flip-flop. Figure 5.4 (d) shows the sequential element
implementation in the ACT 2 and ACT 3 architectures.
FIGURE 5.4 The Actel ACT 2 and ACT 3 Logic Modules. (a) The C-Module
for combinational logic. (b) The ACT 2 S-Module. (c) The ACT 3 S-Module.
(d) The equivalent circuit (without buffering) of the SE (sequential element).
(e) The sequential element configured as a positive-edgetriggered D flip-flop.
(Source: Actel.)

5.1.5 Timing Model and Critical Path


Figure 5.5 (a) shows the timing model for the ACT family. 5 This is a simple
timing model since it deals only with logic buried inside a chip and allows us
only to estimate delays. We cannot predict the exact delays on an Actel chip until
we have performed the place-and-route step and know how much delay is
contributed by the interconnect. Since we cannot determine the exact delay
before physical layout is complete, we call the Actel architecture
nondeterministic .
Even though we cannot determine the preroute delays exactly, it is still important
to estimate the delay on a logic path. For example, Figure 5.5 (a) shows a typical
situation deep inside an ASIC. Internal signal I1 may be from the output of a
register (flip-flop). We then pass through some combinational logic, C1, through
a register, S1, and then another register, S2. The register-to-register delay
consists of a clockQ delay, plus any combinational delay between registers, and
the setup time for the next flip-flop. The speed of our system will depend on the
slowest registerregister delay or critical path between registers. We cannot make
our clock period any longer than this or the signal will not reach the second
register in time to be clocked.
Figure 5.5 (a) shows an internal logic signal, I1, that is an input to a C-module,
C1. C1 is drawn in Figure 5.5 (a) as a box with a symbol comprising the
overlapping letters C and L (borrowed from carpenters who use this symbol to
mark the centerline on a piece of wood). We use this symbol to describe
combinational logic. For the standard-speed grade ACT 3 (we shall look at speed
grading in Section 5.1.6 ) the delay between the input of a C-module and the
output is specified in the data book as a parameter, t PD , with a maximum value
of 3.0 ns.
The output of C1 is an input to an S-Module, S1, configured to implement
combinational logic and a D flip-flop. The Actel data book specifies the
minimum setup time for this D flip-flop as t SUD = 0.8 ns. This means we need to
get the data to the input of S1 at least 0.8 ns before the rising clock edge (for a
positive-edgetriggered flip-flop). If we do this, then there is still enough time for
the data to go through the combinational logic inside S1 and reach the input of
the flip-flop inside S1 in time to be clocked. We can guarantee that this will work
because the combinational logic delay inside S1 is fixed.
FIGURE 5.5 The Actel ACT timing model. (a) Timing parameters for a 'Std'
speed grade ACT 3. (Source: Actel.) (b) Flip-flop timing. (c) An example of
flip-flop timing based on ACT 3 parameters.

The S-Module seems like good valuewe get all the combinational logic functions
of a C-module (with delay t PD of 3 ns) as well as the setup time for a flip-flop for
only 0.8 ns? &not really. Next I will explain why not.
Figure 5.5 (b) shows what is happening inside an S-Module. The setup and hold
times, as measured inside (not outside) the S-Module, of the flip-flop are t' SUD
and t' H (a prime denotes parameters that are measured inside the S-Module). The
clockQ propagation delay is t' CO . The parameters t' SUD , t' H , and t' CO are
measured using the internal clock signal CLKi. The propagation delay of the
combinational logic inside the S-Module is t' PD . The delay of the combinational
logic that drives the flip-flop clock signal ( Figure 5.4 d) is t' CLKD .
From outside the S-Module, with reference to the outside clock signal CLK1:
t SUD = t' SUD + (t' PD t' CLKD ),

t H = t' H + (t' PD t' CLKD ),

t CO = t' CO + t' CLKD .(5.15)

Figure 5.5 (c) shows an example of flip-flop timing. We have no way of knowing
what the internal flip-flop parameters t' SUD , t' H , and t' CO actually are, but we
can assume some reasonable values (just for illustration purposes):
t' SUD = 0.4 ns, t' H = 0.1 ns, t' CO = 0.4 ns.(5.16)

We do know the delay, t' PD , of the combinational logic inside the S-Module. It
is exactly the same as the C-Module delay, so t' PD = 3 ns for the ACT 3. We do
not know t' CLKD ; we shall assume a reasonable value of t' CLKD = 2.6 ns (the
exact value does not matter in the following argument).
Next we calculate the external S-Module parameters from Eq. 5.15 as follows:

t SUD = 0.8 ns, t H = 0.5 ns, t CO = 3.0 ns.(5.17)

These are the same as the ACT 3 S-Module parameters shown in Figure 5.5 (a),
and I chose t' CLKD and the values in Eq. 5.16 so that they would be the same. So
now we see where the combinational logic delay of 3.0 ns has gone: 0.4 ns went
into increasing the setup time and 2.6 ns went into increasing the clockoutput
delay, t CO .

From the outside we can say that the combinational logic delay is buried in the
flip-flop setup time. FPGA vendors will point this out as an advantage that they
have. Of course, we are not getting something for nothing here. It is like
borrowing moneyyou have to pay it back.

5.1.6 Speed Grading


Most FPGA vendors sort chips according to their speed ( the sorting is known as
speed grading or speed binning , because parts are automatically sorted into
plastic bins by the production tester). You pay more for the faster parts. In the
case of the ACT family of FPGAs, Actel measures performance with a special
binning circuit , included on every chip, that consists of an input buffer driving a
string of buffers or inverters followed by an output buffer. The parts are sorted
from measurements on the binning circuit according to Logic Module
propagation delay. The propagation delay, t PD , is defined as the average of the
rising ( t PLH ) and falling ( t PHL ) propagation delays of a Logic Module

t PD = ( t PLH + t PHL )/2.(5.18)


Since the transistor properties match so well across a chip, measurements on the
binning circuit closely correlate with the speed of the rest of the Logic Modules
on the die. Since the speeds of die on the same wafer also match well, most of the
good die on a wafer fall into the same speed bin. Actel speed grades are: a 'Std'
speed grade, a '1' speed grade that is approximately 15 percent faster, a '2' speed
grade that is approximately 25 percent faster than 'Std', and a '3' speed grade that
is approximately 35 percent faster than 'Std'.

5.1.7 Worst-Case Timing


If you use fully synchronous design techniques you only have to worry about
how slow your circuit may benot how fast. Designers thus need to know the
maximum delays they may encounter, which we call the worst-case timing .
Maximum delays in CMOS logic occur when operating under minimum voltage,
maximum temperature, and slowslow process conditions. (A slowslow process
refers to a process variation, or process corner , which results in slow p -channel
transistors and slow n -channel transistorswe can also have fastfast, slowfast,
and fastslow process corners.)
Electronic equipment has to survive in a variety of environments and ASIC
manufacturers offer several classes of qualification for different applications:
● Commercial. VDD = 5 V ± 5 %, T A (ambient) = 0 to +70 °C.

● Industrial. VDD = 5 V ± 10 %, T A (ambient) = 40 to +85 °C.


● Military: VDD = 5 V ± 10 %, T C (case) = 55 to +125 °C.
● Military: Standard MIL-STD-883C Class B.
● Military extended: Unmanned spacecraft.
ASICs for commercial application are cheapest; ASICs for the Cruise missile are
very, very expensive. Notice that commercial and industrial application parts are
specified with respect to the ambient temperature T A (room temperature or the
temperature inside the box containing the ASIC). Military specifications are
relative to the package case temperature , T C . What is really important is the
temperature of the transistors on the chip, the junction temperature , T J , which is
always higher than T A (unless we dissipate zero power). For most applications
that dissipate a few hundred mW, T J is only 510 °C higher than T A . To
calculate the value of T J we need to know the power dissipated by the chip and
the thermal properties of the packagewe shall return to this in Section 6.6.1,
Power Dissipation.
Manufacturers have to specify their operating conditions with respect to T J and
not T A , since they have no idea how much power purchasers will dissipate in
their designs or which package they will use. Actel used to specify timing under
nominal operating conditions: VDD = 5.0 V, and T J = 25 °C. Actel and most
other manufacturers now specify parameters under worst-case commercial
conditions: VDD = 4.75 V, and T J = +70 °C.

Table 5.2 shows the ACT 3 commercial worst-case timing. 6 In this table Actel
has included some estimates of the variable routing delay shown in Figure 5.5
(a). These delay estimates depend on the number of gates connected to a gate
output (the fanout).
When you design microelectronic systems (or design anything ) you must use
worst-case figures ( just as you would design a bridge for the worst-case load).
To convert nominal or typical timing figures to the worst case (or best case), we
use measured, or empirically derived, constants called derating factors that are
expressed either as a table or a graph. For example, Table 5.3 shows the ACT 3
derating factors from commercial worst-case to industrial worst-case and military
worst-case conditions (assuming T J = T A ). The ACT 1 and ACT 2 derating
factors are approximately the same. 7
TABLE 5.2 ACT 3 timing parameters. 8
Fanout
Family Delay 9 1 2 3 4 8
ACT 3-3 (data book) t PD 2.9 3.2 3.4 3.7 4.8
ACT3-2 (calculated) t PD /0.85 3.41 3.76 4.00 4.35 5.65
ACT3-1 (calculated) t PD /0.75 3.87 4.27 4.53 4.93 6.40
ACT3-Std (calculated) t PD /0.65 4.46 4.92 5.23 5.69 7.38
Source: Actel.
TABLE 5.3 ACT 3 derating factors. 10
Temperature T J ( junction) / °C
V DD / V 55 40 0 25 70 85 125
4.5 0.72 0.76 0.85 0.90 1.04 1.07 1.17
4.75 0.70 0.73 0.82 0.87 1.00 1.03 1.12
5.00 0.68 0.71 0.79 0.84 0.97 1.00 1.09
5.25 0.66 0.69 0.77 0.82 0.94 0.97 1.06
5.5 0.63 0.66 0.74 0.79 0.90 0.93 1.01
Source: Actel.

As an example of a timing calculation, suppose we have a Logic Module on a


'Std' speed grade A1415A (an ACT 3 part) that drives four other Logic Modules
and we wish to estimate the delay under worst-case industrial conditions. From
the data in Table 5.2 we see that the Logic Module delay for an ACT 3 'Std' part
with a fanout of four is t PD = 5.7 ns (commercial worst-case conditions,
assuming T J = T A ).

If this were the slowest path between flip-flops (very unlikely since we have only
one stage of combinational logic in this path), our estimated critical path delay
between registers , t CRIT , would be the combinational logic delay plus the
flip-flop setup time plus the clockoutput delay:
t CRIT (w-c commercial) = t PD + t SUD + t CO

= 5.7 ns + 0.8 ns + 3.0 ns = 9.5 ns .(5.19)


(I use w-c as an abbreviation for worst-case.) Next we need to adjust the timing
to worst-case industrial conditions. The appropriate derating factor is 1.07 (from
Table 5.3 ); so the estimated delay is

t CRIT (w-c industrial) = 1.07 ¥ 9.5 ns = 10.2 ns .(5.20)

Let us jump ahead a little and assume that we can calculate that T J = T A + 20 °C
= 105 °C in our application. To find the derating factor at 105 °C we linearly
interpolate between the values for 85 °C (1.07) and 125 °C (1.17) from Table 5.3
). The interpolated derating factor is 1.12 and thus
t CRIT (w-c industrial, T J = 105 °C) = 1.12 ¥ 9.5 ns = 10.6 ns ,(5.21)

giving us an operating frequency of just less than 100 MHz.


It may seem unfair to calculate the worst-case performance for the slowest speed
grade under the harshest industrial conditionsbut the examples in the data books
are always for the fastest speed grades under less stringent commercial
conditions. If we want to illustrate the use of derating, then the delays can only
get worse than the data book values! The ultimate word on logic delays for all
FPGAs is the timing analysis provided by the FPGA design tools. However, you
should be able to calculate whether or not the answer that you get from such a
tool is reasonable.

5.1.8 Actel Logic Module Analysis


The sizes of the ACT family Logic Modules are close to the size of the base cell
of an MGA. We say that the Actel ACT FPGAs use a fine-grain architecture . An
advantage of a fine-grain architecture is that, whatever the mix of combinational
logic to flip-flops in your application, you can probably still use 90 percent of an
Actel FPGA. Another advantage is that synthesis software has an easier time
mapping logic efficiently to the simple Actel modules.
The physical symmetry of the ACT Logic Modules greatly simplifies the
place-and-route step. In many cases the router can swap equivalent pins on
opposite sides of the module to ease channel routing. The design of the Actel
Logic Modules is a balance between efficiency of implementation and efficiency
of utilization. A simple Logic Module may reduce performance in some areasas I
have pointed outbut allows the use of fast and robust place-and-route software.
Fast, robust routing is an important part of Actel FPGAs (see Section 7.1, Actel
ACT).

1. The minterm numbers are formed from the product terms of the canonical
form. For example, A · B' = 10 = 2.
2. The minterm code is formed from the minterms. A '1' denotes the presence of
that minterm.
3. The function number is the decimal version of the minterm code.
4. Connections to a two-input MUX: A0 and A1 are the data inputs and SA is the
select input (see Eq. 5.11 ).

5. 1994 data book, p. 1-101.


6. ACT 3: May 1995 data sheet, p. 1-173. ACT 2: 1994 data book, p. 1-51.
7. 1994 data book, p. 1-12 (ACT 1), p. 1-52 (ACT 2), May 1995 data sheet,
p. 1-174 (ACT 3).
8. V DD = 4.75 V, T J ( junction) = 70 °C. Logic module plus routing delay. All
propagation delays in nanoseconds.
9. The Actel '1' speed grade is 15 % faster than 'Std'; '2' is 25 % faster than 'Std';
'3' is 35 % faster than 'Std'.
10. Worst-case commercial: V DD = 4.75 V, T A (ambient) = +70 °C.
Commercial: V DD = 5 V ± 5 %, T A (ambient) = 0 to +70 °C. Industrial: V DD =
5 V ± 10 %, T A (ambient) = 40 to +85 °C. Military V DD = 5 V ± 10 %, T C
(case) = 55 to +125 °C.
5.2 Xilinx LCA
Xilinx LCA (a trademark, denoting logic cell array) basic logic cells,
configurable logic blocks or CLBs , are bigger and more complex than the Actel
or QuickLogic cells. The Xilinx LCA basic logic cell is an example of a
coarse-grain architecture . The Xilinx CLBs contain both combinational logic and
flip-flops.

5.2.1 XC3000 CLB


The XC3000 CLB, shown in Figure 5.6 , has five logic inputs (AE), a common
clock input (K), an asynchronous direct-reset input (RD), and an enable (EC).
Using programmable MUXes connected to the SRAM programming cells, you
can independently connect each of the two CLB outputs (X and Y) to the output
of the flip-flops (QX and QY) or to the output of the combinational logic (F and
G).

FIGURE 5.6 The Xilinx XC3000 CLB (configurable logic block). (Source:
Xilinx.)
A 32-bit look-up table ( LUT ), stored in 32 bits of SRAM, provides the ability to
implement combinational logic. Suppose you need to implement the function F =
A · B · C · D · E (a five-input AND). You set the contents of LUT cell number 31
(with address '11111') in the 32-bit SRAM to a '1'; all the other SRAM cells are
set to '0'. When you apply the input variables as an address to the 32-bit SRAM,
only when ABCDE = '11111' will the output F be a '1'. This means that the CLB
propagation delay is fixed, equal to the LUT access time, and independent of the
logic function you implement.
There are seven inputs for the combinational logic in the XC3000 CLB: the five
CLB inputs (AE), and the flip-flop outputs (QX and QY). There are two outputs
from the LUT (F and G). Since a 32-bit LUT requires only five variables to form
a unique address (32 = 2 5 ), there are several ways to use the LUT:
● You can use five of the seven possible inputs (AE, QX, QY) with the
entire 32-bit LUT. The CLB outputs (F and G) are then identical.
● You can split the 32-bit LUT in half to implement two functions of four
variables each. You can choose four input variables from the seven inputs
(AE, QX, QY). You have to choose two of the inputs from the five CLB
inputs (AE); then one function output connects to F and the other output
connects to G.
● You can split the 32-bit LUT in half, using one of the seven input variables
as a select input to a 2:1 MUX that switches between F and G. This allows
you to implement some functions of six and seven variables.

5.2.2 XC4000 Logic Block


Figure 5.7 shows the CLB used in the XC4000 series of Xilinx FPGAs. This is a
fairly complicated basic logic cell containing 2 four-input LUTs that feed a
three-input LUT. The XC4000 CLB also has special fast carry logic hard-wired
between CLBs. MUX control logic maps four control inputs (C1C4) into the
four inputs: LUT input H1, direct in (DIN), enable clock (EC), and a set / reset
control (S/R) for the flip-flops. The control inputs (C1C4) can also be used to
control the use of the F' and G' LUTs as 32 bits of SRAM.
FIGURE 5.7 The Xilinx XC4000 family CLB (configurable logic block). (
Source: Xilinx.)

5.2.3 XC5200 Logic Block


Figure 5.8 shows the basic logic cell, a Logic Cell or LC, used in the XC5200
family of Xilinx LCA FPGAs. 1 The LC is similar to the CLBs in the
XC2000/3000/4000 CLBs, but simpler. Xilinx retained the term CLB in the
XC5200 to mean a group of four LCs (LC0LC3).
The XC5200 LC contains a four-input LUT, a flip-flop, and MUXes to handle
signal switching. The arithmetic carry logic is separate from the LUTs. A limited
capability to cascade functions is provided (using the MUX labeled F5_MUX in
logic cells LC0 and LC2 in Figure 5.8 ) to gang two LCs in parallel to provide the
equivalent of a five-input LUT.
FIGURE 5.8 The Xilinx XC5200 family LC (Logic Cell) and CLB
(configurable logic block). (Source: Xilinx.)

5.2.4 Xilinx CLB Analysis


The use of a LUT in a Xilinx CLB to implement combinational logic is both an
advantage and a disadvantage. It means, for example, that an inverter is as slow
as a five-input NAND. On the other hand a LUT simplifies timing of
synchronous logic, simplifies the basic logic cell, and matches the Xilinx SRAM
programming technology well. A LUT also provides the possibility, used in the
XC4000, of using the LUT directly as SRAM. You can configure the XC4000
CLB as a memoryeither two 16 ¥ 1 SRAMs or a 32 ¥ 1 SRAM, but this is
expensive RAM.
Figure 5.9 shows the timing model for Xilinx LCA FPGAs. 2 Xilinx uses two
speed-grade systems. The first uses the maximum guaranteed toggle rate of a
CLB flip-flop measured in MHz as a suffixso higher is faster. For example a
Xilinx XC3020-125 has a toggle frequency of 125 MHz. The other Xilinx
naming system (which supersedes the old scheme, since toggle frequency is
rather meaningless) uses the approximate delay time of the combinational logic
in a CLB in nanosecondsso lower is faster in this case. Thus, for example, an
XC4010-6 has t ILO = 6.0 ns (the correspondence between speed grade and t ILO
is fairly accurate for the XC2000, XC4000, and XC5200 but is less accurate for
the XC3000).
FIGURE 5.9 The Xilinx
LCA timing model. The
paths show different uses
of CLBs (configurable
logic blocks). The
parameters shown are for
an XC5210-6. ( Source:
Xilinx.)

The inclusion of flip-flops and combinational logic inside the basic logic cell
leads to efficient implementation of state machines, for example. The
coarse-grain architecture of the Xilinx CLBs maximizes performance given the
size of the SRAM programming technology element. As a result of the increased
complexity of the basic logic cell we shall see (in Section 7.2, Xilinx LCA) that
the routing between cells is more complex than other FPGAs that use a simpler
basic logic cell.

1. Xilinx decided to use Logic Cell as a trademark in 1995 rather as if IBM were
to use Computer as a trademark today. Thus we should now only talk of a Xilinx
Logic Cell (with capital letters) and not Xilinx logic cells.
2. October 1995 (Version 3.0) data sheet.
5.3 Altera FLEX
Figure 5.10 shows the basic logic cell, a Logic Element ( LE ), that Altera uses in
its FLEX 8000 series of FPGAs. Apart from the cascade logic (which is slightly
simpler in the FLEX LE) the FLEX cell resembles the XC5200 LC architecture
shown in Figure 5.8 . This is not surprising since both architectures are based on
the same SRAM programming technology. The FLEX LE uses a four-input LUT,
a flip-flop, cascade logic, and carry logic. Eight LEs are stacked to form a Logic
Array Block (the same term as used in the MAX series, but with a different
meaning).

FIGURE 5.10 The Altera FLEX architecture. (a) Chip floorplan. (b) LAB
(Logic Array Block). (c) Details of the LE (Logic Element). ( Source: Altera
(adapted with permission).)
5.4 Altera MAX
Suppose we have a simple two-level logic circuit that implements a sum of products
as shown in Figure 5.11 (a). We may redraw any two-level circuit using a regular
structure ( Figure 5.11 b): a vector of buffers, followed by a vector of AND gates
(which construct the product terms) that feed OR gates (which form the sums of the
product terms). We can simplify this representation still further ( Figure 5.11 c), by
drawing the input lines to a multiple-input AND gate as if they were one horizontal
wire, which we call a product-term line . A structure such as Figure 5.11 (c) is called
programmable array logic , first introduced by Monolithic Memories as the PAL
series of devices.

FIGURE 5.11 Logic arrays. (a) Two-level logic. (b) Organized sum of products.
(c) A programmable-AND plane. (d) EPROM logic array. (e) Wired logic.

Because the arrangement of Figure 5.11 (c) is very similar to a ROM, we sometimes
call a horizontal product-term line, which would be the bit output from a ROM, the bit
line . The vertical input line is the word line . Figure 5.11 (d) and (e) show how to
build the programmable-AND array (or product-term array) from EPROM transistors.
The horizontal product-term lines connect to the vertical input lines using the EPROM
transistors as pull-downs at each possible connection. Applying a '1' to the gate of an
unprogrammed EPROM transistor pulls the product-term line low to a '0'. A
programmed n -channel transistor has a threshold voltage higher than V DD and is
therefore always off . Thus a programmed transistor has no effect on the product-term
line.
Notice that connecting the n -channel EPROM transistors to a pull-up resistor as
shown in Figure 5.11 (e) produces a wired-logic functionthe output is high only if all
of the outputs are high, resulting in a wired-AND function of the outputs. The
product-term line is low when any of the inputs are high. Thus, to convert the
wired-logic array into a programmable-AND array, we need to invert the sense of the
inputs. We often conveniently omit these details when we draw the schematics of
logic arrays, usually implemented as NORNOR arrays (so we need to invert the
outputs as well). They are not minor details when you implement the layout, however.
Figure 5.12 shows how a programmable-AND array can be combined with other logic
into a macrocell that contains a flip-flop. For example, the widely used 22V10 PLD,
also called a registered PAL, essentially contains 10 of the macrocells shown in
Figure 5.12 . The part number, 22V10, denotes that there are 22 inputs (44 vertical
input lines for both true and complement forms of the inputs) to the programmable
AND array and 10 macrocells. The PLD or registered PAL shown in Figure 5.12 has
an 2 i ¥ jk programmable-AND array.

FIGURE 5.12 A registered PAL with i inputs, j product terms, and k macrocells.

5.4.1 Logic Expanders


The basic logic cell for the Altera MAX architecture, a macrocell, is a descendant of
the PAL. Using the logic expander , shown in Figure 5.13 to generate extra logic
terms, it is possible to implement functions that require more product terms than are
available in a simple PAL macrocell. As an example, consider the following function:
F = A' · C · D + B' · C · D + A · B + B · C'.(5.22)
This function has four product terms and thus we cannot implement F using a
macrocell that has only a three-wide OR array (such as the one shown in Figure 5.13
). If we rewrite F as a sum of (products of products) like this:
F = (A' + B') · C · D + (A + C') · B
= (A · B)' (C · D) + (A' · C)' · B ;(5.23)
we can use logic expanders to form the expander terms (A · B)' and (A' · C)' (see
Figure 5.13 ). We can even share these extra product terms with other macrocells if
we need to. We call the extra logic gates that form these shareable product terms a
shared logic expander , or just shared expander .

FIGURE 5.13 Expander logic and programmable inversion. An expander increases


the number of product terms available and programmable inversion allows you to
reduce the number of product terms you need.

The disadvantage of the shared expanders is the extra logic delay incurred because of
the second pass that you need to take through the product-term array. We usually do
not know before the logic tools assign logic to macrocells ( logic assignment )
whether we need to use the logic expanders. Since we cannot predict the exact timing
the Altera MAX architecture is not strictly deterministic . However, once we do know
whether a signal has to go through the array once or twice, we can simply and
accurately predict the delay. This is a very important and useful feature of the Altera
MAX architecture.
The expander terms are sometimes called helper terms when you use a PAL. If you
use helper terms in a 22V10, for example, you have to go out to the chip I/O pad and
then back into the programmable array again, using two-pass logic .
FIGURE 5.14 Use of programmed inversion to simplify logic: (a) The function F =
A · B' + A · C' + A · D' + A' · C · D requires four product terms (P1P4) to implement
while (b) the complement, F ' = A · B · C · D + A' · D' + A' · C' requires only three
product terms (P1P3).

Another common feature in complex PLDs, also used in some PLDs, is shown in
Figure 5.13 . Programming one input of the XOR gate at the macrocell output allows
you to choose whether or not to invert the output (a '1' for inversion or to a '0' for no
inversion). This programmable inversion can reduce the required number of product
terms by using a de Morgan equivalent representation instead of a conventional
sum-of-products form, as shown in Figure 5.14 .

As an example of using programmable inversion, consider the function


F = A · B' + A · C' + A · D' + A' · C · D ,(5.24)
which requires four product termsone too many for a three-wide OR array.
If we generate the complement of F instead,
F ' = A · B · C · D + A' · D' + A' · C' ,(5.25)
this has only three product terms. To create F we invert F ', using programmable
inversion.
Figure 5.15 shows an Altera MAX macrocell and illustrates the architectures of
several different product families. The implementation details vary among the
families, but the basic features: wide programmable-AND array, narrow fixed-OR
array, logic expanders, and programmable inversionare very similar. 1 Each family
has the following individual characteristics:
● A typical MAX 5000 chip has: 8 dedicated inputs (with both true and
complement forms); 24 inputs from the chipwide interconnect (true and
complement); and either 32 or 64 shared expander terms (single polarity). The
MAX 5000 LAB looks like a 32V16 PLD (ignoring the expander terms).
● The MAX 7000 LAB has 36 inputs from the chipwide interconnect and 16
shared expander terms; the MAX 7000 LAB looks like a 36V16 PLD.
● The MAX 9000 LAB has 33 inputs from the chipwide interconnect and 16 local
feedback inputs (as well as 16 shared expander terms); the MAX 9000 LAB
looks like a 49V16 PLD.
FIGURE 5.15 The Altera MAX architecture. (a) Organization of logic and
interconnect. (b) A MAX family LAB (Logic Array Block). (c) A MAX family
macrocell. The macrocell details vary between the MAX familiesthe functions
shown here are closest to those of the MAX 9000 family macrocells.
FIGURE 5.16 The timing model for the Altera MAX architecture. (a) A direct
path through the logic array and a register. (b) Timing for the direct path.
(c) Using a parallel expander. (d) Parallel expander timing. (e) Making two
passes through the logic array to use a shared expander. (f) Timing for the
shared expander (there is no register in this path). All timing values are in
nanoseconds for the MAX 9000 series, '15' speed grade. ( Source: Altera.)

5.4.2 Timing Model


Figure 5.16 shows the Altera MAX timing model for local signals. 2 For example, in
Figure 5.16 (a) an internal signal, I1, enters the local array (the LAB interconnect with
a fixed delay t 1 = t LOCAL = 0.5 ns), passes through the AND array (delay t 2 = t LAD
= 4.0 ns), and to the macrocell flip-flop (with setup time, t 3 = t SU = 3.0 ns, and clock
Q or register delay , t 4 = t RD = 1.0 ns). The path delay is thus: 0.5 + 4 +3 + 1 = 8.5
ns.
Figure 5.16 (c) illustrates the use of a parallel logic expander . This is different from
the case of the shared expander ( Figure 5.13 ), which required two passes in series
through the product-term array. Using a parallel logic expander, the extra product term
is generated in an adjacent macrocell in parallel with other product terms (not in series
as in a shared expander).
We can illustrate the difference between a parallel expander and a shared expander
using an example function that we have used before (Eq. 5.22 ),

F = A' · C · D + B' · C · D + A · B + B · C' .(5.26)


This time we shall use macrocell M1 in Figure 5.16 (d) to implement F1 equal to the
sum of the first three product terms in Eq. 5.26 . We use F1 (using the parallel
expander connection between adjacent macrocells shown in Figure 5.15 ) as an input
to macrocell M2. Now we can form F = F1 + B · C' without using more than three
inputs of an OR gate (the MAX 5000 has a three-wide OR array in the macrocell, the
MAX 9000, as shown in Figure 5.15 , is capable of handling five product terms in one
macrocellbut the principle is the same). The total delay is the same as before, except
that we add the delay of a parallel expander, t PEXP = 1.0 ns. Total delay is then 8.5 +
1 = 9.5 ns.
Figure 5.16 (e) and (f) shows the use of a shared expandersimilar to Figure 5.13 .

The Altera MAX macrocell is more like a PLD than the other FPGA architectures
discussed here; that is why Altera calls the MAX architecture a complex PLD. This
means that the MAX architecture works well in applications for which PLDs are most
useful: simple, fast logic with many inputs or variables.

5.4.3 Power Dissipation in Complex PLDs


A programmable-AND array in any PLD built using EPROM or EEPROM transistors
uses a passive pull-up (a resistor or current source), and these macrocells consume
static power . Altera uses a switch called the Turbo Bit to control the current in the
programmable-AND array in each macrocell. For the MAX 7000, static current varies
between 1.4 mA and 2.2 mA per macrocell in high-power mode (the current depends
on the partgenerally, but not always, the larger 7000 parts have lower operating
currents) and between 0.6 mA and 0.8 mA in low-power mode. For the MAX 9000,
the static current is 0.6 mA per macrocell in high-current mode and 0.3 mA in
low-power mode, independent of the part size. 3 Since there are 16 macrocells in a
LAB and up to 35 LABs on the largest MAX 9000 chip (16 ¥ 35 = 560 macrocells),
just the static power dissipation in low-power mode can be substantial (560 ¥ 0.3 mA
¥ 5 V = 840 mW). If all the macrocells are in high-power mode, the static power will
double. This is the price you pay for having an (up to) 114-wide AND gate delay of a
few nanoseconds (t LAD = 4.0 ns) in the MAX 9000. For any MAX 9000 macrocell in
the low-power mode it is necessary to add a delay of between 15 ns and 20 ns to any
signal path through the local interconnect and logic array (including t LAD and t PEXP
).

1. 1995 data book p. 274 (5000), p. 160 (7000), p. 126 (9000).


2. March 1995 data sheet, v2.

You might also like