Lec05 (DSD)
Lec05 (DSD)
Lecture 5
Agenda
Synchronous Digital Logic Review
Performance, Delay, Timing
FSM Review
Digital System Design Process
RTL Design
A Design Example
2
Digital Computers are Everywhere
Transistor (MOSFET)
Cell
Scheme for
Representing
Information Gate
4
Levels of Design Representation/Interpretation
High Level Language Program temp = v[k];
(e.g., C) v[k] = v[k+1];
v[k+1] = temp;
Compiler
lw $t0, 0($2)
Assembly Language lw $t1, 4($2)
Program sw $t1, 0($2)
sw $t0, 4($2)
Assembler
0000 1001 1100 0110 1010 1111 0101 1000
Machine Language Program 1010 1111 0101 1000 0000 1001 1100 0110
(Binary Code) 1100 0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111
Machine Interpretation Anything can be
represented as a number,
Hardware Architecture Description i.e., data or instructions
(e.g., Block Diagrams)
Architecture
Implementation
Logic Circuit Description
(Gate-level Netlist)
RTL Design
6
Design of Digital Integrated Systems
System Level
Gate Level
Transistor Level
Layout Level
Mask Level
always
begin
ir <= m[pc];
if(ir[15:13] == 3b’000)
pc <= m[ir[12:0]];
else if (ir[15:13] == 3’b010)
acc <= -m[ir[12:0]];
...
end
endmodule
8
Gate Level
Model on finite-state machine level
Models function in Boolean logic using registers and gates
Various delay models for gates and wires
No
1ns delay?
3ns 4ns
5ns
Transistor Level
Model on CMOS transistor level
Depending on application function modeled as resistive
switches
Used in functional equivalence checking
Or full differential equations for circuit simulation
Used in detailed timing analysis
10
Layout Level
Transistors and wires are laid out as polygons in different
technology layers such as diffusion, poly-silicon, metal,
etc.
11
HDL behavior
12
General Model of Synchronous Circuit
input
clock
output
All wires, except clock, may be multiple Combinational Logic Blocks (CL)
bits wide no internal state
Registers (reg) output only a function of inputs
collections of flip-flops Particular inputs/outputs are optional
clock Optional Feedback
distributed to all flip-flops
typical rate?
13
output
14
Limitations on Clock Rate
Logic Gate Delay Delays in flip-flops
input D
D Q
output
clk
clk Q
setup time clock-to-Q delay
(F/F delay)
What are typical delay values? Both times contribute to limiting the
clock period.
15
b
D Q a D Q
clk clk
𝑇 ≥ 𝑡𝑖𝑚𝑒 𝑐𝑙𝑘𝑄 + 𝑡𝑖𝑚𝑒 𝑚𝑢𝑥 + 𝑡𝑖𝑚𝑒 𝑠𝑒𝑡𝑢𝑝
Clk-to-Q mux
𝑇 ≥ 𝜏 +𝜏 +𝜏
𝑇
clk
a
clk−to−Q
setup
b
mux
16
General Model of Synchronous Circuit
clock input
output
• In general, for correct operation:
𝑇 ≥ 𝑡𝑖𝑚𝑒 (𝑐𝑙𝑘 → 𝑄) + 𝑡𝑖𝑚𝑒 (𝐶𝐿) + 𝑡𝑖𝑚𝑒 (𝑠𝑒𝑡𝑢𝑝)
𝑇 ≥ 𝜏 +𝜏 +𝜏 for all paths
17
𝑀-input
NAND
gate
A NAND gate with fan-in 𝑀
A driving gate
(with fan-out 𝑁 ) N
18
Gate Delay due to Fan-out
Fan-out: 2
1
Each logic cell (inverter)
contributes capacitances
3
𝑉
1 load
2 loads
𝑡
The delay of a gate is proportional to its output capacitance.
Because gates #2 and 3 turn on/off later. (It takes longer for the
output of gate #1 to reach the switching threshold of gates #2
and 3 as we add more output capacitance.)
19
Delay in Flip-flops
clk clk
D Q D Q
clk clk clk clk
clk
clk clk
clk == 0
Sense D, but Q D Q
outputs old value setup
clk 0 1
Capture D, pass
D Q
value to Q
hold clk-to-Q
20
Review: Two Kinds of FSMs
• Moore Machine vs. Mealy Machine
𝑶𝒖𝒕𝒑𝒖𝒕 (𝒕) = 𝑮( 𝒔𝒕𝒂𝒕𝒆(𝒕)) 𝑶𝒖𝒕𝒑𝒖𝒕 (𝒕) = 𝑮( 𝒔𝒕𝒂𝒕𝒆(𝒕), 𝑰𝒏𝒑𝒖𝒕 )
Input Output
Combinational Input
Logic
Output
Combinational
state
state
Combinational Logic
Logic
21
In = 0
100 110
In = 1
Sequential Logic
Sequences through a series of states
Based on sequence of values on input signals
Clock period defines elements of sequence
22
Review: Formal Design Process
Logic equations from table:
OUT = PS • Review of Design Steps:
NS = PS xor IN
1. Circuit functional specification
2. State Graph
Circuit Diagram: 3. Symbolic State Table
4. Encoded State Table
5. Derive Logic Equations
6. Circuit Diagram
ps ns
FFs for state
Combinational logic for NS and OUT
23
IN = 0
EVEN
OUT = 0
IN = 1
“State Graph” IN = 1
24
Formal Design Process
• State Table:
IN = 0
ODD
• Invent a code to represent states: OUT = 1
Let 0 = EVEN state, 1 = ODD state
IN = 0
25
Pgm Language
Asm / Machine Lang
HDL
FlipFlops
Gates
Circuits
Devices
Transistor Physics
Transfer Function
26
A Standard High-level Organization
Controller
Datapath Memory
(FSM)
Memory
Controller
Optional block used for long term storage of Datapath Memory
(FSM)
data structures
27
signals
Control Points
28
Datapath Bit-Sliced Organization
Control Flow
Bit 3
Pipeline Register
Pipeline Register
Pipeline Register
Pipeline Register
Register File
Multiplexer
Multiplexer
Bit 2
Shifter
Adder
From
I$
Bit 1
Bit 0
decoder
29
Personal style:
input use “;” to separate transfers that occur on
clock separate cycles
Use “,” to separate transfers that occur
input CL reg CL reg output on the same cycle
Example (2 cycles):
regA ← regB, regB ← 0;
output
regC ← regA;
30
Register Transfer Operation
Two-character names denote registers, such as R0, R1, DR, or SA.
Arrows indicate data transfers. To copy the contents of the source
register R2 into the destination register R1 in one clock cycle:
R1 ← R2
31
Logical operations are applied bitwise. AND and OR are denoted with special
symbols, to prevent confusion with arithmetic operations.
Lastly, we can shift registers. Here, the source register R1 is not modified, and
we assume that the shift input is just 0.
R2 ← sl R1; // left shift
R2 ← sr R1; // right shift
32
RTL Abstraction
Increases productivity by allowing designers to focus on
behavior rather than gate-level logic
Design components can be specified w/ concise and modular
code in HDL (Verilog, VHDL)
Synthesis tools understand RTL design
33
Interface Requirements
What signals control the operations?
What order these signals are in?
State-machine design
What are the outputs in each state?
Look for concurrency in the design
34
Register Transfer – Multiple Busses
One transfer per bus
Each set of wires can carry one value
State Elements
Registers
Register files MUX MUX MUX MUX
Memory
R1 R2 R3 R4
Combinational Elements
Busses
ALUs
Memory (read)
35
Registers
Selectively loaded – EN or LD input
Output enable – OE input (when using Tri-state bus)
Multiple registers – group 4 or 8 in parallel
36
Register Files
Collections of registers in one package
Two-dimensional array of FFs
Address used as index to a particular word
Separate read and write addresses so can do both at same time
37
Memories
Larger Collections of Storage Elements
Implemented not as FFs but as much more area-efficient cells
High-density memories use 1-6 switches (transistors) per bit
Ex: Static RAM – 1024 words each 4 bits wide
Once written, memory holds forever (not true for denser dynamic RAM)
Address lines to select word (10 lines for 1024 words)
Read enable
Same as output enable RD
38
ALU
Block Diagram
Input: data and operation to perform
Add, Sub, AND, OR, NOT, XOR, …
Output: result of operation and status information
A B
16 16
Operation
16
N S Z
39
Datapath (Hierarchy)
Arithmetic circuits
constructed in Cin
A
S
B
40
Example Data Path (ALU + Registers)
Accumulator
Special register
One of the inputs to ALU
Output of ALU stored back in 16
accumulator
REG AC
One-input Operation
16 16
Other operand and destination
OP
is accumulator register
AC AC op REG
N 16
“Single address instructions”
Z
AC AC op Mem[addr]
41
Datapath (Bit-slice)
Bit-slice concept: iterate to build -bit wide datapaths
Data bit busses run through the slice
AC AC AC
R0 R0 R0
R1 R1 R1
R1 R1 R1
R2 R2 R2
42
Interconnect Layout in Datapath
Control signals
Data track
<63>
<i>
<1>
<0>
43
44
High Performance Devices
Mixture of full custom, standard cells and macro’s
Full custom for special blocks: Adder (data path), etc.
Macros for standard blocks: RAM, ROM, etc.
Standard cells for non-critical digital blocks
45
0
1 It becomes the high-level
S1 0 1 specification for the controller.
clock
clock R1 Design of the FSM controller
S3
follows directly from the RTL
0 sequence.
1
FSM controls movement of data
by controlling the multiplexor/tri-
ACC ← ACC + R0, R1 ← R0; state control signals.
ACC ← ACC + R1, R0 ← R1;
R0 ← ACC;
•
•
•
46
Example of Using RTL
RTL often used as a starting point for designing both the datapath
and the control:
Example:
regA IN;
regB IN;
regC regA + regB;
regB regC;
47
Announcements
Homework #3 (Due: Next Week)
Textbook problems
Page 56~57: 1.31, 1.32
Reports must be written in hand
48
A Step-by-Step Design Example
49
50
1. Problem Specification
Design a circuit that forms the sum of all the 2's complements integers
stored in a linked-list structure starting at memory address 0:
All integers and pointers are 8-bit. The link-list is stored in a memory
block with an 8-bit address port and 8-bit data port, as shown below.
The pointer from the last element in the list is 0. At least one node in
list.
I/Os:
START resets to head of list
and starts addition process.
DONE signals completion.
R, Bus that holds the final
result.
51
1. Other Specifications
Design Constraints:
Usually, the design specification puts a restriction on cost, performance,
power or all. We will leave this unspecified for now and return to it later.
Component Library:
Component Delay
Simple logic gates 0.5 ns
clk-to-Q = 0.5 ns
𝑛-bit register
setup = 0.5 ns (data & load signal)
𝑛-bit 2x1 multiplexor 1 ns
𝑛-bit adder (2 log(𝑛) + 2) ns
memory 10 ns read (synchronous)
Zero compare 0.5 log(𝑛)
52
2. Algorithm Specification
In this case, the memory only allows one access per cycle, so the
algorithm is limited to sequential execution.
If in another case more input data is available at once, then a more
parallel solution may be possible.
Assume datapath state registers NEXT and SUM
NEXT holds a pointer to the node in memory
SUM holds the result of adding the node values to this point
53
3. Architecture #1
Direct implementation of RTL description: Datapath
D
0
1 0 A_SEL
NEXT_SEL
+ Memory
0
0
LD_NEXT NEXT A
1 0 1
SUM_SEL
1
LD_SUM SUM
+
==0
Controller
NEXT_ZERO
54
54
4. Analysis of Cost, Performance, and Power
Skip Power for now.
Cost:
How do we measure it? # of transistors? # of gates?
Depends on implementation technology. Usually, we are
interested in comparing the relative cost of two competing
implementations. (Save this for later)
Performance:
2 clock cycles per number added
What is the minimum clock period?
The controller might be on the critical path. Therefore, we need
to know the implementation, and controller input and output
delay.
55
COMP SUM_SEL
NEXT_ZERO A_SEL
SUM
LD_NEXT
GET NEXT_SEL
START NEXT
DONE DONE
START
56
Critical Path…
Longest path from any reg out to any reg input
57
4. Analysis of Performance
COMPUTE_SUM state
NEXT
.5 8 1 10 10 1 .5
31ns
GET_NEXT state
MUX MUX
NEXT_ZERO
.5 1 10 1 1.5 1.5
15.5ns
58
Critical paths
Identify bottlenecks in design
Share/schedule resources to improve performance
COMPUTE_SUM GET_NEXT
D
0
1 0
NEXT_SEL A_SEL
+ Memory
0
0
LD_NEXT NEXT A
SUM_SEL 1 0 1
SUM
LD_SUM
+
==0
NEXT_ZERO
59
4. Analysis of Performance
Detailed timing:
clock period (clock period for each state)
Observation:
COMPUTE_SUM state does most of the work. Most of the
components are inactive in GET_NEXT state.
GET_NEXT does: Memory access + …
COMPUTE_SUM does: 8-bit add, memory access, 15-bit add +
…
Conclusion:
Move one of the adds to GET_NEXT
60
5. Optimization
Add new register named NUMA, for address of number
to add
Update RTL to reflect our change (note still 2 cycles per
iteration):
61
5. Optimization
Architecture #2: Architecture #1
Cycle split by
register slicing
(pipelining)
62
5. Optimization
Architecture #2:
D
0
1 0
NEXT_SEL
+ A_SEL Memory
0
0
LD_NEXT NEXT A
SUM_SEL 1 0 1
LD_SUM SUM +
==0 1
1 0
NEXT_SEL
If (START == 1) NEXT0, SUM 0,
NUMA 1; NEXT_ZERO
63
5. Optimization, Architecture #2
New timing:
COMPUTE_SUM state Clock Period (𝑇) = max (clock period
for each state)
MUX
15-bit add setup
𝑇 > 23ns, 𝐹 < 43MHz
CLK memory
CLK-Q MUX
NUMA
Is this worth the extra cost?
.5 1 10 10 1 .5 Can we lower the cost?
23ns
GET_NEXT state
21ns
64
5. Optimization, Architecture #3
1
ADD_SEL 1 0
D
0
1 0
+
NEXT_SEL A_SEL Memory
0
0 1
LD_NEXT NEXT 1
A
SUM_SEL 1 0 1 0 NEXT_SEL
==0
LD_SUM SUM NUMA LD_NEXT
NEXT_ZERO
Incremental cost:
Addition of another mux and control. Removal of an 8-bit adder.
Performance:
mux adds 1ns to cycle time. 24ns, 41.67MHz.
Is the cost savings worth the performance degradation?
65
66
Design Gap
Keeping up with Moore's Law requires the
implementation of disruptive design technology every few
years
A common theme of advancing design technology is the
continuing move to higher design abstraction levels
67