0% found this document useful (0 votes)
21 views34 pages

Lec05 (DSD)

Uploaded by

chiyeon0607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views34 pages

Lec05 (DSD)

Uploaded by

chiyeon0607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SSE3061: Digital System Design

Lecture 5

Tae Hee Han: [email protected]


Semiconductor Systems Engineering
Sungkyunkwan University

Agenda
 Synchronous Digital Logic Review
 Performance, Delay, Timing
 FSM Review
 Digital System Design Process
 RTL Design
 A Design Example

2
Digital Computers are Everywhere

How Do We Build a Complex Digital System?

Circuit Board (PCB) Integrated Circuit (IC)


Digital Device (Tablet) IP or Functional Block

Transistor (MOSFET)

Cell
Scheme for
Representing
Information Gate

4
Levels of Design Representation/Interpretation
High Level Language Program temp = v[k];
(e.g., C) v[k] = v[k+1];
v[k+1] = temp;
Compiler
lw $t0, 0($2)
Assembly Language lw $t1, 4($2)
Program sw $t1, 0($2)
sw $t0, 4($2)
Assembler
0000 1001 1100 0110 1010 1111 0101 1000
Machine Language Program 1010 1111 0101 1000 0000 1001 1100 0110
(Binary Code) 1100 0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111
Machine Interpretation Anything can be
represented as a number,
Hardware Architecture Description i.e., data or instructions
(e.g., Block Diagrams)
Architecture
Implementation
Logic Circuit Description
(Gate-level Netlist)

RTL Design

6
Design of Digital Integrated Systems

System Level

Register Transfer Level

Gate Level

Transistor Level

Layout Level

Mask Level

Register Transfer Level (RTL)


 Cycle accurate model “close” to the hardware implementation
 Bit-vector data types and operations as abstraction from bit-level
implementation
 Sequential constructs (e.g., if - then - else, while loops) to support
modeling of complex control flow
module mark1;
reg [31:0] m[0:8192];
reg [12:0] pc;
reg [31:0] acc;
reg [15:0] ir;

always
begin
ir <= m[pc];
if(ir[15:13] == 3b’000)
pc <= m[ir[12:0]];
else if (ir[15:13] == 3’b010)
acc <= -m[ir[12:0]];
...
end
endmodule

8
Gate Level
 Model on finite-state machine level
 Models function in Boolean logic using registers and gates
 Various delay models for gates and wires

No
1ns delay?

3ns 4ns

5ns

Transistor Level
 Model on CMOS transistor level
 Depending on application function modeled as resistive
switches
 Used in functional equivalence checking
 Or full differential equations for circuit simulation
 Used in detailed timing analysis

10
Layout Level
 Transistors and wires are laid out as polygons in different
technology layers such as diffusion, poly-silicon, metal,
etc.

11

Digital System Design


System specification (in words)
ART ART

Datapath specification Controller specification

Comb. logic operations FSM generation

HDL dataflow SG / ST / Encoding

Gates / LUTs Logic: nextstate/outputs

HDL behavior

ART : A skill that is attained by study, practice, or observation Gates / LUTs / FF

12
General Model of Synchronous Circuit
input
clock

input CL reg CL reg output

output

 All wires, except clock, may be multiple  Combinational Logic Blocks (CL)
bits wide  no internal state
 Registers (reg)  output only a function of inputs
 collections of flip-flops  Particular inputs/outputs are optional
 clock  Optional Feedback
 distributed to all flip-flops
 typical rate?

13

General Model of Synchronous Circuit


input
clock

input CL reg CL reg output

output

 How do we measure performance?


 operations/sec?
 cycles/sec?
 What limits the clock rate?
 What happens as we increase the clock rate?

14
Limitations on Clock Rate
 Logic Gate Delay  Delays in flip-flops

input D
D Q
output
clk

clk Q
setup time clock-to-Q delay
(F/F delay)

 What are typical delay values? Both times contribute to limiting the
clock period.

• What must happen in one clock cycle for correct operation?


• Assuming perfect clock distribution (all flip-flops see the clock at the same time):
– All signals must be ready and “setup” before rising edge of clock

15

Example: Parallel-Serial Converter

b
D Q a D Q

clk clk
𝑇 ≥ 𝑡𝑖𝑚𝑒 𝑐𝑙𝑘𝑄 + 𝑡𝑖𝑚𝑒 𝑚𝑢𝑥 + 𝑡𝑖𝑚𝑒 𝑠𝑒𝑡𝑢𝑝
Clk-to-Q mux

𝑇 ≥ 𝜏 +𝜏 +𝜏
𝑇

clk

a
clk−to−Q

setup
b
mux

16
General Model of Synchronous Circuit
clock input

input CL reg CL reg output

output
• In general, for correct operation:
𝑇 ≥ 𝑡𝑖𝑚𝑒 (𝑐𝑙𝑘 → 𝑄) + 𝑡𝑖𝑚𝑒 (𝐶𝐿) + 𝑡𝑖𝑚𝑒 (𝑠𝑒𝑡𝑢𝑝)
𝑇 ≥ 𝜏 +𝜏 +𝜏 for all paths

• How do we enumerate all paths?


– Any circuit input or register output to any register input or circuit
output.
– “setup time” for circuit outputs depends on what it connects to
– “clk-Q time” for circuit inputs depends on from where it comes.

17

Fan-In and Fan-Out


 Fan-in of a gate: the number of inputs to the gate.
 Gates with large fan-in are bigger and slower

𝑀-input
NAND
gate
A NAND gate with fan-in 𝑀

 Fan-out of a gate: the number of load gates that are connected to


the output of the driving gate.
 Gates with large fan-out are slower
1

A driving gate
(with fan-out 𝑁 ) N

18
Gate Delay due to Fan-out
 Fan-out: 2
1
Each logic cell (inverter)
contributes capacitances

3
𝑉
1 load

2 loads

𝑡
 The delay of a gate is proportional to its output capacitance.
Because gates #2 and 3 turn on/off later. (It takes longer for the
output of gate #1 to reach the switching threshold of gates #2
and 3 as we add more output capacitance.)

19

Delay in Flip-flops
clk clk

D Q D Q
clk clk clk clk

clk
clk clk

clk == 0
Sense D, but Q D Q
outputs old value setup

clk 0  1
Capture D, pass
D Q
value to Q
hold clk-to-Q

20
Review: Two Kinds of FSMs
• Moore Machine vs. Mealy Machine
𝑶𝒖𝒕𝒑𝒖𝒕 (𝒕) = 𝑮( 𝒔𝒕𝒂𝒕𝒆(𝒕)) 𝑶𝒖𝒕𝒑𝒖𝒕 (𝒕) = 𝑮( 𝒔𝒕𝒂𝒕𝒆(𝒕), 𝑰𝒏𝒑𝒖𝒕 )

Input Output
Combinational Input
Logic

Output
Combinational
state

state
Combinational Logic
Logic

𝒔𝒕𝒂𝒕𝒆(𝒕 + 𝟏) = 𝑭 ( 𝒔𝒕𝒂𝒕𝒆(𝒕), 𝒊𝒏𝒑𝒖𝒕(𝒕)) 𝒔𝒕𝒂𝒕𝒆(𝒕 + 𝟏) = 𝑭 ( 𝒔𝒕𝒂𝒕𝒆(𝒕), 𝒊𝒏𝒑𝒖𝒕)

Input Input / Output


State / output State

21

Review: Finite State Machine Representations


 States: determined by possible values in sequential storage
elements
 Transitions: change of state
 Clock: controls when state can change by controlling storage
elements
001 010 111
In = 1 In = 0

In = 0
100 110
In = 1
 Sequential Logic
 Sequences through a series of states
 Based on sequence of values on input signals
 Clock period defines elements of sequence

22
Review: Formal Design Process
Logic equations from table:
OUT = PS • Review of Design Steps:
NS = PS xor IN
1. Circuit functional specification
2. State Graph
 Circuit Diagram: 3. Symbolic State Table
4. Encoded State Table
5. Derive Logic Equations
6. Circuit Diagram
ps ns
FFs for state
Combinational logic for NS and OUT

 XOR gate for ns calculation


 DFF to hold present state
 no logic needed for output

23

Formal Design Process

IN = 0

EVEN
OUT = 0
IN = 1
 “State Graph” IN = 1

 circuit is in one of two states


 transition on each cycle with each new ODD
input, over exactly one arc (edge) OUT = 1

 Output depends on which state the


circuit is in IN = 0

24
Formal Design Process
• State Table:
IN = 0

Present Next State


Out EVEN
State IN = 0 IN = 1
OUT = 0
EVEN EVEN ODD 0
IN = 1
ODD ODD EVEN 1 IN = 1

ODD
• Invent a code to represent states: OUT = 1
Let 0 = EVEN state, 1 = ODD state

IN = 0

Present Next State


Out
State IN = 0 IN = 1 Derive logic equations from table (how?):
0 0 1 0 OUT = PS
1 1 0 1 NS = PS xor IN

25

Levels of Design Representation

Pgm Language
Asm / Machine Lang

Instruction Set Arch


Machine Organization

HDL

FlipFlops

Gates

Circuits

Devices

Transistor Physics

Transfer Function

26
A Standard High-level Organization
Controller
Datapath Memory
(FSM)

Inputs Outputs Inputs Outputs

 Controller  Standard model for CPUs, micro-


controllers, many other digital sub-
 Accepts external and control input, generates
control and external output and sequences systems
the movement of data in the datapath  Usually not nested
 Datapath  Often cascaded:
 Responsible for data manipulation.
Controller
Datapath Memory
 Usually includes a limited amount of storage (FSM)

 Memory
Controller
 Optional block used for long term storage of Datapath Memory
(FSM)
data structures

Inputs Outputs Inputs Outputs

27

Datapath vs. Control


Datapath Controller

signals

Control Points

 Datapath: Storage, FU, interconnect sufficient to perform the desired functions


 Inputs are Control Points
 Outputs might be the input signals of the controller

 Controller: State machine to orchestrate operation on the data path


 Based on desired function and signals

28
Datapath Bit-Sliced Organization
Control Flow

Bit 3

Pipeline Register

Pipeline Register

Pipeline Register
Pipeline Register

Register File

Multiplexer
Multiplexer
Bit 2

Shifter
Adder
From
I$
Bit 1

Bit 0

decoder

Data Flow To/From D$

Tile identical bit-slice elements

29

Register Transfer Level Descriptions


 A standard high-level representation  RTL comprises a set of register
for describing systems transfers with optional operators as
part of the transfer.
 It follows from the fact that all
synchronous digital system can be  Example:
described as a set of state elements regA ← regB;
connected by combination logic (CL) regC ← regA + regB;
blocks: if (start==1) regA ← regC;

 Personal style:
input  use “;” to separate transfers that occur on
clock separate cycles
 Use “,” to separate transfers that occur
input CL reg CL reg output on the same cycle

 Example (2 cycles):
regA ← regB, regB ← 0;
output
regC ← regA;

30
Register Transfer Operation
 Two-character names denote registers, such as R0, R1, DR, or SA.
 Arrows indicate data transfers. To copy the contents of the source
register R2 into the destination register R1 in one clock cycle:
R1 ← R2

 A conditional transfer is performed only if the Boolean condition in


front of the colon is true. To transfer R3 to R2 when K = 1:
K: R2 ← R3

 Multiple transfers on the same clock cycle are separated by


commas
R1 ← R2, K: R2 ← R3

31

Register Transfer Operations


 We can apply arithmetic operations to registers.
R1 ← R2 + R3;
R3 ← R1 − 1;

 Logical operations are applied bitwise. AND and OR are denoted with special
symbols, to prevent confusion with arithmetic operations.

R2 ← R1 & R2; // bitwise AND


R3 ← R0 | R1; // bitwise OR

 Lastly, we can shift registers. Here, the source register R1 is not modified, and
we assume that the shift input is just 0.
R2 ← sl R1; // left shift
R2 ← sr R1; // right shift

32
RTL Abstraction
 Increases productivity by allowing designers to focus on
behavior rather than gate-level logic
 Design components can be specified w/ concise and modular
code in HDL (Verilog, VHDL)
 Synthesis tools understand RTL design

 Think of design in terms of Control and Datapath

 Designers are still very close to hardware. They can think


of and optimize architectures, timing (cycle-level), and
other design trade-offs (power, speed, area..)

33

RTL Design Process


 Data-path Requirements
 How many registers do you need?
 What transformations/operations are needed?

 Interface Requirements
 What signals control the operations?
 What order these signals are in?

 State-machine design
 What are the outputs in each state?
 Look for concurrency in the design

34
Register Transfer – Multiple Busses
 One transfer per bus
 Each set of wires can carry one value
 State Elements
 Registers
 Register files MUX MUX MUX MUX

 Memory
R1 R2 R3 R4
 Combinational Elements
 Busses
 ALUs
 Memory (read)

35

Registers
 Selectively loaded – EN or LD input
 Output enable – OE input (when using Tri-state bus)
 Multiple registers – group 4 or 8 in parallel

OE asserted causes FF state to be


LD OE
connected to output pins; otherwise, they
D7 Q7
D6 Q6 are left unconnected (high impedance)
D5 Q5
D4 Q4
D3 Q3
D2 Q2 LD asserted during a lo-to-hi clock
D1 Q1
D0 CLK Q0 transition loads new data into FFs

36
Register Files
 Collections of registers in one package
 Two-dimensional array of FFs
 Address used as index to a particular word
 Separate read and write addresses so can do both at same time

 Eg.: 4 by 4 register file RE


RB
 16 D-FFs RA
Q3
 Organized as four words of four bits each WE
WB Q2
WA Q1
 Write-enable (load) Q0
D3
 Read-enable (output enable) D2
D1
D0

37

Memories
 Larger Collections of Storage Elements
 Implemented not as FFs but as much more area-efficient cells
 High-density memories use 1-6 switches (transistors) per bit
 Ex: Static RAM – 1024 words each 4 bits wide
 Once written, memory holds forever (not true for denser dynamic RAM)
 Address lines to select word (10 lines for 1024 words)
 Read enable
 Same as output enable RD

 Often called chip select WR

 Permits connection of many A9


A8 IO3
chips into larger array A7 IO2
A6 IO1
 Write enable (same as load enable) A5 IO0
A4
 Bi-directional data lines A3
A2
 output when reading, input when writing A2
A1
A0

38
ALU
 Block Diagram
 Input: data and operation to perform
 Add, Sub, AND, OR, NOT, XOR, …
 Output: result of operation and status information

A B

16 16

Operation

16

N S Z

39

Datapath (Hierarchy)
 Arithmetic circuits
constructed in Cin

hierarchical and iterative Ain


FA Sum
fashion Bin

 each bit in datapath is Cout


functionally identical Ain Sum
 4-bit, 8-bit, 16-bit, Bin
HA
Cout
32-bit datapaths Cin
HA

A
S
B

40
Example Data Path (ALU + Registers)

 Accumulator
 Special register
 One of the inputs to ALU
 Output of ALU stored back in 16
accumulator
REG AC
 One-input Operation
16 16
 Other operand and destination
OP
is accumulator register
 AC  AC op REG
N 16
 “Single address instructions”
Z
 AC  AC op Mem[addr]

41

Datapath (Bit-slice)
 Bit-slice concept: iterate to build -bit wide datapaths
 Data bit busses run through the slice

CO ALU CI CO ALU ALU CI

AC AC AC

R0 R0 R0

R1 R1 R1

R1 R1 R1

R2 R2 R2

From memory From memory From memory

1 bit wide 2 bits wide

42
Interconnect Layout in Datapath

Control signals
Data track

<63>

<i>

<1>
<0>

43

Chip Implementation Strategies


One could just start drawing transistors, but this quickly becomes hopeless with a chip of any size.
Instead, we’ll use one of the following strategies:

44
High Performance Devices
 Mixture of full custom, standard cells and macro’s
 Full custom for special blocks: Adder (data path), etc.
 Macros for standard blocks: RAM, ROM, etc.
 Standard cells for non-critical digital blocks

45

Example of Using RTL

S0 0 1  RTL description is used to


S2 sequence the operations on the
clock R0 datapath.
+
ACC

0
1  It becomes the high-level
S1 0 1 specification for the controller.
clock
clock R1  Design of the FSM controller
S3
follows directly from the RTL
0 sequence.
1
 FSM controls movement of data
by controlling the multiplexor/tri-
ACC ← ACC + R0, R1 ← R0; state control signals.
ACC ← ACC + R1, R0 ← R1;
R0 ← ACC;


46
Example of Using RTL
 RTL often used as a starting point for designing both the datapath
and the control:
 Example:
regA  IN;
regB  IN;
regC  regA + regB;
regB  regC;

 From this we can deduce:


 IN must fanout to both regA and regB
 regA and regB must output to an adder
 The adder must output to regC
 regB must take its input from a mux that selects between IN and regC

47

Announcements
 Homework #3 (Due: Next Week)
 Textbook problems
 Page 56~57: 1.31, 1.32
 Reports must be written in hand

48
A Step-by-Step Design Example

49

List Processor Example


 RTL gives us a framework for making high-level
optimizations
 Fixed function unit
 Approach extends to instruction interpreters

 General design procedure outline:


1. Problem, Constraints, and Component Library Spec.
2. “Algorithm” Selection
3. Micro-architecture (hardware organization) Specification
4. Analysis of Cost, Performance, Power
5. Optimizations, Variations
6. Detailed Design

50
1. Problem Specification
 Design a circuit that forms the sum of all the 2's complements integers
stored in a linked-list structure starting at memory address 0:

 All integers and pointers are 8-bit. The link-list is stored in a memory
block with an 8-bit address port and 8-bit data port, as shown below.
The pointer from the last element in the list is 0. At least one node in
list.

I/Os:
 START resets to head of list
and starts addition process.
 DONE signals completion.
 R, Bus that holds the final
result.

51

1. Other Specifications
 Design Constraints:
 Usually, the design specification puts a restriction on cost, performance,
power or all. We will leave this unspecified for now and return to it later.
 Component Library:
Component Delay
Simple logic gates 0.5 ns
clk-to-Q = 0.5 ns
𝑛-bit register
setup = 0.5 ns (data & load signal)
𝑛-bit 2x1 multiplexor 1 ns
𝑛-bit adder (2 log(𝑛) + 2) ns
memory 10 ns read (synchronous)
Zero compare 0.5 log(𝑛)

Are these reasonable?

52
2. Algorithm Specification
 In this case, the memory only allows one access per cycle, so the
algorithm is limited to sequential execution.
 If in another case more input data is available at once, then a more
parallel solution may be possible.
 Assume datapath state registers NEXT and SUM
 NEXT holds a pointer to the node in memory
 SUM holds the result of adding the node values to this point

If (START==1) NEXT  0, SUM  0;


repeat {
SUM  SUM + Memory[NEXT+1];
NEXT  Memory[NEXT];
} until (NEXT == 0);
R  SUM, DONE  1;

53

3. Architecture #1
Direct implementation of RTL description: Datapath

D
0
1 0 A_SEL
NEXT_SEL
+ Memory
0
0
LD_NEXT NEXT A
1 0 1
SUM_SEL
1

LD_SUM SUM
+
==0
Controller

NEXT_ZERO

If (START == 1) NEXT  0, SUM  0;


repeat {
SUM  SUM + Memory[NEXT+1];
NEXT  Memory[NEXT];
} until (NEXT == 0);
R  SUM, DONE  1;

54

54
4. Analysis of Cost, Performance, and Power
 Skip Power for now.
 Cost:
 How do we measure it? # of transistors? # of gates?
 Depends on implementation technology. Usually, we are
interested in comparing the relative cost of two competing
implementations. (Save this for later)
 Performance:
 2 clock cycles per number added
 What is the minimum clock period?
 The controller might be on the critical path. Therefore, we need
to know the implementation, and controller input and output
delay.

55

Possible Controller Implementation

START START LD_SUM

COMP SUM_SEL
NEXT_ZERO A_SEL
SUM

LD_NEXT

GET NEXT_SEL
START NEXT

DONE DONE
START

 Based on this, what is the controller input and output delay?

56
Critical Path…
 Longest path from any reg out to any reg input

57

4. Analysis of Performance
COMPUTE_SUM state

8-bit add memory 15-bit add setup


CLK
CLK-Q MUX MUX

NEXT

.5 8 1 10 10 1 .5

31ns

GET_NEXT state

CLK control output delay


==0

MUX MUX

memory control input delay


A_SEL

NEXT_ZERO

.5 1 10 1 1.5 1.5

15.5ns

58
Critical paths
 Identify bottlenecks in design
 Share/schedule resources to improve performance

COMPUTE_SUM GET_NEXT

D
0
1 0
NEXT_SEL A_SEL
+ Memory
0
0
LD_NEXT NEXT A
SUM_SEL 1 0 1

SUM
LD_SUM
+
==0

NEXT_ZERO

59

4. Analysis of Performance
 Detailed timing:
 clock period (clock period for each state)

 Observation:
 COMPUTE_SUM state does most of the work. Most of the
components are inactive in GET_NEXT state.
 GET_NEXT does: Memory access + …
 COMPUTE_SUM does: 8-bit add, memory access, 15-bit add +

 Conclusion:
 Move one of the adds to GET_NEXT

60
5. Optimization
 Add new register named NUMA, for address of number
to add
 Update RTL to reflect our change (note still 2 cycles per
iteration):

If (START==1) NEXT  0, SUM  0, NUMA  1;


repeat {
SUM  SUM + Memory[NUMA];
NUMA  Memory[NEXT] + 1,
NEXT  Memory[NEXT];
} until (NEXT==0);
R  SUM, DONE  1;

61

5. Optimization
 Architecture #2: Architecture #1

Cycle split by
register slicing
(pipelining)

If (START==1) NEXT0, SUM0, NUMA1;


repeat {
SUM  SUM + Memory[NUMA];
NUMA  Memory[NEXT] + 1, NEXT
 Memory[NEXT] ;
} until (NEXT==0);
RSUM, DONE1;

 Incremental cost: addition of another register and mux.

62
5. Optimization
 Architecture #2:
D
0
1 0
NEXT_SEL
+ A_SEL Memory
0
0
LD_NEXT NEXT A
SUM_SEL 1 0 1

LD_SUM SUM +
==0 1
1 0
NEXT_SEL
If (START == 1) NEXT0, SUM  0,
NUMA  1; NEXT_ZERO

repeat { LD_NEXT NUMA


SUM  SUM + Memory[NUMA];
NUMA  Memory[NEXT] + 1,
NEXT  Memory[NEXT] ;
} until (NEXT == 0);
R  SUM, DONE  1;

 Incremental cost: addition of another register and mux.

63

5. Optimization, Architecture #2
 New timing:
COMPUTE_SUM state Clock Period (𝑇) = max (clock period
for each state)
MUX
15-bit add setup
𝑇 > 23ns, 𝐹 < 43MHz
CLK memory
CLK-Q MUX

NUMA
 Is this worth the extra cost?
.5 1 10 10 1 .5  Can we lower the cost?

23ns

GET_NEXT state

CLK control output delay NUMA reg setup

MUX MUX  Notice that the circuit now only


memory 8-bit add performs one add on every cycle.
A_SEL Why not share the adder for both
.5 1 10 8 1 .5 cycles?

21ns

64
5. Optimization, Architecture #3
1
ADD_SEL 1 0
D
0
1 0

+
NEXT_SEL A_SEL Memory
0

0 1
LD_NEXT NEXT 1
A
SUM_SEL 1 0 1 0 NEXT_SEL

==0
LD_SUM SUM NUMA LD_NEXT

NEXT_ZERO

 Incremental cost:
 Addition of another mux and control. Removal of an 8-bit adder.
 Performance:
 mux adds 1ns to cycle time. 24ns, 41.67MHz.
 Is the cost savings worth the performance degradation?
65

Design Complexity & Productivity Gap


 Design gap is accelerating with advances in processing
technology
 RTL Designers must identify downstream problems —
timing, signal integrity, reliability, and others — prior to
synthesis and be able to implement design fixes where
they will have a more significant impact on chip
performance.
 The key to a successful design is design closure. The
various performance specifications comprising timing,
power, and reliability, along with chip cost, are all closely
coupled.

66
Design Gap
 Keeping up with Moore's Law requires the
implementation of disruptive design technology every few
years
 A common theme of advancing design technology is the
continuing move to higher design abstraction levels

67

You might also like