Lecture 9:
Digital Signal Processors:
Applications and Architectures
Prepared by: Professor Kurt Keutzer
Computer Science 252, Spring 2000
With contributions from:
Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI;
Prof. Bob Brodersen, Prof. David Patterson
1
Kurt Keutzer
Processor Applications
General Purpose - high performance
Pentiums, Alpha’s, SPARC
Increasing
Used for general purpose software
Heavy weight OS - UNIX, NT
Workstations, PC’s
Cost
Embedded processors and processor cores
ARM, 486SX, Hitachi SH7000, NEC V800
Single program
Lightweight, often realtime OS
DSP support
volume
Increasing
Cellular phones, consumer electronics (e.g. CD players)
Microcontrollers
Extremely cost sensitive
Small word size - 8 bit common
Highest volume processors by far
Automobiles, toasters, thermostats, ...
2
Kurt Keutzer
Processor Markets
$30B
32-bit
micro
$5.2B/17%
$1.2B/4% 32 bit DSP
DSP $10B/33%
16-bit $5.7B/19%
micro
8-bit $9.3B/31%
micro 3
Kurt Keutzer
The Processor Design Space
Application specific
architectures
for performance Microprocessors
Embedded
processors
Performance
Performance is
everything
& Software rules
Microcontrollers
Cost is everything
Cost
4
Kurt Keutzer
Market for DSP Products
Mixed/
Signal
Analog
DSP
DSP is the fastest growing segment of the semiconductor market
5
Kurt Keutzer
DSP Applications
Audio applications Networking
• MPEG Audio • Cable modems
• Portable audio • ADSL
Digital cameras • VDSL
Wireless
• Cellular telephones
• Base station
6
Kurt Keutzer
Another Look at DSP Applications
High-end
Wireless Base Station - TMS320C6000
Increasing
Cable modem
gateways
Cost
Mid-end
Cellular phone - TMS320C540
Fax/ voice server
Low end
Storage products - TMS320C27
Digital camera - TMS320C5000
volume
Increasing
Portable phones
Wireless headsets
Consumer audio
Automobiles, toasters, thermostats, ...
7
Kurt Keutzer
Serving a range of applications
8
Kurt Keutzer
World’s Cellular Subscribers
Millions
700
Will provide
600 a ubiquitous
infrastructure
500
for wireless
400 data as well
300 as voice
Digital
200
100
Analog
0
1993 1994 1995 1996 1997 1998 1999 2000 2001 Year
9
Kurt Keutzer Source: Ericsson Radio Systems, Inc.
CELLULAR TELEPHONE
SYSTEM
123 CONTROLLER 415-555-1212
456
789
0
PHYSICAL RF
LAYER BASEBAND
CONVERTER MODEM
PROCESSING
A/D SPEECH SPEECH
ENCODE DECODE DAC
10
Kurt Keutzer
HW/SW/IC
PARTITIONINGMICROCONTROLLER
123
456 CONTROLLER 415-555-1212
789
0
PHYSICAL
BASEBAND RF
LAYER
ASIC CONVERTER MODEM
PROCESSING
A/D SPEECH SPEECH
ENCODE DECODE DAC
DSP
ANALOG IC 11
Kurt Keutzer
Mapping onto a system on a chip
phone keypad
S/P
book intfc
DMA control protocol
S/P
RAM
RAM µC
DMA speech
voice
quality
recognition
enhancment
ASIC DSP RPE-LTP
de-intl &
LOGIC CORE decoder speech decoder
demodulator
and Viterbi
synchronizer equalizer
12
Kurt Keutzer
Example Wireless Phone Organization
C540
ARM7
13
Kurt Keutzer
Multimedia I/O Architecture
Radio Embedded
Modem Processor
Sched ECC Pact Interface
Low Power Bus
FB Fifo Fifo Video
Decomp
SRAM Pen
Data
Flow Graphics Audio Video
14
Kurt Keutzer
Multimedia System on a Chip
E.g. Multimedia terminal electronics
Graphics Out
Uplink Radio Video I/O
Downlink Radio Voice I/O
Pen In
Future chips will be a mix of
processors, memory and
µP Video Unit
dedicated hardware for
specific algorithms and I/O
custom
Coms
Memory
DSP
15
Kurt Keutzer
Requirements of the Embedded
Processors
Optimized for a single program - code often in on-chip ROM or off
chip EPROM
Minimum code size (one of the motivations initially for Java)
Performance obtained by optimizing datapath
Low cost
Lowest possible area
Technology behind the leading edge
High level of integration of peripherals (reduces system
cost)
Fast time to market
Compatible architectures (e.g. ARM) allows reuseable code
Customizable core
Low power if application requires portability 16
Kurt Keutzer
Area of processor cores = Cost
Nintendo processor Cellular phones
17
Kurt Keutzer
Another figure of merit
Computation per unit area
??? Nintendo processor Cellular phones
18
Kurt Keutzer
Code size
If a majority of the chip is the program stored in ROM,
then code size is a critical issue
The Piranha has 3 sized instructions - basic 2 byte, and
2 byte plus 16 or 32 bit immediate
Kurt Keutzer
19
BENCHMARKS - DSPstone
ZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF
AACHEN
APPLICATION BENCHMARKS
ADPCM TRANSCODER - CCITT G.721
REAL_UPDATE
COMPLEX_UPDATES
DOT_PRODUCT
MATRIX_1X3
CONVOLUTION
FIR
FIR2DIM
HR_ONE_BIQUAD
LMS
Kurt Keutzer
20
Evolution of GP and DSP
General Purpose Microprocessor traces roots back to Eckert, Mauchly,
Von Neumann (ENIAC)
DSP evolved from Analog Signal Processors, using analog hardware to
transform phyical signals (classical electrical engineering)
ASP to DSP because
DSP insensitive to environment (e.g., same response in snow or
desert if it works at all)
DSP performance identical even with variations in components; 2
analog systems behavior varies even if built with same
components with 1% variation
Different history and different applications led to different terms,
different metrics, some new inventions
Convergence of markets will lead to architectural showdown
21
Kurt Keutzer
Embedded Systems vs. General Purpose
Computing - 1
Embedded System General purpose
computing
Runs a few applications
often known at design time Intended to run a fully
Not end-user general set of applications
programmable End-user programmable
Operates in fixed run-time Faster is always better
constraints, additional
performance may not be
useful/valuable
22
Kurt Keutzer
Embedded Systems vs. General Purpose
Computing - 2
Embedded System General purpose
computing
Differentiating features:
power Differentiating features
cost
speed (need not be
fully predictable)
speed (must be
predictable)
speed
did we mention
speed?
cost (largest
component power)
23
Kurt Keutzer
DSP vs. General Purpose MPU
DSPs tend to be written for 1 program, not many
programs.
Hence OSes are much simpler, there is no virtual
memory or protection, ...
DSPs sometimes run hard real-time apps
You must account for anything that could happen
in a time slot
All possible interrupts or exceptions must be
accounted for and their collective time be
subtracted from the time interval.
Therefore, exceptions are BAD!
DSPs have an infinite continuous data stream
24
Kurt Keutzer
DSP vs. General Purpose MPU
The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
DSP are judged by whether they can keep the multipliers busy
100% of the time.
The "SPEC" of DSPs is 4 algorithms:
Inifinite Impule Response (IIR) filters
Finite Impule Response (FIR) filters
FFT, and
convolvers
In DSPs, algorithms are king!
Binary compatability not an issue
Software is not (yet) king in DSPs.
People still write in assembly language for a product to minimize
the die area for ROM in the DSP chip.
25
Kurt Keutzer
TYPES OF DSP PROCESSORS
DSP Multiprocessors on a die
TMS320C80
TMS320C6000
32-BIT FLOATING POINT
TI TMS320C4X
MOTOROLA 96000
AT&T DSP32C
ANALOG DEVICES ADSP21000
16-BIT FIXED POINT
TI TMS320C2X
MOTOROLA 56000
AT&T DSP16
ANALOG DEVICES ADSP2100
26
Kurt Keutzer
Note of Caution on DSP Architectures
Successful DSP architectures have two aspects:
Key architectural and micro-architectural
features that enabled product success in key
parameters
Speed
Code density
Low power
Architectural and micro-architectural features
that are artifacts of the era in which they were
designed
• We will focus on the former !
27
Kurt Keutzer
Architectural Features of DSPs
Data path configured for DSP
Fixed-point arithmetic
MAC- Multiply-accumulate
Multiple memory banks and buses -
Harvard Architecture
Multiple data memories
Specialized addressing modes
Bit-reversed addressing
Circular buffers
Specialized instruction set and execution control
Zero-overhead loops
Support for MAC
Specialized peripherals for DSP
THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!
28
Kurt Keutzer
DSP Data Path: Arithmetic
DSPs dealing with numbers representing real world
=> Want “reals”/ fractions
DSPs dealing with numbers for addresses
=> Want integers
Support “fixed point” as well as integers
S . -1 Š x < 1
radix
point
S .
radix
–2N–1 Š x < 2N–1
point 29
Kurt Keutzer
DSP Data Path: Precision
Word size affects precision of fixed point numbers
DSPs have 16-bit, 20-bit, or 24-bit data words
Floating Point DSPs cost 2X - 4X vs. fixed point, slower than
fixed point
DSP programmers will scale values inside code
SW Libraries
Separate explicit exponent
“Blocked Floating Point” single exponent for a group of
fractions
Floating point support simplify development
30
Kurt Keutzer
DSP Data Path: Overflow?
DSP are descended from analog :
what should happen to output when “peg” an input?
(e.g., turn up volume control knob on stereo)
Modulo Arithmetic???
Set to most positive (2N–1–1) or
most negative value(–2N–1) : “saturation”
Many algorithms were developed in this model
31
Kurt Keutzer
DSP Data Path: Multiplier
Specialized hardware performs all key arithmetic
operations in 1 cycle
50% of instructions can involve multiplier
=> single cycle latency multiplier
Need to perform multiply-accumulate (MAC)
n-bit multiplier => 2n-bit product
32
Kurt Keutzer
DSP Data Path: Accumulator
Don’t want overflow or have to scale accumulator
Option 1: accumalator wider than product:
“guard bits”
Motorola DSP:
24b x 24b => 48b product, 56b Accumulator
Option 2: shift right and round product before adder
Multiplier
Multiplier
Shift
ALU ALU
Accumulator G Accumulator
33
Kurt Keutzer
DSP Data Path: Rounding
Even with guard bits, will need to round when store accumulator into
memory
3 DSP standard options
Truncation: chop results
=> biases results up
Round to nearest:
< 1/2 round down, 1/2 round up (more positive)
=> smaller bias
Convergent:
< 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make
lsb a zero (+1 if 1, +0 if 0)
=> no bias
IEEE 754 calls this round to nearest even
34
Kurt Keutzer
Data Path
DSP Processor General-Purpose Processor
Specialized hardware Multiplies often take>1
performs all key arithmetic cycle
operations in 1 cycle. Shifts often take >1 cycle
Hardware support for Other operations (e.g.,
managing numeric fidelity: saturation, rounding)
Shifters typically take multiple
Guard bits cycles.
Saturation
35
Kurt Keutzer
320C54x DSP Functional Block Diagram
36
Kurt Keutzer
FIR Filtering:
A Motivating Problem
M most recent samples in the delay line (Xi)
New sample moves data down delay line
“Tap” is a multiply-add
Each tap (M+1 taps total) nominally requires:
Two data fetches
Multiply
Accumulate
Memory write-back to update delay line
Goal: 1 FIR Tap / DSP instruction cycle
37
Kurt Keutzer
BENCHMARKS - FIR FILTER
FINITE-IMPULSE RESPONSE FILTER
Z 1 Z 1 .... Z 1
C1 C2 C N 1 CN
38
Kurt Keutzer
Micro-architectural impact - MAC
N1
h(m)x(n m)
element of finite-impulse
y(n) response filter computation
0 X Y
MPY
ADD/SUB
ACC REG
39
Kurt Keutzer
Mapping of the filter onto a DSP execution unit
4 6
1 3 5
Xn X Yn 1 2
2
Y X
6
D
n-1
4
5 D
3
The critical hardware unit in a DSP is the multiplier - much of the
architecture is organized around allowing use of the multiplier on
every cycle
This means providing two operands on every cycle, through multiple
data and address busses, multiple address units and local
accumulator feedback
40
Kurt Keutzer
MAC Eg. - 320C54x DSP Functional Block Diagram
41
Kurt Keutzer
DSP Memory
FIR Tap implies multiple memory accesses
DSPs want multiple data ports
Some DSPs have ad hoc techniques to reduce memory bandwdith demand
Instruction repeat buffer: do 1 instruction 256 times
Often disables interrupts, thereby increasing interrupt response time
Some recent DSPs have instruction caches
Even then may allow programmer to “lock in” instructions into cache
Option to turn cache into fast program memory
No DSPs have data caches
May have multiple data memories
42
Kurt Keutzer
Conventional ``Von Neumann’’ memory
43
Kurt Keutzer
HARVARD ARCHITECTURE in DSP
PROGRAM
X MEMORY Y MEMORY
MEMORY
GLOBAL
P DATA
X DATA
Y DATA
44
Kurt Keutzer
Memory Architecture
DSP Processor General-Purpose Processor
Harvard architecture Von Neumann architecture
2-4 memory accesses/cycle Typically 1 access/cycle
No caches-on-chip SRAM May use caches
Program
Memory
Processor Processor Memory
Data
Memory
45
Kurt Keutzer
Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard
Architecture
46
Kurt Keutzer
Eg. 320C62x/67x DSP
47
Kurt Keutzer
DSP Addressing
Have standard addressing modes: immediate,
displacement, register indirect
Want to keep MAC datapth busy
Assumption: any extra instructions imply clock cycles of
overhead in inner loop
=> complex addressing is good
=> don’t use datapath to calculate fancy address
Autoincrement/Autodecrement register indirect
lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1
Option to do it before addressing, positive or negative
48
Kurt Keutzer
DSP Addressing: FFT
FFTs start or end with data in weird bufferfly order
0 (000) => 0 (000)
1 (001) => 4 (100)
2 (010) => 2 (010)
3 (011) => 6 (110)
4 (100) => 1 (001)
5 (101) => 5 (101)
6 (110) => 3 (011)
7 (111) => 7 (111)
What can do to avoid overhead of address checking instructions for FFT?
Have an optional “bit reverse” address addressing mode for use with
autoincrement addressing
Many DSPs have “bit reverse” addressing for radix-2 FFT
49
Kurt Keutzer
BIT REVERSED ADDRESSING
000 x(0) F(0)
100 x(4) F(1)
010 x(2) F(2)
110 x(6) F(3)
001 x(1) F(4)
101 x(5) F(5)
011 x(3) F(6)
111 x(7) F(7)
Four 2-point Two 4-point One 8-point DFT
DFTs DFTs
Data flow in the radix-2 decimation-in-time FFT algorithm
50
Kurt Keutzer
DSP Addressing: Buffers
DSPs dealing with continuous I/O
Often interact with an I/O buffer (delay lines)
To save memory, buffer often organized as circular buffer
What can do to avoid overhead of address checking instructions
for circular buffer?
Option 1: Keep start register and end register per address
register for use with autoincrement addressing, reset to start
when reach end of buffer
Option 2: Keep a buffer length register, assuming buffers starts
on aligned address, reset to start when reach end
Every DSP has “modulo” or “circular” addressing
51
Kurt Keutzer
CIRCULAR BUFFERS
Instructions accomodate three
elements:
• buffer address
• buffer size
• increment
Allows for cyling through:
• delay elements
• coefficients in data memory
52
Kurt Keutzer
Addressing
DSP Processor General-Purpose Processor
•Dedicated address •Often, no separate
generation units address generation unit
•Specialized addressing •General-purpose
modes; e.g.: addressing modes
Autoincrement
Modulo (circular)
Bit-reversed (for FFT)
•Good immediate data
support
53
Kurt Keutzer
Address calculation unit for DSP
Supports modulo and bit
reversal arithmetic
Often duplicated to
calculate multiple
addresses per cycle
54
Kurt Keutzer
DSP Instructions and Execution
May specify multiple operations in a single instruction
Must support Multiply-Accumulate (MAC)
Need parallel move support
Usually have special loop support to reduce branch overhead
Loop an instruction or sequence
0 value in register usually means loop maximum number of
times
Must be sure if calculate loop count that 0 does not mean 0
May have saturating shift left arithmetic
May have conditional execution to reduce branches
55
Kurt Keutzer
ADSP 2100: ZERO-OVERHEAD LOOP
DO <addr> UNTIL condition”
DO X ... Address Generation
PCS = PC + 1
if (PC = x && ! condition)
PC = PCS
else
PC = PC +1
X
• Eliminates a few instructions in loops -
• Important in loops with small bodies
56
Kurt Keutzer
Instruction Set
DSP Processor General-Purpose Processor
Specialized, complex
instructions General-purpose instructions
Multiple operations per Typically only one operation
instruction
per instruction
mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov *r0,x0
mov *r1,y0
mpy x0, y0, a
add a, b
mov y0, *r2
inc r0
inc rl
57
Kurt Keutzer
Specialized Peripherals for DSPs
•Synchronous serial ports •Host ports
•Parallel ports •Bit I/O ports
•Timers
•On-chip DMA controller
•On-chip A/D, D/A
converters
•Clock generators
• On-chip peripherals often designed for
“background” operation, even when core is
powered down.
58
Kurt Keutzer
Specialized peripherals
59
Kurt Keutzer
TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995
60
Kurt Keutzer
Summary of Architectural Features of DSPs
Data path configured for DSP
Fixed-point arithmetic
MAC- Multiply-accumulate
Multiple memory banks and buses -
Harvard Architecture
Multiple data memories
Specialized addressing modes
Bit-reversed addressing
Circular buffers
Specialized instruction set and execution control
Zero-overhead loops
Support for MAC
Specialized peripherals for DSP
THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!
61
Kurt Keutzer