DSP Cores vs. Chips

Digital Signal Processor (DSP) Architecture
• Classification of Processor Applications

• Requirements of Embedded Processors
• DSP vs. General Purpose CPUs
• DSP Cores vs. Chips
• Classification of DSP Applications
• DSP Algorithm Format
• DSP Benchmarks
• Basic Architectural Features of DSPs
• DSP Software Development Considerations
• Classification of Current DSP Architectures and example DSPs:
– Conventional DSPs: TI TMSC54xx
– Enhanced Conventional DSPs: TI TMSC55xx
– VLIW DSPs: TI TMS320C62xx, TMS320C64xx
– Superscalar DSPs: LSI Logic ZSP400 DSP core
EECC722 - Shaaban
#1 lec # 8 Fall 2003 10-8-2003
Processor Applications
• General Purpose Processors (GPPs) - high performance.
Increasing
– Alpha’s, SPARC, MIPS ...
– Used for general purpose software
– Heavy weight OS - UNIX, Windows
Cost
– Workstations, PC’s, Clusters
• Embedded processors and processor cores
– ARM, 486SX, Hitachi SH7000, NEC V800...
– Often require Digital signal processing (DSP) support.
– Single program
– Lightweight, often realtime OS
volume
Increasing
– Cellular phones, consumer electronics .. (e.g. CD players)
• Microcontrollers
– Extremely cost sensitive
– Small word size - 8 bit common
– Highest volume processors by far
– Control systems, Automobiles, toasters, thermostats, ...
EECC722 - Shaaban
#2 lec # 8 Fall 2003 10-8-2003
$30B Processor Markets
32-bit
micro
$5.2B/17%
$1.2B/4% 32 bit DSP
DSP $10B/33%
16-bit $5.7B/19%
micro
8-bit $9.3B/31%
micro
EECC722 - Shaaban
#3 lec # 8 Fall 2003 10-8-2003
The Processor Design Space
Application specific
architectures
for performance Microprocessors
Embedded
processors
Performance
Performance is
everything
& Software
rules
Microcontrollers
Cost is everything
Cost
EECC722 - Shaaban
#4 lec # 8 Fall 2003 10-8-2003
Requirements of Embedded Processors
• Optimized for a single program - code often in on-chip ROM
or off chip EPROM
• Minimum code size (one of the motivations initially for Java)
• Performance obtained by optimizing datapath
• Low cost
– Lowest possible area
– Technology behind the leading edge
– High level of integration of peripherals (reduces system cost)
• Fast time to market
– Compatible architectures (e.g. ARM) allows reusable code
– Customizable cores (System-on-Chip, SoC).
• Low power if application requires portability
EECC722 - Shaaban
#5 lec # 8 Fall 2003 10-8-2003
Area of processor cores = Cost
Nintendo processor Cellular phones

EECC722 - Shaaban
#6 lec # 8 Fall 2003 10-8-2003
Another figure of merit: Computation per unit area

EECC722 - Shaaban
#7 lec # 8 Fall 2003 10-8-2003
Code size
• If a majority of the chip is the program stored in ROM,

then code size is a critical issue
• The Piranha has 3 sized instructions - basic 2 byte, and
2 byte plus 16 or 32 bit immediate
EECC722 - Shaaban
#8 lec # 8 Fall 2003 10-8-2003
Embedded Systems vs. General Purpose
Computing
Embedded System General purpose computing
• Runs a few applications often • Intended to run a fully
known at design time general set of applications
• Not end-user programmable • End-user programmable
• Operates in fixed run-time
• Faster is always better
constraints that must be met,
additional performance may • Differentiating features
not be useful/valuable – speed (need not be fully
• Differentiating features: predictable)
– Application-specific – cost (largest component
capability (e.g DSP). power)
– power
– cost
– speed (must be predictable)
EECC722 - Shaaban
#9 lec # 8 Fall 2003 10-8-2003
Evolution of GPPs and DSPs
• General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly,
Von Neumann (ENIAC)
• DSP processors are microprocessors designed for efficient mathematical
manipulation of digital signals.
– DSP evolved from Analog Signal Processors (ASPs), using analog
hardware to transform physical signals (classical electrical
engineering)
– ASP to DSP because
• DSP insensitive to environment (e.g., same response in snow or desert if it
works at all)
• DSP performance identical even with variations in components; 2 analog
systems behavior varies even if built with same components with 1%
variation
• Different history and different applications led to different terms,
different metrics, some new inventions.
EECC722 - Shaaban
#10 lec # 8 Fall 2003 10-8-2003
DSP vs. General Purpose CPUs
• DSPs tend to run one program, not many programs.
– Hence OSes are much simpler, there is no virtual memory or
protection, ...
• DSPs usually run applications with hard real-time
constraints:
– You must account for anything that could happen in a time
slot
– All possible interrupts or exceptions must be accounted for
and their collective time be subtracted from the time interval.
– Therefore, exceptions are BAD.
• DSPs usually process infinite continuous data streams.
• The design of DSP architectures and ISAs driven by the
requirements of DSP algorithms.
EECC722 - Shaaban
#11 lec # 8 Fall 2003 10-8-2003
DSP vs. GPP
• The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
– MAC is common in DSP algorithms that involve computing a vector dot product, such as
digital filters, correlation, and Fourier transforms.
– DSP are judged by whether they can keep the multipliers busy 100% of the time and by how
many MACs are performed in each cycle.
• The "SPEC" of DSPs is 4 algorithms:
– Inifinite Impule Response (IIR) filters
– Finite Impule Response (FIR) filters
– FFT, and
– convolvers
• In DSPs, target algorithms are important:
– Binary compatibility not a mojor issue
• High-level Software is not (yet) very important in DSPs.
– People still write in assembly language for a product to minimize the die area for
ROM in the DSP chip.
EECC722 - Shaaban
#12 lec # 8 Fall 2003 10-8-2003
TYPES OF DSP PROCESSORS
• 32-BIT FLOATING POINT (5% of market):
– TI TMS320C3X, TMS320C67xx
– AT&T DSP32C
– ANALOG DEVICES ADSP21xxx
– Hitachi SH-4
• 16-BIT FIXED POINT (95% of market):
– TI TMS320C2X, TMS320C62xx
– Infineon TC1xxx (TriCore1)
– MOTOROLA DSP568xx, MSC810x
– ANALOG DEVICES ADSP21xx
– Agere Systems DSP16xxx, Starpro2000
– LSI Logic LSI140x (ZPS400)
– Hitachi SH3-DSP
– StarCore SC110, SC140
EECC722 - Shaaban
#13 lec # 8 Fall 2003 10-8-2003
DSP Cores vs. Chips
DSP are usually available as synthesizable cores or off-the-
shelf chips
• Synthesizable Cores:
– Map into chosen fabrication process
• Speed, power, and size vary
– Choice of peripherals, etc. (SoC)
– Requires extensive hardware development effort.
• Off-the-shelf chips:
– Highly optimized for speed, energy efficiency, and/or cost.
– Limited performance, integration options.
– Tools, 3rd-party support often more mature
EECC722 - Shaaban
#14 lec # 8 Fall 2003 10-8-2003
DSP ARCHITECTURE
Enabling Technologies
Time Frame Approach Primary Application Enabling Technologies
Early 1970’s  Discrete logic  Non-real time  Bipolar SSI, MSI

procesing  FFT algorithm
 Simulation
Late 1970’s  Building block  Military radars  Single chip bipolar multiplier
 Digital Comm.  Flash A/D
Early 1980’s  Single Chip DSP P  Telecom  P architectures

 Control  NMOS/CMOS
Late 1980’s  Function/Application  Computers  Vector processing

specific chips  Communication  Parallel processing
Early 1990’s  Multiprocessing  Video/Image Processing  Advanced multiprocessing

 VLIW, MIMD, etc.
Late 1990’s  Single-chip  Wireless telephony  Low power single-chip DSP

multiprocessing  Internet related  Multiprocessing
EECC722 - Shaaban
#15 lec # 8 Fall 2003 10-8-2003
Texas Instruments TMS320 Family
Multiple DSP P Generations
First Bit Size Clock Instruction MAC MOPS Device density (#
Sample speed Throughput execution of transistors)
(MHz) (ns)
Uniprocessor
Based
(Harvard
Architecture)
TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3)
TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2)

TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1)
TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5)
TMS320C2XXX 1995 16 integer 40 MIPS 25 80
Multiprocessor
Based
TMS320C80 1996 32 integer/flt. 2 GOPS MIMD
120 MFLOP
TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
EECC722 - Shaaban
#16 lec # 8 Fall 2003 10-8-2003
DSP Applications
• Digital audio applications • Industrial control
– MPEG Audio • Seismic exploration
– Portable audio • Networking:
• Digital cameras – Wireless
• Cellular telephones – Base station
• Wearable medical appliances – Cable modems
• Storage products: – ADSL
– disk drive servo control – VDSL
• Military applications:
– radar
– sonar
EECC722 - Shaaban
#17 lec # 8 Fall 2003 10-8-2003
DSP Applications
DSP Algorithm System Application
Digital cellular telephones, personal communications systems, digital cordless telephones,
Speech Coding
multimedia computers, secure communications.
Speech Encryption
secure communications.
Advanced user interfaces, multimedia workstations, robotics, automotive applications,
Speech Recognition
cellular telephones, personal communications systems.
Speech Synthesis Advanced user interfaces, robotics
Speaker Identification Security, multimedia workstations, advanced user interfaces
Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia
High-fidelity Audio
computers
Modems digital audio broadcast, digital signaling on cable TV, multimedia computers, wireless
computing, navigation, data/fax
Noise cancellation Professional audio, advanced vehicular audio, industrial applications
Audio Equalization Consumer audio, professional audio, advanced vehicular audio, music
Ambient Acoustics Emulation Consumer audio, professional audio, advanced vehicular audio, music
Audio Mixing/Editing Professional audio, music, multimedia computers
Sound Synthesis Professional audio, music, multimedia computers, advanced user interfaces
Security, multimedia computers, advanced user interfaces, instrumentation, robotics,
Vision
navigation
Image Compression Digital photography, digital video, multimedia computers, videoconferencing
Image Compositing Multimedia computers, consumer video, advanced user interfaces, navigation
Beamforming Navigation, medical imaging, radar/sonar, signals intelligence
Echo cancellation Speakerphones, hands-free cellular telephones
Spectral Estimation Signals intelligence, radar/sonar, professional audio, music
EECC722 - Shaaban
#18 lec # 8 Fall 2003 10-8-2003
Another Look at DSP Applications
Increasing
• High-end
– Military applications
– Wireless Base Station - TMS320C6000
Cost
– Cable modem
– gateways
• Mid-end
– Industrial control
– Cellular phone - TMS320C540
– Fax/ voice server
volume
Increasing
• Low end
– Storage products - TMS320C27
– Digital camera - TMS320C5000
– Portable phones
– Wireless headsets
– Consumer audio
– Automobiles, toasters, thermostats, ...
EECC722 - Shaaban
#19 lec # 8 Fall 2003 10-8-2003
DSP range of applications
EECC722 - Shaaban
#20 lec # 8 Fall 2003 10-8-2003
CELLULAR TELEPHONE SYSTEM
123 CONTROLLER 415-555-1212
456
789
0
PHYSICAL RF
LAYER BASEBAND
CONVERTER MODEM
PROCESSING
A/D SPEECH SPEECH

ENCODE DECODE DAC
EECC722 - Shaaban
#21 lec # 8 Fall 2003 10-8-2003
HW/SW/IC PARTITIONING
MICROCONTROLLER
123
456 CONTROLLER 415-555-1212
789
0
PHYSICAL
BASEBAND RF
LAYER
ASIC CONVERTER MODEM
PROCESSING
A/D SPEECH SPEECH

ENCODE DECODE DAC
DSP
ANALOG IC
EECC722 - Shaaban
#22 lec # 8 Fall 2003 10-8-2003
Mapping Onto System-on-Chip (SoC)
phone keypad
S/P
book intfc
DMA control protocol
S/P
RAM
RAM µC
DMA speech
voice
quality
recognition
enhancment
ASIC DSP RPE-LTP
de-intl &
LOGIC CORE decoder speech decoder
demodulator
and Viterbi
synchronizer equalizer
EECC722 - Shaaban
#23 lec # 8 Fall 2003 10-8-2003
Example Wireless Phone Organization
C540
ARM7
EECC722 - Shaaban
#24 lec # 8 Fall 2003 10-8-2003
Multimedia I/O Architecture
Radio Embedded
Modem Processor
Sched ECC Interface
Pact
Low Power Bus
FB Fifo Fifo Video

Decomp
SRAM Pen
Data
Flow Graphics Audio Video
EECC722 - Shaaban
#25 lec # 8 Fall 2003 10-8-2003
Multimedia System-on-Chip (SoC)
E.g. Multimedia terminal electronics
Graphics Out
Uplink Radio Video I/O
Downlink Radio Voice I/O
Pen In
• Future chips will be a mix of

µP Video Unit
processors, memory and
dedicated hardware for
custom
Coms
specific algorithms and I/O
Memory
DSP
EECC722 - Shaaban
#26 lec # 8 Fall 2003 10-8-2003
DSP Algorithm Format
• DSP culture has a graphical format to represent

formulas.
• Like a flowchart for formulas, inner loops,
not programs.
• Some seem natural:
 is add, X is multiply
• Others are obtuse:
z–1 means take variable from earlier iteration.
• These graphs are trivial to decode
EECC722 - Shaaban
#27 lec # 8 Fall 2003 10-8-2003
DSP Algorithm Notation
• Uses “flowchart” notation instead of equations
• Multiply is or
X
• Add is or
+ 
• Delay/Storage is or or
Delay z–1 D
EECC722 - Shaaban
#28 lec # 8 Fall 2003 10-8-2003
Typical DSP Algorithm:
Finite-Impulse Response (FIR) Filter
• Filters reduce signal noise and enhance image or signal quality by
removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute:
N 1
where
y (i )   h(k ) x(i  k )  h(n) * x(n)
– 0
x is the inputksequence
– y is the output sequence
– h is the impulse response (filter coefficients)
– N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and impulse
response.
EECC722 - Shaaban
#29 lec # 8 Fall 2003 10-8-2003
Finite-impulse Response (FIR) Filter
• N most recent samples in the delay line (Xi)
• New sample moves data down delay line
• “Tap” is a multiply-add
• Each tap (N taps total) nominally requires:
– Two data fetches
– Multiply
– Accumulate
– Memory write-back to update delay line
• Goal: at least 1 FIR Tap / DSP instruction cycle
EECC722 - Shaaban
#30 lec # 8 Fall 2003 10-8-2003
FINITE-IMPULSE RESPONSE (FIR) FILTER
X ....
Z 1 Z 1 Z 1
h0 h1 hN-2 hN-1
A Tap N 1
y (i )   h(k ) x(i  k )
k 0
Goal: at least 1 FIR Tap / DSP instruction cycle
EECC722 - Shaaban
#31 lec # 8 Fall 2003 10-8-2003
Sample Computational Rates
for FIR Filtering
Signal type Frequency # taps Performance
Speech 8 kHz N =128 20 MOPs
Music 48 kHz N =256 24 MOPs
Video phone 6.75 MHz N*N = 81 1,090 MOPs
TV 27 MHz N*N = 81 4,370 MOPs
HDTV 144 MHz N*N = 81 23,300 MOPs
1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2.
EECC722 - Shaaban
#32 lec # 8 Fall 2003 10-8-2003
FIR filter on (simple)
General Purpose Processor
loop:
lw x0, 0(r0)
lw y0, 0(r1)
mul a, x0,y0
add y0,a,b
sw y0,(r2)
inc r0
inc r1
inc r2
dec ctr
tst ctr
jnz loop
• Problems: Bus / memory bandwidth bottleneck, control code
overhead
EECC722 - Shaaban
#33 lec # 8 Fall 2003 10-8-2003
Infinite-Impulse Response (IIR) Filter
• Infinite Impulse Response (IIR) filters compute:
M 1 N 1
y (i )   a(k ) y(i  k )   b(k ) x(i  k )
k 1 k 0
• Output sequence depends on input sequence, previous
outputs, and impulse response.
• Both FIR and IIR filters
– Require dot product (multiply-accumulate) operations
– Use fixed coefficients
• Adaptive filters update their coefficients to minimize the
distance between the filter output and the desired signal.
EECC722 - Shaaban
#34 lec # 8 Fall 2003 10-8-2003
Discrete Fourier Transform
• The Discrete Fourier Transform (DFT) allows for spectral
analysis in the frequency domain.
• It is computed as
N 1 2 j 
y (k )  WN x(n)
nk
WN e N j  1
for k = 0, 1, … , N-1,
n 0 where
– x is the input sequence in the time domain
– y is an output sequence in the frequency domain
• The Inverse Discrete Fourier Transform is computed as
• The Fast Fourier Transform (FFT) provides an efficient method

for computing the DFT. N 1
 nk
x(n)  WN y (k ), for n  0, 1, ... , n - 1
k 0
EECC722 - Shaaban
#35 lec # 8 Fall 2003 10-8-2003
Discrete Cosine Transform (DCT)
• The Discrete Cosine Transform (DCT) is frequently used in
video compression (e.g., MPEG-2).
• The DCT and Inverse DCT (IDCT) are computed as:
N 1
(2n  1)k
y (k )  e(k )  cos[ ]x(n), for k  0, 1, ... N - 1
n 0 2N
2 N 1
(2n  1)k
x ( n) 
N
 e(k ) cos[ 2 N ] y(n), for k  0, 1, ... N - 1
k 0
where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.

• A N-Point, 1D-DCT requires N2 MAC operations.
EECC722 - Shaaban
#36 lec # 8 Fall 2003 10-8-2003
DSP BENCHMARKS
• DSPstone: University of Aachen, application benchmarks
– ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES
– DOT_PRODUCT, MATRIX_1X3, CONVOLUTION
– FIR, FIR2DIM, HR_ONE_BIQUAD
– LMS, FFT_INPUT_SCALED
• BDTImark2000: Berkeley Design Technology Inc

– 12 DSP kernels in hand-optimized assembly language
– Returns single number (higher means faster) per processor
– Use only on-chip memory (memory bandwidth is the major bottleneck in
performance of embedded applications).
• EEMBC (pronounced “embassy”): EDN Embedded

Microprocessor Benchmark Consortium
– 30 companies formed by Electronic Data News (EDN)
– Benchmark evaluates compiled C code on a variety of embedded processors
(microcontrollers, DSPs, etc.)
– Application domains: automotive-industrial, consumer, office automation,
networking and telecommunications
EECC722 - Shaaban
#37 lec # 8 Fall 2003 10-8-2003
EECC722 - Shaaban
#38 lec # 8 Fall 2003 10-8-2003
Basic Architectural Features of DSPs
• Data path configured for DSP
– Fixed-point arithmetic
– MAC- Multiply-accumulate
• Multiple memory banks and buses -
– Harvard Architecture
– Multiple data memories
• Specialized addressing modes
– Bit-reversed addressing
– Circular buffers
• Specialized instruction set and execution control
– Zero-overhead loops
– Support for fast MAC
– Fast Interrupt Handling
• Specialized peripherals for DSP
EECC722 - Shaaban
#39 lec # 8 Fall 2003 10-8-2003
DSP Data Path: Arithmetic
• DSPs dealing with numbers representing real world
=> Want “reals”/ fractions
• DSPs dealing with numbers for addresses
=> Want integers
• Support “fixed point” as well as integers
S. -1 Š x < 1
radix
point
S .
radix
–2N–1 Š x < 2N–1
point
EECC722 - Shaaban
#40 lec # 8 Fall 2003 10-8-2003
DSP Data Path: Precision
• Word size affects precision of fixed point numbers
• DSPs have 16-bit, 20-bit, or 24-bit data words
• Floating Point DSPs cost 2X - 4X vs. fixed point, slower than
fixed point
• DSP programmers will scale values inside code
– SW Libraries
– Separate explicit exponent
• “Blocked Floating Point” single exponent for a group of
fractions
• Floating point support simplify development
EECC722 - Shaaban
#41 lec # 8 Fall 2003 10-8-2003
DSP Data Path: Overflow
• DSP are descended from analog :
– Modulo Arithmetic.
• Set to most positive (2N–1–1) or
most negative value(–2N–1) : “saturation”
• Many DSP algorithms were developed in this
model.
EECC722 - Shaaban
#42 lec # 8 Fall 2003 10-8-2003
DSP Data Path: Multiplier
• Specialized hardware performs all key arithmetic

operations in 1 cycle
• 50% of instructions can involve multiplier
=> single cycle latency multiplier
• Need to perform multiply-accumulate (MAC)
• n-bit multiplier => 2n-bit product
EECC722 - Shaaban
#43 lec # 8 Fall 2003 10-8-2003
DSP Data Path: Accumulator
• Don’t want overflow or have to scale accumulator
• Option 1: accumalator wider than product:
“guard bits”
– Motorola DSP:
24b x 24b => 48b product, 56b Accumulator
• Option 2: shift right and round product before adder
Multiplier
Multiplier
Shift
ALU ALU
Accumulator G Accumulator
EECC722 - Shaaban
#44 lec # 8 Fall 2003 10-8-2003
DSP Data Path: Rounding
• Even with guard bits, will need to round when store accumulator into
memory
• 3 DSP standard options
• Truncation: chop results
=> biases results up
• Round to nearest:
< 1/2 round down,  1/2 round up (more positive)
=> smaller bias
• Convergent:
< 1/2 round down, > 1/2 round up (more positive), = 1/2 round to
make lsb a zero (+1 if 1, +0 if 0)
=> no bias
IEEE 754 calls this round to nearest even
EECC722 - Shaaban
#45 lec # 8 Fall 2003 10-8-2003
Data Path Comparison
DSP Processor General-Purpose Processor
• Specialized hardware • Multiplies often take>1

performs all key arithmetic cycle
operations in 1 cycle. • Shifts often take >1 cycle
• Hardware support for • Other operations (e.g.,
managing numeric fidelity: saturation, rounding)
– Shifters typically take multiple
– Guard bits cycles.
– Saturation
EECC722 - Shaaban
#46 lec # 8 Fall 2003 10-8-2003
TI 320C54x DSP (1995) Functional Block Diagram
EECC722 - Shaaban
#47 lec # 8 Fall 2003 10-8-2003
First Commercial DSP (1982): Texas
Instruments TMS32010
Instruction
Memory
• 16-bit fixed-point arithmetic
Processor
• Introduced at 5Mhz (200ns)
Data
instruction cycle. Memory
• “Harvard architecture” Datapath:
– separate instruction, Mem
data memories T-Register
• Accumulator
• Specialized instruction set Multiplier
– Load and Accumulate P-Register
ALU
• Two-cycle (400 ns) Multiply-
Accumulate (MAC) time.
Accumulator
EECC722 - Shaaban
#48 lec # 8 Fall 2003 10-8-2003
First Generation DSP P
Texas Instruments TMS32010 - 1982
Features
• 200 ns instruction cycle (5 MIPS)

• 144 words (16 bit) on-chip data RAM
• 1.5K words (16 bit) on-chip program ROM - TMS32010
• External program memory expansion to a total of 4K words at full speed
• 16-bit instruction/data word
• single cycle 32-bit ALU/accumulator
• Single cycle 16 x 16-bit multiply in 200 ns
• Two cycle MAC (5 MOPS)
• Zero to 15-bit barrel shifter
• Eight input and eight output channels
EECC722 - Shaaban
#49 lec # 8 Fall 2003 10-8-2003
TMS32010 BLOCK DIAGRAM
EECC722 - Shaaban
#50 lec # 8 Fall 2003 10-8-2003
TMS32010 FIR Filter Code
• Here X4, H4, ... are direct (absolute) memory addresses:

LT X4 ; Load T with x(n-4)
MPY H4 ; P = H4*X4
LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3);
; Acc = Acc + P
MPY H3 ; P = H3*X3
LTD X2
MPY H2
...
• Two instructions per tap, but requires unrolling
EECC722 - Shaaban
#51 lec # 8 Fall 2003 10-8-2003
Micro-architectural impact - MAC
N1
 h(m)x(n  m)
element of finite-impulse
y(n)  response filter computation
0 X Y
MPY
ADD/SUB
ACC REG
EECC722 - Shaaban
#52 lec # 8 Fall 2003 10-8-2003
Mapping of the filter onto a DSP execution unit
4 6
1 3 5
Xn X  Yn 1 2
2
 Y X
6
D
n-1
4

5 D
3
• The critical hardware unit in a DSP is the multiplier - much of the

architecture is organized around allowing use of the multiplier on
every cycle
• This means providing two operands on every cycle, through multiple
data and address busses, multiple address units and local accumulator
feedback
EECC722 - Shaaban
#53 lec # 8 Fall 2003 10-8-2003
MAC Eg. - 320C54x DSP Functional Block Diagram
EECC722 - Shaaban
#54 lec # 8 Fall 2003 10-8-2003
DSP Memory
• FIR Tap implies multiple memory accesses
• DSPs require multiple data ports
• Some DSPs have ad hoc techniques to reduce memory bandwdith
demand:
– Instruction repeat buffer: do 1 instruction 256 times
– Often disables interrupts, thereby increasing interrupt response time
• Some recent DSPs have instruction caches
– Even then may allow programmer to “lock in” instructions into cache
– Option to turn cache into fast program memory
• No DSPs have data caches.
• May have multiple data memories
EECC722 - Shaaban
#55 lec # 8 Fall 2003 10-8-2003
Conventional ``Von Neumann’’ memory
EECC722 - Shaaban
#56 lec # 8 Fall 2003 10-8-2003
HARVARD MEMORY ARCHITECTURE in DSP
PROGRAM
X MEMORY Y MEMORY
MEMORY
GLOBAL
P DATA
X DATA
Y DATA
EECC722 - Shaaban
#57 lec # 8 Fall 2003 10-8-2003
Memory Architecture Comparison
• Harvard architecture • Von Neumann architecture
• 2-4 memory accesses/cycle • Typically 1 access/cycle
• No caches-on-chip SRAM • Use caches
Program
Memory
Processo Processor Memory
r Data
Memory
EECC722 - Shaaban
#58 lec # 8 Fall 2003 10-8-2003
Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard
Architecture
EECC722 - Shaaban
#59 lec # 8 Fall 2003 10-8-2003
Eg. TI 320C62x/67x DSP (1997)
EECC722 - Shaaban
#60 lec # 8 Fall 2003 10-8-2003
DSP Addressing
• Have standard addressing modes: immediate,
displacement, register indirect
• Want to keep MAC datapath busy
• Assumption: any extra instructions imply clock cycles
of overhead in inner loop
=> complex addressing is good
=> don’t use datapath to calculate fancy address
• Autoincrement/Autodecrement register indirect
– lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1
– Option to do it before addressing, positive or negative
EECC722 - Shaaban
#61 lec # 8 Fall 2003 10-8-2003
DSP Addressing: FFT
• FFTs start or end with data in bufferfly order
0 (000) => 0 (000)
1 (001) => 4 (100)
2 (010) => 2 (010)
3 (011) => 6 (110)
4 (100) => 1 (001)
5 (101) => 5 (101)
6 (110) => 3 (011)
7 (111) => 7 (111)
• What can do to avoid overhead of address checking instructions for FFT?
• Have an optional “bit reverse” address addressing mode for use with autoincrement
addressing
• Many DSPs have “bit reverse” addressing for radix-2 FFT
EECC722 - Shaaban
#62 lec # 8 Fall 2003 10-8-2003
BIT REVERSED ADDRESSING
000 x(0) F(0)
100 x(4) F(1)
010 x(2) F(2)
110 x(6) F(3)
001 x(1) F(4)
101 x(5) F(5)
011 x(3) F(6)
111 x(7) F(7)
Four 2-point Two 4-point One 8-point DFT

DFTs DFTs
Data flow in the radix-2 decimation-in-time FFT algorithm

EECC722 - Shaaban
#63 lec # 8 Fall 2003 10-8-2003
DSP Addressing: Buffers
• DSPs dealing with continuous I/O
• Often interact with an I/O buffer (delay lines)
• To save memory, buffers often organized as circular buffers
• What can do to avoid overhead of address checking
instructions for circular buffer?
• Option 1: Keep start register and end register per address
register for use with autoincrement addressing, reset to start
when reach end of buffer
• Option 2: Keep a buffer length register, assuming buffers
starts on aligned address, reset to start when reach end
• Every DSP has “modulo” or “circular” addressing
EECC722 - Shaaban
#64 lec # 8 Fall 2003 10-8-2003
CIRCULAR BUFFERS
Instructions accomodate three

elements:
• buffer address
• buffer size
• increment
Allows for cycling through:
• delay elements
• coefficients in data memory
EECC722 - Shaaban
#65 lec # 8 Fall 2003 10-8-2003
Addressing Comparison
• Dedicated address • Often, no separate address
generation units generation unit
• Specialized addressing • General-purpose addressing
modes; e.g.: modes
– Autoincrement
– Modulo (circular)
– Bit-reversed (for FFT)
• Good immediate data
support
EECC722 - Shaaban
#66 lec # 8 Fall 2003 10-8-2003
Address calculation unit for DSPs
• Supports modulo and bit

reversal arithmetic
• Often duplicated to
calculate multiple
addresses per cycle
EECC722 - Shaaban
#67 lec # 8 Fall 2003 10-8-2003
DSP Instructions and Execution
• May specify multiple operations in a single instruction
• Must support Multiply-Accumulate (MAC)
• Need parallel move support
• Usually have special loop support to reduce branch overhead
– Loop an instruction or sequence
– 0 value in register usually means loop maximum number of
times
– Must be sure if calculate loop count that 0 does not mean 0
• May have saturating shift left arithmetic
• May have conditional execution to reduce branches
EECC722 - Shaaban
#68 lec # 8 Fall 2003 10-8-2003
ADSP 2100: ZERO-OVERHEAD LOOP
DO <addr> UNTIL condition”
DO X ... Address Generation

PCS = PC + 1
if (PC = x && ! condition)
PC = PCS
else
PC = PC +1
X
• Eliminates a few instructions in loops -

• Important in loops with small bodies
EECC722 - Shaaban
#69 lec # 8 Fall 2003 10-8-2003
Instruction Set Comparison
• Specialized, complex • General-purpose

instructions
instructions
• Multiple operations per
• Typically only one
instruction
operation per instruction
mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov *r0,x0

mov *r1,y0
mpy x0, y0, a
add a, b
mov y0, *r2
inc r0
inc rl
EECC722 - Shaaban
#70 lec # 8 Fall 2003 10-8-2003
Specialized Peripherals for DSPs
• Synchronous serial ports
• Host ports
• Parallel ports Instruction
Memory
• Timers
DSP
Core • Bit I/O ports
• On-chip DMA
Serial Ports
• On-chip A/D, D/A A/D Converter Data
converters D/A Converter

Memory
controller
• Clock generators
• On-chip peripherals often designed for

“background” operation, even when core
is powered down.
EECC722 - Shaaban
#71 lec # 8 Fall 2003 10-8-2003
Specialized DSP peripherals
EECC722 - Shaaban
#72 lec # 8 Fall 2003 10-8-2003
TI TMS320C203/LC203 BLOCK DIAGRAM
DSP Core Approach - 1995
EECC722 - Shaaban
#73 lec # 8 Fall 2003 10-8-2003
Summary of Architectural Features of DSPs
– Fixed-point arithmetic
– MAC- Multiply-accumulate
– Harvard Architecture
• Specialized addressing modes
– Bit-reversed addressing
– Support for MAC
• THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE

DESIGN.
EECC722 - Shaaban
#74 lec # 8 Fall 2003 10-8-2003
DSP Software Development Considerations
• Different from general-purpose software development:
– Resource-hungry, complex algorithms.
– Specialized and/or complex processor architectures.
– Severe cost/storage limitations.
– Hard real-time constraints.
– Optimization is essential.
– Increased testing challenges.
• Essential tools:
– Assembler, linker.
– Instruction set simulator.
– HLL Code generation: C compiler.
– Debugging and profiling tools.
• Increasingly important:
– Software libraries.
– Real-time operating systems.
EECC722 - Shaaban
#75 lec # 8 Fall 2003 10-8-2003
Classification of Current DSP Architectures
• Modern Conventional DSPs:

– Similar to the original DSPs of the early 1980s
– Single instruction/cycle. Example: TI TMS320C54x
• Enhanced Conventional DSPs:
– Add parallel execution units: SIMD operation
– Complex, compound instructions. Example: TI TMS320C55x
• Multiple-Issue DSPs:
– VLIW Example: TI TMS320C62xx, TMS320C64xx
– Superscalar, Example: LSI Logic ZPS400
EECC722 - Shaaban
#76 lec # 8 Fall 2003 10-8-2003
A Conventional DSP:
TI TMSC54xx
• 16-bit fixed-point DSP.
• Issues one 16-bit instruction/cycle
• Modified Harvard memory architecture
• Peripherals typical of conventional DSPs:
– 2-3 synch. Serial ports, parallel port
– Bit I/O, Timer, DMA
• Inexpensive (100 MHz ~$5 qty 10K).
• Low power (60 mW @ 1.8V, 100 MHz).
EECC722 - Shaaban
#77 lec # 8 Fall 2003 10-8-2003
A Current Conventional DSP:
TI TMSC54xx
EECC722 - Shaaban
#78 lec # 8 Fall 2003 10-8-2003
An Enhanced Conventional DSP:
TI TMSC55xx
• The TMS320C55xx is based on Texas Instruments' earlier TMS320C54xx
family, but adds significant enhancements to the architecture and instruction
set, including:
– Two instructions/cycle
• Instructions are scheduled for parallel execution by the assembly programmer or
compiler.
– Two MAC units.
• Complex, compound instructions:
– Assembly source code compatible with C54xx
– Mixed-width instructions: 8 to 48 bits.
– 200 MHz @ 1.5 V, ~130 mW , $17 qty 10k
• Poor compiler target.
EECC722 - Shaaban
#79 lec # 8 Fall 2003 10-8-2003
TI TMSC55xx
EECC722 - Shaaban
#80 lec # 8 Fall 2003 10-8-2003
16-bit Fixed-Point VLIW DSP:
TI TMS320C6201 Revision 2 (1997)
The TMS320C62xx is the
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM
first fixed-point DSP
processor from Texas Pwr C6201 CPU Megamodule
Dwn Program Fetch
Instruments that is based Instruction Dispatch
Control
Host Registers
Port
on a VLIW-like architecture Interface
Instruction Decode
Control
Data Path 1 Data Path 2
which allows it to execute up 4-DMA A Register File B Register File
Logic
Test
to eight 32-bit RISC-like Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Ext. Interrupts
instructions per clock cycle.
Memory
Interface
2 Timers
2 Multi-
Data Memory channel
buffered
32-Bit address, 8-, 16-, 32-Bit data serial ports
(T1/E1)
512K Bits RAM
EECC722 - Shaaban
#81 lec # 8 Fall 2003 10-8-2003
C6201 Internal Memory Architecture
• Separate Internal Program and Data Spaces
• Program
– 16K 32-bit instructions (2K Fetch Packets)
– 256-bit Fetch Width
– Configurable as either
• Direct Mapped Cache, Memory Mapped Program Memory
• Data
– 32K x 16
– Single Ported Accessible by Both CPU Data Buses
– 4 x 8K 16-bit Banks
• 2 Possible Simultaneous Memory Accesses (4 Banks)
• 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
EECC722 - Shaaban
#82 lec # 8 Fall 2003 10-8-2003
C62x Datapaths
Registers A0 - A15 Registers B0 - B15
1X 2X
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
DDATA_I1 DDATA_I2
(load data) (load data)
DDATA_O1 DADR1 DADR2 DDATA_O2

(store data) (address) (address) (store data)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
EECC722 - Shaaban
#83 lec # 8 Fall 2003 10-8-2003
C62x Functional Units
• L-Unit (L1, L2)
– 40-bit Integer ALU, Comparisons
– Bit Counting, Normalization
• S-Unit (S1, S2)

– 32-bit ALU, 40-bit Shifter
– Bitfield Operations, Branching
• M-Unit (M1, M2)

– 16 x 16 -> 32
• D-Unit (D1, D2)

– 32-bit Add/Subtract
– Address Calculations
EECC722 - Shaaban
#84 lec # 8 Fall 2003 10-8-2003
C62x Instruction Packing
Instruction Packing Advanced VLIW
Example 1
• Fetch Packet
A B C D E F G H – CPU fetches 8 instructions/cycle
• Execute Packet
A – CPU executes 1 to 8 instructions/cycle
B – Fetch packets can contain multiple execute packets
C • Parallelism determined at compile / assembly time

• Examples
D Example 2 – 1) 8 parallel instructions
E – 2) 8 serial instructions
F – 3) Mixed Serial/Parallel Groups
G • A // B
H A B
• C
• D
C • E // F // G // H
D Example 3 • Reduces Codesize, Number of Program Fetches, Power
E Consumption
F G H
EECC722 - Shaaban
#85 lec # 8 Fall 2003 10-8-2003
C62x Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
• Single-Cycle Throughput
• Decode
• Operate in Lock Step
– DP Instruction Dispatch
• Fetch – DC Instruction Decode
– PG Program Address Generate • Execute
– PS Program Address Send – E1 - E5 Execute 1 through Execute 5
– PW Program Access Ready Wait
– PR Program Fetch Packet Receive
Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5
EECC722 - Shaaban
#86 lec # 8 Fall 2003 10-8-2003
Delay Slots
• Delay Slots: number of extra cycles until result is:
– written to register file
– available for use by a subsequent instructions
– Multi-cycle NOP instruction can fill delay slots while minimizing
code size impact
Most Instructions E1 No Delay
Integer Multiply E1 E2 1 Delay Slots
Loads E1 E2 E3 E4 E5 4 Delay Slots

Branches E1
Branch Target PG PSPWPR DPDC E1 5 Delay Slots
EECC722 - Shaaban
#87 lec # 8 Fall 2003 10-8-2003
C6000 Instruction Set Features
Conditional Instructions
• All Instructions can be Conditional

– A1, A2, B0, B1, B2 can be used as Conditions
– Based on Zero or Non-Zero Value
– Compare Instructions can allow other Conditions (<, >,
etc)
• Reduces Branching
• Increases Parallelism
EECC722 - Shaaban
#88 lec # 8 Fall 2003 10-8-2003
C6000 Instruction Set Addressing
Features
• Load-Store Architecture
• Two Addressing Units (D1, D2)
• Orthogonal
– Any Register can be used for Addressing or Indexing
• Signed/Unsigned Byte, Half-Word, Word, Double-
Word Addressable
– Indexes are Scaled by Type
• Register or 5-Bit Unsigned Constant Index
EECC722 - Shaaban
#89 lec # 8 Fall 2003 10-8-2003
Features
• Indirect Addressing Modes
– Pre-Increment *++R[index]
– Post-Increment *R++[index]
– Pre-Decrement *--R[index]
– Post-Decrement *R--[index]
– Positive Offset *+R[index]
– Negative Offset *-R[index]
• 15-bit Positive/Negative Constant Offset from Either B14 or
B15
• Circular Addressing
– Fast and Low Cost: Power of 2 Sizes and Alignment
– Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes
• Dual Endian Support
EECC722 - Shaaban
#90 lec # 8 Fall 2003 10-8-2003
EECC722 - Shaaban
#91 lec # 8 Fall 2003 10-8-2003
EECC722 - Shaaban
#92 lec # 8 Fall 2003 10-8-2003
TI TMS320C64xx
• Announced in February 2000, the TMS320C64xx is an extension of
Texas Instruments' earlier TMS320C62xx architecture.
• The TMS320C64xx has 64 32-bit general-purpose registers, twice as
many as the TMS320C62xx.
• The TMS320C64xx instruction set is a superset of that used in the
TMS320C62xx, and, among other enhancements, adds significant
SIMD processing capabilities:
– 8-bit operations for image/video processing.
• 600 MHz clock speed, but:
– 11-stage pipeline with long latencies
– Dynamic caches.
• $100 qty 10k.
• The only DSP family with compatible fixed and floating-point versions.
EECC722 - Shaaban
#93 lec # 8 Fall 2003 10-8-2003
Superscalar DSP:
LSI Logic ZSP400
• A 4-way superscalar dynamically scheduled 16-bit fixed-
point DSP core.
• 16-bit RISC-like instructions
• Separate on-chip caches for instructions and data
• Two MAC units, two ALU/shifter units
– Limited SIMD support.
– MACS can be combined for 32-bit operations.
• Disadvantage:
– Dynamic behavior complicates DSP software development:
• Ensuring real-time behavior
• Optimizing code.
EECC722 - Shaaban
#94 lec # 8 Fall 2003 10-8-2003
EECC722 - Shaaban
#95 lec # 8 Fall 2003 10-8-2003
Computing Engine Choices
• General Purpose Processors (GPPs): Intended for general purpose computing
(desktops, servers, clusters..) General Purpose ISAs (RISC or CISC)
• Application-Specific Processors (ASPs): Processors with ISAs and architectural
features tailored towards specific application domains
– E.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics
Processing Units (GPUs), Vector Processors??? ...
Special Purpose ISAs
• Co-Processors: A hardware (hardwired) implementation of specific algorithms with
limited programming interface (augment GPPs or ASPs)
• Configurable Hardware:
– Field Programmable Gate Arrays (FPGAs)
– Configurable array of simple processing elements
• Application Specific Integrated Circuits (ASICs): A custom VLSI hardware solution
for a specific computational task
• The choice of one or more depends on a number of factors including:
- Type and complexity of computational algorithm
(general purpose vs. Specialized)
- Desired level of flexibility and programmability
- Performance requirements
- Desired level of computational efficiency
(e.g Computations per watt or computations per chip area)
- Power requirements - Real-time constraints
- Development time and cost - System cost
EECC722 - Shaaban
#96 lec # 8 Fall 2006 10-16-2006
Computing Engine Choices
e.g Digital Signal Processors (DSPs),
Programmability / Flexibility
General Purpose Network Processors (NPs),

Processors Media Processors,
(GPPs): Graphics Processing Units (GPUs)
Physics Processor ….
Processor = Programmable computing element that runs

Application-Specific
programs written using a pre-defined set of instructions
Processors (ASPs)
Configurable Hardware
Selection Factors:
-Type and complexity of computational algorithm
(general purpose vs. Specialized) Co-Processors
- Desired level of flexibility and programmability Application Specific
- Performance requirements Integrated Circuits
- Desired level of computational efficiency (ASICs)
- Power requirements - Real-time constraints
- Development time and cost - System cost
Specialization , Development cost/time

Performance/Chip Area/Watt (Computational Efficiency)
EECC722 - Shaaban
#97 lec # 8 Fall 2006 10-16-2006
Why Application-Specific Processors (ASPs)?
Computing Element Choices Observation

• Generality and efficiency are in some sense inversely related
to one another:
– The more general-purpose a computing element is and thus the greater the
number of tasks it can perform, the less efficient (e.g. Computations per chip
area /watt) it will be in performing any of those specific tasks.
– Design decisions are therefore almost always compromises; designers identify
key features or requirements of applications that must be met and and make
compromises on other less important features.
• To counter the problem of computationally intense and
specialized problems for which general purpose machines
cannot achieve the necessary performance/other
requirements:
– Special-purpose processors (or Application-Specific Processors, ASPs) ,
attached processors, and coprocessors have been designed/built for many
years, for specific application domains, such as image or digital signal
processing (for which many of the computational tasks are specialized and
can be very well defined).
Generality = Flexibility = Programmability ?
Efficiency = Computations per watt or chip area EECC722 - Shaaban
#98 lec # 8 Fall 2006 10-16-2006
Digital Signal Processor (DSP) Architecture
• Classification of Main Processor Types/Applications
• Requirements of Embedded Processors
• DSP vs. General Purpose CPUs
• DSP Cores vs. Chips
• Classification of DSP Applications
• DSP Algorithm Format
• DSP Benchmarks
• Basic Architectural Features of DSPs
• DSP Software Development Considerations
• Classification of Current DSP Architectures and example DSPs:
– Conventional DSPs: TI TMSC54xx
– Enhanced Conventional DSPs: TI TMSC55xx
– Multiple-Issue DSPs:
• VLIW DSPs: TI TMS320C62xx, TMS320C64xx
• Superscalar DSPs: LSI Logic ZSP400/500 DSP core
EECC722 - Shaaban
#99 lec # 8 Fall 2006 10-16-2006
Main Processor Types/Applications
• General Purpose Computing & General Purpose Processors (GPPs) –
Cost/Complexity
– High performance: In general, faster is always better.
– RISC or CISC: Intel P4, IBM Power4, SPARC, PowerPC, MIPS ...
– Used for general purpose software
– End-user programmable
Increasing
– Real-time performance may not be fully predictable (due to dynamic arch. features)
– Heavy weight, multi-tasking OS - Windows, UNIX
– Normally, low cost and power not a requirement (changing)
– Servers, Workstations, Desktops (PC’s), Notebooks, Clusters …
• Embedded Processing: Embedded processors and processor cores
– Cost, power code-size and real-time requirements and constraints
– Once real-time constraints are met, a faster processor may not be better
– e.g: Intel XScale, ARM, 486SX, Hitachi SH7000, NEC V800...
– Often require Digital signal processing (DSP) support or other
application-specific support (e.g network, media processing)
– Single or few specialized programs – known at system design time
– Not end-user programmable
– Real-time performance must be fully predictable (avoid dynamic arch. features)
– Lightweight, often realtime OS or no OS
volume
Increasing
– Examples: Cellular phones, consumer electronics .. …
• Microcontrollers
– Extremely code size/cost/power sensitive
– Single program
– Small word size - 8 bit common
– Usually no OS
– Highest volume processors by far
– Examples: Control systems, Automobiles, industrial control, thermostats, ...
Examples of Application-Specific Processors (ASPs) EECC722 - Shaaban

#100 lec # 8 Fall 2006 10-16-2006
The Processor Design Space
Application specific
architectures
for performance Microprocessors
Embedded
Real-time constraints processors GPPs
Performance
Specialized applications
Low power/cost constraints Performance is
everything
& Software rules
Microcontrollers
Cost is everything
Chip Area, Power Processor Cost
complexity
EECC722 - Shaaban
#101 lec # 8 Fall 2006 10-16-2006
Requirements of Embedded Processors
• Usually must meet strict real-time constraints:
– Real-time performance must be fully predictable:
• Avoid dynamic processor architectural features that make real-time performance
harder to predict ( e.g cache, dynamic scheduling, hardware speculation …)
– Once real-time constraints are met, a faster processor is not desirable (overkill)
due to increased cost/power requirements.
• Optimized for a single (or few) program (s) - code often in on-chip ROM or on/off
chip EPROM/flash memory.
• Minimum code size (one of the motivations initially for Java)
• Performance obtained by optimizing datapath
• Low cost
– Lowest possible area
• High computational efficiency: Computation per unit area
– VLSI implementation technology usually behind the leading edge
– High level of integration of peripherals (System-on-Chip -SoC- approach reduces system
cost/power)
• Fast time to market
– Compatible architectures (e.g. ARM family) allows reusable code
– Customizable cores (System-on-Chip, SoC).
• Low power if application requires portability
EECC722 - Shaaban
#102 lec # 8 Fall 2006 10-16-2006
Embedded Processors
Area of processor cores = Cost
(and Power requirements)

EECC722 - Shaaban
#103 lec # 8 Fall 2006 10-16-2006
Embedded Processors
Another figure of merit: Computation per unit area
(Computational Efficiency)

EECC722 - Shaaban
#104 lec # 8 Fall 2006 10-16-2006
Embedded Processors
Code size Smaller is better
• If a majority of the chip is the program stored in ROM, then minimizing code size is a critical issue
• Common embedded processor ISA features to minimize code size:
– Variable length instruction encoding common:
• e.g. the Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate
– Complex/specialized instructions
– Complex addressing modes
EECC722 - Shaaban
#105 lec # 8 Fall 2006 10-16-2006
Embedded Systems vs. General Purpose Computing
Embedded General Purpose Computing Systems
Systems
(and embedded processors) (and processors GPPs)
Run a single or few specialized applications Used for general purpose software :
often known at system design time Intended to run a fully general set of applications
that may not be known at design time
May require application-specific capability No application-specific capability required
(e.g DSP)
Not end-user programmable End-user programmable
Minimum code size is highly desirable Minimizing code size is not an issue
Lightweight, often real-time OS or no OS Heavy weight, multi-tasking OS - Windows, UNIX
Low power and cost constraints/requirements Higher power and cost constraints/requirements
Usually must meet strict real-time constraints In general, no real-time constraints
–(e.g. real-time sampling rate)
Real-time performance must be fully Real-time performance may not be fully predictable
predictable: (due to dynamic processor architectural features):
•Avoid dynamic processor architectural features •Superscalar: dynamic scheduling, hardware
that make real-time performance harder to speculation, branch prediction, cache.
predict
Once real-time constraints are met, a faster Faster (higher-performance) is always better
processor is not desirable (overkill) due to
increased cost/power requirements.
EECC722 - Shaaban
#106 lec # 8 Fall 2006 10-16-2006
Evolution of GPPs and DSPs
• General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly, Von
Neumann (ENIAC) + EDSAC
• Digital Signal Processors (DSPs) are microprocessors designed for efficient
mathematical manipulation of digital signals utilizing digital signal processing
algorithms.
– DSPs usually process infinite continuous sampled data streams (signals)
while meeting real-time and power constraints.
– DSPs evolved from Analog Signal Processors (ASPs) that utilize analog
hardware to transform physical signals (classical electrical engineering)
– ASP to DSP because:
• DSP insensitive to environment (e.g., same response in snow or desert if it
works at all)
• DSP performance identical even with variations in components; 2 analog
systems behavior varies even if built with same components with 1% variation
• Different history and different applications requirements led to different
terms, different metrics, architectures, some new inventions.
EECC722 - Shaaban
#107 lec # 8 Fall 2006 10-16-2006
DSP vs. General Purpose CPUs
• DSPs tend to run one (or few) program(s), not many programs.
– Hence OSes (if any) are much simpler, there is no virtual memory or protection, ...
• DSPs usually run applications with hard real-time constraints:
– DSP must meet application signal sampling rate computational requirements:
• A faster DSP is overkill (higher DSP cost, power..)
– You must account for anything that could happen in a time slot (DSP algorithm
inner-loop, data sampling rate)
– All possible interrupts or exceptions must be accounted for and their collective
time be subtracted from the time interval.
• Therefore, exceptions are BAD.
• DSPs usually process infinite continuous data streams:
– Requires high memory bandwidth (with predictable latency, e.g no data cache) for
streaming real-time data samples and predictable processing time on the data
samples
• The design of DSP ISAs and processor architectures is driven by the
requirements of DSP algorithms.
– Thus DSPs are application-specific processors
EECC722 - Shaaban
#108 lec # 8 Fall 2006 10-16-2006
DSP vs. GPP
• The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
– MAC is common
i.e Maininperformance
DSP algorithms thatofinvolve
measure DSPs iscomputing
MAC speeda vector dot product, such as
digital filters, correlation, and Fourier transforms.
Why?
– DSP are judged by whether they can keep the multipliers busy 100% of the time and by how
many MACs are performed in each cycle.
• The "SPEC" of DSPs is 4 algorithms:
– Inifinite Impule Response (IIR) filters
– Finite Impule Response (FIR) filters
– FFT, and
– convolvers
• In DSPs, target algorithms are important:
– Binary compatibility not a major issue
• High-level Software is not as important in DSPs as in GPPs.
– People still write in assembly language for a product to minimize the die area for
ROM in the DSP chip.
Note: While this is still mostly true, however, programming for DSPs in high
level languages (HLLs) has been gaining more acceptance due to the
development of more efficient HLL DSP compilers in recent years. EECC722 - Shaaban
#109 lec # 8 Fall 2006 10-16-2006
Types of DSP Processors
According to type of Arithmetic/operand Size Supported
• 32-BIT FLOATING POINT (5% of DSP market):

– TI TMS320C3X, TMS320C67xx (VLIW)
– AT&T DSP32C
– ANALOG DEVICES ADSP21xxx
– Hitachi SH-4
• 16-BIT FIXED POINT (95% of DSP market):

– TI TMS320C2X, TMS320C62xx (VLIW)
– Infineon TC1xxx (TriCore1) (VLIW)
– MOTOROLA DSP568xx, MSC810x (VLIW)
– ANALOG DEVICES ADSP21xx
– Agere Systems DSP16xxx, Starpro2000
– LSI Logic LSI140x (ZPS400) superscalar
– Hitachi SH3-DSP
– StarCore SC110, SC140 (VLIW)
EECC722 - Shaaban
#110 lec # 8 Fall 2006 10-16-2006
DSP Cores vs. Chips
DSP are usually available as synthesizable cores or off-the-
shelf packaged chips
• Synthesizable Cores:
– Map into chosen fabrication process
• Speed, power, and size vary
– Choice of peripherals, etc. (SoC) SOC = System On Chip
– Requires extensive hardware development effort.
Resulting in more development time and cost
• Off-the-shelf packaged chips:

– Highly optimized for speed, energy efficiency, and/or cost.
– Limited performance, integration options.
– Tools, 3rd-party support often more mature
EECC722 - Shaaban
#111 lec # 8 Fall 2006 10-16-2006
DSP ARCHITECTURE
Enabling Technologies
Time Frame Approach Primary Application Enabling Technologies
Early 1970’s  Discrete logic  Non-real time  Bipolar SSI, MSI

processing  FFT algorithm
 Simulation
First microprocessor DSP Late 1970’s  Building block  Military radars  Single chip bipolar multiplier
TI TMS 32010  Digital Comm.  Flash A/D
1 Early 1980’s  Single Chip DSP P  Telecom  P architectures

 Control  NMOS/CMOS
2 Late 1980’s  Function/Application  Computers  Vector processing
specific chips  Communication  Parallel processing
3 Early 1990’s  Multiprocessing  Video/Image Processing  Advanced multiprocessing

 VLIW, MIMD, etc.
4 Late 1990’s  Single-chip  Wireless telephony  Low power single-chip DSP

multiprocessing  Internet related  VLIW/Multiprocessing
Generations of single-chip (microprocessor) DSPs
EECC722 - Shaaban
#112 lec # 8 Fall 2006 10-16-2006
Texas Instruments TMS320 Family
Multiple DSP P Generations
First Bit Size Clock Instruction MAC MOPS Device density (#
Sample speed Throughput execution of transistors)
(MHz) (ns)
Uniprocessor
Based
(Harvard
Architecture)
1 TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3)
2 TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2)
TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1)
3 TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5)
TMS320C2XXX 1995 16 integer 40 MIPS 25 80
Multiprocessor (VLIW)
Based
TMS320C80 1996 32 integer/flt. 2 GOPS MIMD
120 MFLOP
4 TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
Generations of single-chip (microprocessor) DSPs EECC722 - Shaaban

#113 lec # 8 Fall 2006 10-16-2006
DSP Applications
• Digital audio applications • Industrial control
– MPEG Audio • Seismic exploration
– Portable audio • Networking:
• Digital cameras (Telecom infrastructure)
• Cellular telephones – Wireless
• Wearable medical appliances – Base station
• Storage products: – Cable modems
– disk drive servo control – ADSL
• Military applications: – VDSL
– radar – …...
– sonar
Current DSP Killer Applications: Cell phones and telecom infrastructure
EECC722 - Shaaban
#114 lec # 8 Fall 2006 10-16-2006
DSP Algorithms & Applications
DSP Algorithm System Application
Speech Coding
multimedia computers, secure communications.
Speech Encryption
secure communications.
Advanced user interfaces, multimedia workstations, robotics, automotive applications,
Speech Recognition
cellular telephones, personal communications systems.
Speech Synthesis Advanced user interfaces, robotics
Speaker Identification Security, multimedia workstations, advanced user interfaces
Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia
High-fidelity Audio
computers
Modems digital audio broadcast, digital signaling on cable TV, multimedia computers, wireless
computing, navigation, data/fax
Noise cancellation Professional audio, advanced vehicular audio, industrial applications
Audio Equalization Consumer audio, professional audio, advanced vehicular audio, music
Ambient Acoustics Emulation Consumer audio, professional audio, advanced vehicular audio, music
Audio Mixing/Editing Professional audio, music, multimedia computers
Sound Synthesis Professional audio, music, multimedia computers, advanced user interfaces
Security, multimedia computers, advanced user interfaces, instrumentation, robotics,
Vision
navigation
Image Compression Digital photography, digital video, multimedia computers, videoconferencing
Image Compositing Multimedia computers, consumer video, advanced user interfaces, navigation
Beamforming Navigation, medical imaging, radar/sonar, signals intelligence
Echo cancellation Speakerphones, hands-free cellular telephones
Spectral Estimation Signals intelligence, radar/sonar, professional audio, music
EECC722 - Shaaban
#115 lec # 8 Fall 2006 10-16-2006
Another Look at DSP Applications
• High-end:
Increasing
– Military applications (e.g. radar/sonar)
– Wireless Base Station - TMS320C6000
Cost
– Cable modem
– Gateways
• Mid-range:
– Industrial control
– Cellular phone - TMS320C540
– Fax/ voice server
• Low end:
volume
Increasing
– Storage products - TMS320C27 (hard drive controllers)
– Digital camera - TMS320C5000
– Portable phones
– Wireless headsets
– Consumer audio
– Automobiles, thermostats, ...
EECC722 - Shaaban
#116 lec # 8 Fall 2006 10-16-2006
DSP range of applications
& Possible Target DSPs
EECC722 - Shaaban
#117 lec # 8 Fall 2006 10-16-2006
Cellular Phone System
123 CONTROLLER 415-555-1212
456
789
0
PHYSICAL RF
LAYER BASEBAND
CONVERTER MODEM
PROCESSING
A/D SPEECH SPEECH

ENCODE DECODE DAC
Example DSP Application EECC722 - Shaaban

#118 lec # 8 Fall 2006 10-16-2006
Cellular Phone: HW/SW/IC
Partitioning
MICROCONTROLLER
123
456 CONTROLLER 415-555-1212
789
0
PHYSICAL
BASEBAND RF
LAYER
ASIC CONVERTER MODEM
PROCESSING
A/D SPEECH SPEECH

ENCODE DECODE DAC
DSP
ANALOG IC
#119 lec # 8 Fall 2006 10-16-2006
Mapping Onto System-on-Chip (SoC)
(Cellular Phone)
phone keypad
S/P
book intfc
DMA control protocol
S/P
RAM
RAM µC
DMA speech
voice
quality
recognition
enhancment
ASIC DSP RPE-LTP
de-intl &
LOGIC CORE decoder speech decoder
demodulator
and Viterbi
synchronizer equalizer

#120 lec # 8 Fall 2006 10-16-2006
Example Cellular Phone Organization
C540
(DSP)
ARM7
(µC)

#121 lec # 8 Fall 2006 10-16-2006
Multimedia System-on-Chip (SoC)
e.g. Multimedia terminal electronics
Graphics Out
Uplink Radio Video I/O
ASIC
Downlink Radio Voice I/O
Co-processor
Or ASP
Pen In
• Future chips will be a mix of

µP Video Unit
processors, memory and (ASIC)
dedicated hardware for
custom
Coms
specific algorithms and I/O
Memory
DSP

#122 lec # 8 Fall 2006 10-16-2006
DSP Algorithm Format
• DSP culture has a graphical format to represent

formulas.
• Like a flowchart for formulas, inner loops,
not programs.
• Some seem natural:
 is add, X is multiply
• Others are obtuse:
z–1 means take variable from earlier iteration (delay).
• These graphs are trivial to decode
EECC722 - Shaaban
#123 lec # 8 Fall 2006 10-16-2006
DSP Algorithm Notation
• Uses “flowchart” notation instead of equations
• Multiply is or
X
• Add is or
+ 
• Delay/Storage is or or
Delay z–1 D
EECC722 - Shaaban
#124 lec # 8 Fall 2006 10-16-2006
Finite-Impulse Response (FIR) Filter
• Filters reduce signal noise and enhance image or signal quality by
removing unwanted frequencies.
• Finite Impulse Response (FIR) filters compute:
N 1
where
y (i )   h(k ) x(i  k )  h(n) * x(n)
– 0
x is the inputksequence
– y is the output sequence
– h is the impulse response (filter coefficients)
– N is the number of taps (coefficients) in the filter
• Output sequence depends only on input sequence and impulse
response.
EECC722 - Shaaban
#125 lec # 8 Fall 2006 10-16-2006
Typical DSP Algorithms:
Finite-impulse Response (FIR) Filter
• N most recent samples in the delay line (Xi)
• New sample moves data down delay line
• Filter “Tap” is a multiply-add
(Multiply And Accumulate, MAC)
• Each tap (N taps total) nominally requires:
– Two data fetches Requires real-time data sample streaming

• Predictable data bandwidth/latency
– Multiply • Special addressing modes
– Accumulate
Repetitive computations, multiply and accumulate (MAC)
– Memory write-back to update delayefficient
• Requires line MAC support
• Special addressing modes (e.g modulo)
• Goal: At least 1 FIR Tap / DSP instruction cycle
EECC722 - Shaaban
#126 lec # 8 Fall 2006 10-16-2006
FINITE-IMPULSE RESPONSE (FIR) FILTER
Delay (accumulator register)
X ....
Z 1 Z 1 Z 1
h0 h1 hN-2 hN-1
A Filter Tap N 1
One FIR Filter Tap
y (i )   h(k ) x(i  k )
k 0 i.e. Vector dot product
Performance Goal: at least 1 FIR Tap / DSP instruction cycle

DSP must meet application signal sampling rate computational requirements:
A faster DSP is overkill (more cost/power than really needed)
EECC722 - Shaaban
#127 lec # 8 Fall 2006 10-16-2006
Sample Computational Rates
for FIR Filtering
FIR
Signal type Frequency # taps Performance
Type
1-D Speech 8 kHz N =128 20 MOPs
1-D Music 48 kHz N =256 24 MOPs
2-D Video phone 6.75 MHz N*N = 81 1,090 MOPs
2-D TV 27 MHz N*N = 81 4,370 MOPs
(4.37 GOPs)
2-D HDTV 144 MHz N*N = 81 23,300 MOPs
(23.3 GOPs)
1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2. OP = Operation
DSP must meet application signal sampling rate computational requirements:

• A faster DSP is overkill (higher DSP cost, power..)
EECC722 - Shaaban
#128 lec # 8 Fall 2006 10-16-2006
FIR filter on (simple)
loop:
General Purpose Processor
lw x0, 0(r0)
lw y0, 0(r1)
mul a, x0,y0
add y0,a,b
sw y0,(r2)
inc r0
inc r1
inc r2
dec ctr
tst ctr
jnz loop + GPP Real-time performance may (to meet signal sampling
• Problems: rate) not be fully predictable (due to dynamic processor
• Bus / memory bandwidtharchitectural
bottleneck, features):
•Superscalar: dynamic scheduling, hardware speculation,
• control/loop code overhead
branch prediction, cache.
• No suitable addressing modes, instructions -
– e.g. multiply and accumulate (MAC) instruction
EECC722 - Shaaban
#129 lec # 8 Fall 2006 10-16-2006
Infinite-Impulse Response (IIR) Filter
• Infinite Impulse Response (IIR) filters compute:
M 1 N 1
y (i )   a(k ) y(i  k )   b(k ) x(i  k )
k 1 k 0
• Output sequence depends on input sequence, previous
outputs, and impulse response.
• Both FIR and IIR filters
i.e Filter coefficients: a(k), b(k)
– Require vector dot product (multiply-accumulate) operations
– Use fixed coefficients
MAC
• Adaptive filters update their coefficients to minimize the
distance between the filter output and the desired signal.
EECC722 - Shaaban
#130 lec # 8 Fall 2006 10-16-2006
Discrete Fourier Transform (DFT)
• The Discrete Fourier Transform (DFT) allows for spectral
analysis in the frequency domain.
• It is computed as
N 1 2 j 
y (k )  WN x(n)
nk
WN e N j  1
for k = 0, 1, … , N-1,
n 0 where
– x is the input sequence in the time domain
– y is an output sequence in the frequency domain
• The Inverse Discrete Fourier Transform is computed as
• The Fast Fourier Transform (FFT) provides an efficient method

for computing the DFT. N 1
 nk
x(n)  WN y (k ), for n  0, 1, ... , n - 1
k 0
EECC722 - Shaaban
#131 lec # 8 Fall 2006 10-16-2006
Discrete Cosine Transform (DCT)
• The Discrete Cosine Transform (DCT) is frequently used
in image & video compression (e.g. JPEG, MPEG-2).
• The DCT and Inverse DCT (IDCT) are computed as:
N 1
(2n  1)k
y (k )  e(k )  cos[ ]x(n), for k  0, 1, ... N - 1
n 0 2N
2 N 1
(2n  1)k
x ( n) 
N
 e(k ) cos[ 2 N ] y(n), for k  0, 1, ... N - 1
k 0
where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.

• A N-Point, 1D-DCT requires N2 MAC operations.
EECC722 - Shaaban
#132 lec # 8 Fall 2006 10-16-2006
DSP BENCHMARKS
• DSPstone: University of Aachen, application benchmarks
– ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES
– DOT_PRODUCT, MATRIX_1X3, CONVOLUTION
– FIR, FIR2DIM, HR_ONE_BIQUAD
– LMS, FFT_INPUT_SCALED
• BDTImark2000: Berkeley Design Technology Inc BDTI

– 12 DSP kernels in hand-optimized assembly language:
• FIR, IIR, Vector dot product, Vector add, Vector maximum, FFT ….
– Returns single number (higher means faster) per processor
– Use only on-chip memory (memory bandwidth is the major bottleneck in
performance of embedded applications).
• EEMBC (pronounced “embassy”): EDN Embedded

Microprocessor Benchmark Consortium
– 30 companies formed by Electronic Data News (EDN)
– Benchmark evaluates compiled C code on a variety of embedded processors
(microcontrollers, DSPs, etc.)
– Application domains: automotive-industrial, consumer, office automation,
networking and telecommunications
EECC722 - Shaaban
#133 lec # 8 Fall 2006 10-16-2006
4th Generation
3rd
Generation
2nd
Generation > 800x
Faster than
first generation
1st
Generation
EECC722 - Shaaban
#134 lec # 8 Fall 2006 10-16-2006
Basic ISA/Architectural Features of DSPs
• Data path configured for DSP algorithms
– Fixed-point arithmetic (most DSPs) DSP ISA Feature
• Modulo arithmetic (saturation to handle overflow)
– MAC- Multiply-accumulate unit(s)
DSP Architectural Features
– Hardware rounding support
DSP Architectural Feature
Usually with no data cache
– Harvard Architecture for predictable fast data sample
streaming
DSP ISA Feature • Specialized addressing modes DSP Architectural Feature
– Bit-reversed addressing Dedicated address generation units

are usually used
DSP ISA Feature • Specialized instruction set and execution control
– Support for fast MAC To meet real-time signal
sampling/processing constraints
– Fast Interrupt Handling
- (System on Chip - SoC style)
DSP Architectural Feature EECC722 - Shaaban
#135 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
DSP Data Path: Arithmetic

• DSPs dealing with numbers representing real world signals
=> Want “reals”/ fractions
• DSPs dealing with numbers for addresses
=> Want integers
• DSP ISA (and DSP) must Support “fixed point” as well as
integers
S . -1 Š x < 1
DSP ISA Feature
radix In DSP ISAs: Fixed-point arithmetic must
point be supported, floating point support is
optional and is much less common
S .
radix
–2N–1 Š x < 2N–1
point
Usually 16-bit
EECC722 - Shaaban
#136 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
DSP Data Path: Precision

• Word size affects precision of fixed point numbers
• DSPs have 16-bit, 20-bit, or 24-bit data words 16-bit most common
• Floating Point DSPs cost 2X - 4X vs. fixed point, slower than
fixed point In DSP ISAs: Fixed-point arithmetic must be supported, floating point
support is optional and is much less common
• DSP programmers will scale values inside code
– SW Libraries
– Separate explicit exponent
• “Blocked Floating Point” single exponent for a group of fractions
• Floating point support simplify development for high-end DSP
applications.
EECC722 - Shaaban
#137 lec # 8 Fall 2006 10-16-2006
DSP ISA Feature
DSP Data Path: Overflow

• DSP are descended from analog :
– Modulo Arithmetic.
• Set to most positive (2N–1–1) or
most negative value(–2N–1) : “saturation”
• Many DSP algorithms were developed in this
model.
2N–1–1
Saturation
Why Support?
Due to physical
nature of signals
Saturation –2N–1
EECC722 - Shaaban
#138 lec # 8 Fall 2006 10-16-2006
DSP Data Path: Specialized Hardware

• Specialized hardware functional units performs all key
arithmetic operations in 1 cycle, including:
– Shifters
– Saturation
– Guard bits
– Rounding modes
– Multiplication/addition (MAC)
• 50% of instructions can involve multiplier
=> single cycle latency multiplier
• Need to perform multiply-accumulate (MAC) fast
• n-bit multiplier => 2n-bit product
EECC722 - Shaaban
#139 lec # 8 Fall 2006 10-16-2006
DSP Data Path: Accumulator
• Don’t want overflow or have to scale accumulator
• Option 1: accumalator wider than product:
“guard bits”
– Motorola DSP:
24b x 24b => 48b product, 56b Accumulator
• Option 2: shift right and round product before adder
}
Multiplier
Multiplier
Shift MAC
Unit
ALU ALU
Accumulator G Accumulator
EECC722 - Shaaban
#140 lec # 8 Fall 2006 10-16-2006
DSP Data Path: Rounding Modes
• Even with guard bits, will need to round when storing accumulator
into memory
• 3 DSP standard options (supported in hardware)
• Truncation: chop results
=> biases results up
• Round to nearest:
< 1/2 round down,  1/2 round up (more positive)
=> smaller bias
• Convergent:
< 1/2 round down, > 1/2 round up (more positive), = 1/2 round to
make lsb a zero (+1 if 1, +0 if 0)
=> no bias
IEEE 754 calls this round to nearest even
EECC722 - Shaaban
#141 lec # 8 Fall 2006 10-16-2006
Data Path Comparison
• Specialized hardware • Multiplies often take>1

performs all key arithmetic cycle
operations in 1 cycle. • Shifts often take >1 cycle
– e.g MAC • Other operations (e.g.,
• Hardware support for saturation, rounding)
managing numeric fidelity: typically take multiple
– Shifters cycles.
– Guard bits
– Saturation
EECC722 - Shaaban
#142 lec # 8 Fall 2006 10-16-2006
TI 320C54x DSP (1995) Functional Block Diagram
Multiple memory
banks and buses
MAC
Unit
Hardware support for rounding/saturation

EECC722 - Shaaban
#143 lec # 8 Fall 2006 10-16-2006
First Commercial DSP (1982): Texas
Instruments TMS32010
Instruction
Memory
• 16-bit fixed-point arithmetic
Processor
• Introduced at 5Mhz (200ns)
Data
instruction cycle. Memory
• “Harvard architecture” Datapath:
– separate instruction, Mem
data memories T-Register
• Accumulator
• Specialized instruction set Multiplier
– Load and Accumulate P-Register
ALU
• Two-cycle (400 ns) Multiply-
Accumulate (MAC) time.
Accumulator
EECC722 - Shaaban
#144 lec # 8 Fall 2006 10-16-2006
First Generation DSP P
Texas Instruments TMS32010 - 1982
Features
• 200 ns instruction cycle (5 MIPS)

• 144 words (16 bit) on-chip data RAM
• 1.5K words (16 bit) on-chip program ROM - TMS32010
• External program memory expansion to a total of 4K words at full speed
• 16-bit instruction/data word
• single cycle 32-bit ALU/accumulator
• Single cycle 16 x 16-bit multiply in 200 ns
• Two cycle MAC (5 MOPS)
• Zero to 15-bit barrel shifter
• Eight input and eight output channels
EECC722 - Shaaban
#145 lec # 8 Fall 2006 10-16-2006
First Generation DSP P TI TMS32010
Block Diagram
Program Memory
(ROM/EPROM)
MAC
Unit
EECC722 - Shaaban
#146 lec # 8 Fall 2006 10-16-2006
TMS32010 FIR Filter Code
• Here X4, H4, ... are direct (absolute) memory addresses:

LT X4 ; Load T with x(n-4)
MPY H4 ; P = H4*X4
LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3);
; Acc = Acc + P
MPY H3 ; P = H3*X3
LTD X2
MPY H2 Load and Accumulate
...
• Two instructions per tap, but requires unrolling
EECC722 - Shaaban
#147 lec # 8 Fall 2006 10-16-2006
DSP Memory
• FIR Tap implies multiple memory accesses
• DSPs require multiple data ports
• Some DSPs have ad hoc techniques to reduce memory bandwdith
demand:
– Instruction repeat buffer: do 1 instruction 256 times
– Often disables interrupts, thereby increasing interrupt response time
• Some recent DSPs have instruction caches
– Even then may allow programmer to “lock in” instructions into cache
– Option to turn cache into fast program memory
• Usually DSPs have no data caches.
• May have multiple data memories
For better
real-time
performance
predictability
e.g one for signal data samples and one for filter coefficients
EECC722 - Shaaban
#148 lec # 8 Fall 2006 10-16-2006
Conventional ``Von Neumann’’ memory
EECC722 - Shaaban
#149 lec # 8 Fall 2006 10-16-2006
HARVARD MEMORY ARCHITECTURE in DSP
e.g one for signal data samples and one for filter coefficients
ROM/EPROM/ Data Memory Banks (SRAM)

FLASH?
PROGRAM
X MEMORY Y MEMORY
MEMORY
GLOBAL
P DATA
X DATA
Y DATA
Multiple memory
banks and buses
EECC722 - Shaaban
#150 lec # 8 Fall 2006 10-16-2006
Memory Architecture Comparison
• Harvard architecture • Von Neumann architecture
• 2-4 memory accesses/cycle • Typically 1 access/cycle
• No caches: on-chip SRAM • Use caches
For real-time performance Makes real-time performance
predictability harder to predict
Program
Memory
Processor Processor Memory
Data
Memory
EECC722 - Shaaban
#151 lec # 8 Fall 2006 10-16-2006
TI TMS320C3x MEMORY BLOCK DIAGRAM - Harvard
Architecture
Instruction Data Data Program
Cache
Multiple memory
Multiple memory
banks and buses
banks and buses
EECC722 - Shaaban
#152 lec # 8 Fall 2006 10-16-2006
TI 320C62x/67x DSP (1997) – (Fourth Generation DSP)
EECC722 - Shaaban
#153 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
DSP Addressing Modes
• Have standard addressing modes: immediate, displacement, register indirect
• Want to keep MAC datapath busy
• Assumption: any extra instructions imply clock cycles of overhead in inner
loop
=> complex addressing is good
• Autoincrement/Autodecrement register indirect
– lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 To match data access patterns in DSP algorithms
And reduced number of instructions (code size)
– Option to do it before addressing, positive or negative
• “bit reverse” address addressing mode.
• “modulo” or “circular” addressing
=> don’t use normal datapath to calculate fancy addressing modes:
– Use dedicated address generation units
Related DSP Architectural Feature
EECC722 - Shaaban
#154 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
DSP Addressing: FFT
• FFTs start or end with data in bufferfly order
0 (000) => 0 (000)
1 (001) => 4 (100)
2 (010) => 2 (010)
3 (011) => 6 (110)
4 (100) => 1 (001) Bit Reversed
5 (101) => 5 (101) Addressing
6 (110) => 3 (011)
7 (111) => 7 (111)
• How to avoid overhead of address checking instructions for FFT?

• Have an optional “bit reverse” address addressing mode for use with autoincrement
addressing
• Thus most DSPs have “bit reverse” addressing for radix-2 FFT
EECC722 - Shaaban
#155 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
Bit Reversed Addressing
000 x(0) F(0)
100 x(4) F(1)
010 x(2) F(2)
110 x(6) F(3)
001 x(1) F(4)
101 x(5) F(5)
011 x(3) F(6)
111 x(7) F(7)
Four 2-point Two 4-point One 8-point DFT

DFTs DFTs
Data flow in the radix-2 decimation-in-time FFT algorithm

EECC722 - Shaaban
#156 lec # 8 Fall 2006 10-16-2006
DSP Addressing: Circular Buffers
and addressing
• DSPs dealing with continuous I/O
• Often interact with an I/O buffer (delay lines)
• To save memory, buffers often organized as circular buffers
• What can do to avoid overhead of address checking
instructions for circular buffer?
• Option 1: Keep start register and end register per address
register for use with autoincrement addressing, reset to start
when reach end of buffer
• Option 2: Keep a buffer length register, assuming buffers
starts on aligned address, reset to start when reach end
• Every DSP has “modulo” or “circular” addressing
Circular
Buffer
addressing
EECC722 - Shaaban
#157 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
Circular Buffers Addressing Support

Every DSP has “modulo” or
“circular” addressing mode
Instructions accommodate three

elements:
• Buffer address
• Buffer size
Why? • Increment
Allows for cycling through:
• delay elements (signal samples)
• Filter coefficients in data memory
EECC722 - Shaaban
#158 lec # 8 Fall 2006 10-16-2006
Address calculation for DSPs
• Dedicated address
generation units
• Supports modulo and bit
reversal arithmetic
• Often duplicated to calculate
multiple addresses per cycle
EECC722 - Shaaban
#159 lec # 8 Fall 2006 10-16-2006
Addressing Comparison
• Dedicated address • Often, no separate address
generation units generation units
• Specialized addressing • General-purpose addressing
DSP ISA Feature
modes; e.g.: modes GPP ISA Feature
– Autoincrement
– Modulo (circular)
– Bit-reversed (for FFT)
• Good immediate data
support
EECC722 - Shaaban
#160 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
DSP Instructions and Execution
• May specify multiple operations in a single instruction
– e.g. A compound instruction may perform: To reduce number of instructions
and reduce code size
multiply + add + load + modify address register
• Must support Multiply-Accumulate (MAC)
• Need parallel move support
• Usually have special loop support to reduce branch overhead
– Loop an instruction or sequence
– 0 value in register usually means loop maximum number of times
– Must be sure if calculate loop count that 0 does not mean 0
• May have saturating shift left arithmetic
• May have conditional execution to reduce branches
In 4th generation VLIW DSPs EECC722 - Shaaban

#161 lec # 8 Fall 2006 10-16-2006
DSP ISA Features
DSP Low/Zero Overhead Loops Examples
Example FIR inner loop on TI TMS320C54xx:
Number of filter taps
Repeat
In ADSP 2100: DO <addr> UNTIL condition”
DO X ... Address Generation

PCS = PC + 1
if (PC = x && ! condition)
PC = PCS
else
PC = PC +1
Lowers loop overhead
X
• Eliminates a few instructions in loops -
• Important in loops with small bodies
EECC722 - Shaaban
#162 lec # 8 Fall 2006 10-16-2006
Instruction Set Comparison
DSP Processor
General-Purpose Processor
ISA
• Specialized, complex instructions (e.g. MAC) ISA
• Multiple operations per instruction
• General-purpose
instructions Less complex
• Typically only one
• Zero or reduced overhead loops. operation per instruction
mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov *r0,x0

mov *r1,y0
Code Size = 7 x 32 =
Code Size = 16 bits mpy x0, y0, a 224 bits
add a, b
mov y0, *r2 (14X)
inc r0
inc rl
• No zero or reduced overhead
loops support
EECC722 - Shaaban
#163 lec # 8 Fall 2006 10-16-2006
Specialized Peripherals for DSPs

• Synchronous serial ports
• Host ports
• Parallel ports Instruction
Memory
• Timers
DSP
Core • Bit I/O ports
• On-chip DMA
Serial Ports
• On-chip A/D, D/A A/D Converter Data
converters D/A Converter

Memory
controller
• Clock generators
System on Chip (SoC)

• On-chip peripherals often designed for
“background” operation, even when core is
powered down.
EECC722 - Shaaban
#164 lec # 8 Fall 2006 10-16-2006
TI TMS320C203/LC203 Block Diagram
DSP Core Approach - 1995
Integrated
DSP Peripherals
EECC722 - Shaaban
#165 lec # 8 Fall 2006 10-16-2006
Summary of Architectural Features of DSPs
– Fixed-point arithmetic Most common 95% of all DSPs
– Fast MAC- Multiply-accumulate
• Multiple memory banks and buses - •Avoiding dynamic processor
architectural features that make real-
– Harvard Architecture time performance harder to predict (e.g
– Multiple data memories dynamic scheduling, hardware
speculation, branch prediction, cache).
– Dedicated address generation units
Why?
• Specialized addressing modes To achieve predictable real-time
– Bit-reversed addressing performance
– Support for MAC
• Specialized peripherals for DSP (SoC)
• THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE

DESIGN.
EECC722 - Shaaban
(or algorithm driven, DSP algorithms in this case) #166 lec # 8 Fall 2006 10-16-2006
DSP Software Development Considerations
• Different from general-purpose software development:
– Resource-hungry, complex algorithms.
– Specialized and/or complex processor architectures.
– Severe cost/storage limitations.
– Hard real-time constraints.
– Optimization is essential. Program in DSP Assembly
– Increased testing challenges.
• Essential tools:
– Assembler, linker.
– Instruction set simulator. HLL/tools becoming
– HLL Code generation: C compiler. more mature/
– Debugging and profiling tools. gaining popularity
• Increasingly important:
– DSP Software libraries.
– Real-time operating systems.
EECC722 - Shaaban
#167 lec # 8 Fall 2006 10-16-2006
Classification of Current DSP Architectures
• Modern Conventional DSPs:
– Similar to the original DSPs of the early 1980s
Second
– Single instruction/cycle. Example: TI TMS320C54x
Generation
Lower – Complex instructions/Not compiler friendly
Cost/
Power
• Enhanced
UsuallyConventional
one MAC unit DSPs:
– Add parallel execution units: SIMD operation
– Complex, compound instructions. Third
– Example: TI TMS320C55x Generation
– Not compiler friendly
• Multiple-Issue DSPs:
Example: TI TMS320C62xx, TMS320C64xx Fourth
Usually more than one MAC unit
– VLIW
Generation
• Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and
instruction bandwidth needed,
• More compiler friendly - Higher cost/power
Higher • SIMD instructions support added to recent DSPs of this class
Cost/
Power – Superscalar, Example: LSI Logic ZPS400, ZPS500
Performance
EECC722 - Shaaban
DSPs from all these three generations are still available today
#168 lec # 8 Fall 2006 10-16-2006
A Conventional DSP: Second
TI TMSC54xx Generation DSP
• 16-bit fixed-point DSP.

• Issues one 16-bit instruction/cycle
• Modified Harvard memory architecture
• Peripherals typical of conventional DSPs:
– 2-3 synch. Serial ports, parallel port
– Bit I/O, Timer, DMA
• Inexpensive (100 MHz ~$5 qty 10K).
• Low power (60 mW @ 1.8V, 100 MHz).
Has one MAC unit
EECC722 - Shaaban
#169 lec # 8 Fall 2006 10-16-2006
A Current Conventional DSP:
TI TMSC54xx Second
Generation DSP
One
MAC
Unit
EECC722 - Shaaban
#170 lec # 8 Fall 2006 10-16-2006
TI TMSC55xx Third
Generation DSP
• The TMS320C55xx is based on Texas Instruments' earlier TMS320C54xx
family, but adds significant enhancements to the architecture and
instruction set, including:
– Two instructions/cycle
(limited
• Instructions are scheduled for parallel VLIW?)
execution by the assembly programmer or
compiler.
– Two MAC units.
• Complex, compound instructions:
– Assembly source code compatible with C54xx
– Mixed-width instructions: 8 to 48 bits.
– 200 MHz @ 1.5 V, ~130 mW , $17 qty 10k
• Poor compiler target.
EECC722 - Shaaban
#171 lec # 8 Fall 2006 10-16-2006
TI TMSC55xx Third
Generation DSP
2 MAC
Units
EECC722 - Shaaban
#172 lec # 8 Fall 2006 10-16-2006
Multiple-Issue DSPs 16-bit Fixed-Point VLIW DSP:
TI TMS320C6201 Revision 2 (1997)
(1997
The TMS320C62xx is the
first fixed-point DSP Program Cache / Program Memory
processor from Texas 32-bit address, 256-Bit data512K Bits RAM
Instruments that is based
on a VLIW-like architecture Pwr C6201 CPU Megamodule
which allows it to execute up Dwn Program Fetch

Control
Host Instruction Dispatch Registers
to eight 32-bit RISC-like Port Instruction Decode
Interface Data Path 1 Data Path 2 Control
instructions per clock cycle.
4-DMA A Register File B Register File
Logic
TMS320C67xx Test
Emulation
Floating Point version L1 S1 M1 D1 D2 M2 S2 L2
Ext. Interrupts
• More compiler friendly Memory

• Higher cost/power Interface
2 Timers
•SIMD instructions support added
2 Multi-
to recent DSPs of this class
(TMS320C64xx) Data Memory channel
buffered
32-Bit address, 8-, 16-, 32-Bit data serial ports
Fourth 512K Bits RAM (T1/E1)
Generation DSP
EECC722 - Shaaban
#173 lec # 8 Fall 2006 10-16-2006
TI TMS320C62xx Internal Memory
Architecture
• Separate Internal Program and Data Spaces
• Program
– 16K 32-bit instructions (2K Fetch Packets)
– 256-bit Fetch Width
– Configurable as either
• Direct Mapped Cache, Memory Mapped Program Memory
• Data
– 32K x 16
– Single Ported Accessible by Both CPU Data Buses
– 4 x 8K 16-bit Banks 4 Banks
• 2 Possible Simultaneous Memory Accesses (4 Banks)
• 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
EECC722 - Shaaban
#174 lec # 8 Fall 2006 10-16-2006
Fourth
Generation DSP TI TMS320C62xx Datapaths
Registers A0 - A15 Registers B0 - B15
1X 2X
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
DDATA_I1 DDATA_I2
(load data) (load data)
DDATA_O1 DADR1 DADR2 DDATA_O2

(store data) (address) (address) (store data)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
EECC722 - Shaaban
#175 lec # 8 Fall 2006 10-16-2006
TI TMS320C62xx Functional Units
• L-Unit (L1, L2)
– 40-bit Integer ALU, Comparisons
– Bit Counting, Normalization
• S-Unit (S1, S2)

– 32-bit ALU, 40-bit Shifter
– Bitfield Operations, Branching
• M-Unit (M1, M2)

– 16 x 16 -> 32
• D-Unit (D1, D2)

– 32-bit Add/Subtract
– Address Calculations
(Statically Scheduled)
EECC722 - Shaaban
#176 lec # 8 Fall 2006 10-16-2006
TI TMS320C62xx Instruction Packing
Instruction Packing Advanced 8-way VLIW
Example 1
• Fetch Packet
A B C D E F G H – CPU fetches 8 instructions/cycle
• Execute Packet
A – CPU executes 1 to 8 instructions/cycle
B – Fetch packets can contain multiple execute packets
C • Parallelism determined at compile / assembly time

• Examples
D Example 2 – 1) 8 parallel instructions
E – 2) 8 serial instructions
F – 3) Mixed Serial/Parallel Groups
G • A // B
H A B
• C
• D
C • E // F // G // H
D Example 3 • Reduces Codesize, Number of Program Fetches, Power
E Consumption
F G H
(Statically Scheduled VLIW)
EECC722 - Shaaban
#177 lec # 8 Fall 2006 10-16-2006
TI TMS320C62xx Pipeline Operation
Pipeline Phases
Fetch Decode Execute
• Single-Cycle Throughput
• Decode
• Operate in Lock Step
– DP Instruction Dispatch
• Fetch – DC Instruction Decode
– PG Program Address Generate • Execute
– PS Program Address Send – E1 - E5 Execute 1 through Execute 5
– PW Program Access Ready Wait
– PR Program Fetch Packet Receive
EECC722 - Shaaban
#178 lec # 8 Fall 2006 10-16-2006
Delay Slots
• Delay Slots: number of extra cycles until result is:
– written to register file
– available for use by a subsequent instructions
– Multi-cycle NOP instruction can fill delay slots while minimizing
code size impact
Most Instructions E1 No Delay
Integer Multiply E1 E2 1 Delay Slots
Loads E1 E2 E3 E4 E5 4 Delay Slots

Branches E1
Branch Target PG PSPWPR DPDC E1 5 Delay Slots
(Statically Scheduled VLIW) EECC722 - Shaaban

For better real-time performance predictability
#179 lec # 8 Fall 2006 10-16-2006
C6000 Instruction Set Features
Conditional Instruction Execution
• All Instructions can be Conditional (similar to Intel IA-64)

– A1, A2, B0, B1, B2 can be used as Conditions
– Based on Zero or Non-Zero Value
– Compare Instructions can allow other Conditions (<, >, etc)
• Reduces Branching
• Increases Parallelism
EECC722 - Shaaban
#180 lec # 8 Fall 2006 10-16-2006
Features
• Load-Store Architecture
• Two Addressing Units (D1, D2)
• Orthogonal
– Any Register can be used for Addressing or Indexing
• Signed/Unsigned Byte, Half-Word, Word, Double-
Word Addressable
– Indexes are Scaled by Type
• Register or 5-Bit Unsigned Constant Index
EECC722 - Shaaban
#181 lec # 8 Fall 2006 10-16-2006
Features
• Indirect Addressing Modes
– Pre-Increment *++R[index]
– Post-Increment *R++[index]
– Pre-Decrement *--R[index]
– Post-Decrement *R--[index]
– Positive Offset *+R[index]
– Negative Offset *-R[index]
• 15-bit Positive/Negative Constant Offset from Either B14 or B15
• Circular Addressing
– Fast and Low Cost: Power of 2 Sizes and Alignment
– Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes
• Bit-reversal Addressing
• Dual Endian Support
EECC722 - Shaaban
#182 lec # 8 Fall 2006 10-16-2006
FIR Filter On TMS320C54xx vs. TMS320C62xx
2nd Gen Conventional DSP 4th Gen VLIW DSP
VLIW: Larger code size
Two filter taps
EECC722 - Shaaban
#183 lec # 8 Fall 2006 10-16-2006
TI TMS320C64xx
• Announced in February 2000, the TMS320C64xx is an extension of
Texas Instruments' earlier TMS320C62xx architecture.
• The TMS320C64xx has 64 32-bit general-purpose registers, twice as
many as the TMS320C62xx.
• The TMS320C64xx instruction set is a superset of that used in the
TMS320C62xx, and, among other enhancements, adds significant
SIMD/media processing capabilities:
– 8-bit operations for image/video processing.
Media Processing
• Introduced
SIMD at 600 MHz clock speed (1 GHz now), but:
– 11-stage pipeline with long latencies
– Dynamic caches.
• $100 qty 10k.
• The only DSP family with compatible fixed and floating-point versions.
EECC722 - Shaaban
#184 lec # 8 Fall 2006 10-16-2006
C64xx (also C62xx and C67xx) VLIW have higher memory use
due to simpler (RISC-like, fixed-width) instructions than conventional DSPs,
more instructions and instruction bandwidth needed,
Also VLIW but with variable-length instruction encoding (less memory use than C64xx)
(16-32 bits)
EECC722 - Shaaban
#185 lec # 8 Fall 2006 10-16-2006
Computational
(XScale)
EECC722 - Shaaban
#186 lec # 8 Fall 2006 10-16-2006
Multiple-Issue 4th Generation DSPs
Superscalar DSP: LSI Logic ZSP400

• A 4-way superscalar dynamically scheduled 16-bit fixed-
point DSP core.
• 16-bit RISC-like instructions
• Separate on-chip caches for instructions and data
• Two MAC units, two ALU/shifter units
– Limited SIMD support.
– MACS can be combined for 32-bit operations.
• Possible Disadvantage:
– Dynamic behavior complicates DSP software development:
• Ensuring real-time behavior
• Optimizing code.
EECC722 - Shaaban
#187 lec # 8 Fall 2006 10-16-2006
EECC722 - Shaaban
#188 lec # 8 Fall 2006 10-16-2006
TI not actively improving their flagship
FP DSP (fixed-point more important!)
EECC722 - Shaaban
#189 lec # 8 Fall 2006 10-16-2006

DSP Cores vs. Chips

Uploaded by

DSP Cores vs. Chips

Uploaded by

Digital Signal Processor (DSP) Architecture

• Classification of Processor Applications

Nintendo processor Cellular phones

Nintendo processor Cellular phones

• If a majority of the chip is the program stored in ROM,

Time Frame Approach Primary Application Enabling Technologies

Early 1970’s  Discrete logic  Non-real time  Bipolar SSI, MSI

Early 1980’s  Single Chip DSP P  Telecom  P architectures

Late 1980’s  Function/Application  Computers  Vector processing

Early 1990’s  Multiprocessing  Video/Image Processing  Advanced multiprocessing

Late 1990’s  Single-chip  Wireless telephony  Low power single-chip DSP

TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2)

A/D SPEECH SPEECH

A/D SPEECH SPEECH

DMA control protocol

FB Fifo Fifo Video

Downlink Radio Voice I/O

• Future chips will be a mix of

• DSP culture has a graphical format to represent

Speech 8 kHz N =128 20 MOPs

Music 48 kHz N =256 24 MOPs

Video phone 6.75 MHz N*N = 81 1,090 MOPs

TV 27 MHz N*N = 81 4,370 MOPs

HDTV 144 MHz N*N = 81 23,300 MOPs

• The Fast Fourier Transform (FFT) provides an efficient method

where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1.

• BDTImark2000: Berkeley Design Technology Inc

• EEMBC (pronounced “embassy”): EDN Embedded

• Specialized hardware performs all key arithmetic

• Specialized hardware • Multiplies often take>1

• 200 ns instruction cycle (5 MIPS)

• Here X4, H4, ... are direct (absolute) memory addresses:

• The critical hardware unit in a DSP is the multiplier - much of the

100 x(4) F(1)

010 x(2) F(2)

110 x(6) F(3)

001 x(1) F(4)

101 x(5) F(5)

011 x(3) F(6)

111 x(7) F(7)

Four 2-point Two 4-point One 8-point DFT

Data flow in the radix-2 decimation-in-time FFT algorithm

Instructions accomodate three

• Supports modulo and bit

DO <addr> UNTIL condition”

DO X ... Address Generation

• Eliminates a few instructions in loops -

• Specialized, complex • General-purpose

mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov *r0,x0

converters D/A Converter

• On-chip peripherals often designed for

• THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE

• Modern Conventional DSPs:

DDATA_O1 DADR1 DADR2 DDATA_O2

• S-Unit (S1, S2)

• M-Unit (M1, M2)

• D-Unit (D1, D2)

C • Parallelism determined at compile / assembly time

Most Instructions E1 No Delay

Integer Multiply E1 E2 1 Delay Slots

Loads E1 E2 E3 E4 E5 4 Delay Slots

Branch Target PG PSPWPR DPDC E1 5 Delay Slots

• All Instructions can be Conditional

General Purpose Network Processors (NPs),

Processor = Programmable computing element that runs

Specialization , Development cost/time

Computing Element Choices Observation

Examples of Application-Specific Processors (ASPs) EECC722 - Shaaban

Nintendo processor Cellular phones

Nintendo processor Cellular phones

• 32-BIT FLOATING POINT (5% of DSP market):

• 16-BIT FIXED POINT (95% of DSP market):

• Off-the-shelf packaged chips:

Time Frame Approach Primary Application Enabling Technologies

Early 1970’s  Discrete logic  Non-real time  Bipolar SSI, MSI

1 Early 1980’s  Single Chip DSP P  Telecom  P architectures