PDSP Architecture
PDSP Architecture
Increasing
volume
Increasing
Cost
Processor Applications
EECC722 - Shaaban
#2 lec # 8
$30B
32-bit
micro
$1.2B/4%
Processor Markets
$5.2B/17%
32 bit DSP
DSP
$10B/33%
16-bit
micro
$5.7B/19%
8-bit
micro
$9.3B/31%
EECC722 - Shaaban
#3 lec # 8
Performance
Microprocessors
Performance is
everything
& Software rules
Microcontrollers
Cost is everything
Cost
EECC722 - Shaaban
#4 lec # 8
Nintendo processor
Cellular phones
EECC722 - Shaaban
#6 lec # 8
Nintendo processor
Cellular phones
EECC722 - Shaaban
#7 lec # 8
Code size
EECC722 - Shaaban
#9 lec # 8
EECC722 - Shaaban
#10 lec # 8
EECC722 - Shaaban
#12 lec # 8
TI TMS320C3X, TMS320C67xx
AT&T DSP32C
ANALOG DEVICES ADSP21xxx
Hitachi SH-4
TI TMS320C2X, TMS320C62xx
Infineon TC1xxx (TriCore1)
MOTOROLA DSP568xx, MSC810x
ANALOG DEVICES ADSP21xx
Agere Systems DSP16xxx, Starpro2000
LSI Logic LSI140x (ZPS400)
Hitachi SH3-DSP
EECC722 - Shaaban
#13 lec # 8
Off-the-shelf chips:
Highly optimized for speed, energy efficiency, and/or cost.
Limited performance, integration options.
Tools, 3rd-party support often more mature
EECC722 - Shaaban
#14 lec # 8
DSP ARCHITECTURE
Enabling Technologies
Time Frame
Approach
Primary Application
Enabling Technologies
Early 1970s
Discrete logic
Late 1970s
Building block
Non-real time
procesing
Simulation
Military radars
Digital Comm.
Early 1980s
Telecom
Control
P architectures
NMOS/CMOS
Late 1980s
Function/Application
specific chips
Computers
Communication
Vector processing
Parallel processing
Early 1990s
Multiprocessing
Video/Image Processing
Late 1990s
Single-chip
multiprocessing
Wireless telephony
Internet related
Advanced multiprocessing
VLIW, MIMD, etc.
Low power single-chip DSP
Multiprocessing
EECC722 - Shaaban
#15 lec # 8
Bit Size
Clock
speed
(MHz)
Instruction
Throughput
MAC
execution
(ns)
MOPS
Device density (#
of transistors)
Uniprocessor
Based
(Harvard
Architecture)
TMS32010
1982
16 integer
20
5 MIPS
400
58,000 (3)
TMS320C25
1985
16 integer
40
10 MIPS
100
20
160,000 (2)
TMS320C30
1988
32 flt.pt.
33
17 MIPS
60
33
695,000 (1)
TMS320C50
1991
16 integer
57
29 MIPS
35
60
1,000,000 (0.5)
TMS320C2XXX
1995
16 integer
40 MIPS
25
80
Multiprocessor
Based
TMS320C80
1996
32 integer/flt.
MIMD
TMS320C62XX
1997
16 integer
2 GOPS
120 MFLOP
20 GOPS
TMS310C67XX
1997
32 flt. pt.
1 GFLOP
VLIW
1600 MIPS
VLIW
EECC722 - Shaaban
#16 lec # 8
DSP Applications
Industrial control
Seismic exploration
Networking:
Wireless
Base station
Cable modems
ADSL
VDSL
EECC722 - Shaaban
#17 lec # 8
DSP Applications
DSP Algorithm
Speech Coding
Speech Encryption
Speech Recognition
Speech Synthesis
Speaker Identification
High-fidelity Audio
Modems
Noise cancellation
Audio Equalization
Ambient Acoustics Emulation
Audio Mixing/Editing
Sound Synthesis
Vision
Image Compression
Image Compositing
Beamforming
Echo cancellation
Spectral Estimation
System Application
Digital cellular telephones, personal communications systems, digital cordless telephones,
multimedia computers, secure communications.
Digital cellular telephones, personal communications systems, digital cordless telephones,
secure communications.
Advanced user interfaces, multimedia workstations, robotics, automotive applications,
cellular telephones, personal communications systems.
Advanced user interfaces, robotics
Security, multimedia workstations, advanced user interfaces
Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia
computers
Digital cellular telephones, personal communications systems, digital cordless telephones,
digital audio broadcast, digital signaling on cable TV, multimedia computers, wireless
computing, navigation, data/fax
Professional audio, advanced vehicular audio, industrial applications
Consumer audio, professional audio, advanced vehicular audio, music
Consumer audio, professional audio, advanced vehicular audio, music
Professional audio, music, multimedia computers
Professional audio, music, multimedia computers, advanced user interfaces
Security, multimedia computers, advanced user interfaces, instrumentation, robotics,
navigation
Digital photography, digital video, multimedia computers, videoconferencing
Multimedia computers, consumer video, advanced user interfaces, navigation
Navigation, medical imaging, radar/sonar, signals intelligence
Speakerphones, hands-free cellular telephones
Signals intelligence, radar/sonar, professional audio, music
EECC722 - Shaaban
#18 lec # 8
Increasing
volume
High-end
Military applications
Wireless Base Station - TMS320C6000
Cable modem
gateways
Mid-end
Industrial control
Cellular phone - TMS320C540
Fax/ voice server
Low end
Storage products - TMS320C27
Digital camera - TMS320C5000
Portable phones
Wireless headsets
Consumer audio
Automobiles, toasters, thermostats, ...
Increasing
Cost
EECC722 - Shaaban
#19 lec # 8
EECC722 - Shaaban
#20 lec # 8
PHYSICAL
LAYER
PROCESSING
A/D
415-555-1212
CONTROLLER
SPEECH
ENCODE
BASEBAND
CONVERTER
SPEECH
DECODE
RF
MODEM
DAC
EECC722 - Shaaban
#21 lec # 8
HW/SW/IC PARTITIONING
MICROCONTROLLER
123
456
789
0
ASIC
A/D
415-555-1212
CONTROLLER
PHYSICAL
LAYER
PROCESSING
SPEECH
ENCODE
BASEBAND
CONVERTER
SPEECH
DECODE
RF
MODEM
DAC
DSP
ANALOG IC
EECC722 - Shaaban
#22 lec # 8
RAM
RAM
book
intfc
DMA
ASIC
LOGIC
keypad
control protocol
DMA
S/P
phone
DSP
CORE
speech
voice
quality
recognition
enhancment
de-intl &
RPE-LTP
decoder
speech decoder
demodulator
and
synchronizer
Viterbi
equalizer
EECC722 - Shaaban
#23 lec # 8
C540
ARM7
EECC722 - Shaaban
#24 lec # 8
Embedded
Processor
Interface
Low Power Bus
FB
Fifo
Video
Decomp
Pen
SRAM
Data
Flow
Fifo
Graphics
Audio
Video
EECC722 - Shaaban
#25 lec # 8
Video I/O
Downlink Radio
Voice I/O
Pen In
Video Unit
Memory
Coms
custom
DSP
EECC722 - Shaaban
#26 lec # 8
or
Delay/Storage is or
or
Delay
z1
EECC722 - Shaaban
#28 lec # 8
k =0
EECC722 - Shaaban
#30 lec # 8
h0
h1
....
h N -1
h N -2
A Tap
N 1
EECC722 - Shaaban
#31 lec # 8
Frequency # taps
Performance
Speech
8 kHz
N =128
20 MOPs
Music
48 kHz
N =256
24 MOPs
TV
27 MHz
HDTV
144 MHz
1-D FIR has nop = 2N and a 2-D FIR has nop = 2N2.
EECC722 - Shaaban
#32 lec # 8
EECC722 - Shaaban
#33 lec # 8
y(i) =
M 1
N 1
k =1
k =0
y(k) = WN nk x(n)
n=0
WN
2 j
=e N
j = 1
k =0
DSP BENCHMARKS
DSPstone: University of Aachen, application benchmarks
EECC722 - Shaaban
#37 lec # 8
EECC722 - Shaaban
#38 lec # 8
EECC722 - Shaaban
#39 lec # 8
-1 x < 1
radix
point
.
radix
point
EECC722 - Shaaban
#40 lec # 8
EECC722 - Shaaban
#42 lec # 8
EECC722 - Shaaban
#43 lec # 8
ALU
Accumulator
EECC722 - Shaaban
#44 lec # 8
General-Purpose Processor
Multiplies often take>1
cycle
Shifts often take >1 cycle
Other operations (e.g.,
saturation, rounding)
typically take multiple
cycles.
EECC722 - Shaaban
#46 lec # 8
EECC722 - Shaaban
#47 lec # 8
Instruction
Memory
Processor
Data
Memory
Datapath:
Mem
T-Register
Accumulator
Specialized instruction set
Load and Accumulate
Multiplier
ALU
P-Register
Accumulator
EECC722 - Shaaban
#48 lec # 8
Features
EECC722 - Shaaban
#49 lec # 8
EECC722 - Shaaban
#50 lec # 8
y(n) = h(m)x(n m)
0
element of finite-impulse
response filter computation
X
MPY
ADD/SUB
ACC REG
EECC722 - Shaaban
#52 lec # 8
Xn X
2
n-1
Yn
EECC722 - Shaaban
#53 lec # 8
EECC722 - Shaaban
#54 lec # 8
DSP Memory
FIR Tap implies multiple memory accesses
DSPs require multiple data ports
Some DSPs have ad hoc techniques to reduce memory
bandwdith demand:
Instruction repeat buffer: do 1 instruction 256 times
Often disables interrupts, thereby increasing interrupt
response time
EECC722 - Shaaban
#56 lec # 8
X MEMORY
Y MEMORY
GLOBAL
P DATA
X DATA
Y DATA
EECC722 - Shaaban
#57 lec # 8
DSP Processor
Harvard architecture
2-4 memory accesses/cycle
No caches-on-chip SRAM
General-Purpose Processor
Von Neumann architecture
Typically 1 access/cycle
Use caches
Program
Memory
Processor
Processor
Memory
Data
Memory
EECC722 - Shaaban
#58 lec # 8
EECC722 - Shaaban
#59 lec # 8
EECC722 - Shaaban
#60 lec # 8
DSP Addressing
Have standard addressing modes: immediate,
displacement, register indirect
Want to keep MAC datapath busy
Assumption: any extra instructions imply clock cycles
of overhead in inner loop
=> complex addressing is good
=> dont use datapath to calculate fancy address
Autoincrement/Autodecrement register indirect
lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1
Option to do it before addressing, positive or negative
EECC722 - Shaaban
#61 lec # 8
EECC722 - Shaaban
#62 lec # 8
x(0)
F(0)
100
x(4)
F(1)
010
x(2)
F(2)
110
x(6)
F(3)
001
x(1)
F(4)
101
x(5)
F(5)
011
x(3)
F(6)
111
x(7)
F(7)
Four 2-point
DFTs
Two 4-point
DFTs
EECC722 - Shaaban
#63 lec # 8
CIRCULAR BUFFERS
Instructions accomodate three
elements:
buffer address
buffer size
increment
Allows for cycling through:
delay elements
coefficients in data memory
EECC722 - Shaaban
#65 lec # 8
Addressing Comparison
DSP Processor
Dedicated address
generation units
Specialized addressing
modes; e.g.:
Autoincrement
Modulo (circular)
Bit-reversed (for FFT)
Good immediate data
support
General-Purpose Processor
Often, no separate address
generation unit
General-purpose addressing
modes
EECC722 - Shaaban
#66 lec # 8
EECC722 - Shaaban
#67 lec # 8
DO X ...
Address Generation
PCS = PC + 1
if (PC = x && ! condition)
PC = PCS
else
PC = PC +1
EECC722 - Shaaban
#69 lec # 8
General-Purpose Processor
General-purpose
instructions
Typically only one operation
per instruction
mov *r0,x0
mov *r1,y0
mpy x0, y0, a
add a, b
mov y0, *r2
inc r0
inc rl
EECC722 - Shaaban
#70 lec # 8
A/D Converter
D/A Converter
Instruction
Memory
Data
Memory
Serial Ports
Synchronous serial
ports
Parallel ports
Timers
On-chip A/D, D/A
converters
Host ports
Bit I/O ports
On-chip DMA
controller
Clock generators
EECC722 - Shaaban
#71 lec # 8
EECC722 - Shaaban
#72 lec # 8
EECC722 - Shaaban
#73 lec # 8
EECC722 - Shaaban
#74 lec # 8
Essential tools:
Assembler, linker.
Instruction set simulator.
HLL Code generation: C compiler.
Debugging and profiling tools.
Increasingly important:
Software libraries.
Real-time operating systems.
EECC722 - Shaaban
#75 lec # 8
Multiple-Issue DSPs:
VLIW Example: TI TMS320C62xx, TMS320C64xx
Superscalar, Example: LSI Logic ZPS400
EECC722 - Shaaban
#76 lec # 8
A Conventional DSP:
TI TMSC54xx
EECC722 - Shaaban
#77 lec # 8
EECC722 - Shaaban
#78 lec # 8
EECC722 - Shaaban
#80 lec # 8
Pwr
Dwn
Host
Port
Interface
Program Fetch
Control
Registers
Instruction Dispatch
4DMA
Instruction Decode
Data Path 1
Data Path 2
A Register File
Control
Logic
B Register File
Test
Emulation
Ext.
Memory
Interface
L1
S1
M1
D1
D2 M2
S2
L2
Interrupts
2 Timers
Data Memory
32-Bit address, 8-, 16-, 32-Bit data
512K Bits RAM
2 Multichannel
buffered
serial ports
(T1/E1)
EECC722 - Shaaban
#81 lec # 8
EECC722 - Shaaban
#82 lec # 8
C62x Datapaths
Registers A0 - A15
Registers B0 - B15
1X
S1
2X
S2
D DL SL
L1
SL DL D S1 S2
S1
D S1
S2
M1
DDATA_I1
(load data)
DDATA_O1
(store data)
D S1 S2
S2 S1 D
S2
S1 D
D1
D2
M2
S2
S1 D DL SL
S2
SL DL D
S2
S1
L2
DDATA_I2
(load data)
DDATA_O2
(store data)
DADR1 DADR2
(address) (address)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
EECC722 - Shaaban
#83 lec # 8
EECC722 - Shaaban
#84 lec # 8
A B C D E F G H
A
B
C
D Example 2
E
F
G
H
A B
C
D Example 3
E
F G H
Fetch Packet
CPU fetches 8 instructions/cycle
Execute Packet
CPU executes 1 to 8 instructions/cycle
Fetch packets can contain multiple execute packets
Parallelism determined at compile / assembly time
Examples
1) 8 parallel instructions
2) 8 serial instructions
3) Mixed Serial/Parallel Groups
A // B
C
D
E // F // G // H
Reduces Codesize, Number of Program Fetches, Power
Consumption
EECC722 - Shaaban
#85 lec # 8
Decode
Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
Single-Cycle Throughput
Operate in Lock Step
Fetch
PG
Program Address Generate
PS
Program Address Send
PW
Program Access Ready Wait
PR
Program Fetch Packet Receive
PG PS PW PR DP DC
Execute Packet 2 PG PS PW PR DP
Execute Packet 3 PG PS PW PR
Execute Packet 4 PG PS PW
Execute Packet 5 PG PS
Execute Packet 6 PG
Execute Packet 7
E1
DC
DP
PR
PW
PS
PG
Decode
DP
DC
Execute
E1 - E5
E2
E1
DC
DP
PR
PW
PS
E3
E2
E1
DC
DP
PR
PW
E4
E3
E2
E1
DC
DP
PR
Instruction Dispatch
Instruction Decode
Execute 1 through Execute 5
E5
E4
E3
E2
E1
DC
DP
E5
E4
E3
E2
E1
DC
E5
E4
E3
E2
E1
E5
E4 E5
E3 E4 E5
E2 E3 E4 E5
EECC722 - Shaaban
#86 lec # 8
E1 E2 1 Delay Slots
E1 E2 E3 E4 E5 4 Delay Slots
E1
EECC722 - Shaaban
#87 lec # 8
Reduces Branching
Increases Parallelism
EECC722 - Shaaban
#88 lec # 8
EECC722 - Shaaban
#89 lec # 8
Pre-Increment
Post-Increment
Pre-Decrement
Post-Decrement
Positive Offset
Negative Offset
*++R[index]
*R++[index]
*--R[index]
*R--[index]
*+R[index]
*-R[index]
EECC722 - Shaaban
#91 lec # 8
EECC722 - Shaaban
#92 lec # 8
TI TMS320C64xx
Announced in February 2000, the TMS320C64xx is an extension
of Texas Instruments' earlier TMS320C62xx architecture.
The TMS320C64xx has 64 32-bit general-purpose registers, twice
as many as the TMS320C62xx.
The TMS320C64xx instruction set is a superset of that used in the
TMS320C62xx, and, among other enhancements, adds significant
SIMD processing capabilities:
8-bit operations for image/video processing.
600 MHz clock speed, but:
11-stage pipeline with long latencies
Dynamic caches.
$100 qty 10k.
The only DSP family with compatible fixed and floating-point
versions.
EECC722 - Shaaban
#93 lec # 8
Superscalar DSP:
Disadvantage:
Dynamic behavior complicates DSP software development:
Ensuring real-time behavior
Optimizing code.
EECC722 - Shaaban
#94 lec # 8
EECC722 - Shaaban
#95 lec # 8