DSP Architecture
DSP Architecture
Processors (DSPs)
Dr J Ravi Kumar
Asst Professor
ECE Dept
NIT Warangal
1
Outline/objectives
Identify the most important DSP processor
architecture features and how they relate
to DSP applications
Understand the types of code appropriate
for DSP implementation
2
Advantages of DSPs vs. Analog
Circuit
Can implement complex linear and non-linear
algorithms,
Application can be modified simply by changing
code,
Highly reliable,
Manufacturing is fairly easy.
3
What is a DSP?
A specialized microprocessor for real-
time DSP applications
Digital filtering (FIR and IIR)
FFT
Convolution, Matrix Multiplication etc
DIGITAL DIGITAL
ANALOG INPUT OUTPUT ANALOG
ADC DSP DAC
INPUT OUTPUT
4
$30B Processor Markets
32-bit
micro
$5.2B/17%
$1.2B/4% 32 bit DSP
DSP $10B/33%
16-bit $5.7B/19%
micro
8-bit $9.3B/31%
micro
DSP Applications
Digital audio applications Industrial control
MPEG Audio Seismic exploration
Portable audio Networking:
Digital cameras Wireless
Cellular telephones Base station
Wearable medical appliances Cable modems
Storage products: ADSL
disk drive servo control VDSL
Military applications:
radar
sonar
DSP Applications
DSP Algorithm System Application
Digital cellular telephones, personal communications systems, digital cordless telephones,
Speech Coding
multimedia computers, secure communications.
Digital cellular telephones, personal communications systems, digital cordless telephones,
Speech Encryption
secure communications.
Speech Recognition
Advanced user interfaces, multimedia workstations, robotics, automotive applications,
cellular telephones, personal communications systems.
Speech Synthesis Advanced user interfaces, robotics
Speaker Identification Security, multimedia workstations, advanced user interfaces
Consumer audio, consumer video, digital audio broadcast, professional audio, multimedia
High-fidelity Audio
computers
Digital cellular telephones, personal communications systems, digital cordless telephones, Modems digital audio
broadcast, digital signaling on cable TV, multimedia computers, wireless
computing, navigation, data/fax
Noise cancellation Professional audio, advanced vehicular audio, industrial applications
Audio Equalization Consumer audio, professional audio, advanced vehicular audio, music
Ambient Acoustics Emulation Consumer audio, professional audio, advanced vehicular audio, music
Audio Mixing/Editing Professional audio, music, multimedia computers
Sound Synthesis Professional audio, music, multimedia computers, advanced user interfaces
Vision Security, multimedia computers, advanced user interfaces, instrumentation, robotics,
navigation
Image Compression Digital photography, digital video, multimedia computers, videoconferencing Image Compositing
Multimedia computers, consumer video, advanced user interfaces, navigation Beamforming Navigation, medical
imaging, radar/sonar, signals intelligence
Echo cancellation Speakerphones, hands-free cellular telephones
Spectral Estimation Signals intelligence, radar/sonar, professional audio, music
Another Look at DSP Applications
High-end
Increasing
Military applications
Wireless Base Station - TMS320C6000
Cost
Cable modem
gateways
Mid-end
Industrial control
Cellular phone - TMS320C540
Fax/ voice server
volume
Increasing
Low end
Storage products - TMS320C27
Digital camera - TMS320C5000
Portable phones
Wireless headsets
Consumer audio
Automobiles, toasters, thermostats, ...
Hardware used in DSP
ASIC FPGA GPP DSP
9
Example Wireless Phone Organization
C540
ARM7
TYPES OF DSP PROCESSORS
32-BIT FLOATING POINT (5% of market):
TI TMS320C3X, TMS320C67xx
AT&T DSP32C
ANALOG DEVICES ADSP21xxx
Hitachi SH-4
16-BIT FIXED POINT (95% of market):
TI TMS320C2X, TMS320C62xx
Infineon TC1xxx (TriCore1)
MOTOROLA DSP568xx, MSC810x
ANALOG DEVICES ADSP21xx
Agere Systems DSP16xxx, Starpro2000
LSI Logic LSI140x (ZPS400)
Hitachi SH3-DSP
StarCore SC110, SC140
Fixed vs. Floating Point DSPs
Fixed Point DSPs (Modems, Controllers, Phones)
Cheaper and consume less power
More expensive.
PROGRAM
MEMORY
18
HARVARD MEMORY ARCHITECTURE in DSP
PROGRAM
X MEMORY Y MEMORY
MEMORY
GLOBAL
P DATA
X DATA
Y DATA
EECC72 - Shaaban
#57 lec # 8 Fall 2003 10-8-2003
ATmega16
Internal
Architecture
Memory Architecture Comparison
DSP Processor General-Purpose Processor
Harvard architecture Von Neumann architecture
2-4 memory accesses/cycle Typically 1 access/cycle
No caches-on-chip SRAM Use caches
Program Memory
23
TMS320C6701 DSP
Block Diagram
Program Cache/Program Memory
32-bit address, 256-Bit data
512K Bits RAM
25
TMS320C6701
Memory /Peripherals
Same as C6201
External interface supports
SDRAM, SRAM, SBSRAM
4-channel bootloading DMA
16-bit host port interface
1Mbit on-chip SRAM
2 multichannel buffered serial ports (T1/E1)
Pin compatible with C6201
26
TMS320C67x CPU Core
C67x Floating-Point CPU Core
Program Fetch
Control
Instruction Dispatch Registers
Instruction Decode
Data Path 1 Data Path 2 Control
Logic
A Register File B Register File
Test
Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Interrupts
Arithmetic Auxiliary
Floating-Point
Multiplier Capabilities
Logic Logic
Unit
Unit Unit
27
DSP Units
.M Multiplication Unit:
16-bit 16-bit, 32-bit 32-bit, 64-bit 64-bit
.L Logic Unit:
Arithmetic, comparisons and logic operations.
.S Shifter Unit:
Bit manipulation (set, get, shift, rotate).
.D Data Unit:
Load/Store to/from memory (exclusively)
Performs addition and pointer arithmetic.
28
C6700 DSP Simplified
Architecture
Program RAM Data
or Cache Cache
Addr
Internal Buses
DMA
Data
.D1 .D2 Serial Port
Regs (A0-A15)
Regs (B0-B15)
External .M1 .M2 Host Port
Memory
-Sync .L1 .L2 Boot Load
-Async
.S1 .S2
Timers
Control Regs
Pwr Down
CPU
29
C6713 DSP Features
8 functional units are divided into 2 sets each with
4 different units and their own 16 general purpose
registers (A0-A15 and B0-B15).
There is a single data bus connecting the 2 sides.
Register files support data ranging in size from 16-
bit through 40-bit fixed point & 64-bit floating
point.
Register access using the register file across the
CPU supports one read and one write per cycle.
30
C67x Datapaths
L-Unit (L1, L2)
2 Data Paths Floating-Point, 40-bit Integer ALU
8 Functional Units Bit Counting, Normalization
Orthogonal/Independent
2 Floating Point Multipliers S-Unit (S1, S2)
2 Floating Point Arithmetic Floating Point Auxiliary Unit
2 Floating Point Auxiliary 32-bit ALU/40-bit shifter
Control Bitfield Operations, Branching
Independent
Up to 8 32-bit Instructions M-Unit (M1, M2)
Registers Multiplier: Integer & Floating-Point
2 Files D-Unit (D1, D2)
32, 32-bit registers total 32-bit add/subtract Addr Calculations
Cross paths (1X, 2X)
S1 S2 D DL SL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SL DL D S2 S1
L1 S1 M1 D1 D2 M2 S2 L2
31
C6713 DSP Features
Instruction Set Features
Hardware support for IEEE 754 standard on single and
double precision floating-point operations.
8, 16 or 32-bit addressable load/store instructions.
L1/L2 Memory Architecture (2 level cache).
Can be configured to support Little Endian or Big
Endian.
16-bit HPI to allow other processor access to
memory space.
32
33
34
35
36
OMAP-L138 Functional Block Diagram
37
TMS320C6748 Megamodule Block Diagram
38
39
40
6748
The internal program memory is structured
so that a total of eight instructions can be
fetched every cycle.
With a clock rate of 375 MHz,the C6748 is
capable of fetching eight 32-bit instructions
every 1/(375 MHz) or 2.67 ns.
41
Memory 6748
Features of the C6748 include 326 kB of
internal memory
(32 kB of L1P program
RAM/cache,
32 kB of L1D data RAM/cache, and
256 kB of L2 RAM/cache)
42
OMAP Memory
the OMAP-L138 features 128 kB of on-
chip RAM shared by its C6748 and ARM9
processor cores
external memory interface addressing 256
MB of 16-bit mDDR SDRAM
43
Eight functional or execution units
composed of six ALUs and two multiplier
units,
44
45
46
C6700 DSP Simplified
Architecture
Program RAM Data
or Cache Cache
Addr
Internal Buses
DMA
Data
.D1 .D2 Serial Port
Regs (A0-A15)
Regs (B0-B15)
External .M1 .M2 Host Port
Memory
-Sync .L1 .L2 Boot Load
-Async
.S1 .S2
Timers
Control Regs
Pwr Down
CPU
47
48
49
Functional Units
.M Multiplication Unit:
16-bit 16-bit, 32-bit 32-bit, 64-bit 64-bit
.L Logic Unit:
Arithmetic, comparisons and logic operations.
.S Shifter Unit:
Bit manipulation (set, get, shift, rotate).
.D Data Unit:
Load/Store to/from memory (exclusively)
50
DMA Controller
DMA Controller (C6701 DSP only) transfers data between address ranges in the
memory map without intervention by the CPU. The DMA controller has four
programmable channels and a fifth auxiliary channel.
EDMA Controller performs the same functions as the DMA controller. The EDMA has
16 programmable channels, as well as a RAM space to hold multiple configurations
for future transfers.
HPI is a parallel port through which a host processor can directly access the CPUs
memory space. The host device has ease of access because it is the master of the
interface. The host and the CPU can exchange information via internal or external
memory. In addition, the host has direct access to memory-mapped peripherals.
52
Power-down logic allows reduced clocking to reduce power consumption.
Most of the operating power of CMOS logic dissipates during circuit
switching from one logic state to another. By preventing some or all of the
chips logic from switching, you can realize significant power savings
without losing any data or operational context.
53
DSP Features
375/456-MHz Fixed/Floating-Point Load-Store Architecture
with VLIW architecture.
10/100 Mb/s Ethernet MAC (EMAC)
USB2.0 OTG, USB1.1 OHCI interface
Two inter-integrated circuit (I2C) bus interfaces
Two multichannel audio serial port (McASP)
Two multichannel buffered serial ports (McBSP) with
FIFO buffers
Two SPI interfaces with multiple chip selects.
Four 64-bit general-purpose timers.
54
DSP Features (Contd.)
55
DSP Features (Contd.)
C674x Two Level Cache Memory Architecture
32K-Byte L1P Program RAM/Cache
32K-Byte L1D Data RAM/Cache
256K-Byte L2 Unified Mapped RAM/Cache
Flexible RAM/Cache Partition (L1 and L2)
56
C6748 Floating-Point VLIW DSP Core
Load-Store Architecture With Non-Aligned VLIW
DSP Support
Supports TIs Basic Secure Boot 64 General-Purpose
Registers (32 Bit)
Six ALU (32-/40-Bit) Functional Units
Supports 32-Bit Integer, SP (IEEE Single
Precision/32-Bit) and DP (IEEE Double Precision/64-
Bit) Floating Point
Supports up to Four SP Additions Per clock, Four DP
Additions Every 2 clocks.
57
C6748 Floating-Point VLIW DSP Core
(Contd.)
Two Multiply Functional Units
Mixed-Precision IEEE Floating Point Multiply Supported up
to:
2 SP x SP SP Per Clock
2 SP x SP DP Every Two Clocks
2 SP x DP DP Every Three Clocks
2 DP x DP DP Every Four Clocks
Fixed Point Multiply Supports Two 32 x 32-Bit Multiplies,
Four 16 x 16-Bit Multiplies, or Eight 8 x 8-Bit Multiplies per
Clock Cycle, and Complex Multiples
58
Memory Mapping
59
TMS320C6748 Megamodule Block Diagram
60
61
C6713 DSK Memory Map
TMS320C6713 C6713 DSK
0000_0000
8 MB SDRAM
256 kB Internal
Program / Data
FFFF_FFFF
63
64
65
66
67
68
C67xx Instruction Set
69
70
71
72
73
Parallel Operations
Instruction word for each functional unit is 32-bits.
Instructions are fetched 8 at a time in 256 bit
packets called fetch packets.
Up to 8 instructions can be executed in parallel,
one in each functional unit.
Bit 0 of 32-bit instruction indicates if next
instruction belongs to same execute packet.
Fetch packet Execute packet (execute packet can
be larger than 256 bits, so it spans more fetch
packets). 74
Single-Cycle MAC unit
ai xi
Multiplier
a i-1 x i-1
n
(a ix i )
ai xi
Adder
i=0
a i x i + a i-1 x i-1
Can compute a sum of n-
Register
products in n cycles
75
Single Instruction - Multiple Data
(SIMD)
A technique for data-level parallelism by
employing a number of processing
elements working in parallel
76
Very Long Instruction Word (VLIW)
A technique for
VLIW instruction F=a+b c=e/g d=x&y w=z*h
instruction-level a
F
parallelism by executing b PU
instructions without
dependencies (known at e PU
c
compile-time) in parallel g
Example of a single x d
PU
VLIW instruction: y
77
CISC vs. RISC vs. VLIW
78
Pipelining
DSPs commonly feature deep pipelines
TMS320C6x processors have 3 pipeline stages
with a number of phases (cycles):
Fetch
Program Address Generate (PG)
Program Address Send (PS)
Program ready wait (PW)
Program receive (PR)
Decode
Dispatch (DP)
Decode (DC)
Execute
6 to 10 phases
79
Saturation Arithmetic
fixed range for operations like addition and
multiplication
normal overflow and underflow produce the
maximum and minimum allowed value,
respectively
Associativity and distributivity no longer apply
1 signed byte saturation arithmetic examples:
64 + 69 = 127
-127 5 = -128
(64 + 70) 25 = 122 64 + (70 -25) = 109
80
Examples
Perform the following operations using
one-byte saturation arithmetic
0x77 + 0x99 =
0x4*0x42=
0x3*0x51=
81
Zero Overhead Looping
Hardware support for loops with a
constant number of iterations using
hardware loop counters and loop buffers
No branching
No loop overhead
No pipeline stalls or branch prediction
No need for loop unrolling
82
Hardware Circular Addressing
A data structure Head
the queue.
Requires at least 2
X[n-2]
X[n-3] X[n-3]
pointers (head and tail)
Extensively used in digital
filtering Tail
y[n] = a0x[n]+a1x[n-1]++akx[n-k]
83
Direct Memory Access (DMA)
The feature that allows peripherals to access
main memory without the intervention of the
CPU
Typically, the CPU initiates DMA transfer, does
other operations while the transfer is in
progress, and receives an interrupt from the
DMA controller once the operation is complete.
Can create cache coherency problems (the data
in the cache may be different from the data in
the external memory after DMA)
Requires a DMA controller
84
Cache memory
Separate instruction and data L1 caches
(Harvard architecture)
Cache coherence protocols required,
since most systems use DMA
85
DSP vs. Microcontroller
DSP Microcontroller
Harvard Architecture Mostly von Neumann
VLIW/SIMD (parallel Architecture
execution units) Single execution unit
No bit level operations Flexible bit-level
Hardware MACs operations
DSP applications No hardware MACs
Control applications
86
Examples
Estimate how long will the following code
fragment take to execute on
A general purpose processor with 1 GHz operating
frequency, five-stage pipelining and 5 cycles required
for multiplication, 1 cycle for addition
A DSP running at 500 MHz, zero overhead looping
and 6 independent ALUs and 2 independent single-
cycle MAC units?
88
Review Questions
Which of the following is not a typical DSP
feature?
Dedicated multiplier/MAC
Von Neumann memory architecture
Pipelining
Saturation arithmetic
Which implementation would you choose for
lowest power consumption?
ASIC
FPGA
General-Purpose Processor
DSP
89
Examples
How many VLIW instructions does the following program
fragment require if there two independent data paths
(a,b), with 3 ALUs and 1 MAC available in each and 8
instructions/word? How many cycles will it take to
execute if they are the first instructions in the program
and all instructions require 1 cycle, assuming the
pipelining architecture of slide 10 with 6 phases of
execution?
ADD a1,a2,a3 ;a3 = a1+a2
SUB b1,b3,b4 ;b4 = b1-b3
MUL a2,a3,a5 ;a5 = a2-a3
MUL b3,b4,b2 ;b2 = b3*b4
AND a7,a0,a1 ;a1 = a7 AND a0
MUL a3,a4,a5 ;a5 = a3*a4
OR a6,a3,a2 ;a2 = a6 OR a3
90