DSP Unit-5 Final
DSP Unit-5 Final
UNIT V
Digital Signal Processors: Introduction to programmable DSP processors – Von-Neumann architecture- Harvard
architecture- VLIW architecture – MAC unit- pipelining.- Special addressing modes in P-DSPs- On chip
peripherals, PDSPs with RISC and CISC- Architecture and addressing modes of TMS320C50 and TMS320C6X.
DSP is a technique of performing the mathematical operations on the signals in digital domain. Digital Signal
Processors (DSPs) are microprocessors with the following characteristics:
a) Real-time digital signal processing capabilities. DSPs typically have to process data in real time, i.e.,
the correctness of the operation depends heavily on the time when the data
processing is completed.
b) High throughput. DSPs can sustain processing of high-speed streaming data, such as audio and
multimedia data processing.
c) Deterministic operation. The execution time of DSP programs can be foreseen accurately, thus
guaranteeing a repeatable, desired performance.
d) Re-programmability by software. Different system behaviour might be obtained by
re-coding the algorithm executed by the DSP instead of by hardware modifications.
DSPs appeared on the market in the early 1980s. Over the last 15 years they have been the key enabling technology
for many electronics products in fields such as communication systems, multimedia, automotive, instrumentation
and military. Fig 1. gives an overview of the evolution of DSP features together with the first year of marketing for
some DSP families.
Fig. 1. Evolution of DSP features from their early days until now.
unit-5 DSP Processors
Table 1 gives an overview of some of these fields and of the corresponding typical DSP applications.
unit-5 DSP Processors
Programmable digital signal processors (PDSPs) are general-purpose microprocessors designed specifically for
digital signal processing (DSP) applications. They contain special instructions and special architecture supports so as
to execute computation-intensive DSP algorithms more efficiently. PDSPs are designed mainly for embedded DSP
applications. As such, the user may never realize the existence of a PDSP in an information appliance. Important
applications of PDSPs include
unit-5 DSP Processors
modem
hard drive controller
cellular phone data pump
set-top box, etc.
The categorization of PDSPs falls between the general-purpose microprocessor and the custom designed, dedicated
chip set. The former have the advantage of ease of programming and development. However, they often suffer from
disappointing performance for DSP applications due to overheads incurred in both the architecture and the
instruction set. Dedicated chip sets, on the other hand, lack the flexibility of programming. The time to market delay
due to chip development may be longer than the program coding of programmable devices.
A programmable DSP device should provide instructions similar to a conventional microprocessor. The instruction
set of a typical DSP device should include the following,
Each computational block of the DSP should be optimized for functionality and speed and in the meanwhile the
design should be sufficiently general so that it can be easily integrated with other blocks to implement overall DSP
systems.
o Multipliers
o Parallel Multipliers
o Multipliers for Signed Numbers
o Speed
o Bus Widths
o Shifters
o Barrel Shifters
1.3 Multiply and Accumulate Unit
o Overflow and Underflow
Shifters
Guard bits
Saturation Logic
1.4 Arithmetic and Logic Unit
o Status Flags
o Overflow Management
o Register File
1.5 Bus Architecture and Memory
o On-chip Memories
Speed
unit-5 DSP Processors
Size
o Organization of On-chip Memories
1.6 Data Addressing Capabilities
o Immediate Addressing Mode
o Register Addressing Mode
o Direct Addressing Mode
o Indirect Addressing Mode
1.7 Special Addressing Modes
o Circular Addressing Mode
o Bit Reversed Addressing Mode
1.8 Address Generation Unit
1.9 Programmability and program Execution
o Program Control
o Program Sequencer
1. Introduction
DSP architecture has been shaped by the requirements of predictable and accurate real-time digital signal
processing. An example is the Finite Impulse Response (FIR) filter, with the corresponding mathematical equation
(1), where y is the filter output, x is the input data and a is a vector of filter coefficients. Depending on the
application, there might be just a few filter coefficients or many hundreds or more.
---------------- (1)
As shown in Eq. (1), the main component of a filter algorithm is the ‘multiply and accumulate’
operation, typically referred to as MAC. Coefficients data have to be retrieved from the memory and the whole
operation must be executed in a predictable and fast way, so as to sustain a high throughput rate. Finally, high
accuracy should typically be guaranteed. These requirements are common to many other algorithms performed in
digital signal processing, such as Infinite Impulse Response (IIR) filters and Fourier Transforms. Table 2 shows a
selection of processing requirements together with the main DSP hardware features satisfying them.
Table 2. Main requirements and corresponding DSP hardware implementations for predictable and accurate
real-time digital signal processing.
unit-5 DSP Processors
MAC-centred
Pipelining
2. Fast computation Parallel architectures (VLIW, SIMD)
• Traditional general-purpose microprocessors are based upon the Von Neumann architecture.
• Disadv:
– only one memory access per instruction cycle is possible
Super-Harvard architecture
unit-5 DSP Processors
• The Harvard architecture can be improved by adding to the DSP core a small bank of fast memory, called
‘instruction cache’, and allowing data to be stored in the program memory.
• The last-executed program instructions are relocated at run time in the instruction cache.
• Recent improvement of the Harvard architecture is the presence of a ‘data cache’, namely a fast memory
located close to the DSP core which is dynamically loaded with data.
• The L1 cache comprises 8 kbyte of memory divided into 4 kbyte of program cache and 4 kbyte of data
cache.
• The L2 cache comprises 256 kbyte of memory divided into 192 kbyte mapped-SRAM memory and 64
kbyte dual cache memory.
• The latter can be configured as mapped memory, cache or a combination of the two.
• Adv:
• the fact of having the cache memory very close to the DSP allows clocking it at high speed, as
routing wire delays are short.
• cache memories improve the average system performance.
• Drawbacks: lack of full predictability for cache hits.
• a missing cache hit is, for instance, the flow change due to branch instructions.
• A hierarchical memory allows one to take advantage of both the speed and the capacity of different
memory types.
– Registers are banks of very fast internal memory, typically with single-cycle access time. They are
a precious DSP resource used for temporary storage of coefficients and intermediate processing
values.
– The L1 cache is typically high-speed static RAM made of five or six transistors. The amount of L1
cache available thus depends directly on the available chip space.
– A L2 cache needs typically a smaller number of transistors hence can be present in higher
quantities inside the DSPs. Recent years have also seen the integration of DRAM memory blocks
into the DSP chip, thus guaranteeing larger internal memories with relatively short access times.
– The Level 3 (L3) memory is rarely present in DSPs while the external memory is typically
available. This is often a large memory with long access times.
• In doing so the DMA controller frees the DSP core for other processing tasks.
3. Fast computation
• MAC-centred
• Pipelining
• Parallel architectures (VLIW, SIMD)
3.1. MAC-centred
The basic DSP arithmetic processing blocks are
a) many registers;
b) one or more multipliers;
c) one or more Arithmetic Logic Units (ALUs);
d) one or more shifters.
These blocks work in parallel during the same clock cycle thus optimizing MAC as well as other
arithmetic operations.
a) Registers:
unit-5 DSP Processors
– these are banks of very fast memory used to store intermediate data processing. Very often they
are wider than the DSP normal word width, so as to provide a higher resolution during the
processing.
b) Multiplier:
– it can carry out single-cycle multiplications and very often it includes very wide accumulator
registers to reduce round-off or truncation errors.
– As a consequence, truncation and round-off errors will happen only at the end of the data
processing, when the data is stored onto memory.
– Sometimes an adder is integrated in the multiplier unit.
c) ALU:
- it carries out arithmetic and logical operations.
d) Shifters:
- it shifts the input value by one or more bits, left or right. In the latter case, the shifter is called a
barrel shifter and is especially useful in the implementation of floating point add and subtract
operations.
1. Fetch. The DSP calculates the address of the next instruction to execute and retrieve the opcode, i.e., the binary
word containing the operands and the operation to be carried out on them.
2. Decode. The op-code is interpreted and sent to the corresponding functional unit. The instruction is interpreted
and the operands are retrieved.
3. Execute. The instruction is executed and the results are written onto the registers.
Instruction execution and processing time gain of a pipelined CPU (plot b) with respect to
a non-pipelined one (plot a)
unit-5 DSP Processors
• A pipeline is called fully-loaded if all stages are executed at the same time; this corresponds to the
maximum possible instruction throughput.
• The depth of the pipeline, i.e., the number of stages into which an instruction is divided, can vary from one
processor to another.
• Generally speaking a deeper pipeline allows the processor to execute faster, hence many processors sub-
divide pipeline stages into smaller steps, each one executed at each clock cycle.
• The smaller the step, the faster the processor clock speed can be.
• An example of deep pipeline is the TI TMS320C6713 DSP, which includes four fetch stages, two decode
stages, and up to ten execution stages.
• Drawback:
– Hardware and programming complexity required
3.3 Parallel architectures
• The DSP performance can be increased by an increased parallelism in the instructions execution.
unit-5 DSP Processors
• Parallel-enhanced DSP architectures started to appear on the market in the mid 1990s and were based on
o Very Long Instruction Word (VLIW), instruction-level parallelism
o Single-Input Multiple-Data (SIMD), data-level parallelism
o or a combination of both.
• VLIW architectures are based upon instruction level parallelism, i.e., many instructions are issued at the
same time and are executed in parallel by multiple execution units.
• As a consequence, DSPs based on this architecture are also called ‘multi-issue’ DSP.
• This is an innovative architecture that was first used in the TI TMS320C62xx DSP family.
• eight, 32-bit instructions are packed together in a 256-bit wide instruction which is fed to eight separate
execution units.
• Characteristics of VLIW architectures include simple and regular instruction sets.
• Instruction scheduling is done at compile-time and not at run-time so as to guarantee a deterministic
behaviour.
• Adv:
• it can increase the DSP performance for a wide range of algorithms.
• Additionally, the architecture is potentially scalable, i.e., more execution units could be added to
allow a higher number of instructions to be executed in parallel.
• Disadv:
• high memory use and power consumption required by this architecture.
• From a programmer’s viewpoint, writing assembly code for VLIW architecture is very complex
and the optimization is often better left to the compiler.
SIMD architecture
o only one instruction is issued at a time but the same operation specified by the instruction is performed on
multiple data sets.
unit-5 DSP Processors
• Two 32-bit input registers provide four, 16-bit each, data inputs.
• They are processed in parallel by two separate execution units that carry out the same operation.
• Adv:
– it is applicable to other architectures, an example is the ADI TigerSHARC DSP that comprises
both VLIW and SIMD characteristics.
• Drawbacks:
– not useful for algorithms that process data serially or that contain tight feedback loops.
4. Numerical fidelity
• Arithmetic operations such as additions and multiplications are the heart of DSP systems.
• It is thus essential that the numerical fidelity be maximized, i.e., that errors due to the finite number of bits
used in the number representation and in the arithmetic operations be minimized.
• DSPs have many ways to obtain this, ranging from the numeric representation to dedicated hardware
features.
5. Fast-execution control
• two important examples of how DSP can fast-execute control instructions.
• The first example is the zero-overhead hardware loop and refers to the program flow control in loops.
• The second example refers to how DSPs react to interrupts.
1. zero-overhead hardware loop
unit-5 DSP Processors
- Power
J 3.3-V and 5-V static CMOS technology with two power-down modes
J Power consumption control with IDLE1 and IDLE2 instructions for
power-down modes
- Memory
Introduction 1-7
TMS320C5x Key Features unit-5 DSP Processors
- Program control
- Instruction set
1-8
TMS320C5x
unit-5 DSPKey Features
Processors
- On-chip peripherals
- Test/Emulation
- Packages
Introduction 1-9
unit-5 DSP Processors
Chapter 2
Architectural Overview
All ’C5x DSPs have the same CPU structure; however, they have different
on-chip memory configurations and on-chip peripherals.
Topic Page
Memory
Program Data/Program
ROM SARAM Peripherals
’C50 2K ’C50 9K 6
’C51 8K ’C51 1K Serial port 1
’C52 4K ’C52 — Data DARAM
’C53 16K ’C53 3K Data/Program
’LC56 32K ’LC56 6K DARAM B2 (32 X 16) 6
’C57S 2K ’C57S 6K Serial port 2
’LC57 32K ’LC57 6K B0 (512 X 16) B1 (512 X 16)
6
TDM
serial port
Program bus 6
Buffered
serial port
1
Timer
Program
controller
18
Memory- Host port
Program interface
Memory control counter mapped
registers CALU
Multiprocessing Parallel 7
Status/control
registers logic Test/emulation
Interrupts
D Multiplier unit
Initialization D Accumulator (PLU)
Hardware stack
Auxiliary D ACC Buffer
Oscillator/timer register D Shifters
Address generation arithmetic
logic D Arithmetic
unit logic unit (ALU)
(ARAU)
Instruction register
CPU
Data bus
2-2
Bus
unit-5 DSP Structure
Processors
The PAB provides addresses to program memory space for both reads and
writes. The PB also carries the instruction code and immediate operands from
program memory space to the CPU. The DB interconnects various elements
of the CPU to data memory space. The program and data buses can work
together to transfer data from on-chip data memory and internal or external
program memory to the multiplier for single-cycle multiply/accumulate opera-
tions.
The ’C5x CPU maintains source-code compatibility with the ’C1x and ’C2x
generations while achieving high performance and greater versatility. Im-
provements include a 32-bit accumulator buffer, additional scaling capabili-
ties, and a host of new instructions. The instruction set exploits the additional
hardware features and is flexible in a wide range of applications. Data man-
agement has been improved through the use of new block move instructions
and memory-mapped register instructions. See Chapter 3, Central Processing
Unit (CPU).
For information on the CALU, see Section 3.2, Central Arithmetic Logic Unit
(CALU), on page 3-7.
2-4
Centralunit-5
Processing Unit (CPU)
DSP Processors
- Program counter
- Status and control registers
- Hardware stack
- Address generation logic
- Instruction register
The ’C5x has a total address range of 224K words 16 bits. The memory
space is divided into four individually selectable memory segments: 64K-word
program memory space, 64K-word local data memory space, 64K-word input/
output ports, and 32K-word global data memory space. For information on the
memory organization, see Chapter 8, Memory.
The on-chip ROM may be configured with or without boot loader code. Howev-
er, the on-chip ROM is intended for your specific program. Once the program
is in its final form, you can submit the ROM code to Texas Instruments for
implementation into your device. For details on how to submit code to Texas
Instruments to program your ROM, see Appendix F, Submitting ROM Codes
to TI.
2-6
On-Chip
unit-5 DSP Memory
Processors
- 544 words 16 bits configured as data memory and 512 words × 16 bits
configured as program memory
DARAM improves the operational speed of the ’C5x CPU. The CPU operates
with a 4-deep pipeline. In this pipeline, the CPU reads data on the third stage
and writes data on the fourth stage. Hence, for a given instruction sequence,
the second instruction could be reading data at the same time the first instruc-
tion is writing data. The dual data buses (DB and DAB) allow the CPU to read
from and write to DARAM in the same machine cycle. For information on
DARAM, see Section 8.3, Local Data Memory, on page 8-15.
The SARAM is divided into 1K- and/or 2K-word blocks contiguous in address
memory space. All ’C5x CPUs support parallel accesses to these SARAM
blocks. However, one SARAM block can be accessed only once per machine
cycle. In other words, the CPU can read from or write to one SARAM block
while accessing another SARAM block. When the CPU requests multiple
accesses, the SARAM schedules the accesses by providing a not-ready
condition to the CPU and executing the multiple accesses one cycle at a time.
SARAM supports more flexible address mapping than DARAM because
SARAM can be mapped to both program and data memory space simulta-
neously. However, because of simultaneous program and data mapping, an
instruction fetch and data fetch that could be performed in one machine cycle
with DARAM may take two machine cycles with SARAM. For information on
SARAM, see Section 8.3, Local Data Memory, on page 8-15.
All ’C5x DSPs have the same CPU structure; however, they have different on-
chip peripherals connected to their CPUs. The ’C5x DSP on-chip peripherals
available are:
- Clock generator
- Hardware timer
- Software-programmable wait-state generators
- Parallel I/O ports
- Host port interface (HPI)
- Serial port
- Buffered serial port (BSP)
- Time-division multiplexed (TDM) serial port
- User-maskable interrupts
2-8
On-Chip
unit-5 Peripherals
DSP Processors
A total of 64K I/O ports are available, sixteen of these ports are
memory-mapped in data memory space. Each of the I/O ports can be ad-
dressed by the IN or the OUT instruction. The memory-mapped I/O ports can
be accessed with any instruction that reads from or writes to data memory. The
IS signal indicates a read or write operation through an I/O port. The ’C5x can
easily interface with external I/O devices through the I/O ports while requiring
minimal off-chip address decoding circuits. For information, see Section 9.6,
Parallel I/O Ports, on page 9-22.
Table 2–1 lists the number and type of parallel ports available in ’C5x DSPs
with various package types.
The HPI available on the ’C57S and ’LC57 is an 8-bit parallel I/O port that pro-
vides an interface to a host processor. Information is exchanged between the
DSP and the host processor through on-chip memory that is accessible to both
the host processor and the ’C57. For information, see Section 9.10, Host Port
Interface, on page 9-87.
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
Table 2–1. Number of Serial/Parallel Ports Available in Different ’C5x Package Types
ÁÁÁÁÁ
ÁÁÁÁÁ
TMS320
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
Device ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁ
Package
ÁÁÁÁÁÁ
ID†
ÁÁÁÁÁÁ
High-Speed
Serial Port
TDM
Serial Port
Buffered
Serial Port
Host Port
(Parallel)
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
’C50/’LC50
ÁÁÁÁÁ ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
’C51/’LC51
ÁÁÁÁÁÁ
ÁÁÁÁÁ
PQ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
PQ/PZ
1
1
1
1
–
–
–
–
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
’C52/’LC52
ÁÁÁÁÁ ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
’C53/’LC53
ÁÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁ
PJ/PZ
ÁÁÁÁÁÁ
PQ
ÁÁÁÁÁÁ
1
1
–
1
–
–
–
–
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
’C53S/’LC53S
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁ ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
PZ
ÁÁÁÁÁÁ
2 – – –
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
’LC56 PZ 1 – 1 –
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
’C57S/’LC57S PGE 1 – 1 1
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
’LC57 PBK 1 – 1 1
2-10
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors
The four product shift modes (PM) at the PREG output are useful for perform-
ing multiply/accumulate operations and fractional arithmetic and for justifying
fractional products. The PM field of status register ST1 specifies the PM shift
mode of the p-scaler:
- If PM = 002, the PREG 32-bit output is not shifted when transferred into the
ALU or stored.
- If PM = 012, the PREG output is left-shifted 1 bit when transferred into the
ALU or stored, and the LSB is zero filled. This shift mode compensates for
the extra sign bit gained when multiplying two 16-bit 2s-complement num-
bers.
MUX
TREG0 TREG1(5)
Multiplier
PRESCALER
SFL(0–16) PREG(32)
32
MUX P–SCALER
(–6,0,1,4)
32
32 32
PRESCALER
SFR(0–16)
MUX
32
32 ALU(32)
32
ST1 C(1) 32
Program Bus
ACCH ACCL ACCB(32)
32
POSTSCALER
(0–7)
Data Bus
Notes: All registers and data lines are 16-bits wide unless otherwise specified.
- If PM = 102, the PREG output is left-shifted 4 bits when transferred into the
ALU or stored, and the 4 LSBs are zero filled. This shift mode is used in
conjunction with the MPY instruction with a short immediate value (13 bits
or less) to eliminate the four extra sign bits gained when multiplying a16-bit
number times a 13-bit number.
3-8
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors
The PM shifts also occur when the PREG contents are stored to data memory.
The PREG contents remain unchanged during the shifts.
The LT (load TREG0) instruction loads TREG0, from the data bus, with the first
operand; the MPY instruction provides the second operand for multiplication
operations. To perfrom a multiplication with a short or long immediate operand,
use the MPY instruction with an immediate operand. A product can be ob-
tained every two cycles except when a long immediate operand is used.
For example, consider multiplying the row of one matrix times the column of
a second matrix: there are 10 10 matrices, MTRX1 points to the beginning
of the first matrix, INDX = 10, and the current AR points to the beginning of the
second matrix:
The MAC and MACD instructions obtain their coefficient pointer from a long
immediate address and are, therefore, 2-word instructions. The MADS and
MADD instructions obtain their coefficient pointer from the BMAR and are,
therefore, 1-word instructions. When you use the BMAR as a source to the co-
efficient table, one block of code can support multiple applications, and you
can change the long immediate address without modifying executable code.
The MACD and MADD instructions include a data move (DMOV) operation
that, in conjunction with the fetch of the data multiplicand, writes the data value
to the next higher data address.
The MACD and MADD instructions, when repeated, support filter constructs
(weighted running averages) so that as the sum-of-products operation is ex-
ecuted, the sample data is shifted in memory to make room for the next sample
and to throw away the oldest sample. Circular addressing with MAC and
MADS instructions can also be used to support filter implementation.
In the next example, the current AR points to the oldest of the samples; BMAR
points to the coefficient table. In addition to initiating the repeat operation, the
RPTZ instruction also clears the ACC and the PREG. In this example, the PC
is stored in a temporary register while the repeated operation is executed.
Next, the PC is loaded with the value stored in BMAR. The program bus is used
to address the coefficients and, as the MADD instruction is repeatedly ex-
ecuted, the PC increments to step through the coefficient table. The ARAU
generates the address of the sample data.
Indirect addressing with decrement steps through the sample data, starting
with the oldest data. As the data is fetched, it is also written to the next higher
location in data memory. This operation aligns the data for the next execution
of the filter by moving the oldest sample out past the end of the sample’s array
and making room for the new sample at the beginning of the sample array. The
previous product of the PREG is added to the ACC, while the two fetched val-
ues are multiplied and the new product value is loaded into the PREG. Note
that the DMOV portion of the MACD and MADD instructions does not function
with external data memory addresses.
After the multiplication of two 16-bit numbers, this 32-bit product is loaded into
PREG. The product from the PREG can be transferred to the ALU or to data
memory via the store product high (SPH) and store product low (SPL) instruc-
tions.
3-10
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors
The 32-bit general-purpose ALU and ACC implement a wide range of arithme-
tic and logical functions, the majority of which execute in a single clock cycle.
Once an operation is performed in the ALU, the result is transferred to the
ACC, where additional operations, such as shifting, can occur. Data that is in-
put to the ALU can be scaled by the prescaler.
2) Data is passed through the prescaler and the ALU, where the arithmetic
is performed, and
The ALU operates on 16-bit words taken from data memory or derived from
immediate instructions. In addition to the usual arithmetic instructions, the ALU
can perform Boolean operations, thereby facilitating the bit manipulation abil-
ity required of a high-speed controller. One input to the ALU is always supplied
by the ACC. The other input can be transferred from the PREG of the multiplier,
the ACCB, or the output of the prescaler (that has been read from data memory
or from the ACC). After the ALU has performed the arithmetic or logical opera-
tion, the result is stored in the ACC. For the following example, assume that
ACC = 0, PREG = 0022 2200h, PM = 002, and ACCB = 0033 3300h:
The 32-bit ACC can be split into two 16-bit segments (ACCH and ACCL) for
storage in data memory (see Figure 3–2). A postscaler at the output of the
ACC provides a left shift of 0 to 7 places. This shift is performed while the data
is being transferred to the data bus for storage. The contents of the ACC re-
main unchanged. When the postscaler is used on the high word of the ACC
(bits 16 – 31), the MSBs are lost and the LSBs are filled with bits shifted in from
the low word (bits 0 – 15). When the postscaler is used on the low word, the
LSBs are zero filled. For the following example, assume that
ACC = FF23 4567h:
The single-cycle 1-bit to 16-bit right shift of the ACC can efficiently align its con-
tents. This shift, coupled with the 32-bit temporary buffer on the ACC, en-
hances the effectiveness of the CALU in extended-precision arithmetic. The
ACCB provides a temporary storage place for a fast save of the ACC. The
ACCB can also be used as an input to the ALU. The minimum or maximum
value in a string of numbers can be found by comparing the contents of the
ACCB with the contents of the ACC. The minimum or maximum value is placed
in both registers, and, if the condition is met, the carry bit (C) is set. The mini-
mum and maximum functions are executed by the CRLT and CRGT instruc-
tions, respectively. These operations are signed arithmetic operations. In the
next example, assume that ACC = 1234 5678h and ACCB = 7654 3210h:
CRLT ;ACC = ACCB = 1234 5678h. C = 1.
CRGT ;ACC = ACCB = 7654 3210h. C = 0.
The ACC overflow saturation mode can be enabled by setting and disabled by
clearing the overflow mode (OVM) bit of ST0. When the ACC is in the overflow
saturation mode and an overflow occurs, the overflow flag is set and the ACC
is loaded with either the most positive or the most negative value represent-
able in the ACC, depending upon the direction of the overflow. The value of
the ACC upon saturation is 7FFF FFFFh (positive) or 8000 0000h (negative).
If the OVM bit is cleared and an overflow occurs, the overflowed results are
loaded into the ACC without modification. Note that logical operations cannot
result in overflow.
The ’C5x can execute a variety of branch instructions that depend on the status
of the ALU and the ACC. For example, execution of the instruction BCND can
depend on a variety of conditions in the ALU and the ACC. The BACC instruc-
tion allows branching to an address stored in the ACC. The bit test instructions
(BITT and BIT) facilitate branching on the condition of a specified bit in data
memory.
3-12
Central Arithmetic
unit-5Logic
DSP Unit (CALU)
Processors
The ACC has an associated carry bit that is set or cleared, depending on vari-
ous operations within the ’C5x. The carry bit allows more efficient computation
of extended-precision products and additions or subtractions; it is also useful
in overflow management. The carry bit is affected by most arithmetic instruc-
tions as well as the single-bit shift and rotate instructions. The carry bit is not
affected by loading the ACC, logical operations, or other nonarithmetic or con-
trol instructions. Examples of carry bit operations are shown in Figure 3–3.
The value added to or subtracted from the ACC can come from the prescaler,
ACCB, or PREG. The carry bit is set if the result of an addition or accumulation
process generates a carry; it is cleared if the result of a subtraction generates
a borrow. Otherwise, it is cleared after an addition or set after a subtraction.
The add to ACC with carry (ADDC) and add ACCB to ACC with carry (ADCB)
instructions use the previous value of carry in their addition operation. The
subtract from ACC with borrow (SUBB) and subtract ACCB from ACC with bor-
row (SBBB) instructions use the logical inversion of the previous value of carry.
The one exception to the operation of the carry bit is in the use of ADD with
a shift count of 16 (add to ACCH) and SUB with a shift count of 16 (subtract
from ACCH). These instructions can generate a carry or a borrow, but they will
not clear a carry or borrow, as is normally the case if a carry or borrow is not
generated. This feature is useful for extended-precision arithmetic.
Two conditional operands, C and NC, are provided for branching, calling, re-
turning, and conditionally executing according to the status of the carry bit. The
CLRC, LST #1, and SETC instructions can be used to load the carry bit. The
carry bit is set on a reset.
The 1-bit shift to the left (SFL) or right (SFR) and the rotate to the left (ROL)
or right (ROR) instructions shift or rotate the contents of the ACC through the
carry bit. The SXM bit affects the definition of the shift accumulator right (SFR)
instruction. When SXM = 1, SFR performs an arithmetic right shift, maintaining
the sign of the ACC data. When SXM = 0, SFR performs a logical shift, shifting
out the LSBs and shifting in a 0 for the MSB. The shift accumulator left (SFL)
instruction is not affected by the SXM bit and behaves the same in both cases,
shifting out the MSB and shifting in a 0. The RPT and RPTZ instructions can
be used with the shift and rotate instructions for multiple-bit shifts.
The SFLB, SFRB, RORB, and ROLB instructions can shift or rotate the 65-bit
combination of the ACC, ACCB, and carry bit as described above.
The ACC can also be shifted 0–31 bits right in two instruction cycles or 1–16
bits right in one cycle. The bits shifted out are lost, and the bits shifted in are
either 0s or copies of the original sign bit, depending on the value of the SXM
bit. A shift count of 1 to 16 is embedded in the instruction word of the BSAR
instruction. For example, let ACC = 1234 5678h:
BSAR 7 ;ACC = 0246 8ACEh.
The right shift can also be controlled via TREG1. The SATL instruction shifts
the ACC by 0–15 bits, as defined by bits 0–3 of TREG1. The SATH instruction
shifts the ACC 16 bits to the right if bit 4 of TREG1 is a 1. The following code
sequence executes a 0- to 31-bit right shift of the ACC, depending on the shift
count stored at SHIFT. For example, consider the value stored at
SHIFT = 01Bh and ACC = 1234 5678h:
LMMR TREG1,SHIFT ;TREG1 = shift count 0 – 31. TREG1 = 1B
SATH ;If shift count > 15, then ACC >> 16
;ACC = 00001234
SATL ;ACC >> shift count. ACC = 0000 0002
The p-scaler and postscaler make it possible for the CALU to perform numeri-
cal scaling, bit extraction, extended-precision arithmetic, and overflow preven-
tion. These shifters are connected to the output of the PREG and the ACC (see
Figure 3–2 on page 3-8).
3-14
Parallel LogicProcessors
unit-5 DSP Unit (PLU)
DBMR
MUX
Program Bus
PLU
Note: All registers and data lines are 16-bits wide unless otherwise specified.
The PLU makes it possible to directly manipulate bits in any location in data
memory space by ANDing, ORing, exclusive-ORing, or loading a 16-bit long
immediate value to a data location. For example, to use AR1 for circular buffer
1 and AR2 for circular buffer 2 but not enable the circular buffers, initialize the
circular buffer control register (CBCR) by executing the following code:
SPLK #021h,CBCR ;Store peripheral long immediate
;(DP = 0).
Next, enable circular buffers 1 and 2 by executing the code:
OPL #088h,CBCR ;Set bit 7 and bit 3 in CBCR.
To test for individual bits in a specific register or data word, use the BIT instruc-
tion; however, to test for a pattern of bits, use the compare parallel long imme-
diate (CPL) instruction. If the data value is equal to the long immediate value,
then the test/control (TC) bit in ST1 is set. The TC bit is set if the result of any
PLU instruction is 0.
The set, clear, and toggle functions can be executed with a 16-bit dynamic reg-
ister value instead of the long immediate value. This is done with the following
three instructions: AND DBMR register to data (APL), OR DBMR register to
data (OPL), and exclusive-OR DBMR register to data (XPL).
The TC bit is also set by the APL, OPL, and XPL instructions if the result of the
PLU operation (value written back into data memory) is 0. This allows bits to
be tested and cleared simultaneously. For example,
APL #0FF00h,TEMP ;Clear low byte and check for
;bits set in high byte.
BCND HIGH_BITS_SET,NTC ;If bits active in high byte,
;then branch.
or
In the first example, the low byte of a flag word is cleared while the high byte
is checked for any active flags (bits = 1). If none of the flags in the high byte
is set, then the resulting APL operation yields a 0 to TEMP and the TC bit is
set. If any of the flags in the high byte are set, then the resulting APL operation
yields a nonzero value to TEMP and the TC bit is cleared. Therefore, the condi-
tional branch (BCND) following the APL instruction branches if any of the bits
in the high byte are nonzero. The second example tests the flag. If the flag is
low, the flag is set high; if the flag is high, the flag is cleared and the branch is
taken. The PLU instructions can operate anywhere in data address space, so
they can operate with flags stored in RAM locations as well as in control regis-
ters for both on- and off-chip peripherals. The PLU instructions are listed in
Table 6–6 on page 6-14.
3-16
Auxiliary Registerunit-5
Arithmetic
DSPUnit (ARAU)
Processors
AR0 0 5 3 7 h Location
00 0 0 h
AR1 5 1 5 0 h
Auxiliary Register
Pointer
AR2 0 E 9 F C h
(in ST0)
ARP 0 1 1 AR3 0 F F 3 A h 0 FF3 Ah 3121h
AR4 1 0 3 B h 0 FFFFh
AR5 2 6 B 1 h
AR6 0 0 0 8 h
AR7 8 4 3 D h
MUX
A15–A0 Program
Control
3 AR0 IREG
AR1
AR2
AR3
ST0 ARP(3) AR4
3
AR5
AR6
ST1 ARB(3) AR7
CBCR(8)
3 CBSR1
CBSR2
CBER1
MUX CBER2
INDX DRB
ARCR
Program Bus
16
Data Bus
ARAU MUX
MUX MUX
Notes: All registers and data lines are 16-bits wide unless otherwise specified.
The ARAU updates the ARs during the decode phase (second stage)
of the pipeline, while the CALU writes during the execution phase
(fourth stage). Therefore, the two instructions that immediately follow
the CALU write to an AR should not use the same AR for address
generation. See Chapter 7, Pipeline, for more details.
3-18
Auxiliary Registerunit-5
Arithmetic
DSPUnit (ARAU)
Processors
Function Description
Current AR + INDX → Current AR Index the current AR by adding an unsigned 16-bit
integer contained in INDX. Example: ADD *0+
Current AR + IR(7–0) → Current AR Add an 8-bit immediate value to current AR. Exam-
ple: ADRK #55h
Current AR – IR(7–0) → Current AR Subtract an 8-bit immediate value from the current
AR. Example: SBRK #55h
If (Current AR) = (ARCR), then TC = 1 Compare the current AR to ARCR and, if the condi-
If (Current AR) < (ARCR), then TC = 1 tion is true, then set the TC bit of the status register
If (Current AR) > (ARCR), then TC = 1 ST1. If false, then clear the TC bit. Example: CMPR 3
If (Current AR) ≠ (ARCR), then TC = 1
If (Current AR) = (CBER), then Current AR = CBSR If the current AR is at the end of circular buffer, reload
the start address. The test for this condition is per-
formed before the execution of the AR modification.
Example: ADD *+
The INDX can be added to or subtracted from the current AR on any AR update
cycle. The INDX can be used to increment or decrement the address in steps
larger than 1; this is useful for operations such as addressing down a matrix
column. The ARCR limits blocks of data and supports logical comparisons be-
tween the current AR and ARCR in conjunction with the CMPR instruction.
Note that the ’C2x uses AR0 for this implementation. After reset, you can use
the load auxiliary register (LAR) instruction to load AR0; if the enable extra in-
dex register (NDX) bit in the PMST is set, LAR also loads INDX and ARCR to
maintain compatibility with the ’C2x.
Because the ARs are memory-mapped, the CALU can act directly upon them
and use more advanced indirect addressing techniques. For example, the
multiplier can calculate the addresses of 3-dimensional matrices. After a
CALU load of the AR, there is, however, a 2-instruction-cycle delay before the
ARs can be used for address generation. The INDX and ARCR are accessible
via the CALU, regardless of the condition of the NDX bit (that is, SAMM ARCR
writes only to the ARCR).
The 3-bit auxiliary register pointer buffer (ARB), shown in Figure 3–6, stores
the ARP on subroutine calls when the automatic context switch feature of the
’C5x is not used.
Two circular buffers can operate at a given time and are controlled via the cir-
cular buffer control register (CBCR). Upon reset (rising edge of RS), both circu-
lar buffers are disabled. To define a circular buffer, load CBSR1 or CBSR2 with
the start address of the buffer and CBER1 or CBER2 with the end address;
then load the AR to be used with the circular buffer with an address between
the start and end addresses. Finally, load CBCR with the appropriate AR num-
ber and set the enable (CENB1 or CENB2) bit.
As the address is stepping through the circular buffer, the AR value is com-
pared against the value contained in CBER prior to the update to the AR value.
If the current AR value and the CBER are equal and an AR modification occurs,
the value contained in CBSR is automatically loaded into the AR. If the values
in the CBER and the AR are not equal, the AR is modified as specified.
3-20
Summary
unit-5 DSP of Registers
Processors
The 16-bit block repeat counter register (BRCR) holds the count value for the
block repeat feature. This value is loaded before a block repeat operation is
initiated. The value can be changed while a block repeat is in progress; howev-
er, take care to avoid infinite loops. The block repeat program address start
register (PASR) indicates the 16-bit address where the repeated block of code
starts. The block repeat program address end register (PAER) indicates the
16-bit address where the repeated block of code ends. The PASR and PAER
are loaded by the RPTB instruction. Block repeats are described in Section
4.7, Block Repeat Function, on page 4-31.
3.5.5 Buffered Serial Port Registers (ARR, AXR, BKR, BKX, SPCE)
The buffered serial port (BSP) is available on ’C56 and ’C57 devices. The BSP
comprises a full-duplex, double-buffered serial port interface and an autobuf-
fering unit (ABU). The BSP has a 2K-word buffer, which resides in the ’C5x
internal memory. Five registers control and operate the BSP. The 16-bit BSP
control extension register (SPCE) contains the mode control and status bits
of the BSP. The 11-bit BSP address receive register (ARR) and 11-bit BSP
receive buffer size register (BKR) support address generation for writing to the
data receive register (DRR) in the ’C5x internal memory. The 11-bit BSP
address transmit register (AXR) and 11-bit BSP transmit buffer size register
(BKX) support address generation for reading a word from the ’C5x internal
memory to the data transmit register (DXR). The BSP is described in Section
9.8, Buffered Serial port (BSP) Interface, on page 9-53.
3-22
Summary
unit-5 DSP of Registers
Processors
3.5.16 Serial Port Interface Registers (SPC, DRR, DXR, XSR, RSR)
Five registers control and operate the serial port interface. The 16-bit serial
port control register (SPC) contains the mode control and status bits of the seri-
al port. The 16-bit data receive register (DRR) holds the incoming serial data,
and the 16-bit data transmit register (DXR) holds the outgoing serial data. The
16-bit data transmit shift register (XSR) controls the shifting of the data from
the DXR to the output pin. The 16-bit data receive shift register (RSR) controls
the storing of the data from the input pin to the DRR. The serial port is de-
scribed in Section 9.7, Serial Port Interface, on page 9-23.
3-24
Summary
unit-5 DSP of Registers
Processors
Software compatibility can be maintained with the ’C2x by clearing the enable
multiple TREGs (TRM) bit in the PMST. This causes any ’C2x instruction that
loads TREG0 to write to all three TREGs, maintaining ’C5x object-code com-
patibility with the ’C2x.
3.5.21 TDM Serial Port Registers (TRCV, TDXR, TSPC, TCSR, TRTA, TRAD, TRSR)
The time-division-multiplexed (TDM) serial port interface is a feature superset
of the serial port interface and supports applications that require serial commu-
nication in a multiprocessing environment. Six registers control and operate
the TDM serial port interface. The 16-bit TDM serial port control register
(TSPC) contains the mode control and status bits of the TDM serial port inter-
face. The 16-bit TDM data receive register (TRCV) holds the incoming TDM
serial data, and the 16-bit TDM data transmit register (TDXR) holds the outgo-
ing TDM serial data. The 16-bit TDM data receive shift register (TRSR) con-
trols the storing of the data, from the input pin, to the TRCV. The 16-bit TDM
channel select register (TCSR) specifies in which time slot(s) each ’C5x device
is to transmit. The 16-bit TDM receive/transmit address register (TRTA) speci-
fies in the eight LSBs (RA0–RA7) the receive address of the ’C5x device and
in the eight MSBs (TA0–TA7) the transmit address of the ’C5x device. The
16-bit TDM receive address register (TRAD) contains various information re-
garding the status of the TDM address line (TADD). See Section 9.9, Time-Di-
vision Multiplexed (TDM) Serial Port Interface, on page 9-74.
Chapter 5
Addressing Modes
This chapter describes each of the following addressing modes and gives the
opcode formats and some examples.
- Direct addressing
- Indirect addressing
- Immediate addressing
- Dedicated-register addressing
- Memory-mapped register addressing
- Circular addressing
Topic Page
Figure 5–1 illustrates how the 16-bit data memory address is formed.
9 7 LSBs
15 6 0
16-bit data memory address
DP dma
PAGE 511
PAGE 510
512 DATA
PAGES PAGE 3 DAB
PAGE 2
PAGE 1
PAGE 0
(MEMORY-
128-WORD MAPPED
PAGE REGISTERS
AND
DARAM B2)
5-2
unit-5 Direct Addressing
DSP Processors
Note:
The DP is not initialized by reset and, therefore, is undefined after power-up.
The ’C5x development tools, however, use default values for many parameters,
including the DP. Because of this, programs that do not explicitly initialize the
DP may execute improperly, depending on whether they are executed on a
’C5x device or with a development tool. Thus, it is critical that all programs
initialize the DP in software.
Figure 5–2 illustrates the direct addressing mode. Bits 15 through 8 contain
the opcode. Bit 7, with a value of 0, defines the addressing mode as direct, and
bits 6 through 0 contain the dma.
DP 1 1 0 0 1 1 1 0 1
DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0
Operand Data(DAB)
3 3 Auxiliary registers
AR0
ARB 3 ARP
3 AR1
(ARP = 2)
AR2
AR3
AR4 16
AR5
AR6
AR7
16
16
ARAU
To select a specific AR, load the auxiliary register pointer (ARP) with a value
from 0 through 7, designating AR0 through AR7, respectively. The register
pointed to by the ARP is referred to as the current auxiliary register (current
AR). You can load the address into the AR using the LAR instruction and you
can change the content of the AR by the:
- ADRK instruction
- MAR instruction
- SBRK instruction
- Indirect addressing field of any instruction supporting indirect addressing.
The content of the current AR is used as the address of the data memory oper-
and. After the instruction uses the data value, the content of the current AR can
be incremented or decremented by the auxiliary register arithmetic unit
(ARAU), which implements unsigned 16-bit arithmetic.
5-4
unit-5Indirect Addressing
DSP Processors
The contents of the current AR are used as the address of the data memory
operand. Then, the ARAU performs the specified mathematical operation on
the indicated AR. Additionally, the ARP can be loaded with a new value. All
indexing operations are performed on the current AR in the same cycle as the
original instruction decode phase of the pipeline.
The bit-reversed addressing modes (see subsection 5.2.3 on page 5-12) helps
you achieve efficient I/O by the resequencing of data points in a radix-2 fast
Fourier transform (FFT) program. The direction of carry propagation in the
ARAU is reversed when bit-reversed addressing is selected, and INDX is added
to/subtracted from the current AR. Normally, this addressing mode requires that
INDX first be set to a value corresponding to one-half of the array’s size, and
that the current AR be set to the base address of the data (the first data point).
The following indirect-addressing symbols are used in the ’C5x assembly language
instructions:
5-6
unit-5Indirect Addressing
DSP Processors
Table 5–3 on page 5-9 shows the instruction field bit values, notation, and op-
eration used for indirect addressing. Example 5–1 through Example 5–8 illus-
trate the indirect addressing formats. Example 5–9 shows an indirect address-
ing routine.
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
ÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁ
ÁÁÁÁÁÁÁ
Figure 5–4. Indirect Addressing Opcode Format Diagram
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
ÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁ
15–8
ÁÁÁÁÁÁÁ 7 6 5 4 3 2–0
ÁÁÁÁÁÁÁÁÁÁÁÁÁÁÁ
ÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁ
Opcode
ÁÁÁÁÁÁÁ I IDV INC DEC N NAR
7 I Addressing mode bit. This 1-bit field determines the addressing mode.
IDV = 0 The INDX is not used in the arithmetic operation. An increment or decrement
(if any) by 1 occurs to the current AR.
IDV = 1 The INDX is used in the arithmetic operation. An increment or decrement (if
any) by the contents of INDX or by reverse carry propagation occurs to the
current AR.
5 INC Auxiliary register increment bit. This 1-bit field determines whether the current AR is in-
cremented. The INC bit works in conjunction with the IDV and DEC bits to determine the
arithmetic operation.
4 DEC Auxiliary register decrement bit. This 1-bit field determines whether the current AR is de-
cremented. The DEC bit works in conjunction with the IDV and INC bits to determine the
arithmetic operation. See Table 5–2 for specific arithmetic operations.
3 N Next auxiliary register indicator bit. This 1-bit field determines whether the instruction will
change the ARP value.
N=1 The content of NAR will be loaded into the ARP, and the old ARP value is
loaded into the auxiliary register buffer (ARB) of status register ST1.
2–0 NAR Next auxiliary register value bits. This 3-bit field contains the value of the next auxiliary
register. If the N bit is set, NAR is loaded into the ARP.
5-8
unit-5Indirect Addressing
DSP Processors
Bit values
IDV INC DEC Arithmetic Operation Performed on Current AR
0 0 0 No operation on current AR
0 0 1 (Current AR) – 1 → current AR
0 1 0 (Current AR) + 1 → current AR
0 1 1 Reserved
1 0 0 (Current AR) – INDX [reverse carry propagation] → current AR
1 0 1 (Current AR) – INDX → current AR
1 1 0 (Current AR) + INDX → current AR
1 1 1 (Current AR) + INDX [reverse carry propagation] → current AR
Table 5–3. Instruction Field Bit Values for Indirect Addressing (Continued)
Instruction Field Bit Values
15–8 7 6 5 4 3 2–0 Notation Operation
← Opcode → 1 1 1 1 0 ← NAR → *BR0+ (Current AR) + rcINDX → current AR
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *,8
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0 1 0 0 0 1 0 0 0 0 0 0 0
In Example 5–1, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is not changed. The instruction word is 2880h.
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *–,8
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0 1 0 0 0 1 0 0 1 0 0
In Example 5–2, the content of the data memory address, defined by the con-
0 0
tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is decremented by 1. The instruction word is 2890h.
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *+,8
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0 1 0 0 0 1 0 1 0 0 0
In Example 5–3, the content of the data memory address, defined by the con-
0 0
tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is incremented by 1. The instruction word is 28A0h.
5-10
unit-5Indirect Addressing
DSP Processors
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ADD *+,8,AR3
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
1
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
1 0 0 0 1 0 1 0 1 0 1 1
In Example 5–4, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The current
AR is incremented by 1. The auxiliary register pointer (ARP) is loaded with the
value 3 for subsequent instructions. The instruction word is 28ABh.
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ADD *0 –,8
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
1
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
0
ÁÁ
ÁÁÁ
1 0 0 0 1 1 0 1 0 0 0
In Example 5–5, the content of the data memory address, defined by the con-
0
tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX is subtracted from the current AR. The instruction word is 28D0h.
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0
In Example 5–6, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX is added to the current AR. The instruction word is 28E0h.
Example 5–7. Indirect Addressing With INDX Subtracted from AR With Reverse Carry
ÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁ
ADD *BR0 –,8
ÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁ
ÁÁÁÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁÁÁÁÁÁ
ÁÁ ÁÁÁÁÁÁÁÁÁÁ
ÁÁ ÁÁÁÁÁ
ÁÁ ÁÁÁÁÁ
ÁÁ ÁÁÁ
0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0
In Example 5–7, the content of the data memory address, defined by the con-
tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX with reverse carry propagation is subtracted from the current AR. The
instruction word is 28C0h.
Example 5–8. Indirect Addressing With INDX Added to AR With Reverse Carry
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
ADD *BR0+,8
ÁÁÁ
0
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁÁ
ÁÁ
ÁÁ
ÁÁÁ
1
ÁÁ
ÁÁ
ÁÁÁ
ÁÁ
0
ÁÁ
ÁÁÁ
ÁÁ
1 0 0 0 1 1 1 1 0 0 0
In Example 5–8, the content of the data memory address, defined by the con-
0
tent of the current AR, is shifted left 8 bits and added to the ACC. The content
of INDX with reverse carry propagation is added to the current AR. The instruc-
tion word is 28F0h.
Assume that the auxiliary registers are eight bits long, that AR2 represents the
base address of the data in memory (0110 00002), and that INDX contains the
value 0000 10002. Example 5–10 shows a sequence of modifications to AR2
and the resulting values of AR2. Table 5–4 shows the relationship of the bit pat-
tern of the index steps and the four LSBs of AR2, which contain the bit-
reversed address.
5-12
unit-5Indirect Addressing
DSP Processors
Machine Code 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1
Operand 1 1 1 1 1 1 1 1
5-14
Immediate
unit-5 Addressing
DSP Processors
- One-operand instructions
- Two-operand instructions
Machine Code 1 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0
Operand 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0
01234h
The long immediate addressing also could apply for a second data memory
access for the execution of the instruction. The prefetch counter (PFC) is
pushed onto the microcall stack (MCS), and the long immediate value is loaded
into the PFC. The program address/data bus is then used for the operand fetch
or write. At the completion of the instruction, the MCS is popped back to the PFC,
the program counter (PC) is incremented by two, and execution continues. The
PFC is used so that when the instruction is repeated, the address generated can
be autoincremented.
Figure 5–7 shows an example of long immediate addressing with two oper-
ands. In Figure 5–7, the source address (OPERAND1) is fetched via PAB, and
the destination address (OPERAND2) uses the direct addressing mode. Bits
15 through 8 of machine code1 contain the opcode. Bit 7, with a value of 0,
defines the addressing mode as direct, and bits 6 through 0 contain the dma.
DP 1 1 0 0 1 1 1 0 1
DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 0
02345h
Machine Code2 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1
PC 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1
5-16
Dedicated-Register Addressing
unit-5 DSP Processors
- Exclude the immediate value from a parallel logic unit (PLU) instruction:
Figure 5–8 shows how the BMAR is used in the dedicated-register addressing
mode. Bits 15 through 8 of the machine code contain the opcode. Bit 7, with
a value of 0, defines the addressing mode as direct, and bits 6 through 0 con-
tain the dma.
DP 1 1 0 0 1 1 1 0 1
DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 0
BMAR PFC
APL 010h
15 8 7 6 0
Machine Code 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0
DP 1 1 0 0 1 1 1 0 1
DAB 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0
Operand1 Data(DAB)
Operand2 DBMR
Note: DAB is the 16-bit internal data memory address bus.
5-18
Memory-Mapped Register
unit-5 Addressing
DSP Processors
Figure 5–10 illustrates how this is done by forcing the 9 MSBs of the data
memory address to 0, regardless of the current value of the DP when direct
addressing is used or of the current AR value when indirect addressing is used.
7 LSBs
15 6 0 16-bit memory-mapped
register address
0 0 0 0 0 0 0 0 0 dma
PAGE 0
DAB
(MEMORY-
128-WORD MAPPED
PAGE REGISTERS
AND
DARAM B2)
In Example 5–11, assume that ARP = 3 and AR3 = FF07h. The content of the
ACC is stored to the PMST (address 07h) pointed at by the 7 LSBs of AR3.
In Example 5–12, assume that DP = 0184h and TEMP1 = 8060h. The content
of memory location 07h (PMST) is loaded into the ACC. Figure 5–11 illustrates
memory-mapped register addressing in the direct addressing mode.
Value 0 0 0 0 0 0 0 0 0
DAB 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Operand Data(DAB)
5-20
Circular
unit-5 Addressing
DSP Processors
The 8-bit CBCR enables and disables the circular buffer operation and is
defined in subsection 4.4.1, Circular Buffer Control Register (CBCR), on
page 4-6.
To define circular buffers, you first load the start and end addresses into the
corresponding buffer registers; next, load a value between the start and end
registers for the circular buffer into an AR. Load the proper AR value, and set
the corresponding circular buffer enable bit in the CBCR. Note that you must
not enable the same AR for both circular buffers; if you do, unexpected results
occur. The algorithm for circular buffer addressing below shows that the test
of the AR value is performed before any modifications:
mar *,ar6
lpd #,0
splk #200h,CBSR1 ; Circular buffer start register
splk #203h,CBER1 ; Circular buffer end register
splk #0Eh,CBCR ; Enable AR6 pointing to buffer 1
lar ar6,#200h ; Case 1
lacc * ; AR6 = 200h
lar ar6,#203h ; Case 2
lacc * ; AR6 = 203h
lar ar6,#200h ; Case 3
lacc *+ ; AR6 = 201h
lar ar6,#203h ; Case 4
lacc *+ ; AR6 = 200h
lar ar6,#200h ; Case 5
lacc *– ; AR6 = 1FFh
lar ar6,#203h ; Case 6
lacc *– ; AR6 = 200h
lar ar6,#202h ; Case 7
adrk 2 ; AR6 = 204h
lar ar6,#203h ; Case 8
adrk 2 ; AR6 = 200h
In circular addressing, the step is the quantity that is being added to or sub-
tracted from the specified AR. Take care when using a step of greater than 1
to modify the AR pointing to an element of the circular buffer. If an update to
an AR generates an address outside the range of the circular buffer, the ARAU
does not detect this situation, and the buffer does not wrap around. AR up-
dates are performed as described in Section 5.2, Indirect Addressing.
Because of the pipeline, there is a two-cycle latency between configuring the
CBCR and performing AR modifications.
5-22
unit-5 DSP Processors
DSP Applications Using C and the TMS320C6x DSK. Rulph Chassaing
Copyright © 2002 John Wiley & Sons, Inc.
ISBNs: 0-471-20754-3 (Hardback); 0-471-22112-0 (Electronic)
3
Architecture and Instruction Set
of the C6x Processor
3.1 INTRODUCTION
11.72 mV.With an 8-bit ADC, 28 or 256 different levels can represent the input signal.
An ADC with a larger word length such as a 16-bit ADC (currently very common)
can reduce the quantization error, yielding a higher resolution. The more bits that
an ADC has, the better it can represent an input signal.
The TMS320C30 floating-point processor was introduced in the late 1980s. The
C31, C32, and the more recent C33 are all members of the C3x family of floating-
point processors [2,3]. The C4x floating-point processors, introduced subsequently,
are code-compatible with the C3x processors and are based on the modified
Harvard architecture [4].
The TMS320C6201 (C62x), announced in 1997, is the first member of the C6x
family of fixed-point digital signal processors. Unlike the previous fixed-point
processors, C1x, C2x, and C5x, the C62x is based on a very-long-instruction-word
(VLIW) architecture, still using separate memory spaces for instructions and data
as with the Harvard architecture. The VLIW architecture has simpler instructions,
but more are needed for a task than with a conventional DSP architecture.
The C62x is not code-compatible with the previous generation of fixed-point
processors. Subsequently, the TMS320C6701 (C67x) floating-point processor was
introduced as another member of the C6x family of processors. The instruction
set of the C62x fixed-point processor is a subset of the instruction set of the
C67x processor. Appendix A contains a list of instructions available on the C6x
processors. A recent addition to the family of the C6x processors is the fixed-point
C64x.
An application-specific integrated circuit (ASIC) has a DSP core with customized
circuitry for a specific application. A C6x processor can be used as a standard
general-purpose digital signal processor programmed for a specific application.
Specific-purpose digital signal processors are the modem, echo canceler, and others.
A fixed-point processor is better for devices that use batteries, such as cellular
phones, since it uses less power than does an equivalent floating-point processor.
The fixed-point processors, C1x, C2x, and C5x are 16-bit processors with limited
dynamic range and precision. The C6x fixed-point processor is a 32-bit processor
with improved dynamic range and precision. In a fixed-point processor, it is neces-
sary to scale the data. Overflow, which occurs when an operation such as the addi-
tion of two numbers produces a result with more bits than can fit within a processor’s
register, becomes a concern.
A floating-point processor is generally more expensive since it has more “real
estate” or is a larger chip because of additional circuitry necessary to handle integer
as well as floating-point arithmetic. Several factors, such as cost, power consump-
tion, and speed, come into play when choosing a specific digital signal processor.
The C6x processors are particularly useful for applications requiring intensive com-
putations. Family members of the C6x include both fixed-point (e.g., C62x, C64x)
and floating-point processors (e.g., C67x). Other digital signal processors are also
available, from companies such as Motorola and Analog Devices [5].
Other architectures include the Super Scalar, which requires special hardware to
determine which instructions are executed in parallel. The burden is then on the
unit-5 DSP Processors
TMS320C6x Architecture 63
processor more than on the programmer as in the VLIW architecture. It does not
execute necessarily the same group of instructions, and as a result, it is difficult to
time. Thus, it is rarely used in DSP.
independent buses. Since internal memory is organized into memory banks, two
loads or two stores instructions can be performed in parallel. No conflict results if
the data accessed are in different memory banks. Separate buses for program, data,
and direct memory access (DMA) allow the C6x to perform concurrent program
fetches, data read and write, and DMA operations. With data and instructions
residing in separate memory spaces, concurrent memory accesses are possible. The
C6x has a byte-addressable memory space. Internal memory is organized as sepa-
rate program and data memory spaces, with two 32-bit internal ports (two 64-bit
ports with the C64x) to access internal memory.
The C6711 on the DSK includes 72 kB of internal memory, which starts at
0x00000000, and 16 MB of external SDRAM, mapped through CE0 starting at
0x80000000. The DSK also includes 128 kB of Flash memory onboard, starting at
0x90000000. A two-level internal memory block diagram is shown in Figure 3.2,
included with CCS [7]. Table 3.1 shows the memory map. A schematic diagram of
the DSK is included with CCS (C6711dsk_schematics.pdf).
With a clock of 150 MHz onboard the DSK, one can ideally achieve two multi-
plies and accumulates per cycle, for a total of 300 million multiplies and accumu-
Functional Units 65
lates (MACs) per second. With six of the eight functional units in Figure 3.1 (not
the .D units described below) capable of handling floating-point operations, it is
possible to perform 900 million floating-point operations per second (MFLOPS).
Operating at 150 MHz, this translates to 1200 million instructions per second (MIPS)
with a 6.67-ns instruction cycle time.
The CPU consists of eight independent functional units divided into two data paths
A and B, as shown in Figure 3.1. Each path has a unit for multiply operations (.M),
for logical and arithmetic operations (.L), for branch, bit manipulation, and
arithmetic operations (.S), and for loading/storing and arithmetic operations (.D).
The .S and .L units are for arithmetic, logical, and branch instructions. All data
transfers make use of the .D units.
The arithmetic operations, such as subtract or add (SUB or ADD), can be per-
formed by all the units except the .M units (one from each data path). The eight
functional units consist of four floating/fixed-point ALUs (two .L and two .S), two
fixed-point ALUs (.D units), and two floating/fixed-point multipliers (.M units).
Each functional unit can read directly from or write directly to the register file
unit-5 DSP Processors
within its own path. Each path includes a set of sixteen 32-bit registers, A0 through
A15 and B0 through B15. Units ending in 1 write to register file A, and units ending
in 2 write to register file B.
Two cross-paths (1x and 2x) allow functional units from one data path to access
a 32-bit operand from the register file on the opposite side. There can be a maximum
of two cross-path source reads per cycle. Each functional unit side can access data
from the registers on the opposite side using a cross-path (i.e., the functional units
on one side can access the register set from the other side). There are 32 general-
purpose registers, but some of them are reserved for specific addressing or are used
for conditional instructions.
The architecture VELOCITI, introduced by TI, is derived from the VLIW archi-
tecture. An execute packet (EP) consists of a group of instructions that can be
executed in parallel within the same cycle time. The number of EPs within a fetch
packet (FP) can vary from one (with eight parallel instructions) to eight (with no
parallel instructions). The VLIW architecture was modified to allow more than one
EP to be included within an EP.
The least significant bit of every 32-bit instruction is used to determine if the next
or subsequent instruction belongs in the same EP (if 1) or is part of the next EP
(if 0). Consider an FP with three EPs: EP1, with two parallel instructions, and EP2
and EP3, each with three parallel instructions, as follows:
Instruction A
|| Instruction B
Instruction C
|| Instruction D
|| Instruction E
Instruction F
|| Instruction G
|| Instruction H
EP1 contains the two parallel instructions A and B; EP2 contains the three par-
allel instructions C, D, and E; and EP3 contains the three parallel instructions F, G,
and H. The FP would be as shown in Figure 3.3. Bit 0 (LSB) of each 32-bit
instruction contains a “p” bit that signals whether it is in parallel with a subsequent
instruction. For example, the “p” bit of instruction B is zero, denoting that it is
not within the same EP as the subsequent instruction C. Similarly, instruction E
is not within the same EP as instruction F.
unit-5 DSP Processors
Pipelining 67
FIGURE 3.3. One fetch packet with three execute packets, showing the “p” bit of each
instruction.
3.5 PIPELINING
Table 3.2 shows the pipeline phases, and Table 3.3 shows the pipelining effects.
The first row in Table 3.3 represents cycle 1, 2, . . . , 12. Each subsequent row repre-
sents an FP. The rows represented PG, PS, . . . , illustrate the phases associated with
each FP. The program generate (PG) of the first FP starts in cycle 1, and the PG of
the second FP starts in cycle 2, and so on. Each FP takes four phases for program
fetch and two phases for decoding. However, the execution phase can take from 1
to 10 phases (not all execution phases are shown in Table 3.3). We are assuming that
each FP contains one execute packet (EP).
For example, at cycle 7, while the instructions in the first FP are in the first exe-
cution phase E1 (which may be the only one), the instructions in the second FP are
in the decoding phase, the instructions in the third FP are in the dispatching phase,
and so on. All seven instructions are proceeding through the various phases. There-
fore, at cycle 7, “the pipeline is full.”
unit-5 DSP Processors
1 2 3 4 5 6 7 8 9 10 11 12
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6
PG PS PW PR DP DC E1 E2 E3 E4 E5
PG PS PW PR DP DC E1 E2 E3 E4
PG PS PW PR DP DC E1 E2 E3
PG PS PW PR DP DC E1 E2
PG PS PW PR DP DC E1
PG PS PW PR DP DC
Most instructions have one execute phase. Instructions such as multiply (MPY),
load (LDH/LDW), and branch (B) take two, five, and six phases, respectively. Addi-
tional execute phases are associated with floating-point and double-precision types
of instructions, which can take up to 10 phases. For example, the double-precision
multiply operation (MPYDP), available on the C67x, has nine delay slots, so that the
execution phase takes a total of 10 phases.
The functional unit latency, which represents the number of cycles that an instruc-
tion ties up a functional unit, is 1 for all instructions except double-precision instruc-
tions, available with the floating-point C67x. Functional unit latency is different from
a delay slot. For example, the instruction MPYDP has four functional unit latencies
but nine delay slots. This implies that no other instruction can use the associated
multiply functional unit for four cycles. A store has no delay slot but finishes its
execution in the third execution phase of the pipeline.
If the outcome of a multiply instruction such as MPY is used by a subsequent
instruction, a NOP (no operation) must be inserted after the MPY instruction for the
pipelining to operate properly. Four or five NOPs are to be inserted in case an instruc-
tion uses the outcome of a load or a branch instruction, respectively.
3.6 REGISTERS
Two sets of register files, each set with 16 registers, are available: register file A (A0
through A15) and register file B (B0 through B15). Registers A0, A1, B0, B1, and
B2 are used as conditional registers. Registers A4 through A7 and B4 through B7
are used for circular addressing. Registers A0 through A9 and B0 through B9
(except B3) are temporary registers. Any of the registers A10 through A15 and
unit-5 DSP Processors
B10 through B15 used are saved and later restored before returning from a
subroutine.
A 40-bit data value can be contained across a register pair. The 32 least signifi-
cant bits (LSBs) are stored in the even register (e.g., A2) and the remaining 8 bits
are stored in the 8 LSBs of the next-upper (odd) register (A3). A similar scheme is
used to hold a 64-bit double-precision value within a pair of registers (even
and odd).
These 32 registers are considered as general-purpose registers. Several special-
purpose registers are also available for control and interrupts: for example, the
address mode register (AMR) used for circular addressing and interrupt control
registers, as shown in Appendix B.
Addressing modes determine how one accesses memory. They specify how data are
accessed, such as retrieving an operand indirectly from a memory location. Both
linear and circular modes of addressing are supported. The most commonly used
mode is the indirect addressing of memory.
1. *R. Register R contains the address of a memory location where a data value
is stored.
2. *R++(d). Register R contains the memory address (location). After the
memory address is used, R is postincremented (modified), such that the new
address is the current address offset by the displacement value d. If d = 1 (by
default), the new address is R + 1, or R is incremented to the next-higher
address in memory. A double minus (--) instead of a double plus would
update or postdecrement the address to R - d.
3. *++R(d). The address is preincremented or offset by d, such that the current
address is R + d. A double minus would predecrement the memory address
so that the current address is R - d.
4. *+R(d). The address is preincremented by d, such that the current address is
R + d (as with the preceding case). However, in this case, R preincre-
ments without modification. Unlike the previous case, R is not updated or
modified.
unit-5 DSP Processors
Circular addressing is used to create a circular buffer. This buffer is created in hard-
ware and is very useful in several DSP algorithms, such as in digital filtering or
correlation algorithms where data need to be updated. An example in Chapter 4
illustrates the implementation of a digital filter using a circular buffer to update the
“delay” samples.
The C6x has dedicated hardware to allow a circular type of addressing. This
addressing mode can be used in conjunction with a circular buffer to update samples
by shifting data without the overhead created by shifting data directly. As a pointer
reaches the end or “bottom” location of a circular buffer that contains the last
element in the buffer, and is then incremented, the pointer is automatically wrapped
around or points to the beginning or “top” location of the buffer that contains the
first element.
Two independent circular buffers are available using BK0 and BK1 within the
address mode register (AMR), as shown in Appendix B. The eight registers A4
through A7 and B4 through B7, in conjunction with the two .D units, can be used
as pointers (all registers can be used for linear addressing). The following code
segment illustrates the use of a circular buffer using register B2 (only side B can be
used) to set the appropriate values within AMR:
The two move instructions MVK and MVKLH (using the .S unit) move 0x0004
into the 16 LSBs of register B2 and 0x0005 into the 16 MSBs of B2. The MVC (move
constant) instruction is the only instruction that can access the AMR and the other
control registers (shown in Appendix B) and executes only on the B side in con-
junction with the functional units and registers on the side B. A 32-bit value is
created in B2, which is then transferred to AMR with the instruction MVC to access
AMR [6].
The value 0x0004 = (0100)b into the 16 LSBs of AMR sets bit 2 (third bit)
to 1 and all other bits to zero. This sets the mode to 01 and selects register A5 as
the pointer to a circular buffer using block BK0.
Table 3.4 shows the modes associated with registers A4 through A7 and B4
through B7. The value 0x0005 = (0101)b into the 16 MSBs of AMR sets bits 16
and 18 to 1 (other bits to zero). This corresponds to the value of N used to select
the size of the buffer as 2N+1 = 64 bytes using BK0. For example, if a buffer size of
128 is desired using BK0, the upper 16 bits of AMR are set to (0110)b = 0x0006.
If assembly code is used for the circular buffer, as execution returns to a calling C
function, AMR needs to be reinitialized to the default linear mode. Hence the
pointer’s address must be saved.
unit-5 DSP Processors
.word value
reserves 32 bits in memory and fill with the specified value. A mnemonic is an
actual instruction that executes at run time. The instruction (mnemonic or assem-
bler directive) cannot start in column 1. The Unit field, which can be one of the
eight CPU units, is optional. Comments starting in column 1 can begin with either
an asterisk or a semicolon, whereas comments starting in any other columns must
begin with a semicolon.
Code for the floating-point processors C3x/C4x is not compatible with code for
the fixed-point processors C1x, C2x, and C5x/C54x. However, the code for the fixed-
point C62x is compatible with the code for the floating-point C67x. C62x code is
actually a subset of C67x code. Additional instructions to handle double-precision
and floating-point operations are available only on the C67x processor (some addi-
tional instructions are also available on the fixed-point C64x processor).
unit-5 DSP Processors
1
Generali
1. The dominant architecture in the PC market, the Intel IA-32, belongs to the
Complex Instruction Set Computer (CISC) design. The obvious reason for this
classification is the “complex” nature of its Instruction Set Architecture (ISA). The
motivation for designing such complex instruction sets is to provide an instruction set
that closely supports the operations and data structures used by Higher-Level
Languages (HLLs). However, the side effects of this design effort are far too serious
to ignore.
Evolution of RISCii
3. For these and other reasons, in the early 1980s designers started looking at
simple ISAs. Because these ISAs tend to produce instruction sets with far fewer
instructions, they coined the term Reduced Instruction Set Computer (RISC). Even
though the main goal was not to reduce the number of instructions, but the
complexity, the term has stuck.
4. There is no precise definition of what constitutes a RISC design. However, we
can identify certain characteristics that are present in most RISC systems.
unit-5 DSP Processors
2
Definition of RISCiii
MIPS, and Berkeley RISC 1 and 2 were all designed with a similar
philosophy which has become known as RISC. Certain design features
have been characteristic of most RISC processors
(1) One Cycle Execution Time. RISC processors have a CPI
(clock per instruction) of one cycle. This is due to the
optimization of each instruction on the CPU and a technique
called ;
(2) Pipelining. A technique that allows for simultaneous execution
of parts, or stages, of instructions to more efficiently process
instructions;
(3) Large Number of Registers. The RISC design philosophy
generally incorporates a larger number of registers to prevent in
large amounts of interactions with memory
7. Another general goal was to provide every possible addressing mode for
every instruction, known as orthogonality, to ease compiler implementation.
Arithmetic operations could therefore often have results as well as operands directly
in memory (in addition to register or immediate).
8. The attitude at the time was that hardware design was more mature than
compiler design so this was in itself also a reason to implement parts of the
functionality in hardware and/or microcode rather than in a memory constrained
unit-5 DSP Processors
5
compiler (or its generated code) alone. This design philosophy became retroactively
termed Complex Instruction Set Computer (CISC) after the RISC philosophy came
onto the scene.
9. An important force encouraging complexity was very limited main memories
(on the order of kilobytes). It was therefore advantageous for the density of
information held in computer programs to be high, leading to features such as highly
encoded, variable length instructions, doing data loading as well as. These issues
were of higher priority than the ease of decoding such instructions.
10. An equally important reason was that main memories were quite slow (a
common type was ferrite core memory); by using dense information packing, one
could reduce the frequency with which the CPU had to access this slow resource.
Modern computers face similar limiting factors: main memories are slow compared to
the CPU and the fast cache memories employed to overcome this are instead limited
in size. This may partly explain why highly encoded instruction sets have proven to
be as useful as RISC designs in modern computers.
12. The simplest way to examine the advantages and disadvantages of RISC
architecture is by contrasting it with its predecessor, CISC (Complex Instruction Set
Computers) architecture.
13. Multiplying Two Numbers in Memory. The main memory is divided into
locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The execution
unit is responsible for carrying out all computations. However, the execution unit can
only operate on data that has been loaded into one of the six registers (A, B, C, D, E,
or F). Let's say we want to find the product of two numbers - one stored in location
2:3 and another stored in location 5:2 - and then store the product back in the
location 2:3
14. The CISC Approach. The primary goal of CISC architecture is to complete a
task in as few lines of assembly as possible. This is achieved by building processor
unit-5 DSP Processors
8
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
17. Analysis. At first, this may seem like a much less efficient way of completing
the operation. Because there are more lines of code, more RAM is needed to store
the assembly level instructions. The compiler must also perform more work to
convert a high-level language statement into code of this form.
a. Advantage of RISC. However, the RISC strategy also brings some very
important advantages. Because each instruction requires only one
clock cycle to execute, the entire program will execute in approximately
the same amount of time as the multi-cycle "MUL" command. These
RISC "reduced instructions" require less transistors of hardware space
than the complex instructions, leaving more room for general purpose
registers. Because all of the instructions execute in a uniform amount
of time (i.e. one clock), pipelining is possible.
(1) Separating the "LOAD" and "STORE" instructions actually
reduces the amount of work that the computer must perform.
(2) After a CISC-style "MUL" command is executed, the processor
automatically erases the registers. If one of the operands needs
to be used for another computation, the processor must re-load
the data from the memory bank into a register. In RISC, the
operand will remain in the register until another value is loaded
in its place.
b. The following table will differentiate both the architectures and based
on the analysis the overall advantage will be discussed.
unit-5 DSP Processors
10
CISC RISC
Emphasis on hardware Emphasis on software
Includes multi-clock complex instructions Single-clock, reduced instruction only
Memory-to-memory: Register to register:
"LOAD" and "STORE" incorporated in "LOAD" and "STORE" are
instructions independent instructions
Small code sizes, high cycles per second Low cycles per second, large code
sizes
Transistors used for storing complex Spends more transistors on memory
instructions registers
18. The Performance Equation. The following equation is commonly used for
expressing a computer's performance ability:
that the RISC use of RAM and emphasis on software has become
ideal.
21. Advanced Micro Devices (AMD) 29000.x The AMD 29000, often simply 29k,
was a popular family of 32-bit RISC microprocessors and microcontrollers developed
and fabricated by Advanced Micro Devices (AMD).
unit-5 DSP Processors
12
a. They were, for a time, the most popular RISC chips on the market,
widely used in laser printers from a variety of manufacturers.
b. In late 1995 AMD dropped development of the 29k because the design
team was transferred to support the PC side of the business and was
realigned towards the embedded 186 family of 80186 derivatives.
c. The majority of AMD's resources were then concentrated on their high-
performance, desktop x86 clones, using many of the ideas and
individual parts of the latest 29k to produce the AMD K5.
22. Advanced RISC Machine (ARM). The ARM is a 32-bit reduced instruction
set computer (RISC) instruction set architecture (ISA) developed by ARM Holdings.
It was known as the Advanced RISC Machine, and before that as the Acorn RISC
Machine.
a. The ARM architecture is the most widely used 32-bit ISA in terms of
numbers produced.
b. They were originally conceived as a processor for desktop personal
computers by Acorn Computers, a market now dominated by the x86
family used by IBM PC compatible computers.
c. The relative simplicity of ARM processors made them suitable for low
power applications.
unit-5 DSP Processors
13
d. This has made them dominant in the mobile and embedded electronics
market as relatively low cost and small microprocessors and
microcontrollers.
23. Atmel AVRxi. The AVR is a Modified Harvard architecture 8-bit RISC single
chip microcontroller (µC) which was developed by Atmel in 1996.
a. The AVR was one of the first microcontroller families to use on-chip
flash memory for program storage, as opposed to One-Time
Programmable ROM, EPROM, or EEPROM used by other
microcontrollers at the time.
a. The early MIPS architectures were 32-bit, and later versions were 64-
bit.
b. Multiple revisions of the MIPS instruction set exist, including MIPS I,
MIPS II, MIPS III, MIPS IV, MIPS V, MIPS32, and MIPS64.
c. The current revisions are MIPS32 (for 32-bit implementations) and
MIPS64 (for 64-bit implementations).
d. MIPS32 and MIPS64 define a control register set as well as the
instruction set.
27. SuperHxv. SuperH (SH) is a 32-bit reduced instruction set computer (RISC)
instruction set architecture (ISA) developed by Hitachi. It is implemented by
unit-5 DSP Processors
16
Conclusion
References
i
"Microprocessors From the Programmer's Perspective" by Andrew Schulman 1990
ii
http://cse.stanford.edu/class/sophomore-college/projects-00/risc/whatis/index.html
iii
Stanford sophomore students defined RISC as “a type of microprocessor
architecture that utilizes a small, highly-optimized set of instructions, rather than a
more specialized set of instructions often found in other types of architectures”.
iv
"Guide to RISC Processors for Programmers and Engineers": Chapter 3: "RISC
Principles" by Sivarama P. Dandamudi, 2005, ISBN 978-0-387-21017-9. "the main
goal was not to reduce the number of instructions, but the complexity"
v
http://www.cpushack.net/CPU/cpuAppendA.html
vi
"CISC, RISC, and DSP Microprocessors" by Douglas L. Jones 2000
vii
http://cse.stanford.edu/class/sophomore-college/projects-00/risc/risccisc/
viii
http://en.wikipedia.org/wiki/RISC
ix
http://en.wikipedia.org/wiki/DEC_Alpha
x
http://en.wikipedia.org/wiki/AMD_29k
xi
http://en.wikipedia.org/wiki/Atmel_AVR
xii
http://en.wikipedia.org/wiki/MIPS_architecture
xiii
http://en.wikipedia.org/wiki/PA-RISC
xiv
http://en.wikipedia.org/wiki/Power_Architecture
xv
http://en.wikipedia.org/wiki/SuperH
xvi
http://en.wikipedia.org/wiki/SPARC