Introduction to Digital Signal Processing
Introduction
DSP is a technique of performing the mathematical operations on the
signals in digital domain
As real time signals are analog in nature we need to first convert the
analog signal to digital, then we have to process the signal in digital
domain and again converting back to analog domain
Thus ADC is required at the input side whereas a DAC is required at
the output end
A typical DSP system is as shown in Fig1.1
Fig 1.1: A Typical DSP System
Need for DSP
Analog signal Processing has the following drawbacks:
They are sensitive to environmental changes
Aging
Uncertain performance in production units
Variation in performance of units
Cost of the system will be high
Scalability
Advantages and Disadvantages of DSP
Greater accuracy
Cheaper
Flexibility in configuration, as reconfiguration is possible by
changing the program, whereas an analog system reconfiguration
would involve, redesigning the hardware
Easy data storage on magnetic media without loss of fidelity
Implementation of better algorithms for processing
Adaptable to low frequency signal processing, where analog
processors would require very large passive elements like inductors
and capacitors
Feasibility of time sharing of a signal processor among a number of
signals
However DSP suffers from few limitations:
System complexity is increased because of the need to convert real-
life analog signals to digital and the processed digital signal back to
analog
Limited bandwidth due to constraint on sampling rate
Power consumption
Applications of DSP
Consumer electronic appliances - TV, music synthesizers, stereos,
recorders etc
Image processing techniques like image compression, image
enhancement, image analysis and recognition
Process control
Robotics
Medical application like CT scans, X-ray scanning, MRI, ECG, EEG
etc
Speech processing
Earth quake monitoring
Communication application like modems, cellular phones, FAX etc
A Digital Signal Processing System
A computer or a processor is used for digital signal processing
Antialiasing filter is a LPF which passes signal with frequency less
than or equal to half the sampling frequency in order to avoid
Aliasing effect
Similarly at the other end, reconstruction filter is used to reconstruct
the samples from the staircase output of the DAC (Fig1.2)
Fig 1.2 The Block Diagram of a DSP System
Signals that occur in a typical DSP are as shown in Fig 1.3.
Fig 1.3: (a) Continuous time signal (b) Sampled Signal
(c) Sampled Data Signal (d) Quantized Signal (e) DAC Output
The Sampling Process
ADC process involves sampling the signal and then quantizing the
same to a digital value
In order to avoid Aliasing effect, the signal has to be sampled at a
rate at least equal to the Nyquist rate
The condition for Nyquist Criterion is as given below,
fs≥2 fmax
where , fs is the sampling frequency
fmax is the maximum frequency component in the message signal
If the sampling of the signal is carried out with a rate less than the
Nyquist rate, the higher frequency components of the signal cannot
be reconstructed properly
Example:
If fs=20kHz, then fmax should be ≤10kHz. If the input signal has a
frequency components greater than fmax, then these components are
converted into low frequency ones, resulting in a noise called
aliasing noise
Problem 1: Determine the Nyquist rate for the analog signal
x(t)=3cos100t-6sin400t+4cos60t
Solution:
The frequencies present in x(t) are
f1=50Hz, f2=200Hz, f3=30Hz
The max frequency component fmax=f2=200Hz. Hence as per
sampling theorem Nyquist rate is fs=2fmax= 400Hz
Problem 2: The signal is sampled at a rate of 300Hz. After sampling
the signal is reconstructed from the samples. Determine the
frequency of the reconstructed analog signal. comment on the
aliasing error
Solution: The signal in problem1 has frequencies 50Hz, 200Hz, 30Hz
and is sampled at 300Hz
30Hz and 50Hz are less than fs/2 (i.e, 150Hz) and hence not affected
by aliasing and hence are
The 200Hz component of x(t) is above fs/2 and is changed by
aliasing effect. In the reconstructed analog filter the 200Hz filter will
be aliased into 200-fs=200-300=-100Hz (minus sign represents a
phase shift 0f 1800)
In summary the reconstructed analog signal has the frequency
components 30Hz, 50Hz and 100Hz
Problem 3: Find the minimum sampling rate for y(t) such that no
aliasing results. Find suitable antialiasing filter (type of cut-off
frequency) if the sampling rate must be 6KHz
Solution:
The maximum frequency component of y(t) is 5kHz, since beyond
this the amplitude is very low. Hence the minimum sampling rate for
y(t)=2fmax =10kHz i.e, fs=10kHz
Using a sampling rate of 6 kHz results in aliasing error which can be
avoided using anti-aliasing filter
For 6kHz sampling rate, the maximum frequency component in y(t)
should be less than 6kHz/2=3kHz. Hence use a low pass filter with
cut-off of 3kHz as anti-aliasing filter
Discrete Time Sequences
Consider an analog signal x(t) given by,
x(t)= A cos (2ft)
If this signal is sampled at a Sampling Interval T, in the above
equation replacing t by nT we get,
x(nT) = A cos (2fnT) where n= 0,1, 2,..etc
For simplicity denote x (nT) as x (n)
x(n) = A cos (2fnT) where n= 0,1, 2,..etc
We have fs=1/T also θ = 2fnT
x(n) = A cos (2fnT)= A cos (2fn/fs) = A cos θn
The quantity θ is called as digital frequency
θ = 2fT = 2f/fs radians
Fig 1.5 A Cosine Waveform
A sequence that repeats itself after every period N is called a
periodic sequence
Consider a periodic sequence x (n) with period N
x(n)=x (n+N), n=……..,-1,0,1,2,……..
Frequency response gives the frequency domain equivalent of a
discrete time sequence
It is denoted as X(e jθ), i.e,
Frequency response of a discrete sequence involves both magnitude
response and phase response
Problem 4: A signal x(t) has the frequency components of 50Hz,
250Hz and 600Hz. It is sampled with 4kHz sampling rate. Determine
the resulting digital frequencies (in radians)
Solution:
Digital frequency θn=2fn/fs
θ1=2x50/4k=0.025rad
θ2=2x250/4k=0.125rad
θ3=2x600/4k=0.3rad
Problem 5: A signal is sampled with 8kHz sampling frequency.
Determine the analog frequencies (in Hz and rad/s) for the digital
frequencies 0, /4, /3, /2, and 3/4 radians.
Solution:
Discrete Fourier Transform (DFT) and Fast Fourier Transform
(FFT)
DFT Pair
DFT is used to transform a time domain sequence x(n) to a frequency
domain sequence X(k)
The equations that relate the time domain sequence x(n) and the
corresponding frequency domain sequence X(k) are called DFT Pair
and is given by,
DFT(FFT):
where (k=0,1,.......,N-
1)
IDFT(IFFT):
The Relationship between DFT and Frequency Response
We have
Also
, k = 0, 1, 2,...(N-1)
From the above expression it is clear that we can use DFT to find the
Frequency response of a discrete signal
Spacing between the elements of X(k) is given as
f=fs/N=1/NT=1/T0
Where T0 is the signal record length
It is clear from the expression of f that, in order to minimize the
spacing between the samples N has to be a large value
Although DFT is an efficient technique of obtaining the frequency
response of a sequence, it requires more number of complex
operations like additions and multiplications
Thus many improvements over DFT were proposed
One such technique is to use the periodicity property of the twiddle
factor e-j2/N
Those algorithms were called as Fast Fourier Transform
Algorithms
The following table depicts the complexity involved in the
computation using DFT algorithms
Table 1.1 Complexity in DFT algorithm
Operations Number of Computations
Complex Multiplications N2
Complex Additions N (N-1)
Real Multiplications 4N2
Real Additions 2N (2N-1)
Trigonometric Functions 2N2
FFT algorithms are classified into two categories viz
1. Decimation in Time FFT
2. Decimation in Frequency FFT
In decimation in time FFT the sequence is divided in time domain
successively till we reach the sequences of length 2
Whereas in Decimation in Frequency FFT, the sequence X(k) is
divided successively
The complexity of computation will get reduced considerably in case
of FFT algorithms
Problem 6: Compute the 5-point DFT of the sequence x[n]=[1 2 4] and
hence plot its magnitude and phase spectra
Solution: x[n]=[1 2 4] is a 3-point sequence and hence 5-point DFT
can be applied
The 5-point DFT is computed as
where k=0, 1, 2, 3, 4 and N=5
x[0]=1; x[1]=2; x[2]=4
X[k]=x(0)e-j20k/N+ x(1)e-j2k/N+ x(2)e-j22k/N
=1+2xe-j2k/5+4x e-j4k/5
For k=0, X(0)=7
Similarly,
X(1)=4.55-1.93
X(2)=2.71.3399
X(3)=2.7-1.3399
X(4)=4.551.93
Magnitude of X(k)=X(k)=7,4.55,2.7,2.7,4.55
Phase of X(k)=X(k)= 0,-1.93,1.3399,-1.3399,1.93
Problem 7: A signal x(t) is filtered and sampled using a sampling rate
of 8kHz. If 1024 samples of this signal are used to compute the
discrete Fourier transform X(k), find the frequency spacing between
adjacent X(k) elements. What is the analog frequency corresponding
to k=12, 30, 64, 120, 200, 320, 400
Solution: The frequency spacing,
f=fs/N=8KHz/1024=7.8125Hz
Analog frequency corresponding to k=12 is
fxk=7.8125x12=93.75 Hz
Similarly the analog frequencies corresponding to k=30, 64, 120,
200, 320 and 400 are 234.375Hz, 500Hz, 937.5Hz, 1562.5Hz,
2500Hz and 3125Hz respectively
Fast Fourier Transform (FFT)
The direct computation of DFT and IDFT requires a large number of
complex multiplies
A number of algorithms have been developed to efficiently compute
DFT and IDFT
These algorithms use power of 2 points and exploit the periodic
nature of the complex exponential ej2nk/N occurring in the DFT and
IDFT equations
Table below compares the complex multiples needed to compute
DFT directly by using an FFT algorithm called radix-2 algorithm
The radix-2 algorithm uses N that is an integer power of 2, such as 2,
4, 8, 16, etc
It is possible to show that the DFT requires N2 complex multiples
and radix-2 FFT algorithm requires N/2log 2N
This produces computational savings for larger value of N
Table: Computations complexity of FFT algorithm
N Direct FFT computation Amount of saving Ratio of computational
point computation (radix2)N/2log2N of complex complexity direct:FFT
N2 multiplications N2/(N/2)log2N=2N2/log2N
2 4 1 3 4
4 16 4 12 4
8 64 12 52 5.33
16 256 32 244 8
32 1024 80 944 12.8
64 4096 192 3904 21.8
256 65536 1024 64512 64
1024 1048576106 5120 106 204.8
Linear Time Invariant Systems
A system which satisfies superposition theorem is called as a linear
system and a system that has same input output relation at all times
is called a Time Invariant System
Systems, which satisfy both the properties, are called LTI systems
An LTI System
LTI systems are characterized by its impulse response or unit sample
response in time domain whereas it is characterized by the system
function in frequency domain
Convolution
Convolution is the operation that related the input output of an LTI
system, to its unit sample response
The output of the system y(n) for the input x(n) and the impulse
response of the system being h (n) is given as
where, x(n) is the input of the system
h(n) is the impulse response of the system
y(n) is the output of the system
Problem 8: Find the convolution of the sequence x(n)=[1 2 3 4] and
h(n)=[3 2 1]
Solution:
y(n)= x(n) * h(n)=[3 8 14 20 11 4]
Problem 9: Compute and plot the sequence x(n) * h(n), where x[n]= [3
4 2 11 0 7 -1 10] and h=[1.2 3.7 4.8]. Determine the length of the
computed sequence
Solution:
y(n)= x(n) * h(n)
y(n)= 3.6000 15.9000 31.6000 39.8000 50.3000 61.2000
24.7000 41.9000 32.2000 48.0000
Length of the computed sequence y(n)=lenx(n)+lenh(n)=8+3-1=10
plot of sequence x(n) * h(n)
Matlab code is:
x=[3 4 2 11 0 7 -1 10];
h=[1.2 3.7 4.8];
y=conv(x,h);
figure(1);
stem(y);
l=length(y);
Problem 10: Find the sequence X(k) and H(k) using FFT. Next
multiply two sequences to generate the sequence Y(k)=X(k).H(k).
Use IFFT to compute y(n)
Solution :
x=[3 4 2 11 0 7 -1 10];
h=[1.2 3.7 4.8];
xlen=length(x);
hlen=length(h);
ylen=xlen+hlen-1;
N=ylen;
Xk=fft(x,N);
Hk=fft(h,N);
Yk=Xk.*Hk;
y=ifft(Yk)
figure(1);
stem(y);
X(k)= 36.0000,-5.8262 - 5.7921i, -7.6803 - 3.4410i, 9.8262 -
2.9919i, 14.6803 - 0.8123i, -28.0000, 14.6803 + 0.8123i, 9.8262 +
2.9919i, -7.6803 + 3.4410i, -5.8262 + 5.7921i
H(k)= 9.7000,5.6766 - 6.7399i, -1.5399 - 6.3403i,-3.8266 -
0.6975i,-0.3101 + 2.3903i, 2.3000, -0.3101 - 2.3903i,-3.8266 +
0.6975i,-1.5399 + 6.3403i, 5.6766 + 6.7399i
Y(k)= 102x3.4920,-0.7211 + 0.0639i, -0.0999 + 0.5399i,-0.3969 +
0.0459i, -0.0261 + 0.3534i,-0.6440, -0.0261 - 0.3534i, -0.3969 -
0.0459i,-0.0999 - 0.5399i, -0.7211 - 0.0639i
y={ 3.6000 , 15.9000, 31.6000, 39.8000, 50.3000, 61.2000 ,
24.7000 , 41.9000, 32.2000, 48.0000
The result is same as the first problem
Problem 11: For the above problem use 8-point DFT and IDFT.
Compare these results with that obtained in above example
Solution:
x=[3 4 2 11 0 7 -1 10];
h=[1.2 3.7 4.8];
Xk=fft(x,8);
Hk=fft(h,8);
Yk=Xk.*Hk;
y=ifft(Yk)
y ={ 35.8000 63.9000 31.6000 39.8000 50.3000 61.2000
24.7000 41.9000}
The last two samples of y(n)=conv(x,h) i.e, 32.8 and 48 are added
with first 2 samples 3.6 and 15.9 respectively to result 35.8 and 63.9
of circular convolution output
Z Transformation
Z Transformations are used to find the frequency response of the
system
The Z Transform for a discrete sequence x(n) is given by,
The System Function
An LTI system is characterized by its System function or the
transfer function
The system function of a system is the ratio of the Z transformation
of its output to that of its input
It is denoted as H(z) and is given by
H(z) = Y(z)/ X(z)
The magnitude and phase of the transfer function H(z) gives the
frequency response of the system
From the transfer function we can also get the poles and zeros of the
system by solving its numerator and denominator respectively
Digital Filters
Filters are used to remove the unwanted components in the sequence
They are characterized by the impulse response h(n)
The general difference equation for an Nth order filter is given by,
A typical digital filter structure is as shown in figure below
Structure of a Digital Filter
Values of the filter coefficients vary with respect to the type of the
filter
Design of a digital filter involves determining the filter coefficients
Based on the length of the impulse response, digital filters are
classified into two categories viz Finite Impulse Response (FIR)
Filters and Infinite Impulse Response (IIR) Filters
FIR Filters
FIR filters have impulse responses of finite lengths
In FIR filters the present output depends only on the past and present
values of the input sequence but not on the previous output
sequences
Thus they are non recursive hence they are inherently stable
FIR filters possess linear phase response and hence they are very
much applicable for the applications requiring linear phase response
The difference equation of an FIR filter is represented as
The frequency response of an FIR filter is given as
Also
The major drawback of FIR filters is, they require more number of
filter coefficients to realize a desired response as compared to IIR
filters
Thus the computational time required will also be more
Problem 12: Find the magnitude and phase response of an FIR filter
represented by the difference equation
y(n)= 0.5 x(n) + 0.5 x(n-1)
Solution:
As, y(n)= 0.5 x(n) + 0.5 x(n-1)
h(n)= 0.5 δ(n) + 0.5 δ(n-1) = [0.5 0.5]
H(z)= 0.5+0.5z-1
H(ejθ)= 0.5+0.5 e -jθ
= 0.5+0.5 cosθ -j0.5 sinθ
=0.5 (1+ cosθ) -j0.5 sinθ
= [0.5*2* cos2 (θ/2)]-j[0.5*2* sin (θ/2)* cos (θ/2)]
= cos2 (θ/2) -j[sin (θ/2)* cos (θ/2)]
mag (H(ejθ)) = sqrt (cos4 (θ/2) +sin2 (θ/2) cos2 (θ/2))
= sqrt [cos2 (θ/2)( cos2 (θ/2)+ sin2 (θ/2))]
= cos (θ/2)
Similarly,
Phase (H(ejθ)) = tan –1[-(sin (θ/2) cos (θ/2))/ cos2 (θ/2)]
= tan –1[-tan (θ/2)]
= - (θ/2)
The magnitude and phase response curves of the designed FIR filter
are as shown in figure below
FIR Filter Design
Frequency response of an FIR filter is given by the following
expression,
Design procedure of an FIR filter involves the determination of the
filter coefficients bk
IIR Filters
Unlike FIR filters, IIR filters have infinite number of impulse response
samples
They are recursive filters as the output depends not only on the past
and present inputs but also on the past outputs
They generally do not have linear phase characteristics
Typical system function of such filters is given by,
Stability of IIR filters depends on the number and the values of the
filter coefficients
The major advantage of IIR filters over FIR is that, they require lesser
coefficients compared to FIR filters for the same desired response,
thus requiring less computation time
Problem 13: Obtain the transfer function of the IIR filter whose
difference equation is given by, y(n)= 0.9y(n-1)+0.1x(n)
Solution:
y(n)= 0.9y(n-1)+0.1x(n)
Taking Z transformation both sides
Y(z)= 0.9 z-1 Y(z) + 0.1 X(z)
Y(z) [ 1- 0.9 z-1] = 0.1 X(z)
The transfer function of the system is given by the expression,
H(z)= Y(z)/X(z)
= 0.1/ [ 1- 0.9 z-1]
Realization of the IIR filter with the above difference equation is as
shown in figure below
IIR Filter Structure
IIR Filter Design
IIR filters can be designed using two methods viz using analog filter
design techniques and direct method
In this approach, a digital filter can be designed based on its
equivalent analog filter
An analog filter is designed first for the equivalent analog
specifications for the given digital specifications
Then using appropriate frequency transformations, a digital filter
can be obtained
The filter specifications consist of passband and stopband ripples in
dB and Passband and Stopband frequencies in rad/sec
Low pass Filter Specifications
Direct IIR filter design methods are based on least squares fit to a
desired frequency response
These methods allow arbitrary frequency response specifications
Decimation and Interpolation
Decimation and Interpolation are two techniques used to alter the
sampling rate of a sequence
Decimation involves decreasing the sampling rate without
violating the sampling theorem
Interpolation increases the sampling rate of a sequence
appropriately by considering its neighbouring samples
Decimation
Decimation is a process of dropping the samples without violating
sampling theorem
The factor by which the signal is decimated is called as decimation
factor and it is denoted by M
It is given by,
where
Decimation Process
Problem 14: Let x(n)=[3 2 2 4 1 0 –3 –2 –1 0 2 3] be decimated with a
factor of 2. Let the filtered sequence be w(n)=[2.1 2 3.9 1.5 0.1 –2.9 –2 –
1.1 0.1 1.9 2.9]. Obtain the decimated sequence y(m)
Sequence y(m) can be obtained by dropping every alternative sample of
w (n)
y(m) = [2 1.5 -2.9 -1.1 1.9]
Problem 15: Let x(n)=[1 0 2 4.1 5 6 7 3 5.2] and filter sequence h(n)=[0.2
1.2 -0.3 0.4] be decimated by a factor of 3. Obtain the decimated
sequence
Solution:
The output sequence after filter is w(n)=[0.2 1.2 0.1 3.62 5.2 6.77 8.74
9.2 4.94 8.14 -0.36 2.08]
The decimated sequence is y(m)=[0.2 3.62 8.74 8.14]
Interpolation
Interpolation is a process of increasing the sampling rate by
inserting new samples in between
The input output relation for the interpolation, where the sampling
rate is increased by a factor L, is given as,
where w(n)= x(m/L), m=0, ±L, ±2L……
0 for other values of m
Interpolation Process
Problem 16: Let x(n)= [0 3 6 9 12] be interpolated with L=3. If the
filter coefficients of the filters are bk=[1/3 2/3 1 2/3 1/3], obtain the
interpolated sequence
Solution:
After inserting zeros,
w(m) = [0 0 0 3 0 0 6 0 0 9 0 0 12]
bk=[1/3 2/3
↑ 1 2/3 1/3]
We have,
y(m)= Σ bk w(m-k) = b-2 w(m+2)+ b-1 w(m+1)+ b0 w(m)+ b1
w(m-1)+ b2 w(m-2)
Substituting the values of m, we get
y(0)= b-2 w(2)+ b-1 w(1)+ b0 w(0)+ b1 w(-1)+ b2 w(-2)= 0
y(1)= b-2 w(3)+ b-1 w(2)+ b0 w(1)+ b1 w(0)+ b2 w(-1)=1
y(2)= b-2 w(4)+ b-1 w(3)+ b0 w(2)+ b1 w(1)+ b2 w(0)=2
Similarly we get the remaining samples as,
y(n) = [ 0 1 2 3 4 5 6 7 8 9 10 11 12]
Computational Accuracy in DSP Implementations
Introduction
The issues related to computational accuracy of algorithms when
implemented using programmable digital signal processors are
discussed
Various formats of number representation and their effect on the
dynamic range and precision of signals represented using these
formats are discussed
The various sources of errors in the implementation of DSP
algorithms and how to control these errors while designing DSP
systems are discussed
Following topics to be discussed
1. Number formats for signals and coefficients in DSP systems
2. Dynamic range and precision
3. Sources of error in DSP implementations
4. A/D conversion errors
5. DSP computational errors
6. D/A conversion errors
Number formats for signals and coefficients in DSP systems
In a digital signal processor, as in any other digital system, signals
are represented as numbers right from the input, through different
stages of processing, to the input
The DSP structures, such as filters, also require numbers to specify
coefficients
There are various ways of representing these numbers, depending on
the range and precision of signals and coefficients to be represented,
hardware complexity, and speed requirements
The typical formats used for numbers to represent signals and
coefficients in DSP systems
In a DSP, signals are represented as discrete sets of numbers from the
input stage along through intermediate processing stages to the
output
Even DSP structures such as filters require numbers to specify
coefficients for operation
Two typical formats for these numbers are :
1. Fixed-point format
2. Floating-point format
Fixed point format
The simplest scheme of number representation is the format in which
the number is represented as an integer or fraction using a fixed
number of bits
An n-bit fixed-point signed integer −2n-1 ≤ x ≤ 2n-1 −1 is represented
as:
x = -s. 2n-1 + bn-2 .2n-2 + bn-3 .2n-3 + … + b1 .21 + b0 .20 ---- (1)
where s represents the sign of the number (s = 0 for positive and s =
1 for negative)
Fig 1.12 (a) Fixed point format to represent signed integers
Fig 1.12 (b) Fixed point format to represent signed fractions
Example 1: What is the range of numbers that can be
represented in a fixed point format using 16 bits if the numbers
are treated as (a) signed integers (b) signed fractions?
Solution: (a). Using 16 bits, the range of integers that can be
represented is determined by substituting n=16 in equation 1 and is
given as
-215 to +215 -1
i.e, -32,768 to +32,767
(b). The range of fractions, as determined from equation 1 using n=16,
is given as
-1 to +(1-2-15)
i.e, -1 to +.999969482
In DSP implementations, multiplication of integers produces
numbers that may require more bits to represent, and in the event of
a fixed number of available bits, it may create wraparound
The wraparound generates the most negative number after the most
positive number, and vice versa
The problem can be tackled by using fractional representation
When two fractions are multiplied, the result is still a fraction
The resulting fraction may use the same number of bits as the
original fractions by discarding the less significant bits
Double Precision Fixed point format
To increase the range of numbers that can be represented in fixed
point format, one obvious approach is to increase its size
If the size is doubled, the range of numbers increases substantially
Simply doubling the size and still using the fixed point format
creates a double precision fixed point format
But such a format requires double the storage for the same data and
may need double the number of accesses for the same size of data
bus of the DSP device
Floating point format
For DSP applications, if an algorithm involves summation of a large
number of products, it requires a large number of bits to represent the
signal to allow for adequate signal growth over the summation
However, since a processor architecture will not allow for an
unlimited number of bits, some processor choose a floating point
format for signal processing computations
A floating point number is made up of a mantissa M x and an
exponent Ex such that its value x is represented as
x= Mx2Ex
If two floating point numbers x and y are multiplied, the product xy
is given by
xy= MxMy2Ex+ Ey
Implementation of a floating point multiplier must contain a
multiplier for the mantissa and an adder for the exponent
An addition of floating point numbers requires normalization of the
numbers to be added so that they have the same components
A commonly used single-precision floating-point representation is
the IEEE 754-1985 format given as:
IEEE-754 format for floating point numbers
F represents the magnitude fraction of the mantissa, and the exponent
E is an integer
Further in determining the mantissa, an implied 1 is placed
immediately before the binary point of the fraction
The sign bit provides the sign of the fractional part of the number
That is to say, with n bits for F, the range of fractional numbers that
can be represented in the mantissa is –(2-2 -n) to +(2-2-n)
The bias depends upon the bits reserved for the component
In Fig, the bias is 127, the largest positive number represented by 8
bits
The value of E can be from 0 to 255
In double precision representation, the exponent uses 11 bits, making
the bias value as 1023
Example 2: Find the decimal equivalent of the floating point
binary number 1011000011100. Assume a format similar to
IEEE-754 in which the MSB is the sign bit followed by 4
exponent bits followed by 8 bits for the fractional part
Solution: The number is negative, as the sign bit is 1
F= 2-4+2-5+2-6 = 0.109375
E=21+22=6
Thus the value of the number is
x= -1.109375x2(6-7) = -0.5546875
Floating point format, used to increase the range of numbers that can
be represented, suffers from the problem of speed reduction for DSP
computation
More steps are required to complete a floating point computation
compared to fixed point computation
Dynamic range and Precision
The dynamic range of a signal is the ratio of the maximum value to
the minimum value that the signal can take in the given number
representation scheme
The dynamic range of a signal is proportional to the number of bits
used to represent it and increases by 6dB for every additional bit
used for the representation
The number of bits used to represent a signal also determines the
resolution or the precision with which the signal can be represented
However, the time taken for certain operations such as the A/D
conversion increases with the increase in the number of bits
Resolution is the minimum value that can be represented using a
number representation format
For instance, if N bits are used to represent a number from 0 to 1, the
smallest value it can take is the resolution and is given as
Resolution = 1/2N for large N
Resolution of a number representation format is normally expressed
as number of bits used in the representation
At times, it is also expressed as a percentage
Precision is an issue related to the speed of DSP implementation
In general, techniques to improve the precision of an implementation
reduce its speed
Larger word size improves the precision but may pose a problem
with the speed of the processor, especially if its bus width is limited
For example, if the 32 bit product of a 16x16 multiplication has to be
preserved without loss of precision, two memory accesses are
required to store and recall the product using a 16 bit bus
Example 3: Calculate the dynamic range and precision of each
of the following number representation formats
1. 24 bit, single precision, fixed point format
2. 48 bit, double precision, fixed point format
3. A floating point format with a 16 bit mantissa and an 8 bit
exponent
Solution: a. Since each bit gives a dynamic range of 6dB, the total
dynamic range is 24x6 =144dB
Percentage resolution is (1/224)x100 = 6x10-6
b. Since each bit gives a dynamic range of 6dB, the total dynamic
range is 48x6 =288dB
Percentage resolution is (1/248)x100 = 4x10-13
For floating point representation, the dynamic range is determined by
the number of bits in the exponent. Since there are 8 exponent bits, the
dynamic range is (28-1)x6= 1530dB
The percentage resolution depends on the number of bits in the
mantissa. Since there are 16 bits in the mantissa, the resolution is
(1/216)x100 = 1.5x10-3 %
Sources of error in DSP implementations
A DSP system consists of an A/D converter, a DSP device, and a D/A
converter
The accuracy of a DSP implementation depends upon a number of
factors contributed by the A/D and D/A conversions and how the
calculations are performed in the DSP device
The error in the A/D and D/A in the representation of analog signals
by a limited number of bits is called the quantization error
The quantization error decreases with the increase in the number of
bits used to represent signals in A/D and D/A converters
The errors in the DSP calculations are due to the limited word length
used
These errors depend upon how the algorithm is implemented in a
given DSP architecture
These errors can be reduced by using a larger word length for data
and by using rounding, instead of truncation, in calculations
A/D Conversion Errors
Consider an A/D converter, shown in Fig, with b bits used to
represent an unsigned signal value
Its digital representation is of the form …xxx….x, where there are b
bits after the assumed binary point
In this kind of binary representation, the value of the least significant
bit is given by
Δ=2-b
The maximum error due to quantization depends on b
The quantization error for a given conversion as shown in the model
of Fig is given by
ε = xq – x
where x is the input and xq is the quantized output
This error is called the truncation error if the signal value above the
largest integral multiple of Δ is simply dropped
It is called the rounding error if the value is rounded to the nearest
integral multiple of Δ. This way the rounding limits the error to ±Δ/2
The signal to noise ratio (SNR) is a measure that is used to evaluate
the performance of the A/D converter
DSP Computational Errors
The DSP computations involve using the digitized signal values and
DSP structures represented by coefficients
These numbers are typically represented in the signed fractional 2’s
complement form
The computation almost always involve multiplications or multiply
and accumulate (MAC) operations
D/A Conversion Errors
A source of error in a D/A converter is due to the fact that, typically,
a D/A converter uses fewer bits in conversion than the number of bits
required by the computed result, produced by the DSP device
This is equivalent to the truncation or the rounding off error in the
A/D converter and can be handled