Floating Point to Fixed Point Conversion
Audio DSP Ksharing: Rajesh Sharma
www.rajeshsharma.co.in
[email protected]Numbers Real Numbers
Rational, irrational, positive, negative, zero An infinite decimal representation such
2.487154145934798874502350989093. as
Any point on a infinitely long number line
Integer Numberswhole numbers, no fractional part Discrete points on number line e.g. -10, -4, 0, 7, 9
www.rajeshsharma.co.in 2
7/17/2010
Number Representation Computer
Finite Word Length is a major issue Processor Register Length: 8Bit, 16Bit, 24Bit, 32Bit, 64Bit.. A limitation on the possible representations in a register
Floating Point Numbers:
IEEE754 Standard Pseudo Float
Fixed Point Numbers: No Defined Standard
Fractional Numbers: Q Format Integer Numbers
www.rajeshsharma.co.in 3
7/17/2010
Format- Representing numbers in binary system. e.g. for signed number we have
Float: Mantissa Bits & Exponent Bits Fixed: Integer Bits & Fractional Bits Total Bits = Mantissa / Integer + Exponent /fractional + sign bit
Few Terms
Precision - Effective Word Length = N Accuracy Maximum error between actual number and its
representation.
e.g. representation is accurate till 8th decimal place e.g. representation has maximum of 0.001% error
Resolution- smallest non zero magnitude representable with given
format www.rajeshsharma.co.in 4
7/17/2010
Dynamic Range
While Processing/Storing variables: we need to find
Var Var
max
absolute maximum value a variable can have absolute minimum value a variable can have
min
Range of a Variable: : -Var to +Var Resolution of a Variable: : Var
max res
max
Resolution can be determined from..
Absolute Minimum value a variable can have: Var How precisely two consecutive values of a variable may differ what is the accuracy required (e.g. till 8th decimal place)
min
www.rajeshsharma.co.in 5
7/17/2010
Dynamic Range
The ratio of the largest and smallest possible values of the
variable quantity
Dynamic Range is more often expressed as Ratio 5000:1 or as dB 75 dB Or in Bits 12 Bits
For Fixed point, resolution
of the representation becomes the smallest value of the variable.
Varmax Varmax Dynamic Range ( Ratio) = = Varres Dynamic Range ( Bits ) = log 2 ( Varmax ) = N Bits Varres
Bits for signed representation = B = N + 1 Bits for unsigned representation = B = N
www.rajeshsharma.co.in 6
7/17/2010
Dynamic Range: Audio system
Audio system specification uses unsigned representation
The ratio of amplitude of loudest possible undistorted sine wave to rms
noise amplitude
e.g. of microphone, loudspeaker, ADC, DAC etc.. For display devices it is often called contrast ratio
Varmax Dynamic Range (dB) = 20 log10 Dynamic Range (dB) = 6.02 * N + 1.76
For Fixed point conversion normally Signed representation method is used to calculate dynamic range
www.rajeshsharma.co.in 7
7/17/2010
Computers- P, DSP, Controller
N
Finite Word Length
Store results in Registers having finite word length N Bit storage can have only 2 possibilities
How to represent a variable ?
Floating point format Fixed point format
Choice is governed by
Dynamic Range of Variable (Most Imp) Power consumption by specific type of ALU Cost of implementation of ALU.. Other possibilities. www.rajeshsharma.co.in 8
7/17/2010
Floating Point
The decimal point or binary point is floating Economical for very large/small number storage FPU silicon implementation is costly Usually represented as
Two ways of finding values.
r = (1) 2
s (eB)
0. f
or
r = (1) 2
s (eB)
1. f
S : sign Bit value;
B- Bias ; e exponent; f - Mantissa
IEEE754, Single Precision: s- 1; Bit; B-127; e-8Bits; f-23Bits
www.rajeshsharma.co.in 9
7/17/2010
Floating Point Variable
The Range of variable is very high Still there are only 2 distinct values for N Bits The accuracy of variable varies with values Spacing between numbers is not constant IEEE 754 supports four precisions
N
Half precision, 16 Bit; Single precision, 32 Bit Double precision, 64 Bit; Quadruple precision, 128 Bit
www.rajeshsharma.co.in 10
Single Precision example
7/17/2010
Floating Point Arithmetic
Floating point arithmetic is very easy to use Floating point arithmetic: problems
It may not be distributive (processor implementation)*
a * (b + c) != a*b + a*c
Same mathematical operations on different processors
may not produce exactly same result*
Operations on large numbers has more error. Mantissa limits the resolution Exponent limits the largest possible number
*PS: Different floating point implementations of one DSP algorithm may not be Bit Exact.
www.rajeshsharma.co.in 11
7/17/2010
Number line
Double Precision: 64 bit ;
Mantissa = 52 Bits; exponent 11 Bits
Variable spacing example: 4 digit representation
Number 1.7687*107
Next Number 1.7688*107
Difference 0.0001*107 =1000
1.7687*10-2 1.7688*10-2
0.0001*10-2
7/17/2010
=0.000001 www.rajeshsharma.co.in 12
Number line
2 24 values/decade
Microsoft Equation 3.0
This is for decimal Representation For binary representation it is 224 values/octave
www.rajeshsharma.co.in 13
7/17/2010
Float to Fix: Motivation
To achieve higher speed of operation at lower cost To reduce silicon area of hardware -> cost Power saving during code execution Fixed point implementations may be directly used in
many DSPs
Floating point implementation on different processor
may not be bit-exact
Depending on given input range
it may be beneficial to use fixed point.
www.rajeshsharma.co.in 14
7/17/2010
Fixed Point Numbers
The binary point is fixed for a given format
Integer numbers Fractional numbers, Q format
fractional format
Range is directly limited by the number of Bits The spacing between numbers is constant Q format :written as MQN format or M.N
K bits for Integer part N bits for fractional part M = K + Sign bit = K + 1
www.rajeshsharma.co.in 15
7/17/2010
Fractional Format
signed Q format 13.3
13Q3
1Q3 number
www.rajeshsharma.co.in 16
7/17/2010
Fractional Formats: Range (example)
www.rajeshsharma.co.in 17
7/17/2010
How to Convert
A fractional number can be converted to fixed point number for a
given signed Q format MQN
Set all the N LSB bits to 1 and M Bits to 0; call it maxN e.g. for 1Q31 format, max31 = 0X7FFFFFFF For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191 Now multiply the given fractional number with this number
Number -----------Format 1Q15 HEX/Dec 5Q11 0.375 -0.625 1 1.25 2.005 0.0003
2FFF 12287 0BFF
B001 -20479 EC01
7FFF 32767 1FFF
Over Flow 27FE
Over Flow 4026
0009 9 2 2
HEX/Dec 3071 -5119 8191 www.rajeshsharma.co.in 18
10238 16422
7/17/2010
Setting Q Format
The variable to be converted is explored for Range of the variable value Absolute Maximum / Minimum values of variable Resolution of variable: smallest non zero number Accuracy: Maximum error for a representation
Accuracy required highly depends on the arithmetic involved. It also depends on the reference for final comparison Calculate accuracy from tolerable %error in representation.
www.rajeshsharma.co.in 19
7/17/2010
Exploration Process
If the variable is a part of code
Run the code for various test cases Find the previously described values (range, resolution) Range & Accuracy can be also found from arithmetic involved Find the Resolution
If the variable value comes from RO table then
These values values can be found directly
www.rajeshsharma.co.in 20
7/17/2010
Setting Q Format: MQN
Now calculate the Dynamic range required
Varmax Dynamic Range ( Ratio) = Varres Varmax Dynamic Range ( Bits ) = log 2 ( ) = B bits Varres
from MaxInt of the Range
Determine the K Integer bits
MaxInt = ceil (Varmax ) K = ceil (log 2 ( MaxInt ))
DR bits = B = K + N Total Bits = B + 1(S) => 1(S) + K + N; => M + N S is for Sign Bit
www.rajeshsharma.co.in 21
7/17/2010
RO Table Example
Const float Table[5]={ (5.0979557966e-001), (6.0134489352e-001), We will set the Q format (8.9997625993e-001), For this RO table and compare two choices of (2.5629159405e+000) resolution (-4.4628159401e+000) } Range: 4.4628159401e+000 (signed representation) Resolution: 5.0979557966e-001 = Variable minimum value Resolution = Table[1]-Table[0] = 0.09154931386 MaxInt = 5
www.rajeshsharma.co.in 22
7/17/2010
Example cont
DR = (4.4628159401) / 0.09154931386 = 48.74 DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5) M = 4 Bits, N = 3 Bits; Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N) The conversion format MQN is 4Q3 K = 4 Bits ; N = 3 Bits; max3 = 7; 5.0979557966e-001 * max3 = 3 ; 3 / max3 is equivalent to = 0.42857142857. Representation Error = 15.93% (very huge error)
This error is not acceptable
www.rajeshsharma.co.in 23
7/17/2010
The best would be to correctly represent the last digit of
0.50979557966 (accuracy till 11 i.e. Resolution = 0.00000000006
th
Example cont
digit) =
DR = (4.4628159401) / 0.00000000006 = 74380265668 DR Bits = 36 bits Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N) max33 = 0x1FFFFFFFF = 8589934591 5.0979557966e-001 * max33 = 4379110684 ; 4379110684 /max33 = 0.509795579652976. Representation Error = 1.37e-9% Q Format is set as 4Q33
www.rajeshsharma.co.in 24
7/17/2010
Final Table in 4Q33 Format
PolyPhase4ptDCT4toDCT3[]={
4379110684, Error = 1.37e-9% The 37 bits representation error 5165513301, Error = 1.87e-8% is very small and may not be needed for our application. 7730737206, Error = 3.25e-9% We may try with 32 Bits and check -38335297017, Error = 1.53e-9% } if it fits into application or not??
This is very good representation, But still many things matters e.g.
Precision of arithmetic involved Bits available for representation, e.g.
www.rajeshsharma.co.in 25 32 Bits
7/17/2010
Shifting Binary Point
Right Shift to a fractional number in MQN format
Binary point shift to right MQN >> n (M + n)Q( N - n)
Left Shift to a fractional number in MQN format
Binary point shift to left MQN << n (M - n)Q( N + n)
For example:
2Q30 << 1 is 1Q31
Binary point shifting only indicate the change in the maximum/minimum value representable with resulting format
www.rajeshsharma.co.in 26
7/17/2010
Exercise
If array num[] contains 2Q13 format data
for(i=0;i<200;i++){ sum = sum + num[i] }
What is the output format of sum ??
www.rajeshsharma.co.in 27
7/17/2010
Fixed Point Arithmetic
Overflow Vs Carry Flag
Carry Flag is used for unsigned numbers It is set when
Addition/subtraction of two unsigned operands doesnt fit into
the given Bits
Overflow flag is for signed arithmetic It is set when
When Add/sub/mult of two numbers does not give the expected
sign bit at the result
www.rajeshsharma.co.in 29
7/17/2010
Fractional Addition
Two fixed point numbers a & b can be added directly if
Binary point at same place for both numbers, a & b
Resulting format of fractional addition will have
Binary point at same place for both numbers, input & output
www.rajeshsharma.co.in 30
7/17/2010
Fractional Addition
Result format of output is same as that of input
i.e. Binary point at same place for both numbers, input & output
There is a possibility of growth of Integer Part
Example accumulation of an array with data in MQN format for( i=0; I < L; i++ )
sum = sum + ar[i];
m = ceil (log 2 ( L))
Sum will have output format of (M+m)QN
www.rajeshsharma.co.in 31
7/17/2010
Headroom/Guard Bits
Overflow during a addition/subtraction can be handled
with headroom
example addition
for( i=0; i<256; i++ ) sum = sum + ar[i]; There is possibility of 8 bit overflow
Variable sum should have 8 bit headroom in MSBs to
accommodate the overflow
www.rajeshsharma.co.in 32
7/17/2010
2s Complement Multiplication
www.rajeshsharma.co.in 33
7/17/2010
Fractional Multiplication
The product of two N bit numbers is 2N bits Example product of MQN and KQL formats Output format is: (M+K)Q(N+L) format
www.rajeshsharma.co.in 34
7/17/2010
Fractional Multiplication
1Q3 Numbers
2Q6 Numbers
www.rajeshsharma.co.in 35
7/17/2010
Fractional Multiplication Cont..
www.rajeshsharma.co.in 36
7/17/2010
Normalizing Fractions
A product of 2 twos complement numbers has
bits.
two sign
result can be left shifted by 1 bit. If one of the format is 1QN i.e. N+1 Bits, then
Result of left shift will have the format of other operand If other operand is 5Q11 then result of mult is 6Q(11+N) After 1 bit left shift, result is 5Q(11+N) i.e. 5Q11 + (N+1)LSBs
To make most out of available bits,
it is good to left shift the product
www.rajeshsharma.co.in 37
7/17/2010
Rounding
Why Rounding
To replace a Numerical value with approx. equal & shorter i.e. low precision representation
e.g. replacing 45.6782 with 45.68 Rounding is necessary evil... (..remember quantization..)
Many times it is unavoidable It introduces round-off errors Required in float to fixed conversion and.. floating & fixed point arithmetic and.. for function approximation on fixed point processors
www.rajeshsharma.co.in 38
7/17/2010
Rounding Methods
When rounding a number x to q an integer
Round to nearest : q is integer close to x Round towards zero (truncate): q is the integer part of x Round down (floor): q is largest integer that does not exceed x Round up (ceiling): q is smallest integer that is not less than x
Tie Break during rounding
When fractional part to be rounded is exactly half boundary e.g. rounding of 23.5 requires tie breaking
www.rajeshsharma.co.in 39
7/17/2010
Tie Breaking
Round half up: i.e. q = x + 0.5 (x is positive or negative )
It is asymmetric rounding i.e. biased (with round to nearest)
Round half away from zero:
q = x + 0.5 ; x is positive ; q = x 0.5 ; x is negative This method is free of bias
Round half to even
q is the even integer nearest to x This method is more unbiased +33.5 +34 ; +32.5 +32 ; -32.5 -32 ; -33.5 -34 Known as: unbiased rounding , convergent rounding,
statistician's rounding, Dutch rounding, Gaussian rounding, or bankers' rounding www.rajeshsharma.co.in 7/17/2010 40
Dithering
Stochastic Rounding:
Choose q randomly between x+0.5 and x-0.5 It is also bias free because of random component
Dithering: when to use this method
when the signal being rounded / quantized is slowly varying All rounding methods produce monotonous round-off errors Moreover monotonous round-off error because of particular
rounding methods being used
All of them introduces non-linear response in a system Harmonics in the filter response because of rounding method
www.rajeshsharma.co.in 41
7/17/2010
Block Floating Point Processing
When Dynamic Range of a signal is very high When the input is fluctuating between very high and very
small values.
When very small values are as important as larger
values
A block of data values share a common radix point for processing
Instruction like CLZ & NORM are helpful.
www.rajeshsharma.co.in 42
7/17/2010
Assuming you are working on a 32 bit DSP
Convert following table into fixed point.
const float gain_table[8] ={ (8.2387952833e-002), (9.2387952833e-001), (7.0710676573e-001), (3.8268340208e-001) (1.9615705587e+000), (1.6629392064e+000), (1.1111404206e+000), (3.9018056901e+001) };
www.rajeshsharma.co.in 43
7/17/2010
Exercise1
Exercise2
Assume that inp[ ] array contains 16 Bit PCM data
Find the o/p format of out[ ] ? Convert following loop into fixed point. for (i=0;i<8;i++){ out[i] = inp[i] * gain_table[i]; }
www.rajeshsharma.co.in 44
7/17/2010
Exercise3
Now given below is the DCTIII code, with gain applied at end Convert this into a fixed point code
www.rajeshsharma.co.in 45
7/17/2010
Exercise4
Following IIR filter is used for filtering 16 bit PCM ip Convert it into fixed point and ensure that the filter
response is linear.
x[n] + Z-1 x[n-1] Z-1 x[n-2] y[n-2] b2 a2 b1 a1 Z-1 b0 Z-1 y[n-1] y[n]
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
www.rajeshsharma.co.in 46
7/17/2010
Exercise4: C code
www.rajeshsharma.co.in 47
7/17/2010
Exercise4: C code Filter Loop
www.rajeshsharma.co.in 48
7/17/2010
References
Fixed Point Arithmetic: an Introduction
By Randy Yates
Application Note 33
Fixed point arithmetic on the ARM
Document number ARM DAI 0033A
Wiki
www.rajeshsharma.co.in 49
7/17/2010
THANK YOU