FPGA IMPLEMENTATION OF FFT ALGORITHMS USING FLOATING
POINT NUMBERS
Hilal Kaptan1 Ali Tangel1 Suhap Sahin2
1
Kocaeli University, College of Engineering, Department of Electronics and Communication Engineering,
Veziroglu Yerleskesi, 41040, Izmit, Turkey
2
Kocaeli University, College of Engineering, Department of Computer Engineering,
Veziroglu Yerleskesi, 41040, Izmit, Turkey
1 2 1
e-posta: [email protected] e-posta: [email protected] e-posta: [email protected]
Keywords: FPGA, Fast Fourier Transform, Floating Point Arithmetic, DSP
ABSTRACT Floating point numbers have ability to represent a good
In this paper, it is shown that FFT algorithms using approximation and dynamic range representations of the
floating point numbers can be implemented on an real numbers, so that floating point algorithms are
FPGA. A custom VHDL library so called Fp_Lib is frequently used in modern applications, which require
formed for general purpose computer arithmetic millions of calculations per second, such as image
FPGA implementations. This library includes processing and speech recognition.[3]
addition, subtraction, multiplication and division
modules based on 32 bit single precision IEEE 754 In this study, firstly, the realized algorithms of the
format. The algorithms are implemented on Xilinx necessary arithmetic operations used in FFT
Spartan-2 evaluation board, while FFT algorithms are implementation are presented. Next, these design blocks
realized on Xilinx’s Virtex 2 FPGA (XUP-V2Pro are used to realize the mathematical expression of the
Board). 0.6 µs, and 0.72 µs speeds were obtained for FFT. Finally, the study is compared with the similar ones
implementing the FFT and IFFT algorithms, in the literature from structural and performance point of
respectively. The study is also compared with a view.
similar one in the literature from structural and
performance point of view. 2. FLOATING POINT ARITHMETIC
ALGORITHMS
1. INTRODUCTION Many technological systems prefer floating point
Fast Fourier Transform (FFT) plays an important role in arithmetic due to having capacity of dynamic and precise
many signal and image processing, data analyzing for representation for numbers [7]. FPGA usage for the
vibration sensors, frequency measurement of earthquakes, implementation of Floating Point Number
and telecommunication systems such as WiMax implementations rather than microprocessor based
technology which presents both wide bandwidth and structures will be the best choice due to parallel
wireless solutions[1-2]. processing capability, re-programmability, and higher
speed. In spite of their advantage, floating point
In real time applications, it is necessary to obtain and operators consumes large amount of resources and more
process the input data as fast as possible to be able to time even for an ordinary (low resolution)
reach the result almost simultaneously. Although ASIC implementation [6].
solutions always offer fastest and low power solutions for
real time applications, they are unique designs for a 2.1. Floating point addition and subtraction
specific application. Therefore redesign process of an Figure 1 shows the design flow chart of the floating point
ASIC for a new application requires much more money addition and subtraction algorithm implemented. These
and time when comparing with field programmable chips. algorithms are similar to the ones realized in many
processors. Let F1 and F2 represent two floating point
FPGA solutions also provide flexible design, low cost, numbers; Ftop represents the addition of both number; and
and faster time-to-market features besides allowing Fminus =F1-F2. Fminus can be re-written as Fminus= F1+(-F2).
parallel process implementations [4]. Due to parallel The subtraction process is converted to addition form by
processing property, they are much faster than traditional inversing the sign bit of F2. For this reason, only addition
microprocessor based solutions [5]. algorithm is elaborated here. Addition and subtraction
algorithms are realized in three steps. Fi represents the
number; Si is the sign, ei is exponent and fi is the fraction be Fans = ( sans, eans, fans)= F1* F2. The algorithm steps will
part of any number. Lets define the inputs as F1=(s1,e1 ,f1) be as follows:
and F2=(s2,e2 ,f2). The result is represented as 1. Step:
Fans=(sans,eans ,fans)= F1+F2 or F1 +(- F2) Exponent parts, e1 and e2 are added; the resulting number
is appointed as eans. “1” is added to the beginnings of f1
The algorithm steps are as follows: and f2, yielding (1.f1) and (1.f2).
1. Step: 2. Step:
If absolute value of F1 is smaller than F2, F1 and F2 are (1.f1) and (1.f2) are multiplied and the first 23 MSB bits
interchanged. The right shift amount of f2 is calculated by out of the resulting 45 bits is appointed as the final result,
subtracting e1 from e2. “1” is added to the bits after the fans . The sign bit of the final number, sans is obtained by
sign bit (1.f1) ve (1.f2). EXOR’ing the two numbers.
2. Step: 3. Step:
(1.f1) is shifted to the right by the amount of (e1- e2). If fans is shifted to the left until the first bit becomes “1”, and
the sign bits are equal, then (1. f1) and (1. f2) are added, if amount of the shift is calculated. eans is obtained by
not (1.f2 ) is subtracted from (1.f1). The sign of the adding the amount of shift from e1.
resulting number sans is the sign of the bigger f number.
3. Step:
fans is shifted to the left until the first bit becomes “1”, and
amount of the shift is calculated. eans is obtained by
subtracting the amount of shift from e1.
Figure 2. The algorithm flow chart of the 32-bit floating
point multiplication implemented on FPGA
2.3. Floating point division
Assume F1 and F2 are two floating point numbers, and fans
is the division of them. To make it ease, negative zero
and illegal numbers are neglected here as well. The
inputs are represented as:
F1=( s1, e1, f1) and F2=( s2, e2, f2). si is for sign bit ,ei is
for exponent bits, fi is for fraction part of the FP number,
Figure 1. 32-bit floating point adding and subtracting
Fi. The result becomes:
algorithm implemented on an FPGA
Fans =( sans, eans, fans)= F1/F2. The algorithm shown in
Figure 3 can be explained as follows:
2.2 Floating point multiplication
1. Step:
Floating point multiplication shown in Figure 2 is similar
The exponent parts are subtracted from each other, and
to the integer multiplication. Therefore FP multiplication
“1” is added to the beginnings of the fraction parts,
is easier than FP adding or subtracting algorithms here. It
yielding (1.f1) and (1.f2).
is realized in three steps as well.
2. Step:
To make it easy, the algorithm never tests the illegal
(1.f1) bits are shifted to the right by the amount of (e1 -e2).
numbers or negative zero cases. The inputs are same as
If s1 is equal to s2, (1.f1) and (1.f2) are added. If not,
before, F1=( s1, e1, f1) and F2=( s2, e2, f2).. The result will
(1.f1) is subtracted from (1.f2).
Table-2. 32 bits Floating Point adding/subtracting and
multiplication algorithms results on selected FPGAs
Add., Sub. Multiplication
Selected Device: Spartan-2 % Spartan-2 %
Number Of 387 out of 13 326 out of 13
Slices: 2352 2352
Number Of Slice 106 out of 2 65 out of 4704 1
Flip Flops: 4704
Number Of 4 903 out of 15 642 out of 13
Inputs LUTs 4704 4704
Number Of 103 out of 70 103 out of 146 70
Bonded IOBs 146
4. FAST FOURIER AND INVERSE FOURIER
TRANSFORM METHODS
4.1. Fast Fourier Transform (FFT)
Discrete Time Fourier transform provides frequency
domain representation for a signal. FFT is an important
algorithm to calculate Discrete Fourier Transform (DFT).
Discrete Fourier transform of a signal is directly
N −1
Figure 3. The algorithm flow chart of the division
implemented on FPGA
calculated from: X[k]= ∑ X [n]e
n =0
− j ( 2 Π / N ) kn
.
k=0,1,….,N-1 (4.1)
3. Step:
fans is shifted to the left until the first bit becomes “1”, and Here, the phase factor is defined by: WN = e − j ( 2Π / n )
the amount of the shift is calculated. The sign bits of F1 FFT provides a fast calculation strategy by using
and F2 are EXORed, and is appointed as sans. symmetry and periodicity properties of the phase factor to
calculate DFT. As a calculation method, decimation in
3. FPGA IMPLEMENTATIONS OF FLOATING time is used. This means a remarkable savings over direct
POINT ALGORITHMS computation of the DFT.
In this section, VHDL codes are not given, yet the results
of hardware applications of the algorithms mentioned 4.2. Decimation in time
above are presented. Digilentic Evalution Board is used
for the demonstration. This board includes Xilinx N – Radix DFT is defined as:
Spartan-2 having 2352 slices and 14 RAM block with 50 N −1
MHz clock speed [8]. X [k ] = ∑ x[n] W Nkn , k=0,1,…,N-1
n =0
A custom VHDL library so called fp_lib is formed for the
hardware applications. The fp_lib has two different Decimation in time algorithm rearranges the discrete
algorithms as listed in Table-1. The adding and Fourier transform (DFT) equation into two parts: a sum
subtracting algorithm results are summarized in Table-2. over the even-numbered discrete-time indices
n=[0,2,4,…,N−2] and a sum over the odd-numbered
Table-1. Summary of custom arithmetic VHDL algorithm indices n=[1,3,5,…,N−1]. These samples are used to
HDL Description calculate Radix-2 DFT.
Design
Fp_lib IEEE 32-bit single precision floating point These transformations are combined according to
library Equations (4.1) and (4.2), which lead to the calculation of
Fp_mul IEEE 32-bit single precision floating point upper level DFT. When sample indexes are handled in
pipelined parallel multiplier binary format, firstly the index bits are reversed before
Fp_add IEEE 32-bit single precision floating point grouping them in pair as shown below. Next, these pairs
pipelined parallel adder/substractor are inserted into DFT process. Let’s assume the input as:
x[n]=[1 3 0 2], then
In-order index Bit reversed index
Decimal Binary Binary Decimal
0 00 00 0
1 01 10 2
2 10 01 1
3 11 11 3
xt[n]=[1 0] xc[n]=[3 2]. By using the DFT definition of
these samples, Fourier transformations are obtained as:
Xt[n]=[1 1] and Xc[n]=[5 1].
On the next step, DFT with N=4 of x[n] having 4 samples Figure 4. Architectural model of FFT implemented on
are obtained by referring the symmetry and periodicity FPGA
nature of the phase factor. As a result, the following
equations are obtained [9]. These inputs placed into the registers are firstly separated
k as odd ones and even ones to be used in decimation
X[k]= X t [k]+ W X c [k] (4.1)
N process. Next, these data are then sent to the addition
k module, where their sum and difference values are
X[k+N/2]= X t [k] - W N X c [k] (4.2)
calculated to obtain their 2-point DFTs. Thirdly, the
In this case, X[k]=[6 1-j -4 1+j] is calculated.
processed data are serially sent to the multiplication and
addition modules by using a counter in a timely manner.
As a result, this simple reorganization and reuse has k
reduced the total computation by almost a factor of two By referring to the equation: X[k]= X t [k]+ W N X c [k]
over direct DFT computation [9]. X c [k] data and the phase factor are multiplied, and the
4.3. Inverse FFT result is added to X t [k] data. The obtained final result is
The inverse Fourier transform maps the signal back from latched into the output registers in fft.vhd module. The
the frequency domain into the time domain. Recall that package module is used to introduce the sub- modules to
the equations for an 8-point DFT and Inverse FFT are the main program. Table-3 summarizes the hardware
as follows: synthesis results on Xilinx Virtex II-Pro evaluation board.
The experimental set up photo is shown in Figure 5.
X [k ] + X [k + 4]
X t [k ] =
2 Table-3. FFT synthesis results on the FPGA
X [ k ] − X [ k + 4] Selected Device : Virtex II-Pro %
X ç [k ] = k
, k= 0,1,2,3 Number Of Slices: 1989 out of 14%
2W N 13696
X [k ] + X t [k + 2] Number Of Slice Flip 896 out of 27392 3%
X tt [k ] = t Flops:
2
Number Of 4-Input LUTs 3627 out of 13%
X t [k ] − X t [k + 2]
X tç [k ] = k
, k= 0,1,2,3 27392
2W N / 2 Number Of Bonded IOBs 33 out of 556 5%
Number of MULT18x18s 16 out of 136 11%
X ç [k ] + X ç [k + 2]
X çt [k ] = Number of GLCKs 1 out of 16 6%
2
X ç [k ] − X ç [k + 2] CONCLUSION
X çç [k ] = k
2W N / 2 This paper shows that floating point algorithms can be
implemented on an FPGA. Processing steps of the
required algorithms are explained in detail. As an
example of the algorithms, application of FFT and
4.4. FFT Implementation Process inverse FFT methods are examined. For the
32-bit inputs which have both imaginary and reel parts representation of digital numbers, IEEE 754 single
together with phase factor values calculated using precision floating point number format was used, and the
MATLAB are loaded into the registers in the fft.vhd algorithms are realized based on this format.
module as shown in Figure 4.
This situation inevitably brings about large amount of
FPGA resource usage. Hence, some parts of the
algorithms have to be embedded in to the FPGA in serial
(sequential) nature. This situation means sacrificing the REFERENCES
fully parallel usage advantage of FPGAs. Despite its [1] E. O. Brigham, The fast Fourier transform and its
mandatory serial nature, better performance has still been applications, Prentice Hall, 1988.
observed when compared to its traditional processor [2] J. G. Pmakis, Digital signal processing: principles,
based solutions. Real time working was almost algorithms, and applications., Prentice-Hall Intemational, 1996.
[3] W. B. Ligon, S. McMillan, G. Mpnn, F. Stivers, and K. D.
succeeded. 0.6 µs, and 0.72 µs speeds were obtained for Underwood “A Re-evelation of the Practicality of Floating
implementing the FFT and IFFT algorithms, respectively. Point Operations on FPGAs”, Proceedings, IEEE Symposium
In traditional processor solutions, this rate goes up to on Field-Programmable Custom Computing Machines, pp. 206-
millisecond levels. 215, Napa, CA, Apr. 1998. (ICANN'99).
[4] J. Zhu, B. K. Gunther, “Towards an FPGA Based
Reconfigurable Computing Environment for Neural Network
Implementations”, Proceedings of the Ninth International
Conference on Artificial Neural Networks,1999.
[5] M. Poliac, J. Zanetti, D. Salerno., “Performance
Mesuraments of Seismocardiogram Interpretation Using Neural
Networks”, Computer in Cardiology, IEEE Computer Society,
pp 573-576, 1993.
[6] S. Şahin, A. Kavak., “Implementation of Floating Point
Arithmetic Using an FPGA”, Mathematical Methods in
Engineering. Editors K.TAS, J.A.T.Machado and D.Baleanu,
Springer Book, 2007.
[7] B. Fagin and C. Renard, “Field Programmable Gate Arrays
and Floating Point Arithmetic,” IEEE Transactions on VLSI,
Vol. 2, No. 3, pp. 365-367, September 1994.
[8] Xilinx Inc., The Programmable Logic Data Book, San
Jose, California, 1993.
Figure 5. The Virtex-II Pro experimental setup [9] S. Erturk , İşaret İşleme, Birsen yayınevi, 2005.
[10] H. Hu, T. Jin, X. Zhang, Z. Lu, Z. Qian, ” A Floating-point
A similar study has been found from the literature, Coprocessor Configured by a FPGA in a Digital Platform Based
Haibing et. al [10]. Table 4 compares two designs. on Fixed-point DSP for Power Electronics”, IEEE IPEMC’2006
[11] I. Az, S. Sahin, C. Karakuzu, M. A. Cavuslu,
Table-4 Comparison Table “Implementation of FFT and IFFT Algorithms in FPGA”,
Architecture Selected Source Usage Performance ISEECE-2006, pp.7-10
Device Logic [12] M. A. Cavuslu, S. Dikmese, S. Sahin, K. Kucuk, A. Kavak
Elements, “Akıllı anten algoritmalarının IEEE 754 Kayan Sayı Formatı ile
LUTs FPGA Tabanlı Gerçeklenmesi ve Performans Analizi”,
Haibing Hybrid Altera Adder=883 40 ns for III.URSI-Türkiye’2006, pp.610-612.
et. al Design Cyclone Multiplication adder,
(2006) DSP+FPGA =995 multiplier, and
LUT Based Division division
Division =1505 (Faster due to
Algorithm LUT usage for
division
algorithm)
Proposed Full FPGA Xilinx Adder=903 Adder=50ns
solution Spartan2 Multiplication Multiply=50ns
Subtractive and =642 Div.=150ns
Based Virtex- FFT=3627 FFT=600nsec
Division Pro II IFFT=720nsec
Algorithm
Moreover, in this study, the earlier VHDL library for
implementing FFT algorithm [11], [12] has been
modified. For this purpose, addition, subtraction,
multiplication and division algorithms in the former
library have been re-designed. Improvement on the
source usage can easily be seen when compared to the
ones declared in [11],[12]. In fact, faster Floating Point
algorithms can be found in the literature. However, these
algorithms consume much more hardware resources.
Therefore, optimization of the source usage was the
primary aim of this study.