Implementation of The CA-CFAR Algorithm For Pulsed
Implementation of The CA-CFAR Algorithm For Pulsed
net/publication/229034537
CITATIONS READS
10 3,198
6 authors, including:
All content following this page was uploaded by Hans Grobler on 06 June 2014.
Abstract — The Cell-Averaging Constant False-Alarm Rate An algorithmic optimization of the Cell-Averaging CFAR
(CA-CFAR) algorithm was implemented and optimized in soft- (CA-CFAR) algorithm was performed in an attempt to find
ware on the NVIDIA Tesla C1060 GPU architecture for appli- an efficient algorithmic structure for software implementation
cation in pulsed-Doppler radar signal processors. A systematic
approach was followed to gradually explore opportunities for on the GPU architecture. Three techniques were identified,
parallel execution and optimization by implementing the algo- implemented, optimized and evaluated using a systematic
rithm first in MATLAB (CPU), followed by native C (CPU) and approach to move from high-level computing languages to the
finally NVIDIA CUDA (GPU) environments. Three techniques target GPU architecture. The GPU results were compared to
for implementing the CA-CFAR in software were identified results of our own implementations in native C, as well as
and implemented, namely a naı̈ve technique, sliding window
technique and a new variant which employs the Summed-Area another CA-CFAR software reference implementation.
Table (SAT) algorithm. The naı̈ve technique performed best The results shows that, in terms of kernel throughput and
on the GPU architecture. The SAT technique shows potential, latency, the GPU implementations generally perform better
especially for cases where very large CFAR windows are required. than comparative CPU implementations for larger input sizes.
However, the results do not justify using the GPU architecture However, in terms of total throughput and latency, which
instead of the CPU architecture for this application when data
transfer to and from the GPU is taken into consideration. includes data transfer to and from the GPU, the performance
results do not justify using the GPU architecture for the range
Keywords — CFAR; GPU; Doppler; radar; signal processing
of input sizes that were evaluated.
I. I NTRODUCTION II. CA-CFAR FOR P ULSED -D OPPLER R ADAR
A Constant False-Alarm Rate (CFAR) detector in a radar A 2D data matrix is generally used to represent the sampled
system detects targets in varying interference by adjusting data for pulsed-Doppler radar systems with a single receiver
detection thresholds dynamically. The computational workload front-end element. The 2D data matrix is called a burst and
of calculating the interference estimate for every radar data it has range and Doppler dimensions. The range dimension
sample in order to adjust the detection threshold can be contains “fast-time” samples of the echo of a single received
very high. High throughput and low latency is required for pulse. The Doppler dimension contains “slow-time” samples
radar applications, where the latter is especially important in which represent the echo received from successive transmitted
tracking systems which need to maintain a tight tracking loop. pulses in the burst.
CFAR processing in modern radar signal processors is A Constant False-Alarm Rate (CFAR) detector adjusts the
typically performed with Field-Programmable Gate Arrays detector threshold in real-time in order to maintain a constant,
(FPGAs) or Digital Signal Processors (DSPs) due to the design-specific false-alarm rate with corresponding probability
processing and I/O performance that they provide. Current of Pf a . The threshold is determined by estimating the interfer-
generation many-core GPUs can deliver massive computa- ence around the target. A Cell-Averaging CFAR (CA-CFAR)
tional performance on a single chip. The memory bandwidth detector estimates the interference power for a Cell Under Test
between a GPU chip and its own high-speed memory is also (CUT) by averaging the sample values of adjacent cells, under
very high as a result of very wide data bus widths and high the assumption that interference statistics are homogeneous in
memory clock rates. Although GPUs were originally designed a localized area.
to perform graphics processing, recent generations make pro- A. CA-CFAR Algorithm
vision for General Purpose Computing on the GPU (GPGPU). The maximum likelihood estimate for the interference
GPUs could therefore potentially be used to perform CFAR power, βc2 is obtained from the average of N samples in the
processing in pulsed-Doppler radar systems. vicinity of the CUT [1] as shown in equation 1.
The authors would like to express their gratitude to the National Program N
c2 = 1
X
for Electronics, Communications and Photonics at the King Abdulaziz City β xi (1)
for Science and Technology for supporting this work. N i=1
Doppler bins
this case the energy in the cells immediately adjacent to the 2 CUT
CUT do not represent interference alone and will cause the
detector threshold to be raised artificially if included in the 1
c2
Tb = αβ (2) c2
Figure 1. Calculation of β CU T from CFAR window for naı̈ve technique.
−1/N
α= N (P f a − 1) (3)
A. Naı̈ve Technique
B. Existing CA-CFAR Software Implementations
One of simplest and most intuitive ways of implementing
The High Performance Embedded Computing (HPEC) the CA-CFAR is to simply sum the values for all cells in the
Challenge Benchmark Suite [2] includes a CFAR benchmark CFAR window for each CUT given by
for CPUs. The assessment in [3] compared their CA-CFAR Ncf arW in
GPU implementation to the HPEC CFAR benchmark CPU c2 (i, j) = 1 X
β CW (i, j, k), (4)
implementation. The implementation in [3] favours parallelism Ncf arW in
k=1
by minimizing communication and synchronization between
elements. The problem effectively turns into an embarrassingly where CW (i, j, k) is the kth CFAR window cell of cell (i, j)
parallel problem using this approach. as shown in figure 1. We refer to this technique as the naı̈ve
technique due to the potential drawbacks associated with using
A speedup factor of between 2.6 and 166 was achieved
such a simple, brute-force approach. Redundant summation
with the GPU implementation in [3] over the HPEC CPU
will be performed, since there is a large overlap in the CFAR
reference implementation [4]. The strict kernel latency metric,
window cells for adjacent cells under test in the range-Doppler
as described in the HPEC Challenge benchmark guidelines
map. The computational complexity increases as the dimen-
[2], is used. The conclusion in [3] is that the overall run time
sions of either the input data or the CFAR window increases.
is limited by global memory bandwidth.
The naı̈ve technique provides an embarrassingly parallel struc-
The HPEC Challenge benchmark code for CPUs is publicly ture which means that each cell can be evaluated independently
available, which allowed us to compile and run the code on without requiring synchronization with any other cells.
the same platforms that were used to obtain our own results. The time complexity for the naı̈ve technique with a range-
No code is available for the other implementations that were Doppler map of Nr range bins by Nd Doppler bins and with
found and we could therefore not do comparisons with these leading and lagging CFAR windows of Nrwin by Ndwin cells
implementations on the target hardware platforms. each, is given by
O(Nrd × Ncf arW in ), (5)
III. H IGH -L EVEL A LGORITHMIC O PTIMIZATION
where Nrd = Nr × Nd , Ncf arW in = 2 × Nrwin × Ndwin .
Performance optimization is typically performed on differ-
ent levels ranging from high-level algorithmic optimization, to B. Sliding Window Technique
medium-level implementation optimization, down to low-level The CA-CFAR can also be implemented in software using
instruction and other other architecture-specific optimizations. a sliding window technique, which capitalizes on the redun-
The structure of an algorithm that is implemented in software dancy in the calculation of the local noise estimate for adjacent
can have a major impact its performance, based on how cells [2] as given by
well-suited the chosen structure is for the target hardware c2 (i + 1, j) = β
c2 (i, j) + ∆,
β (6)
architecture.
A process of optimizing the algorithmic structure of the where ∆ is calculated as shown in figure 2. It is more
CA-CFAR algorithm for software implementation on the GPU computationally efficient to slide the CFAR window by one
architecture was followed. The aim was to optimize the cell at a time and only account for the difference in the local
computational performance on the GPU architecture in order interference estimate, instead of naı̈vely performing redundant
to increase throughput and reduce latency. Three techniques summation. Computational complexity is reduced at the cost
with different algorithmic structures were identified, namely of reducing the parallelism of the algorithm, as a result of the
naı̈ve, sliding window and Summed-Area Table (SAT). inherent data dependencies that are created.
2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)
[B+D−A−C]
∆= Ncf arW in . = Old CFAR window = New CFAR window IV. L OW-L EVEL O PTIMIZATION FOR GPU A RCHITECTURE
4 Following the high-level algorithmic optimization of the
CA-CFAR algorithm, additional lower-level optimizations
3
were performed for the target GPU architecture. The opti-
Doppler bins
2 CUT When Doppler ambiguities are present, the cells around the
edges of the Doppler dimension of a range-Doppler map are
1 adjacent in terms of velocity. As a result, CFAR algorithms
often implement wrapping instead of clipping of the CFAR
0 C1 D1 C2 D2
window in the Doppler dimension around the edges of the
0 1 2 3 4 5 6 7 8 9 10 range-Doppler map. Given that GPU performance can degrade
Range bins
when conditional branches diverge, the need to check for
c2 all cases where wrapping can occur is a potential problem.
Figure 3. Calculation of β CU T using Summed-Area Table generated from
range-Doppler map. A standard technique used for GPUs to avoid branching in
similar situations, is to pad an image with an apron of pixels.
For all GPU implementations an apron of redundant cells is
The time complexity for the sliding window technique with replicated around the edge of the range-Doppler map, which
a range-Doppler map of Nr range bins by Nd Doppler bins mirrors the opposite edge, to allow for wrapping without
and with leading and lagging CFAR windows of Nrwin by divergent branching.
Ndwin cells each, is given by
p C. Shared Memory Implementation
O(Nrd × Ncf arW in ), (7) The target GPU architecture provides a large off-chip
where Nrd = Nr × Nd , Ncf arW in = 2 × Nrwin × Ndwin . memory which is referred to as device memory or global
memory. A small amount of on-chip shared memory is also
C. Summed-Area Table Technique available which needs to be explicitly loaded with data by the
A Summed-Area Table (SAT) [5] or Integral Image [6], once programmer. The shared memory is shared by and limited to
generated, can be used to obtain the sum of values inside any the scope of the processor cores in the particular symmetric
sub grid of the original input data matrix by doing lookup and multiprocessor unit. Once data has been loaded into shared
summation of 4 data elements. This technique can be applied memory, the read latency is typically in the order of 100 times
to CA-CFAR by generating a SAT from the input range- less than reading from global memory.
Doppler map sample values. Once generated, the SAT can The basic naı̈ve implementation reads all cells in the CFAR
be used to compute the local interference estimate efficiently, window of a particular CUT from global memory. A variation
by optimizing the summation calculation as shown in figure 3. of the naı̈ve technique was also implemented which uses
The summation is performed in constant time, regardless of shared memory to reduce the read latency for redundant
the size of the CFAR window. No evidence was found of this data reads. The shared memory implementation also segments
technique being used for CFAR in radar applications. the summation into two stages where rows and columns are
The time complexity for the SAT technique with a range- summed separately, allowing for more optimal use of the
Doppler map of Nr range bins by Nd Doppler bins and an limited shared memory that is available on the target GPU.
arbitrary CFAR window size, is given by With this two-stage process the overlap only increases linearly
with increases in the range and Doppler dimensions of the
O(Nrd ), (8)
CFAR window, instead of in two dimensions simultaneously
where Nrd = Nr × Nd . as with the global memory implementation.
2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)
The shared memory implementation has a limitation in The profiler utilizes hardware counters that are built into the
terms of the maximum CFAR window size as a result of the NVIDIA chip sets for debugging and profiling purposes. The
limited amount of shared memory available on the target GPU. GPU kernels were profiled and analyzed during development
The maximum CFAR window size in both range and Doppler using this profiler.
dimensions is limited to a maximum of 33 cells. CUDA also provides access to hardware timers via GPU
events in the API. These events can be used do accurate timing
V. I MPLEMENTATION of GPU code relative to the internal GPU clock. These GPU
The optimization of the CA-CFAR algorithm was an iter- events were utilized to obtain all the timing information for
ative process which included implementation and evaluation the results.
cycles. The process, environment and methods that were used All implementations process a single burst of synthetic radar
to develop, verify and benchmark the implementations is data which is placed in host (CPU) memory before timing
discussed in this section. begins. During execution this burst of data is transferred to
the GPU, processed and transferred back to the host.
A. Development Process
A systematic approach was followed by moving gradu- VI. R ESULTS
ally from initial implementations in high-level computing The results show the throughput that was achieved, which is
environments to native implementations on the target GPU derived from the measured latency of processing a single burst
architecture. and the data size of the burst. For the GPU implementations,
Implementations were performed and verified in MATLAB both the kernel and total latency were measured and used
[7], which was used as a rapid prototyping platform. Native to derive the kernel and total throughput statistics. Total
C implementations for the CPU were then performed as a latency is the time required to transfer the input data from
stepping stone to the final GPU target architecture. The initial the host memory to the GPU memory, perform the CA-CFAR
native C implementations were single-core CPU implementa- processing on GPU and transfer the output data back from the
tions. Multi-core optimizations were then made for the native GPU memory to the host memory. Kernel latency excludes
C implementations on the CPU architecture. The techniques the data transfer to and from the GPU memory in order to
were then implemented and optimized on the target NVIDIA isolate processing and I/O performance results to some extent.
GPU architecture. Speedup graphs are not shown since they only show relative
A comparable level of optimization was performed for performance which is of limited use for the radar application,
the native C and GPU implementations on the respective where actual throughput and latency are of primary concern.
architectures for each technique.
A. GPU Implementations
B. Environment A CFAR window size of N = 40 cells was used as a
A GPGPU system with an NVIDIA Tesla [8] C1060 GPU typical value for the CFAR window size for the initial GPU
was used as the development, test and benchmarking platform implementation benchmarks. A CFAR window size of around
for all the GPU implementations. The CPU implementations N = 32 cells is suggested as being a typical size by [12] and
were benchmarked on a system with dual Intel Xeon X5355 [13]. The value of N = 40 was used instead to allow for a
quad-core CPUs which yields a total of 8 processor cores. symmetrical CFAR window around the CUT, which is required
OpenMP [9] was utilized to optimize the native C code in for our implementations. The results for all three techniques
order to use multiple cores of the processor. Restructuring of on the GPU architecture are shown in figure 4.
the code was necessary in some cases in order to obtain max- 1) Effect of input size: The sliding window technique per-
imum benefit of using OpenMP. Opportunities for exploiting forms the worst over the entire range of parameters, which is
parallelism in the algorithms were exposed using this approach expected since this technique reduces parallelism in exchange
and it also served as a stepping stone towards an efficient for reduced computational complexity. The SAT technique per-
parallel implementation for the GPU. forms better than the sliding window technique, but is still out-
NVIDIA Compute Unified Device Architecture (CUDA) performed by the naı̈ve technique in general for typical CFAR
[10] was utilized for all the GPU implementations. The CUDA window sizes. The SAT technique has increased complexity
Data Parallel Primitives (CUDPP) [11] library was used to and more overhead since it first needs to generate the SAT be-
generate the SAT for the GPU implementation that uses the fore it can subsequently benefit from the constant-time lookup.
SAT technique. 2) Effect of CFAR window size: The basic naı̈ve imple-
mentation, which uses global memory (GMEM), degrades
C. Verification markedly as the CFAR window size increases. The degradation
The output produced by the MATLAB implementation for be can attributed to the growth of the CFAR window in two
a particular test data set or test burst was used as a baseline dimensions as N increases and the fact that both dimensions
to verify other implementations against. are summed simultaneously in a single phase. The overlap
NVIDIA provide a CUDA Visual Profiler [10] (com- of CFAR windows for neighboring cells increases in both
puteprof) as part of the standard CUDA Toolkit release. dimensions, which introduces significant read contention.
2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)
1800
1600
1400
Throughput (MiB/s)
1200
1000
800
600
400
200
0
2 3 4 1 2
10 10 10 10 10
Range Bins N (CFAR Window Size)
This effect is still present in the shared memory (SMEM) implementations are designed for a data matrix which has
implementation, even though it segments the processing into range and Doppler dimensions only. Datasets were therefore
a rows and columns. However, the effect only manifests generated for the HPEC CFAR Benchmark where the number
itself in a single dimension at a time. For larger N the of beams are set to 1. The HPEC CFAR Benchmark CPU code
performance trend for this implementation is better than the was also compiled and run on the target CPU platform.
naı̈ve implementation which uses global memory. Again, note
C. Data Throughput
that this version is limited to a maximum CFAR window size
of 33 cells in both range and Doppler dimensions. For GPU implementations the throughput generally de-
The result for the Summed-Area Table implementation creased as the input burst size decreases, which can be
shows that it is essentially unaffected by the CFAR window attributed to some extent to the fixed overheads associated with
size. The SAT requires a constant time to generate the SAT transferring data between the host to the device. The subopti-
table according to the input size of the range-Doppler map. mal occupancy, in terms of parallel units on the GPU that are
The SAT technique can therefore be particularly useful where utilized, for input burst sizes below a certain threshold also
window sizes become very large. contributes to the trend that is observed for the throughput.
VII. C ONCLUSION
B. Performance Comparison with C Implementations
The results showed that optimized GPU implementations
The performance of our GPU and CPU implementations are that favour reduced algorithmic complexity and increased
compared with the HPEC CFAR Benchmark [2] in figure 5. parallelism over reduced computational complexity performed
The HPEC CFAR Benchmark only uses a 1D CFAR window, the best. The GPU implementation for the naı̈ve technique,
which acts only in the range dimension. Our implementations which uses shared memory performed better, in terms of
are optimized for a 2D CFAR window, but can be run in a kernel throughput and latency, than comparable CPU reference
1D CFAR window mode by setting the Doppler dimension implementations for larger input burst sizes (greater than 8K
of the CFAR window to 1. The results were obtained using cells) and typical CFAR window sizes. Even though the SAT
this configuration. The standard datasets for the HPEC CFAR technique is outperformed by the optimized naı̈ve techniques
Benchmark also uses a data cube, which has range, Doppler for typical CFAR windows sizes, there is a crossover point
and beam dimensions. Multiple beams can be obtained by where the SAT technique will perform better with increasing
using multiple receiver elements in a radar system. Our CFAR window size.
2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)
1600
600
1400
500
1200
Throughput (MiB/s)
400 1000
300 800
600
200
400
100
200
0 0
2 3 4 2 3 4
10 10 10 10 10 10
Range Bins Range Bins
Figure 5. Performance Comparison for N=20, Doppler Bins=32 using a 1D CFAR window.