Local binary pattern (LBP) is a computationally inexpensive feature descriptor popularly used for... more Local binary pattern (LBP) is a computationally inexpensive feature descriptor popularly used for detecting and classifying images. However, the directional attributes associated with conventional LBP lead to the generation of undesirable shades and noisy textures. Consequently, it impairs the features corresponding to edges and intensity gradients in an image. To address this problem, in this article, we have proposed a novel LBP-Like (LBP-L) transform, which could be used as an efficient alternative to the conventional LBP. The LBP-L of an image does not change due to horizontal and vertical flipping of images or when the image is rotated through 90°, 180°, or 270°. Besides, it maps the uniform intensity zones and edges in images to the same feature space irrespective of the orientation. It has significant potential to boost the detection and classification of brain tumors in magnetic resonance (MR) images. The performance of LBP-L has been tested in two benchmark datasets having 3064 and 7023 MR images of the brain with tumors of three different types, namely, meningioma, glioma, and pituitary, along with images without tumors. The detection and classification accuracies of LBP-L on the Kaggle and Figshare datasets are higher than those of the conventional LBP and its variants reported in the literature. It is shown that LBP-L has computational complexity comparable to that of traditional LBP. Moreover, the proposed LBP-L is found to have a lower computation time compared to the well-known LBP variants. Therefore, it could be used as a more efficient substitute for the latter.
IEEE Transactions on Circuits and Systems I-regular Papers, Nov 1, 2015
Sign-extension of operands in the shift-add network of multiple constant multiplication (MCM) res... more Sign-extension of operands in the shift-add network of multiple constant multiplication (MCM) results in a significant overhead in terms of hardware complexity as well as computation time. This paper presents an efficient approach to minimize that overhead. In the proposed method, the shift-add network of an MCM block is partitioned into three types of sub-networks based on the types of fundamentals and interconnections they involve. For each type of sub-network, a scheme which takes the best advantage of the redundancy in the computation of sign-extension part is proposed to minimize the overhead. Moreover, we also propose a technique to avoid the additions pertaining to the most significant bits (MSBs) of the fundamentals. Experimental results show that the proposed method always leads to implementations of MCM blocks with the lowest critical path delay. The existing methods for the minimization of sign-extension overhead are designed particularly for single multiplication or MCM blocks of FIR filter, but the proposed method can be used to reduce the overhead of sign-extension for MCM blocks of any application. In the case of FIR filters, the proposed method outperforms other competing methods in terms of critical path delay, area-delay product (ADP), and power-delay product (PDP), as well.
In this paper, we propose a new hardware architecture of a very high-speed finite impulse respons... more In this paper, we propose a new hardware architecture of a very high-speed finite impulse response (FIR) filter using fine-grained seamless pipelining. The proposed full-parallel pipeline FIR filter can produce an output sample in a few gate delays by placing the pipeline registers not only in between components, but also across the components. A precise critical path analysis at the gate level allows to create an appropriate pipelining strategy depending on the throughput requirement. This paper also presents two alternative architectures, each offering different trade-offs in terms of area and throughput rate. The proposed FIR filters are synthesized to measure the maximum throughput and the balance between complexity and speed. The synthesis results show that the proposed fully pipelined FIR filter supports up to throughput of 1.8 Giga samples per second and offers 73.5% less area-delay product (ADP) than the existing systolic designs. Also, the proposed single multiplier-accumulator (MAC) based FIR filter has 3 times higher throughput and 26.0% less area with 75.8% less ADP compared to the existing design. INDEX TERMS Finite impulse response filters, digital filters, pipeline, Wallace tree, critical path, Booth multiplier.
IEEE Transactions on Very Large Scale Integration Systems, May 1, 2020
We present an innovative region-growing-based technique that permits to improve the surface displ... more We present an innovative region-growing-based technique that permits to improve the surface displacement timeseries retrieval capability of the two-scale Small BAseline Subset (SBAS) Differential Interferometric Synthetic Aperture Radar (DInSAR) approach in medium-to-low coherence regions. Starting from a sequence of multitemporal differential SAR interferograms, computed at the full spatial resolution scale, the developed method "propagates" the information on the deformation relevant to a set of high coherent SAR pixels [referred to as source pixels (SPs)], in correspondence to which SBAS-DInSAR deformation measurements have previously been estimated, to their less coherent neighbouring ones. In this framework, a minimum-norm constrained optimization problem, relying on the use of constrained Delaunay triangulations (CDTs), is solved, where the constraints represent the displacement values at the SP locations. Such DInSAR processing scheme, referred to as Constrained-Network Propagation (C-NetP), is easy to implement and, although specifically developed to work within the two-scale SBAS framework, it can be extended to wider DInSAR scenarios. The validity of the method has been investigated by processing a SAR dataset acquired over the city of Rome (Italy) by the Cosmo-SkyMed constellation from July 2010 to October 2012. The achieved results demonstrate that the proposed C-NetP method is capable to significantly increase the spatial density of the SBAS-DInSAR measurements, reaching an improvement of about 250%. Such an improvement allows revealing deformation patterns that are partially or completely hidden, by applying the conventional twoscale SBAS processing. This is particularly relevant in urban areas where the assessment and management of the risk associated to the deformation affecting infrastructures is strategic for decision makers and local authorities.
IEEE Transactions on Circuits and Systems I-regular Papers, Oct 1, 2016
Most of the research on the implementation of finite impulse response (FIR) filter so far focuses... more Most of the research on the implementation of finite impulse response (FIR) filter so far focuses on the optimization of the multiple constant multiplication (MCM) block. But it is observed that the product-accumulation section often contributes the major part of the critical path, such that the timing optimization of MCM block does not impact significantly on the overall speedup of the FIR filters. In this paper, a precise analysis and optimization of critical path for transposed direct from (TDF) FIR filters is proposed. The delay increment introduced by structural adders is estimated by comparing the delay of a tap and the corresponding delay of the coefficient multiplication only. Based on that, a novel implementation of the product-accumulation section of FIR filters is proposed by retiming the existing delays into the structural adders. It is also shown that the structural adders can be integrated with the MCM block and retimed together for further reduction of critical path. By using the proposed method, the increment of delay caused by the structural adders can be either completely eliminated or significantly reduced. Experimental results show that the critical path delay can be significantly reduced at the cost of very small area overhead. The overall area-delay performance and power-delay performance of the proposed method are superior to the existing methods.
IEEE Transactions on Circuits and Systems I-regular Papers, Dec 1, 2018
Single constant multiplication (SCM) and multiple constant multiplications (MCM) are among the mo... more Single constant multiplication (SCM) and multiple constant multiplications (MCM) are among the most popular schemes used for low-complexity shift-add implementation of finite impulse response (FIR) filters. While SCM is used in the direct form realization of FIR filters, MCM is used in the transposed direct form structures. Very often, the hybrid form FIR filters where the subsections are implemented by fixed-size MCM blocks provide better area, time, and power efficiency than those of traditional MCM and SCM based implementations. To have an efficient hybrid form filter, in this paper, we have performed a detailed complexity analysis in terms of the hardware and time consumed by the hybrid form structures. We find that the existing hybrid form structures lead to an undesirable increase of complexity in the structural-adder block. Therefore, to have a more efficient implementation, a variable size partitioning approach is proposed in this paper. It is shown that the proposed approach consumes less area and provides nearly 11% reduction of critical path delay, 40% reduction of power consumption, 15% reduction of area-delay product, 52% reduction of energydelay product, and 42% reduction of power-area product, on an average, over the state-of-the-art methods. Index Terms-Finite impulse response (FIR) filter, hybrid form FIR filters, constant multiplication schemes, coefficient partitioning approach and low power designs.
In this paper, we propose a low latency scalingfree CORDIC algorithm. The reduction of latency is... more In this paper, we propose a low latency scalingfree CORDIC algorithm. The reduction of latency is achieved by suitable combination of multiple angles in different pipeline stages. The efficient use of trigonometric identities along with augmented Taylor series approximation and modified nanorotation in different stages make the CORDIC algorithm completely scaling-free. The comparison of FPGA implementation results with other CORDIC architectures in Xilinx SPARTAN 3E FPGA shows that the proposed architecture has significantly low slice-delay product with similar accuracy.
IEEE Transactions on Circuits and Systems for Video Technology, Dec 1, 2018
High-speed corner detection is an essential step in many real-time computer vision applications, ... more High-speed corner detection is an essential step in many real-time computer vision applications, e.g. object recognition, motion analysis and stereo matching. Hardware implementation of corner detection algorithms such as the Harris corner detector (HCD) has become a viable solution for meeting real-time requirements of the applications. A major challenge lies in the design of power, energy and area efficient architectures that can be deployed in tightly constrained embedded systems while still meeting real-time requirements. In this paper, we proposed a bit-width optimization strategy for designing hardware-efficient HCD that exploits the thresholding step in the algorithm to determine interest points from the corner responses. The proposed strategy relies on the threshold as a guide to truncate the bitwidths of the operators at various stages of the HCD pipeline with only marginal loss of accuracy. Synthesis results based on 65-nm CMOS technology show that the proposed strategy leads to powerdelay reduction of 35.2%, and area reduction of 35.4% over the baseline implementation. In addition, through careful retiming, the proposed implementation achieves over 2.2 times increase in maximum frequency while achieving an area reduction of 35.1% and power-delay reduction of 35.7% over the baseline implementation. Finally, we performed repeatability tests to show that the optimized HCD architecture achieves comparable accuracy with the baseline implementation (average decrease of repeatability is less than 0.6%).
IEEE Transactions on Circuits and Systems I-regular Papers, Aug 1, 2015
ABSTRACT Abstract--- It is well-known that the (a,b) -way Karatsuba algorithm (KA) with a\neq b i... more ABSTRACT Abstract--- It is well-known that the (a,b) -way Karatsuba algorithm (KA) with a\neq b is used for efficient digit-serial multiplication with sbquadratic space complexity architecture. In this paper, based on (a,b) -way KA decomposition, we have derived a novel k -way block recombination KA (BRKA) decomposition for digit-serial multiplication. The proposed k -way BRKA is formed by a power of 2 polynomial decomposition. By theoretical analysis, it is shown that k -way BRKA can provide the necessary tradeoff between space and time complexity. Using (4,2)-way KA to construct the proposed k -way BRKA architecture in GF(2^{409}) , it is shown that the proposed 2-way BRKA approach requires less area, and the proposed 8-way BRKA approach requires less computation time and less area-time product compared to the existing (a,b) -way KA decomposition.
Cognitive radio is an emerging technology in wireless communications for dynamically accessing un... more Cognitive radio is an emerging technology in wireless communications for dynamically accessing under-utilized spectrum resources. In order to maximize the network utilization, vacant channels are assigned to cognitive users without interference to primary users. This is performed in the spectrum allocation (SA) module of the cognitive radio cycle. Spectrum allocation is a NP hard problem, thus the algorithmic time complexity increases with the cognitive radio network parameters. This paper addresses this by solving the SA problem using Differential Evolution (DE) algorithm and compared its quality of solution and time complexity with Particle Swarm Optimization (PSO) and Firefly algorithms. In addition to this, an Intellectual Property (IP) of DE based SA algorithm is developed and it is interfaced with PowerPC440 processor of Xilinx Virtex-5 FPGA via Auxiliary Processor Unit (APU) to accelerate the execution speed of spectrum allocation task. The acceleration of this coprocessor is compared with the equivalent floating and fixed point arithmetic implementation of the algorithm in the PowerPC440 processor. The simulation results show that the DE algorithm improves quality of solution and time complexity by 29.9% and 242.32%, 19.04% and 46.3% compared to PSO and Firefly algorithms. Furthermore, the implementation results show that the coprocessor accelerates the SA task by 76.79-105Â and 5.19-6.91Â compared to floating and fixed point implementation of the algorithm in PowerPC processor. It is also observed that the power consumption of the coprocessor is 26.5 mW.
IEEE Transactions on Circuits and Systems for Video Technology, Aug 1, 2017
An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4−p... more An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4−point DCT defined by the High Efficiency Video Coding (HEVC) standard, and used that for the computation of DCT and inverse DCT (IDCT) of power−of−2 lengths. There are two reasons to consider the DCT of length 4 as the basic module. Firstly, it allows to compute DCTs of length 4, 8, 16, and 32 prescribed by HEVC. Moreover, the DCTs generated by 4−point DCT not only involve lower complexity but also offer better compression performance. Full-parallel and area-constrained architectures for the proposed approximate DCT are proposed to have flexible trade-off between area and time complexities. Also, a reconfigurable architecture is proposed where 8−point DCT can be used for a pair of 4−point DCTs. Using the same reconfiguration scheme 32−point DCT could be configured for parallel computation of two 16−point DCTs or four 8−point DCTs or eight 4−point DCTs. The proposed reconfigurable design can support real-time coding for high-definition video sequences in the 8K UHDTV format (7680 × 4320 @ 30 fps). A unified forward and inverse transform architecture is also proposed where the hardware complexity is reduced by sharing of hardware between DCT and IDCT computation. The proposed approximation has nearly the same arithmetic complexity and hardware requirement as those of recently proposed related methods, but involves significantly less error energy, and offers better PSNR than the others when the DCTs of length more than are used. A detail comparisons of complexity, energy efficiency, and compression performance of different DCT approximation schemes for video coding are also presented. It is shown that the proposed approximation provides better compressed image quality than the other approximate DCT. The proposed method can perform HEVC-compliant video coding with marginal degradation of video quality and slight increase in bit-rate with a fraction of computational complexity of the latter.
We are very pleased to welcome you to the second annual edition of the International Symposium on... more We are very pleased to welcome you to the second annual edition of the International Symposium on Electronics System Design (ISED 2011) in the port city of Kochi, India. This builds on the successful debut of ISED last year and we are confident you will find this year's program compelling. The goal of ISED is to provide a venue for researchers, educators, students, and industry personnel worldwide to present, learn about, explore, and collaborate in the latest developments in electronic system design research and education. The ISED 2011 program is spread over three days during 19-21 December 2011, with the main conference research tracks and talks scheduled during the first two days, followed by the Workshop for Engineering Scholars (WES) on the last day. This year, ISED has a co-located event immediately following it on December 22-23, 2011: the International Workshop on Embedded Computing & Communication Systems (IWECC) at Rajagiri School of Engineering & Technology in Kochi.
Distributed arithmetic (DA) based architectures are popularly used for inner-product computation ... more Distributed arithmetic (DA) based architectures are popularly used for inner-product computation in various applications. Existing literature shows that the use of approximate DA-architectures in error resilient applications provides a significant improvement in the overall efficiency of the system. Based on precise error analysis, we find that the existing methods introduce large truncation error in the computation of the final inner-product. Therefore, to have a suitable trade-off between the overall hardware complexity and truncation error, a weight-dependent truncation approach is proposed in this paper. The overall efficiency of the structure is further enhanced by incorporating an input truncation strategy in the proposed method. It is observed that the area, time and energy efficiency of the proposed designs are superior to the existing designs with significantly lower truncation error. Evaluation in the case of noisy image smoothing application is also shown in this paper.
IEEE Transactions on Very Large Scale Integration Systems, Apr 1, 2016
Retiming of digital circuits is conventionally based on the estimates of propagation delays acros... more Retiming of digital circuits is conventionally based on the estimates of propagation delays across different paths in the data-flow graphs (DFGs) obtained by discrete component timing model, which implicitly assumes that operation of a node can begin only after the completion of the operation(s) of its preceding node(s) to obey the data dependence requirement. Such a discrete component timing model very often gives much higher estimates of the propagation delays than the actuals particularly when the computations in the DFG nodes correspond to fixedpoint arithmetic operations like additions and multiplications. On the other hand, very often it is imperative to deal with the DFGs of such higher granularity at the architecture-level abstraction of digital system design for mapping an algorithm to the desired architecture, where the overestimation of propagation delay leads to unwanted pipelining and undesirable increase in pipeline overheads. In this paper, we propose the connected component timing model to obtain adequately precise estimates of propagation delays across different combinational paths in a DFG easily, for efficient cutset-retiming in order to reduce the critical path substantially without significant increase in register-complexity and latency. Apart from that, we propose novel node-splitting and node-merging techniques that can be used in combination with the existing retiming methods to achieve reduction of critical path to a fraction that of the original DFG with a small increase in overall register complexity.
Abstract In this paper, a quadral-duty digital pulse width modulation (QDPWM) technique-based low... more Abstract In this paper, a quadral-duty digital pulse width modulation (QDPWM) technique-based low-cost hardware architecture for brushless DC (BLDC) motor drive is proposed. The proposed architecture is developed by incorporating an efficient speed calculation and commutation circuitry to achieve the compactness of the total architecture. The speed calculation circuit is designed to perform edge detection of the rotor position signal, with external noise and glitch resistance. The proposed architecture is implemented in the field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) platform using TSMC 180 nm technology library. The FPGA implementation is compared with existing architectures to validate the resource utilization of the proposed architecture. The ASIC implementation illustrates that the proposed architecture operating at 50 MHz, reduces the gate count and power dissipation to approximately half and one-third, respectively, compared to a standard PI controller based-PWM control architecture. Experimental validation of the FPGA-based architecture is also performed using a laboratory prototype of the BLDC motor drive hardware setup. The performance of the drive is examined for various speed commands and loading conditions. Extensive experimental analysis has been carried out to validate the performance of the proposed architecture-based drive under dynamic load and speed command variation. The ability of the proposed circuit to tolerate the noise in Hall position sensor signals is testified by adding intentional glitches into the signal.
Systolic designs are considered as suitable candidate for high-speed VLSI realization for their i... more Systolic designs are considered as suitable candidate for high-speed VLSI realization for their inherent advantages of simplicity, regularity, modularity, and local interconnections. During the past few decades several systolic designs of finite field multipliers have been proposed in the literature. They are popularly used to achieve very highthroughput rate without any centralized control. But, all these designs incorporate heavy systolic penalties in terms of register complexity and latency of computation. We have analyzed here the hidden systolic penalties in those multipliers and proposed a digit-level systolic-like structure and a super-systolic-like structure for finite field multiplication. We have shown that the key issues to obtain such designs are the choice of design layout and digit size which substantially affect the register complexity, critical path, and latency. We have determined the optimal digit size and design layout to reduce the systolic penalties and at the same time to achieve lower critical path, higher-throughput rate, and lower latency with less register complexity with lower overall area complexity.
Local binary pattern (LBP) is a computationally inexpensive feature descriptor popularly used for... more Local binary pattern (LBP) is a computationally inexpensive feature descriptor popularly used for detecting and classifying images. However, the directional attributes associated with conventional LBP lead to the generation of undesirable shades and noisy textures. Consequently, it impairs the features corresponding to edges and intensity gradients in an image. To address this problem, in this article, we have proposed a novel LBP-Like (LBP-L) transform, which could be used as an efficient alternative to the conventional LBP. The LBP-L of an image does not change due to horizontal and vertical flipping of images or when the image is rotated through 90°, 180°, or 270°. Besides, it maps the uniform intensity zones and edges in images to the same feature space irrespective of the orientation. It has significant potential to boost the detection and classification of brain tumors in magnetic resonance (MR) images. The performance of LBP-L has been tested in two benchmark datasets having 3064 and 7023 MR images of the brain with tumors of three different types, namely, meningioma, glioma, and pituitary, along with images without tumors. The detection and classification accuracies of LBP-L on the Kaggle and Figshare datasets are higher than those of the conventional LBP and its variants reported in the literature. It is shown that LBP-L has computational complexity comparable to that of traditional LBP. Moreover, the proposed LBP-L is found to have a lower computation time compared to the well-known LBP variants. Therefore, it could be used as a more efficient substitute for the latter.
IEEE Transactions on Circuits and Systems I-regular Papers, Nov 1, 2015
Sign-extension of operands in the shift-add network of multiple constant multiplication (MCM) res... more Sign-extension of operands in the shift-add network of multiple constant multiplication (MCM) results in a significant overhead in terms of hardware complexity as well as computation time. This paper presents an efficient approach to minimize that overhead. In the proposed method, the shift-add network of an MCM block is partitioned into three types of sub-networks based on the types of fundamentals and interconnections they involve. For each type of sub-network, a scheme which takes the best advantage of the redundancy in the computation of sign-extension part is proposed to minimize the overhead. Moreover, we also propose a technique to avoid the additions pertaining to the most significant bits (MSBs) of the fundamentals. Experimental results show that the proposed method always leads to implementations of MCM blocks with the lowest critical path delay. The existing methods for the minimization of sign-extension overhead are designed particularly for single multiplication or MCM blocks of FIR filter, but the proposed method can be used to reduce the overhead of sign-extension for MCM blocks of any application. In the case of FIR filters, the proposed method outperforms other competing methods in terms of critical path delay, area-delay product (ADP), and power-delay product (PDP), as well.
In this paper, we propose a new hardware architecture of a very high-speed finite impulse respons... more In this paper, we propose a new hardware architecture of a very high-speed finite impulse response (FIR) filter using fine-grained seamless pipelining. The proposed full-parallel pipeline FIR filter can produce an output sample in a few gate delays by placing the pipeline registers not only in between components, but also across the components. A precise critical path analysis at the gate level allows to create an appropriate pipelining strategy depending on the throughput requirement. This paper also presents two alternative architectures, each offering different trade-offs in terms of area and throughput rate. The proposed FIR filters are synthesized to measure the maximum throughput and the balance between complexity and speed. The synthesis results show that the proposed fully pipelined FIR filter supports up to throughput of 1.8 Giga samples per second and offers 73.5% less area-delay product (ADP) than the existing systolic designs. Also, the proposed single multiplier-accumulator (MAC) based FIR filter has 3 times higher throughput and 26.0% less area with 75.8% less ADP compared to the existing design. INDEX TERMS Finite impulse response filters, digital filters, pipeline, Wallace tree, critical path, Booth multiplier.
IEEE Transactions on Very Large Scale Integration Systems, May 1, 2020
We present an innovative region-growing-based technique that permits to improve the surface displ... more We present an innovative region-growing-based technique that permits to improve the surface displacement timeseries retrieval capability of the two-scale Small BAseline Subset (SBAS) Differential Interferometric Synthetic Aperture Radar (DInSAR) approach in medium-to-low coherence regions. Starting from a sequence of multitemporal differential SAR interferograms, computed at the full spatial resolution scale, the developed method "propagates" the information on the deformation relevant to a set of high coherent SAR pixels [referred to as source pixels (SPs)], in correspondence to which SBAS-DInSAR deformation measurements have previously been estimated, to their less coherent neighbouring ones. In this framework, a minimum-norm constrained optimization problem, relying on the use of constrained Delaunay triangulations (CDTs), is solved, where the constraints represent the displacement values at the SP locations. Such DInSAR processing scheme, referred to as Constrained-Network Propagation (C-NetP), is easy to implement and, although specifically developed to work within the two-scale SBAS framework, it can be extended to wider DInSAR scenarios. The validity of the method has been investigated by processing a SAR dataset acquired over the city of Rome (Italy) by the Cosmo-SkyMed constellation from July 2010 to October 2012. The achieved results demonstrate that the proposed C-NetP method is capable to significantly increase the spatial density of the SBAS-DInSAR measurements, reaching an improvement of about 250%. Such an improvement allows revealing deformation patterns that are partially or completely hidden, by applying the conventional twoscale SBAS processing. This is particularly relevant in urban areas where the assessment and management of the risk associated to the deformation affecting infrastructures is strategic for decision makers and local authorities.
IEEE Transactions on Circuits and Systems I-regular Papers, Oct 1, 2016
Most of the research on the implementation of finite impulse response (FIR) filter so far focuses... more Most of the research on the implementation of finite impulse response (FIR) filter so far focuses on the optimization of the multiple constant multiplication (MCM) block. But it is observed that the product-accumulation section often contributes the major part of the critical path, such that the timing optimization of MCM block does not impact significantly on the overall speedup of the FIR filters. In this paper, a precise analysis and optimization of critical path for transposed direct from (TDF) FIR filters is proposed. The delay increment introduced by structural adders is estimated by comparing the delay of a tap and the corresponding delay of the coefficient multiplication only. Based on that, a novel implementation of the product-accumulation section of FIR filters is proposed by retiming the existing delays into the structural adders. It is also shown that the structural adders can be integrated with the MCM block and retimed together for further reduction of critical path. By using the proposed method, the increment of delay caused by the structural adders can be either completely eliminated or significantly reduced. Experimental results show that the critical path delay can be significantly reduced at the cost of very small area overhead. The overall area-delay performance and power-delay performance of the proposed method are superior to the existing methods.
IEEE Transactions on Circuits and Systems I-regular Papers, Dec 1, 2018
Single constant multiplication (SCM) and multiple constant multiplications (MCM) are among the mo... more Single constant multiplication (SCM) and multiple constant multiplications (MCM) are among the most popular schemes used for low-complexity shift-add implementation of finite impulse response (FIR) filters. While SCM is used in the direct form realization of FIR filters, MCM is used in the transposed direct form structures. Very often, the hybrid form FIR filters where the subsections are implemented by fixed-size MCM blocks provide better area, time, and power efficiency than those of traditional MCM and SCM based implementations. To have an efficient hybrid form filter, in this paper, we have performed a detailed complexity analysis in terms of the hardware and time consumed by the hybrid form structures. We find that the existing hybrid form structures lead to an undesirable increase of complexity in the structural-adder block. Therefore, to have a more efficient implementation, a variable size partitioning approach is proposed in this paper. It is shown that the proposed approach consumes less area and provides nearly 11% reduction of critical path delay, 40% reduction of power consumption, 15% reduction of area-delay product, 52% reduction of energydelay product, and 42% reduction of power-area product, on an average, over the state-of-the-art methods. Index Terms-Finite impulse response (FIR) filter, hybrid form FIR filters, constant multiplication schemes, coefficient partitioning approach and low power designs.
In this paper, we propose a low latency scalingfree CORDIC algorithm. The reduction of latency is... more In this paper, we propose a low latency scalingfree CORDIC algorithm. The reduction of latency is achieved by suitable combination of multiple angles in different pipeline stages. The efficient use of trigonometric identities along with augmented Taylor series approximation and modified nanorotation in different stages make the CORDIC algorithm completely scaling-free. The comparison of FPGA implementation results with other CORDIC architectures in Xilinx SPARTAN 3E FPGA shows that the proposed architecture has significantly low slice-delay product with similar accuracy.
IEEE Transactions on Circuits and Systems for Video Technology, Dec 1, 2018
High-speed corner detection is an essential step in many real-time computer vision applications, ... more High-speed corner detection is an essential step in many real-time computer vision applications, e.g. object recognition, motion analysis and stereo matching. Hardware implementation of corner detection algorithms such as the Harris corner detector (HCD) has become a viable solution for meeting real-time requirements of the applications. A major challenge lies in the design of power, energy and area efficient architectures that can be deployed in tightly constrained embedded systems while still meeting real-time requirements. In this paper, we proposed a bit-width optimization strategy for designing hardware-efficient HCD that exploits the thresholding step in the algorithm to determine interest points from the corner responses. The proposed strategy relies on the threshold as a guide to truncate the bitwidths of the operators at various stages of the HCD pipeline with only marginal loss of accuracy. Synthesis results based on 65-nm CMOS technology show that the proposed strategy leads to powerdelay reduction of 35.2%, and area reduction of 35.4% over the baseline implementation. In addition, through careful retiming, the proposed implementation achieves over 2.2 times increase in maximum frequency while achieving an area reduction of 35.1% and power-delay reduction of 35.7% over the baseline implementation. Finally, we performed repeatability tests to show that the optimized HCD architecture achieves comparable accuracy with the baseline implementation (average decrease of repeatability is less than 0.6%).
IEEE Transactions on Circuits and Systems I-regular Papers, Aug 1, 2015
ABSTRACT Abstract--- It is well-known that the (a,b) -way Karatsuba algorithm (KA) with a\neq b i... more ABSTRACT Abstract--- It is well-known that the (a,b) -way Karatsuba algorithm (KA) with a\neq b is used for efficient digit-serial multiplication with sbquadratic space complexity architecture. In this paper, based on (a,b) -way KA decomposition, we have derived a novel k -way block recombination KA (BRKA) decomposition for digit-serial multiplication. The proposed k -way BRKA is formed by a power of 2 polynomial decomposition. By theoretical analysis, it is shown that k -way BRKA can provide the necessary tradeoff between space and time complexity. Using (4,2)-way KA to construct the proposed k -way BRKA architecture in GF(2^{409}) , it is shown that the proposed 2-way BRKA approach requires less area, and the proposed 8-way BRKA approach requires less computation time and less area-time product compared to the existing (a,b) -way KA decomposition.
Cognitive radio is an emerging technology in wireless communications for dynamically accessing un... more Cognitive radio is an emerging technology in wireless communications for dynamically accessing under-utilized spectrum resources. In order to maximize the network utilization, vacant channels are assigned to cognitive users without interference to primary users. This is performed in the spectrum allocation (SA) module of the cognitive radio cycle. Spectrum allocation is a NP hard problem, thus the algorithmic time complexity increases with the cognitive radio network parameters. This paper addresses this by solving the SA problem using Differential Evolution (DE) algorithm and compared its quality of solution and time complexity with Particle Swarm Optimization (PSO) and Firefly algorithms. In addition to this, an Intellectual Property (IP) of DE based SA algorithm is developed and it is interfaced with PowerPC440 processor of Xilinx Virtex-5 FPGA via Auxiliary Processor Unit (APU) to accelerate the execution speed of spectrum allocation task. The acceleration of this coprocessor is compared with the equivalent floating and fixed point arithmetic implementation of the algorithm in the PowerPC440 processor. The simulation results show that the DE algorithm improves quality of solution and time complexity by 29.9% and 242.32%, 19.04% and 46.3% compared to PSO and Firefly algorithms. Furthermore, the implementation results show that the coprocessor accelerates the SA task by 76.79-105Â and 5.19-6.91Â compared to floating and fixed point implementation of the algorithm in PowerPC processor. It is also observed that the power consumption of the coprocessor is 26.5 mW.
IEEE Transactions on Circuits and Systems for Video Technology, Aug 1, 2017
An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4−p... more An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4−point DCT defined by the High Efficiency Video Coding (HEVC) standard, and used that for the computation of DCT and inverse DCT (IDCT) of power−of−2 lengths. There are two reasons to consider the DCT of length 4 as the basic module. Firstly, it allows to compute DCTs of length 4, 8, 16, and 32 prescribed by HEVC. Moreover, the DCTs generated by 4−point DCT not only involve lower complexity but also offer better compression performance. Full-parallel and area-constrained architectures for the proposed approximate DCT are proposed to have flexible trade-off between area and time complexities. Also, a reconfigurable architecture is proposed where 8−point DCT can be used for a pair of 4−point DCTs. Using the same reconfiguration scheme 32−point DCT could be configured for parallel computation of two 16−point DCTs or four 8−point DCTs or eight 4−point DCTs. The proposed reconfigurable design can support real-time coding for high-definition video sequences in the 8K UHDTV format (7680 × 4320 @ 30 fps). A unified forward and inverse transform architecture is also proposed where the hardware complexity is reduced by sharing of hardware between DCT and IDCT computation. The proposed approximation has nearly the same arithmetic complexity and hardware requirement as those of recently proposed related methods, but involves significantly less error energy, and offers better PSNR than the others when the DCTs of length more than are used. A detail comparisons of complexity, energy efficiency, and compression performance of different DCT approximation schemes for video coding are also presented. It is shown that the proposed approximation provides better compressed image quality than the other approximate DCT. The proposed method can perform HEVC-compliant video coding with marginal degradation of video quality and slight increase in bit-rate with a fraction of computational complexity of the latter.
We are very pleased to welcome you to the second annual edition of the International Symposium on... more We are very pleased to welcome you to the second annual edition of the International Symposium on Electronics System Design (ISED 2011) in the port city of Kochi, India. This builds on the successful debut of ISED last year and we are confident you will find this year's program compelling. The goal of ISED is to provide a venue for researchers, educators, students, and industry personnel worldwide to present, learn about, explore, and collaborate in the latest developments in electronic system design research and education. The ISED 2011 program is spread over three days during 19-21 December 2011, with the main conference research tracks and talks scheduled during the first two days, followed by the Workshop for Engineering Scholars (WES) on the last day. This year, ISED has a co-located event immediately following it on December 22-23, 2011: the International Workshop on Embedded Computing & Communication Systems (IWECC) at Rajagiri School of Engineering & Technology in Kochi.
Distributed arithmetic (DA) based architectures are popularly used for inner-product computation ... more Distributed arithmetic (DA) based architectures are popularly used for inner-product computation in various applications. Existing literature shows that the use of approximate DA-architectures in error resilient applications provides a significant improvement in the overall efficiency of the system. Based on precise error analysis, we find that the existing methods introduce large truncation error in the computation of the final inner-product. Therefore, to have a suitable trade-off between the overall hardware complexity and truncation error, a weight-dependent truncation approach is proposed in this paper. The overall efficiency of the structure is further enhanced by incorporating an input truncation strategy in the proposed method. It is observed that the area, time and energy efficiency of the proposed designs are superior to the existing designs with significantly lower truncation error. Evaluation in the case of noisy image smoothing application is also shown in this paper.
IEEE Transactions on Very Large Scale Integration Systems, Apr 1, 2016
Retiming of digital circuits is conventionally based on the estimates of propagation delays acros... more Retiming of digital circuits is conventionally based on the estimates of propagation delays across different paths in the data-flow graphs (DFGs) obtained by discrete component timing model, which implicitly assumes that operation of a node can begin only after the completion of the operation(s) of its preceding node(s) to obey the data dependence requirement. Such a discrete component timing model very often gives much higher estimates of the propagation delays than the actuals particularly when the computations in the DFG nodes correspond to fixedpoint arithmetic operations like additions and multiplications. On the other hand, very often it is imperative to deal with the DFGs of such higher granularity at the architecture-level abstraction of digital system design for mapping an algorithm to the desired architecture, where the overestimation of propagation delay leads to unwanted pipelining and undesirable increase in pipeline overheads. In this paper, we propose the connected component timing model to obtain adequately precise estimates of propagation delays across different combinational paths in a DFG easily, for efficient cutset-retiming in order to reduce the critical path substantially without significant increase in register-complexity and latency. Apart from that, we propose novel node-splitting and node-merging techniques that can be used in combination with the existing retiming methods to achieve reduction of critical path to a fraction that of the original DFG with a small increase in overall register complexity.
Abstract In this paper, a quadral-duty digital pulse width modulation (QDPWM) technique-based low... more Abstract In this paper, a quadral-duty digital pulse width modulation (QDPWM) technique-based low-cost hardware architecture for brushless DC (BLDC) motor drive is proposed. The proposed architecture is developed by incorporating an efficient speed calculation and commutation circuitry to achieve the compactness of the total architecture. The speed calculation circuit is designed to perform edge detection of the rotor position signal, with external noise and glitch resistance. The proposed architecture is implemented in the field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) platform using TSMC 180 nm technology library. The FPGA implementation is compared with existing architectures to validate the resource utilization of the proposed architecture. The ASIC implementation illustrates that the proposed architecture operating at 50 MHz, reduces the gate count and power dissipation to approximately half and one-third, respectively, compared to a standard PI controller based-PWM control architecture. Experimental validation of the FPGA-based architecture is also performed using a laboratory prototype of the BLDC motor drive hardware setup. The performance of the drive is examined for various speed commands and loading conditions. Extensive experimental analysis has been carried out to validate the performance of the proposed architecture-based drive under dynamic load and speed command variation. The ability of the proposed circuit to tolerate the noise in Hall position sensor signals is testified by adding intentional glitches into the signal.
Systolic designs are considered as suitable candidate for high-speed VLSI realization for their i... more Systolic designs are considered as suitable candidate for high-speed VLSI realization for their inherent advantages of simplicity, regularity, modularity, and local interconnections. During the past few decades several systolic designs of finite field multipliers have been proposed in the literature. They are popularly used to achieve very highthroughput rate without any centralized control. But, all these designs incorporate heavy systolic penalties in terms of register complexity and latency of computation. We have analyzed here the hidden systolic penalties in those multipliers and proposed a digit-level systolic-like structure and a super-systolic-like structure for finite field multiplication. We have shown that the key issues to obtain such designs are the choice of design layout and digit size which substantially affect the register complexity, critical path, and latency. We have determined the optimal digit size and design layout to reduce the systolic penalties and at the same time to achieve lower critical path, higher-throughput rate, and lower latency with less register complexity with lower overall area complexity.
Uploads
Papers by Pramod Meher