Papers by Alberto Nannarelli
Low-power radix-4 combined division and square root
Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999
Because of the similarities in the algorithm it is quite common to implement division and square ... more Because of the similarities in the algorithm it is quite common to implement division and square root in the same unit. The purpose of this work is to implement a low-power combined radix-4 division and square root floating-point double precision unit and to compare its performance and energy consumption with a radix-4 division only unit. Pre- vious work has been
Tradeoffs between residue number system and traditional FIR filters
ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196), 2001
In this work, a study on the implementation of FIR fil-ters in the Residue Number System (RNS) is... more In this work, a study on the implementation of FIR fil-ters in the Residue Number System (RNS) is carried out. For different configurations, RNS filters are compared with filters realized in the traditional two's complement system (TCS) in terms of delay, area and power ...
Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems (Cat.No.CH37144), 2000
×ØÖ Ø-The aim of this work is to reduce the power dissipated in high order Finite Impulse Respons... more ×ØÖ Ø-The aim of this work is to reduce the power dissipated in high order Finite Impulse Response (FIR) filters, while maintaining the delay unchanged. We compare in terms of performance, area, and power dissipation the implementation of a traditional FIR filter with a Residue Number System (RNS) based one. The resulting implementations, designed to work at the same clock rate, show that the RNS filter is smaller and consumes less power than the traditional one for a number of taps larger than eight.

Fast radix-4 retimed division with selection by comparisons
Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors, 2002
Since a large portion of the critical path in an implementation of radix-4 division corresponds t... more Since a large portion of the critical path in an implementation of radix-4 division corresponds to the delay of the quotient-digit selection module, it is of interest to reduce this delay. The proposal of this paper extends the approach presented recently of prestoring the selection constants corresponding to the actual value of the divisor and to perform the determination of the quotient digit by carry-free subtraction and sign detection. This extension consists in advancing the subtraction so that it is outside of the critical path. This advancement also provides the possibility of placing the registers so as to minimize the cycle time. We present the method and report results of synthesis using a family of standard cells. We conclude that the extension results in a speedup of 1.35 with respect to the basic implementation and of 1.3 with respect to the previously mentioned approach. We estimate that the areas of all three units are about the same.
17th IEEE Symposium on Computer Arithmetic (ARITH'05), 2005
Code compression architecture for cache energy minimisation in embedded systems
IEE Proceedings - Computers and Digital Techniques, 2002
ABSTRACT
Proceedings of 1996 International Symposium on Low Power Electronics and Design, 1996
Low-power radix-8 divider
Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273), 1998
This work describes the design of a double-precision radix-8 divider. Low-power techniques are ap... more This work describes the design of a double-precision radix-8 divider. Low-power techniques are applied in the design of the unit, and energy-delay tradeoffs considered. The energy dissipation in the divider can be reduced by up to 70% with respect to a standard implementation not optimized for energy, without penalizing the latency. The radix-8 divider is compared with the one obtained
Cached-code compression for energy minimization in embedded processors
Proceedings of the 2001 international symposium on Low power electronics and design - ISLPED '01, 2001
This paper contributes a novel approach for reducing static code size and instruction fetch energ... more This paper contributes a novel approach for reducing static code size and instruction fetch energy for cache-based core processors running embedded applications. Our implementation of the decompression unit guarantees fast and low-energy, on-the-fly instruction decompression at each cache lookup. The decompressor is placed outside the core boundaries; therefore, processor architecture does not need any modification, making the proposed compression approach
Temperature aware power optimization for multicore floating-point units
2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers, 2010
... Sensor Networks WA3b Multiuser Beamforming and Interference Channels WA4 Advances onAdaptive ... more ... Sensor Networks WA3b Multiuser Beamforming and Interference Channels WA4 Advances onAdaptive Filtering and ... certain classes, Gaussian inputs can be shown to be optimal for this channel. ... impact of the presence of eavesdroppers on the connectivity of wireless networks. ...
Power and Thermal Efficient Numerical Processing
Handbook on Data Centers, 2015
Division Unit for Binary Integer Decimals
2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009

IEEE Transactions on Computers, 2007
In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm... more In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm. The previous decimal division designs do not include recent developments in the theory and practice of this type of algorithm, which were developed for radix-2 k dividers. In addition to the adaptation of these features, the radix-10 quotient digit is decomposed into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required in the recurrence. Moreover, the most significant slice of the recurrence, which includes the selection function, is implemented in radix-2, avoiding the additional delay introduced by the radix-10 carry-save additions and allowing the balancing of the paths to reduce the cycle delay. The results of the implementation of the proposed radix-10 division unit show that its latency is close to that of radix-16 division units (comparable dynamic range of significands) and it has a shorter latency than a radix-10 unit based on the Newton-Raphson approximation.
A Radix-10 Combinational Multiplier
2006 Fortieth Asilomar Conference on Signals, Systems and Computers, 2006
... n*4 n n A B D S C n − 1 ... 1 0 A aaaa ... aaaa aaaa B bbbb ... bbbb bbbb D d ... d d S ssss ... more ... n*4 n n A B D S C n − 1 ... 1 0 A aaaa ... aaaa aaaa B bbbb ... bbbb bbbb D d ... d d S ssss ... ssss ssss C c ... c 0 Fig. ... C5 c c ... c c c C6 c c ... c c c C7 c c ... c c c C8 c c ... c c c Z zzzz zzzz ... zzzzzzzz zzzz Fig. 5. m-digit radix-10 counter. The adder can be divided into three stages. ...
Low-power division: comparison among implementations of radix 4, 8 and 16
Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336), 1999
Although division is less frequent than addition and mul- tiplication, because of its longer late... more Although division is less frequent than addition and mul- tiplication, because of its longer latency it dissipates a sub- stantial part of the energy in floating-point units. In this paper we explore the relation between the radix and the en- ergy dissipated. Previous work has been done on radix-4 and radix-8 division. Here we extend this study to a radix-
A variant of a radix-10 combinational multiplier
2008 IEEE International Symposium on Circuits and Systems, 2008
We consider the problem of adding the partial products in the combinational decimal multiplier pr... more We consider the problem of adding the partial products in the combinational decimal multiplier presented by Lang and Nannarelli. In the original paper this addition is done with a tree of decimal carry-save adders. In this paper, we treat the problem using the multi-operand decimal addition previously published by Dadda, where the sum of each column of the partial product
A Hybrid RNS Adaptive Filter for Channel Equalization
2006 Fortieth Asilomar Conference on Signals, Systems and Computers, 2006
In this work a hybrid residue number system (RNS) implementation of an adaptive FIR filter is pre... more In this work a hybrid residue number system (RNS) implementation of an adaptive FIR filter is presented. The used adaptation algorithm is the least mean squares (LMS). The filter has been designed to meet the constraints of specific class of applications. In fact, it is suitable for applications requiring a large number of taps where a serial updating of the

Combined Radix-10 and Radix-16 Division Unit
2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers, 2007
In this work we extend a previously proposed digit- recurrence radix-10 division unit to be able ... more In this work we extend a previously proposed digit- recurrence radix-10 division unit to be able to perform also radix-16 division. The extension is simplified by the fact that in the radix-10 implementation the quotient digit is decomposed into two parts and that this decomposition is also appropriate for the radix-16 case. Moreover, to reduce the latency in the radix- 10 the most-significant portion of the datapath, including the selection function, has been implemented in radix-2, so that the modifications of that part to include radix-16 consists mainly in combining the two modules to obtain the selection constants. The rest of the modifications relate to the generation of multiples, to the carry-save adder, to the carry-propagate adder, and to the on-the-fly conversion and rounding. The implementation results show that the delay of an iteration is similar to that of the radix-10 case and that the area is about thirty percent larger.
Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154), 2000
The aim of this work is to compare in terms of performance, area and power dissipation, a complex... more The aim of this work is to compare in terms of performance, area and power dissipation, a complex FIR filter realized in the traditional two's complement system with a Quadratic Residue Number System (QRNS) based one. The resulting implementations, designed to work at the same clock rate, show that the QRNS filter is almost half the size of the traditional one, and dissipates about one third of the energy.
2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512), 2004
The aim of this work is the reduction of the power dissipated in digital filters, while maintaini... more The aim of this work is the reduction of the power dissipated in digital filters, while maintaining the timing unchanged. A polyphase filter bank in the Quadratic Residue Number System (QRNS) has been implemented and then compared, in terms of performance, area, and power dissipation to the implementation of a polyphase filter bank in the traditional two's complement system (TCS). The resulting implementations, designed to have the same clock rates, show that the QRNS filter is smaller and consumes less power than the TCS one.
Uploads
Papers by Alberto Nannarelli