Skip to main content

M. Hassan Najafi

University of Minnesota - Twin Cities, Electrical and computer engineering, Graduate Student

Followers

47

Following

20

Co-authors

2

Public Views

University of California, San Diego

Elaheh Sadredini

Dmitri Rachkovskij

International Research and Training Center for Information Technologies and Systems of Ukraine and Ministry of Education of Ukraine

The University of Manchester

Independent Researcher

Sharadhi Ramakrishna

University of California, San Diego

Massachusetts Institute of Technology (MIT)

Interests

Uploads

Papers by M. Hassan Najafi

EventHD: Robust and efficient hyperdimensional learning with neuromorphic sensor

Frontiers in Neuroscience

Brain-inspired computing models have shown great potential to outperform today's deep learnin... more Brain-inspired computing models have shown great potential to outperform today's deep learning solutions in terms of robustness and energy efficiency. Particularly, Hyper-Dimensional Computing (HDC) has shown promising results in enabling efficient and robust cognitive learning. In this study, we exploit HDC as an alternative computational model that mimics important brain functionalities toward high-efficiency and noise-tolerant neuromorphic computing. We present EventHD, an end-to-end learning framework based on HDC for robust, efficient learning from neuromorphic sensors. We first introduce a spatial and temporal encoding scheme to map event-based neuromorphic data into high-dimensional space. Then, we leverage HDC mathematics to support learning and cognitive tasks over encoded data, such as information association and memorization. EventHD also provides a notion of confidence for each prediction, thus enabling self-learning from unlabeled data. We evaluate EventHD efficienc...

Energy-Efficient Near-Sensor Convolution using Pulsed Unary Processing

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Near-sensor convolution engines have many applications in Internet-of-Things. Pulsed unary proces... more Near-sensor convolution engines have many applications in Internet-of-Things. Pulsed unary processing has been recently proposed for high-performance and energy-efficient processing of data using simple digital logic. In this work, we propose a low-cost, high-performance, and energy-efficient near-sensor convolution engine based on pulsed unary processing.

Stochastic Computing in Beyond Von-Neumann Era: Processing Bit-Streams in Memristive Memory

IEEE Transactions on Circuits and Systems II: Express Briefs

Robust In-Memory Computing with Hyperdimensional Stochastic Representation

2021 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), 2021

Brain-inspired HyperDimensional Computing (HDC) is an alternative computation model working based... more Brain-inspired HyperDimensional Computing (HDC) is an alternative computation model working based on the observation that the human brain operates on highdimensional representations of data. Existing HDC solutions rely on expensive pre-processing algorithms for feature extraction. In this paper, we propose StocHD, a novel end-to-end hyperdimensional system that supports accurate, efficient, and robust learning over raw data. StocHD expands HDC functionality to the computing area by mathematically defining stochastic arithmetic over HDC hypervectors. StocHD enables an entire learning application (including feature extractor) to process using HDC data representation, enabling uniform, efficient, robust, and highly parallel computation. We also propose a novel fully digital and scalable Processing In-Memory (PIM) architecture that exploits the HDC memory-centric nature to support extensively parallel computation.

Exact In-Memory Multiplication Based on Deterministic Stochastic Computing

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

Memristors offer the ability to both store and process data in memory, eliminating the overhead o... more Memristors offer the ability to both store and process data in memory, eliminating the overhead of data transfer between memory and processing unit. For data-intensive applications, developing efficient in-memory computing methods is under investigation. Stochastic computing (SC), a paradigm offering simple execution of complex operations, has been used for reliable and efficient multiplication of data in-memory. Current SC-based in-memory methods are incapable of producing accurate results. This work, to the best of our knowledge, develops the first accurate SC-based in-memory multiplier. For logical operations, we use Memristor-Aided Logic (MAGIC), and to generate bit-streams, we propose a novel method, which takes advantage of the intrinsic properties of memristors. The proposed design improves the speed and reduces the memory usage and energy consumption compared to the State-of-the-Art (SoA) accurate in-memory fixed-point and off-memory SC multipliers.

SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks

2019 56th ACM/IEEE Design Automation Conference (DAC), 2019

Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energ... more Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energy efficient CNN accelerators. Stochastic computing (SC) is a promising low-cost alternative to conventional binary implementations of CNNs. Despite the low-cost advantage, SC-based arithmetic units suffer from prohibitive execution time due to processing long bit-streams. In particular, multiplication as the main operation in convolution computation, is an extremely time-consuming operation which hampers employing SC methods in designing embedded CNNs.In this work, we propose a novel architecture, called SkippyNN, that reduces the computation time of SC-based multiplications in the convolutional layers of CNNs. Each convolution in a CNN is composed of numerous multiplications where each input value is multiplied by a weight vector. Producing the result of the first multiplication, the following multiplications can be performed by multiplying the input and the differences of the successiv...

Stochastic-binary convolutional neural networks with deterministic bit-streams

Hardware Architectures for Deep Learning, 2020

In this chapter, we proposed a low-cost and energy -efficient design for hardware implementation ... more In this chapter, we proposed a low-cost and energy -efficient design for hardware implementation of CNNs. LD deterministic bit -streams and simple standard AND gates are used to perform fast and accurate multiplication operations in the first layer of the NN. Compared to prior random bit -stream -based designs, the proposed design achieves a lower misclassification rate for the same processing time. Evaluating LeNet5 NN with MINIST dataset as the input, the proposed design achieved the same classification rate as the conventional fixed-point binary design with 70% saving in the energy consumption of the first convolutional layer. If accepting slight inaccuracies, higher energy savings are also feasible by processing shorter bit -streams.

Polysynchronous Clocking: Exploiting the Skew Tolerance of Stochastic Circuits

IEEE Transactions on Computers, 2017

In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit stre... more In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit streams. The method naturally and effectively tolerates very high clock skew. Exploiting this advantage, this paper introduces polysynchronous clocking, a design strategy in which clock domains are split at a very fine level. Each domain is synchronized by an inexpensive local clock. Alternatively, the skew requirements for a global clock distribution network can be relaxed. This allows for a higher working frequency and so lower latency. The benefits of both approaches are quantified. Polysynchronous clocking results in significant latency, area, and energy savings for wide variety of applications.

Fuzzy-Logic using Unary Bit-Stream Processing

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

There is a growing attention to the theory of fuzzy-logic and its applications. Efficient hardwar... more There is a growing attention to the theory of fuzzy-logic and its applications. Efficient hardware design of the fuzzy-inference engine has become necessary for high-performance applications. Considering the facts that fuzzy-logic variables have truth values in the [0, 1] interval and fuzzy controllers include minimum and maximum operations, this work proposes to apply the concept of unary processing to the platform of fuzzy-logic. In unary processing, data in the [0, 1] interval is encoded as bitstream with the value defined by the frequency of 1s. Operations such as minimum and maximum functions can be implemented using simple logic gates. Latency, however, has been an important issue in the unary designs. To mitigate the latency, the proposed design processes right-aligned bit-streams. A one-hot decoder is used for fast detection of the bit-stream with maximum value. Implementing a fuzzy-inference engine with 81 fuzzy-inference rules, the proposed architecture provides 82%, 46%, and 67% saving in the hardware area, power and energy consumption, respectively, and 94% reduction in the number of used LUTs compared to conventional binary implementation.

Sorting in Memristive Memory

ACM Journal on Emerging Technologies in Computing Systems

Sorting data is needed in many application domains. Traditionally, the data is read from memory a... more Sorting data is needed in many application domains. Traditionally, the data is read from memory and sent to a general-purpose processor or application-specific hardware for sorting. The sorted data is then written back to the memory. Reading/writing data from/to memory and transferring data between memory and processing unit incur significant latency and energy overhead. In this work, we develop the first architectures for in-memory sorting of data to the best of our knowledge. We propose two architectures. The first architecture is applicable to the conventional format of representing data, i.e., weighted binary radix. The second architecture is proposed for developing unary processing systems, where data is encoded as uniform unary bit-streams. As we present, each of the two architectures has different advantages and disadvantages, making one or the other more suitable for a specific application. However, the common property of both is a significant reduction in the processing tim...

High-Accuracy Multiply-Accumulate (MAC) Technique for Unary Stochastic Computing

IEEE Transactions on Computers

Multiply-accumulate (MAC) operations are common in data processing and machine learning but costl... more Multiply-accumulate (MAC) operations are common in data processing and machine learning but costly in terms of hardware usage. Stochastic Computing (SC) is a promising approach for low-cost hardware design of complex arithmetic operations such as multiplication. Computing with deterministic unary bit-streams (defined as bit-streams with all 1s grouped together at the beginning or end of a bit-stream) has been recently suggested to improve the accuracy of SC. Conventionally, SC designs use multiplexer (MUX) units or OR gates to accumulate data in the stochastic domain. MUX-based addition suffers from scaling of data and OR-based addition from inaccuracy. This work proposes a novel technique for MAC operation on unary bit-streams that allows exact, non-scaled addition of multiplication results. By introducing a relative delay between the products, we control correlation between bit-streams and eliminate OR-based addition error. We evaluate the accuracy of the proposed technique compared to the state-of-the-art MAC designs. After quantization, the proposed technique demonstrates at least 37% and up to 100% decrease of the mean absolute error for uniformly distributed random input values, compared to traditional OR-based MAC designs. Further, we demonstrate that the proposed technique is practical and evaluate area, power and energy of three possible implementations.

Using Residue Number Systems to Accelerate Deterministic Bit-stream Multiplication

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2019

Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministi... more Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministic approaches are proposed to produce completely accurate results with SC circuits. Current deterministic methods need a large number of clock cycles to produce exact result. This directly translates to a very high energy consumption. We propose a method based on the Residue Number Systems (RNS) to mitigate the high processing time of the deterministic methods. Compared to the state-of-the-art deterministic methods of SC, our approach delivers 760x and 170x improvement in terms of processing time and energy consumption.

TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact ... more Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the realworld environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a lightweight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1× power saving and 1.65× area reduction over the state-of-the-art DNN training accelerator.

Using Residue Number Systems to Accelerate Deterministic Bit-stream Multiplication

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2019

Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministi... more Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministic approaches are proposed to produce completely accurate results with SC circuits. Current deterministic methods need a large number of clock cycles to produce exact result. This directly translates to a very high energy consumption. We propose a method based on the Residue Number Systems (RNS) to mitigate the high processing time of the deterministic methods. Compared to the state-of-the-art deterministic methods of SC, our approach delivers 760x and 170x improvement in terms of processing time and energy consumption.

High-Performance Deterministic Stochastic Computing Using Residue Number System

IEEE Design & Test, 2021

This article discusses how to reduce the latency of stochastic computations. The authors represen... more This article discusses how to reduce the latency of stochastic computations. The authors represent an integer number as a set of remainders with respect to a set of relatively prime moduli. Operations such as multiplication, implemented using a deterministic version of stochastic computing, work directly on the remainders, thus yielding a partitioning of the original computation and a significant decrease in the number of clock cycles required.-Vincent T. Lee, Facebook Reality Labs Research  StochaStic computing (Sc) [1], [2], an unconventional computing paradigm processing bit streams, has recently gained considerable attention in the hardware design and computer architecture communities. Higher tolerance to noise and lower hardware cost compared to conventional binary-radix-based designs are the most appealing advantages of SC. The inaccuracy of computation and long processing time, however, are the major weaknesses of SC. The inaccuracy is mainly because of random fluctuations in generating bit streams and correlation (independence) between bit streams [1]. The common method to improve accuracy

TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact ... more Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the realworld environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a lightweight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1× power saving and 1.65× area reduction over the state-of-the-art DNN training accelerator.

Exact Stochastic Computing Multiplication in Memristive Memory

IEEE Design & Test, 2021

High-Performance Deterministic Stochastic Computing Using Residue Number System

IEEE Design & Test, 2021

This article discusses how to reduce the latency of stochastic computations. The authors represen... more This article discusses how to reduce the latency of stochastic computations. The authors represent an integer number as a set of remainders with respect to a set of relatively prime moduli. Operations such as multiplication, implemented using a deterministic version of stochastic computing, work directly on the remainders, thus yielding a partitioning of the original computation and a significant decrease in the number of clock cycles required.-Vincent T. Lee, Facebook Reality Labs Research  StochaStic computing (Sc) [1], [2], an unconventional computing paradigm processing bit streams, has recently gained considerable attention in the hardware design and computer architecture communities. Higher tolerance to noise and lower hardware cost compared to conventional binary-radix-based designs are the most appealing advantages of SC. The inaccuracy of computation and long processing time, however, are the major weaknesses of SC. The inaccuracy is mainly because of random fluctuations in generating bit streams and correlation (independence) between bit streams [1]. The common method to improve accuracy

Energy-Efficient Convolutional Neural Network based on Cellular Neural Network Using Beyond-CMOS Technologies

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019

In this paper, we perform a uniform benchmarking for the convolution neural network (CoNN) based ... more In this paper, we perform a uniform benchmarking for the convolution neural network (CoNN) based on the cellular neural network (CeNN) using a variety of beyond-CMOS technologies. Representative charge-based and spintronic device technologies are implemented to enable energy-efficient CeNN related computations. To alleviate the delay and energy overheads of the fully-connect layer, a hybrid spintronic CeNN-based CoNN system is proposed. It is shown that low-power FETs and spintronic devices are promising candidates to implement energy efficient CoNNs based on CeNNs. Specifically, more than 10× improvement in energy-delay product (EDP) is demonstrated for the systems using spin diffusion based devices and tunneling FETs compared to their conventional CMOS counterparts.

Exploring the design space for area-efficient embedded VLIW packet processing engine

2013 21st Iranian Conference on Electrical Engineering (ICEE), 2013

Today area-efficiency is an important factor in designing embedded systems. This paper presents a... more Today area-efficiency is an important factor in designing embedded systems. This paper presents a design space exploration based on an embedded VLIW processor for finding out the optimum architecture for ever increasing demands of embedded packet-processing applications. In our exploration, we use the VEX toolchain for exploring the effects of memory hierarchy, different architectural configurations, and compiler optimizations on both performance and area. Exploration results will find out the best architecture and compiler optimizations for VLIW embedded packetprocessing engines to have area-efficiency as well as timeefficiency.

EventHD: Robust and efficient hyperdimensional learning with neuromorphic sensor

Frontiers in Neuroscience

Brain-inspired computing models have shown great potential to outperform today's deep learnin... more Brain-inspired computing models have shown great potential to outperform today's deep learning solutions in terms of robustness and energy efficiency. Particularly, Hyper-Dimensional Computing (HDC) has shown promising results in enabling efficient and robust cognitive learning. In this study, we exploit HDC as an alternative computational model that mimics important brain functionalities toward high-efficiency and noise-tolerant neuromorphic computing. We present EventHD, an end-to-end learning framework based on HDC for robust, efficient learning from neuromorphic sensors. We first introduce a spatial and temporal encoding scheme to map event-based neuromorphic data into high-dimensional space. Then, we leverage HDC mathematics to support learning and cognitive tasks over encoded data, such as information association and memorization. EventHD also provides a notion of confidence for each prediction, thus enabling self-learning from unlabeled data. We evaluate EventHD efficienc...

Energy-Efficient Near-Sensor Convolution using Pulsed Unary Processing

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Near-sensor convolution engines have many applications in Internet-of-Things. Pulsed unary proces... more Near-sensor convolution engines have many applications in Internet-of-Things. Pulsed unary processing has been recently proposed for high-performance and energy-efficient processing of data using simple digital logic. In this work, we propose a low-cost, high-performance, and energy-efficient near-sensor convolution engine based on pulsed unary processing.

Stochastic Computing in Beyond Von-Neumann Era: Processing Bit-Streams in Memristive Memory

IEEE Transactions on Circuits and Systems II: Express Briefs

Robust In-Memory Computing with Hyperdimensional Stochastic Representation

2021 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), 2021

Brain-inspired HyperDimensional Computing (HDC) is an alternative computation model working based... more Brain-inspired HyperDimensional Computing (HDC) is an alternative computation model working based on the observation that the human brain operates on highdimensional representations of data. Existing HDC solutions rely on expensive pre-processing algorithms for feature extraction. In this paper, we propose StocHD, a novel end-to-end hyperdimensional system that supports accurate, efficient, and robust learning over raw data. StocHD expands HDC functionality to the computing area by mathematically defining stochastic arithmetic over HDC hypervectors. StocHD enables an entire learning application (including feature extractor) to process using HDC data representation, enabling uniform, efficient, robust, and highly parallel computation. We also propose a novel fully digital and scalable Processing In-Memory (PIM) architecture that exploits the HDC memory-centric nature to support extensively parallel computation.

Exact In-Memory Multiplication Based on Deterministic Stochastic Computing

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

Memristors offer the ability to both store and process data in memory, eliminating the overhead o... more Memristors offer the ability to both store and process data in memory, eliminating the overhead of data transfer between memory and processing unit. For data-intensive applications, developing efficient in-memory computing methods is under investigation. Stochastic computing (SC), a paradigm offering simple execution of complex operations, has been used for reliable and efficient multiplication of data in-memory. Current SC-based in-memory methods are incapable of producing accurate results. This work, to the best of our knowledge, develops the first accurate SC-based in-memory multiplier. For logical operations, we use Memristor-Aided Logic (MAGIC), and to generate bit-streams, we propose a novel method, which takes advantage of the intrinsic properties of memristors. The proposed design improves the speed and reduces the memory usage and energy consumption compared to the State-of-the-Art (SoA) accurate in-memory fixed-point and off-memory SC multipliers.

SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks

2019 56th ACM/IEEE Design Automation Conference (DAC), 2019

Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energ... more Employing convolutional neural networks (CNNs) in embedded devices seeks novel low-cost and energy efficient CNN accelerators. Stochastic computing (SC) is a promising low-cost alternative to conventional binary implementations of CNNs. Despite the low-cost advantage, SC-based arithmetic units suffer from prohibitive execution time due to processing long bit-streams. In particular, multiplication as the main operation in convolution computation, is an extremely time-consuming operation which hampers employing SC methods in designing embedded CNNs.In this work, we propose a novel architecture, called SkippyNN, that reduces the computation time of SC-based multiplications in the convolutional layers of CNNs. Each convolution in a CNN is composed of numerous multiplications where each input value is multiplied by a weight vector. Producing the result of the first multiplication, the following multiplications can be performed by multiplying the input and the differences of the successiv...

Stochastic-binary convolutional neural networks with deterministic bit-streams

Hardware Architectures for Deep Learning, 2020

In this chapter, we proposed a low-cost and energy -efficient design for hardware implementation ... more In this chapter, we proposed a low-cost and energy -efficient design for hardware implementation of CNNs. LD deterministic bit -streams and simple standard AND gates are used to perform fast and accurate multiplication operations in the first layer of the NN. Compared to prior random bit -stream -based designs, the proposed design achieves a lower misclassification rate for the same processing time. Evaluating LeNet5 NN with MINIST dataset as the input, the proposed design achieved the same classification rate as the conventional fixed-point binary design with 70% saving in the energy consumption of the first convolutional layer. If accepting slight inaccuracies, higher energy savings are also feasible by processing shorter bit -streams.

Polysynchronous Clocking: Exploiting the Skew Tolerance of Stochastic Circuits

IEEE Transactions on Computers, 2017

In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit stre... more In the paradigm of stochastic computing, arithmetic functions are computed on randomized bit streams. The method naturally and effectively tolerates very high clock skew. Exploiting this advantage, this paper introduces polysynchronous clocking, a design strategy in which clock domains are split at a very fine level. Each domain is synchronized by an inexpensive local clock. Alternatively, the skew requirements for a global clock distribution network can be relaxed. This allows for a higher working frequency and so lower latency. The benefits of both approaches are quantified. Polysynchronous clocking results in significant latency, area, and energy savings for wide variety of applications.

Fuzzy-Logic using Unary Bit-Stream Processing

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

There is a growing attention to the theory of fuzzy-logic and its applications. Efficient hardwar... more There is a growing attention to the theory of fuzzy-logic and its applications. Efficient hardware design of the fuzzy-inference engine has become necessary for high-performance applications. Considering the facts that fuzzy-logic variables have truth values in the [0, 1] interval and fuzzy controllers include minimum and maximum operations, this work proposes to apply the concept of unary processing to the platform of fuzzy-logic. In unary processing, data in the [0, 1] interval is encoded as bitstream with the value defined by the frequency of 1s. Operations such as minimum and maximum functions can be implemented using simple logic gates. Latency, however, has been an important issue in the unary designs. To mitigate the latency, the proposed design processes right-aligned bit-streams. A one-hot decoder is used for fast detection of the bit-stream with maximum value. Implementing a fuzzy-inference engine with 81 fuzzy-inference rules, the proposed architecture provides 82%, 46%, and 67% saving in the hardware area, power and energy consumption, respectively, and 94% reduction in the number of used LUTs compared to conventional binary implementation.

Sorting in Memristive Memory

ACM Journal on Emerging Technologies in Computing Systems

Sorting data is needed in many application domains. Traditionally, the data is read from memory a... more Sorting data is needed in many application domains. Traditionally, the data is read from memory and sent to a general-purpose processor or application-specific hardware for sorting. The sorted data is then written back to the memory. Reading/writing data from/to memory and transferring data between memory and processing unit incur significant latency and energy overhead. In this work, we develop the first architectures for in-memory sorting of data to the best of our knowledge. We propose two architectures. The first architecture is applicable to the conventional format of representing data, i.e., weighted binary radix. The second architecture is proposed for developing unary processing systems, where data is encoded as uniform unary bit-streams. As we present, each of the two architectures has different advantages and disadvantages, making one or the other more suitable for a specific application. However, the common property of both is a significant reduction in the processing tim...

High-Accuracy Multiply-Accumulate (MAC) Technique for Unary Stochastic Computing

IEEE Transactions on Computers

Multiply-accumulate (MAC) operations are common in data processing and machine learning but costl... more Multiply-accumulate (MAC) operations are common in data processing and machine learning but costly in terms of hardware usage. Stochastic Computing (SC) is a promising approach for low-cost hardware design of complex arithmetic operations such as multiplication. Computing with deterministic unary bit-streams (defined as bit-streams with all 1s grouped together at the beginning or end of a bit-stream) has been recently suggested to improve the accuracy of SC. Conventionally, SC designs use multiplexer (MUX) units or OR gates to accumulate data in the stochastic domain. MUX-based addition suffers from scaling of data and OR-based addition from inaccuracy. This work proposes a novel technique for MAC operation on unary bit-streams that allows exact, non-scaled addition of multiplication results. By introducing a relative delay between the products, we control correlation between bit-streams and eliminate OR-based addition error. We evaluate the accuracy of the proposed technique compared to the state-of-the-art MAC designs. After quantization, the proposed technique demonstrates at least 37% and up to 100% decrease of the mean absolute error for uniformly distributed random input values, compared to traditional OR-based MAC designs. Further, we demonstrate that the proposed technique is practical and evaluate area, power and energy of three possible implementations.

Using Residue Number Systems to Accelerate Deterministic Bit-stream Multiplication

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2019

Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministi... more Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministic approaches are proposed to produce completely accurate results with SC circuits. Current deterministic methods need a large number of clock cycles to produce exact result. This directly translates to a very high energy consumption. We propose a method based on the Residue Number Systems (RNS) to mitigate the high processing time of the deterministic methods. Compared to the state-of-the-art deterministic methods of SC, our approach delivers 760x and 170x improvement in terms of processing time and energy consumption.

TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact ... more Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the realworld environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a lightweight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1× power saving and 1.65× area reduction over the state-of-the-art DNN training accelerator.

Using Residue Number Systems to Accelerate Deterministic Bit-stream Multiplication

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2019

Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministi... more Inaccuracy of computations is an important challenge with Stochastic Computing (SC). Deterministic approaches are proposed to produce completely accurate results with SC circuits. Current deterministic methods need a large number of clock cycles to produce exact result. This directly translates to a very high energy consumption. We propose a method based on the Residue Number Systems (RNS) to mitigate the high processing time of the deterministic methods. Compared to the state-of-the-art deterministic methods of SC, our approach delivers 760x and 170x improvement in terms of processing time and energy consumption.

High-Performance Deterministic Stochastic Computing Using Residue Number System

IEEE Design & Test, 2021

This article discusses how to reduce the latency of stochastic computations. The authors represen... more This article discusses how to reduce the latency of stochastic computations. The authors represent an integer number as a set of remainders with respect to a set of relatively prime moduli. Operations such as multiplication, implemented using a deterministic version of stochastic computing, work directly on the remainders, thus yielding a partitioning of the original computation and a significant decrease in the number of clock cycles required.-Vincent T. Lee, Facebook Reality Labs Research  StochaStic computing (Sc) [1], [2], an unconventional computing paradigm processing bit streams, has recently gained considerable attention in the hardware design and computer architecture communities. Higher tolerance to noise and lower hardware cost compared to conventional binary-radix-based designs are the most appealing advantages of SC. The inaccuracy of computation and long processing time, however, are the major weaknesses of SC. The inaccuracy is mainly because of random fluctuations in generating bit streams and correlation (independence) between bit streams [1]. The common method to improve accuracy

TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training

2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020

Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact ... more Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the realworld environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a lightweight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1× power saving and 1.65× area reduction over the state-of-the-art DNN training accelerator.

Exact Stochastic Computing Multiplication in Memristive Memory

IEEE Design & Test, 2021

High-Performance Deterministic Stochastic Computing Using Residue Number System

IEEE Design & Test, 2021

This article discusses how to reduce the latency of stochastic computations. The authors represen... more This article discusses how to reduce the latency of stochastic computations. The authors represent an integer number as a set of remainders with respect to a set of relatively prime moduli. Operations such as multiplication, implemented using a deterministic version of stochastic computing, work directly on the remainders, thus yielding a partitioning of the original computation and a significant decrease in the number of clock cycles required.-Vincent T. Lee, Facebook Reality Labs Research  StochaStic computing (Sc) [1], [2], an unconventional computing paradigm processing bit streams, has recently gained considerable attention in the hardware design and computer architecture communities. Higher tolerance to noise and lower hardware cost compared to conventional binary-radix-based designs are the most appealing advantages of SC. The inaccuracy of computation and long processing time, however, are the major weaknesses of SC. The inaccuracy is mainly because of random fluctuations in generating bit streams and correlation (independence) between bit streams [1]. The common method to improve accuracy

Energy-Efficient Convolutional Neural Network based on Cellular Neural Network Using Beyond-CMOS Technologies

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019

In this paper, we perform a uniform benchmarking for the convolution neural network (CoNN) based ... more In this paper, we perform a uniform benchmarking for the convolution neural network (CoNN) based on the cellular neural network (CeNN) using a variety of beyond-CMOS technologies. Representative charge-based and spintronic device technologies are implemented to enable energy-efficient CeNN related computations. To alleviate the delay and energy overheads of the fully-connect layer, a hybrid spintronic CeNN-based CoNN system is proposed. It is shown that low-power FETs and spintronic devices are promising candidates to implement energy efficient CoNNs based on CeNNs. Specifically, more than 10× improvement in energy-delay product (EDP) is demonstrated for the systems using spin diffusion based devices and tunneling FETs compared to their conventional CMOS counterparts.

Exploring the design space for area-efficient embedded VLIW packet processing engine

2013 21st Iranian Conference on Electrical Engineering (ICEE), 2013

Today area-efficiency is an important factor in designing embedded systems. This paper presents a... more Today area-efficiency is an important factor in designing embedded systems. This paper presents a design space exploration based on an embedded VLIW processor for finding out the optimum architecture for ever increasing demands of embedded packet-processing applications. In our exploration, we use the VEX toolchain for exploring the effects of memory hierarchy, different architectural configurations, and compiler optimizations on both performance and area. Exploration results will find out the best architecture and compiler optimizations for VLIW embedded packetprocessing engines to have area-efficiency as well as timeefficiency.