Asynchronous and GALS Circuits by Matheus Trevisan Moreira

Symposium on VLSI, …, Jan 1, 2008
The evolution of deep submicron technologies allows the development of increasingly complex Syste... more The evolution of deep submicron technologies allows the development of increasingly complex Systems on a Chip (SoC). However, this evolution is rendering less viable some well-established design practices. Examples are the use of multi-point communication architectures (e. g. busses) and designing fully synchronous systems. In addition, power dissipation is becoming one of the main design concerns due e. g. to the increasing use of mobile products. An alternative to overcome such problems is adopting Networks on Chip (NoCs) communication architectures supporting globally asynchronous locally synchronous (GALS) system design. This work proposes a GALS router with associated power control techniques, which enables low power SoC design. This is in contrast with previous works which centered attention in power reduction of SoC processing elements instead. The paper describes the asynchronous communication interface and the employed power control mechanism. The results obtained from simulation at the RTL level with timing show that, even when submitted to large rates of traffic injection, the proposed NoC displays a significant reduction in switching activity and consequently in power dissipation.

Mutual exclusion elements (MUTEXes) are fundamental components of asynchronous arbiters and are p... more Mutual exclusion elements (MUTEXes) are fundamental components of asynchronous arbiters and are particularly
critical to ensure metastable signals are properly filtered before
reaching the arbiter outputs. However, despite their importance, the testability of these circuits is typically limited to functional testing. This paper discusses why this is not sufficient and addresses testability issues in both full-custom and standard-cell implementations. In particular, it proposes two new testable implementations that not only ensure improved coverage for single stuck-at faults but also enable testing the filtering of metastable signals. Additionally, this article quantifies the cost of the testable designs by comparing them to similar traditional designs in terms of area, power and metastability resolution time. Results show the proposed optimizations do increase area and power but have small impact on performance.

Resilient designs offer the promise to remove increasingly large margins due to process, voltage,... more Resilient designs offer the promise to remove increasingly large margins due to process, voltage, and temperature variations and take advantage of average-case data. However, proposed synchronous resilient schemes have either suffered from metastability or require modifying the architecture to add replay-based logic that recovers from timing errors, which leads to high timing error penalties and poses a design challenge in modern processors. This paper presents an asynchronous bundled-data resilient template called Blade that is robust to metastability issues, requires no replay-based logic, and has low timing error penalties. The template is supported by an automated design flow that synthesizes synchronous RTL designs to gate-level asynchronous Blade designs. The benefts of this flow are illustrated on Plasma, a 3-stage OpenCore MIPS CPU. Our results demonstrate that a nominal area overhead of the asynchronous template of less than 10% leads to a 19% performance boost over the synchronous design due to average-case data and a 30-40% improvement when synchronous PVT margins are considered.
As manufacturing processes continue to shrink and supply voltages drop, timing margins due to inc... more As manufacturing processes continue to shrink and supply voltages drop, timing margins due to increased process, temperature, and voltage variability become a significant portion of the clock period. An asynchronous bundled data resilient template called Blade has recently been proposed to curb these margins and thereby outperform synchronous alternatives. This paper proposes a model to analyze the performance of Blade designs and an approach to optimize it. We validate the model against gate-level simulations of a resilient 3-stage MIPS CPU implemented with Blade and use it to compare the optimal performance of Blade designs with synchronous alternatives. The results show that Blade offers up to 44% higher performance than traditional designs and 23% higher performance than Bubble Razor, the synchronous resiliency strategy with the highest reported performance.

Despite their substantial power savings, voltage scaling design increases the concern about sensi... more Despite their substantial power savings, voltage scaling design increases the concern about sensitivity to manufacturing process and operating conditions variations. These can induce signifcant delay changes in fabricated circuits. An elegant approach to cope with these issues is to employ quasi delay-insensitive asynchronous design styles, which allow relaxing timing assumptions, enabling simpler timing closure when compared to clocked solutions. This work explores the effects of supply voltage scaling on a specific class of quasi-delay-insensitive circuits called spatially distributed dual spacer null convention logic (SDDS-NCL). It first analyzes basic SDDS-NCL gates from a 65 nm cell library. The analysis explores the effects of supply voltage scaling on isolated cells, encompassing static power, energy and delay trade-offs. Next, it shows the results of a similar analysis applied to a 324-cell case study circuit. Results indicate that the evaluated class of circuits can significantly benefit from sub- and near-threshold operation to trade of energy efficiency and performance.
We present the design and analysis of three commonly used types of programmable delay elements su... more We present the design and analysis of three commonly used types of programmable delay elements suitable for use in 2-phase bundled-data asynchronous circuits. Our objective is to minimize power consumption and delay margins needed for correct operation under voltage scaling. We propose both circuit design and transistor sizing strategies to optimize these elements and discuss the relative trade-offs observed in a 65 nm bulk CMOS technology.
LASCAS 2015
Networks on chip (NoCs) are efficient infrastructures to enable communication among the... more Networks on chip (NoCs) are efficient infrastructures to enable communication among the large number of IPs that compose modern systems on chip (SoCs). However, even if recent technologies allow the construction of such complex systems, they increase the cost and effort of designing a correct and efficient chip-wide clock delivery network for a fully synchronous system. Asynchronous NoCs are an attractive alternative, as they allow each IP to operate on an independent clock domain, eliminating the need for a global clock network and producing globally asynchronous locally synchronous (GALS) systems. This paper details the design of a low power NoC router that is asynchronous and employs a transition-signaling bundled-data design style. Results show reductions of 27% in energy consumption when compared to a similar synchronous router.
LASCAS 2015
The evolution of technology into deep submicron domains leads to increasingly complex timin... more The evolution of technology into deep submicron domains leads to increasingly complex timing closure problems to design multiprocessor systems. One natural alternative is to resort to the globally asynchronous, locally synchronous paradigm (GALS). This work proposes a generic architecture for very low power- and area-overhead local clock generators (LCG) to drive individual modules of a multiprocessor, e.g. network on
chip routers and other elements. As main original contribution
it details the design of a digitally controlled oscillator (DCO), the
core of the clock generator architecture. This DCO can produce
at least 16 distinct frequencies between 117 MHz and 1 GHz and supports clock gating and glitch-free frequency changes. Its design is robust to PVT variations and takes less than 1,000 µm2.

Increasing process variations and sensitivity to operating conditions are making the design of tr... more Increasing process variations and sensitivity to operating conditions are making the design of traditional synchronous circuits a challenging task. Correct operation of these circuits relies on timing margins, which have an undesirably high cost in performance and power. One approach to mitigate this cost that is gaining substantial interest is the use of timing resilient microarchitectures that utilize error detecting sequential circuits. We evaluate the sensitivity of the transition detector with time borrowing error detecting latch to timing violations, including violations caused by glitches. Results show that the classic design is more constrained than previously believed and does not guarantee safe operation, i.e. does not guarantee that all timing violations will be captured. To overcome this limitation, we propose transistor level optimizations that enable safe operation, guaranteeing that all timing violations are captured, for a cost of 3 extra transistors, 30% in leakage power and 8% in energy.
Papers by Matheus Trevisan Moreira

Traditional synthesis flow dedicated to design ASICs adopts standard cells approach to generate V... more Traditional synthesis flow dedicated to design ASICs adopts standard cells approach to generate VLSI circuits. As consequence, cell-level layout are not fully optimized due to the restricted number of cells present in the library. To solve this problem, ASTRAN, an open source automatic synthesis tool, was developed. This tool generates layouts with unrestricted cell structures and obtains results with similar density compared to standard cell. A key step on ASTRAN flow is the transistor folding, which consists in breaking the transistors that exceed the height limit defined in the project rules. This step is executed in ASTRAN only into single transistors. This paper addresses this issue and introduces a new folding methodology that identifies all series transistors stacks, applying the folding for each of these arrangements. The results obtained through this new folding technique show reductions in cell area.
This paper presents HF-RISC, a 32-bit RISC processor , along with its associated programming tool... more This paper presents HF-RISC, a 32-bit RISC processor , along with its associated programming toolchain. The instruction set architecture of the processor is based on MIPS I and its hardware organization comprises three pipeline stages. The processor was synthesized in four different technology nodes for maximum frequency and simulated using CoreMark, an industry-standard performance evaluation benchmark. Using data obtained from synthesis and benchmarking we analyze the processor performance and compare it to similar commercial products. Obtained results indicate that HF-RISC is a good option for embedded design, as it presents performance figures similar to state-of-the-art ARM processors. Furthermore, its partially reconfigurable hardware organization allows the designer to explore performance and area trade offs.
Lecture Notes in Computer Science, 2011
This work presents the architecture and ASIC implementation of Hermes-A, an asynchronous network ... more This work presents the architecture and ASIC implementation of Hermes-A, an asynchronous network on chip router. Hermes-A is coupled to a network interface that enables communication between router and synchronous processing elements. The ASIC implementation of the router employed standard CAD tools and a specific library of components. Area and timing characteristics for 180nm technology attest the quality of the design, which displays a maximum throughput of 3.6 Gbits/s.
Abstract. This work presents and compares two design flows for implementing asynchronous ASICs. T... more Abstract. This work presents and compares two design flows for implementing asynchronous ASICs. The first flow is fully automated and generates circuits from Balsa descriptions, a high level language dedicated to the description of asynchronous circuits. Balsa descriptions are synthesized through Teak, a synthesis tool that accepts inputs in the Balsa language. The other flow uses conventional EDA tools to synthesize asynchronous circuits initially captured in some typical HDL, such as Verilog or VHDL, where each asynchronous component is ...

2011 18th IEEE International Conference on Electronics, Circuits, and Systems, 2011
The interest in non-synchronous design of digital circuits is growing due to technology scaling i... more The interest in non-synchronous design of digital circuits is growing due to technology scaling into deep submicron transistor geometries and to the problems this scaling causes to keep synchronous design advantageous. To enable most non-synchronous styles, the C-element is a fundamental device that has to be available as logic primitive. A recently proposed design flow improved a standard cell library, adding to it a set of typical asynchronous cells. However, the original flow did not address low power cells explicitly, which is a requirement in many modern applications. This paper proposes the extension of the flow so that it can expand the cell set with low power components. To achieve this, the paper adds a new degree of freedom to cell design. The new standard cell set encompasses over 500 different C-element implementations. The cell set employs a 65nm commercial CMOS process and is fully compliant with the foundry standard cell library. A fully asynchronous RSA crypto core was designed with the new cells, producing savings of more than 35% in total power and more than 69% in leakage power.
Symposium on VLSI, …, Apr 7, 2008
The evolution of deep submicron technologies allows the development of increasingly complex Syste... more The evolution of deep submicron technologies allows the development of increasingly complex Systems on a Chip (SoC). However, this evolution is rendering less viable some well-established design practices. Examples are the use of multi-point communication architectures (eg busses) and designing fully synchronous systems. In addition, power dissipation is becoming one of the main design concerns due eg to the increasing use of mobile products. An alternative to overcome such problems is adopting ...
Hermes-A���An Asynchronous NoC Router with Distributed Routing
This work presents the architecture and ASIC implementation of Hermes-A, an asynchronous network ... more This work presents the architecture and ASIC implementation of Hermes-A, an asynchronous network on chip router. Hermes-A is coupled to a network interface that enables communication between router and synchronous processing elements. The ASIC implementation of the router employed standard CAD tools and a specific library of components. Area and timing characteristics for 180nm technology attest the quality of the design, which displays a maximum throughput of 3.6 Gbits/s.
23rd IEEE International SOC Conference, 2010
A 65nm standard cell set and flow dedicated to automated asynchronous circuits design
2011 IEEE International SOC Conference, 2011
Abstract This work proposes a new design flow for rapid creation and characterization of standard... more Abstract This work proposes a new design flow for rapid creation and characterization of standard cell sets for asynchronous design. The flow is fully automated except for the cell layout generation step. It has been applied to the design of a standard cell set supporting the Teak asynchronous synthesis tool. Cells use a 65 nm gate length commercial CMOS process. An asynchronous RSA cryptography circuit provides the design flow validation.
Fifteenth International Symposium on Quality Electronic Design, 2014
Classically, quasi-delay-insensitive asynchronous circuits based on weak-conditioned half-buffer ... more Classically, quasi-delay-insensitive asynchronous circuits based on weak-conditioned half-buffer employ the return-tozero, 4-phase handshake protocol. This work scrutinizes the alternative return-to-one protocol and analyzes the effects of using it in practical circuits. A pipelined shift and add multiplier serves as case study. Return-to-one and return-to-zero versions of the circuit provide ground for extensive comparison. Experimental results point to reductions in static power and in forward propagation delay of up to 35% and 12%, respectively, when using return-toone. Also, results indicate that mixing return-to-zero and return-toone leads to dynamic power savings.

2008 IEEE Computer Society Annual Symposium on VLSI, 2008
The evolution of deep submicron technologies allows the development of increasingly complex Syste... more The evolution of deep submicron technologies allows the development of increasingly complex Systems on a Chip (SoC). However, this evolution is rendering less viable some well-established design practices. Examples are the use of multi-point communication architectures (e. g. busses) and designing fully synchronous systems. In addition, power dissipation is becoming one of the main design concerns due e. g. to the increasing use of mobile products. An alternative to overcome such problems is adopting Networks on Chip (NoCs) communication architectures supporting globally asynchronous locally synchronous (GALS) system design. This work proposes a GALS router with associated power control techniques, which enables low power SoC design. This is in contrast with previous works which centered attention in power reduction of SoC processing elements instead. The paper describes the asynchronous communication interface and the employed power control mechanism. The results obtained from simulation at the RTL level with timing show that, even when submitted to large rates of traffic injection, the proposed NoC displays a significant reduction in switching activity and consequently in power dissipation.
Uploads
Asynchronous and GALS Circuits by Matheus Trevisan Moreira
critical to ensure metastable signals are properly filtered before
reaching the arbiter outputs. However, despite their importance, the testability of these circuits is typically limited to functional testing. This paper discusses why this is not sufficient and addresses testability issues in both full-custom and standard-cell implementations. In particular, it proposes two new testable implementations that not only ensure improved coverage for single stuck-at faults but also enable testing the filtering of metastable signals. Additionally, this article quantifies the cost of the testable designs by comparing them to similar traditional designs in terms of area, power and metastability resolution time. Results show the proposed optimizations do increase area and power but have small impact on performance.
chip routers and other elements. As main original contribution
it details the design of a digitally controlled oscillator (DCO), the
core of the clock generator architecture. This DCO can produce
at least 16 distinct frequencies between 117 MHz and 1 GHz and supports clock gating and glitch-free frequency changes. Its design is robust to PVT variations and takes less than 1,000 µm2.
Papers by Matheus Trevisan Moreira
critical to ensure metastable signals are properly filtered before
reaching the arbiter outputs. However, despite their importance, the testability of these circuits is typically limited to functional testing. This paper discusses why this is not sufficient and addresses testability issues in both full-custom and standard-cell implementations. In particular, it proposes two new testable implementations that not only ensure improved coverage for single stuck-at faults but also enable testing the filtering of metastable signals. Additionally, this article quantifies the cost of the testable designs by comparing them to similar traditional designs in terms of area, power and metastability resolution time. Results show the proposed optimizations do increase area and power but have small impact on performance.
chip routers and other elements. As main original contribution
it details the design of a digitally controlled oscillator (DCO), the
core of the clock generator architecture. This DCO can produce
at least 16 distinct frequencies between 117 MHz and 1 GHz and supports clock gating and glitch-free frequency changes. Its design is robust to PVT variations and takes less than 1,000 µm2.