Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008, 2008 14th IEEE International Symposium on Asynchronous Circuits and Systems
We present a technique to automatically synthesize heterogeneous asynchronous pipelines by combining two different latching styles: normally open D-latches for high performance and self-resetting D-latches for low power. The former is fast but results in high power consumption due to data glitches that leak through the latch when it is open. The latter is normally closed and is opened just before data stabilizes. Thus, it is more power-efficient but slower than normally open D-latches.
Proceedings Second International Symposium on Advanced Research in Asynchronous Circuits and Systems, 1996
This paper presents design and simulation results of two high-performance asynchronous pipeline circuits. The rst circuit is a two-phase micropipeline but uses pseudo-static Svensson-style double edge-triggered Dip-ops (DETDFF) for data storage in place of traditional transmission gate latches or Sutherland's capture-pass latches. The second circuit is a fourphase micropipeline with burst-mode control circuits.
IEEE Design & Test of Computers, 2011
Carolina at Chapel Hill ONE OF THE FOUNDATIONS of high-performance digital system design is the use of pipelining. In synchronous systems, for several decades, pipelining has been the fundamental technique used to increase parallelism and hence boost system throughputÀ À whether for high-performance processors, multimedia and graphics units, or signal processors. This article provides an overview of pipelining in asynchronous, or clockless, digital systems. We do not attempt an exhaustive coverage, but rather introduce the basics of several leading representative styles. These pipelines naturally fall into two classes: those that use static logic versus those that use dynamic logic for the data path. Each class tends to use a distinct approach for its control and data storage. For static logic, we introduce the classic micropipeline of Sutherland, 1 along with two highperformance variants: Mousetrap 2 (which uses a standard cell design) and GasP 3 (which uses a custom design). For dynamic logic, we present the classic PS0 pipeline of Williams and Horowitz, 4,5 along with two high-performance variants: the precharge half-buffer (PCHB) pipeline 6 (which provides greater timing robustness) and the high-capacity (HC) pipeline 7 (which provides double the storage capacity). We also briefly discuss design tradeoffs, performance evaluation, systemlevel analysis and optimization techniques, CAD tool support, testing, and recent industrial and academic applications.
2000
This paper introduces several new asynchronous pipeline designs which offer high throughput as well as low latency. The designs target dynamic datapaths, both dualrail as well as single-rail. The new pipelines are latchfree and therefore are particularly well-suited for fine-grain pipelining, i.e., where each pipeline stage is only a single gate deep. The pipelines employ new control structures and protocols aimed at reducing the handshaking delay, the principal impediment to achieving high throughput in asynchronous pipelines.
ACM Journal on Emerging Technologies in Computing Systems, 2011
We present two novel energy-efficient pipeline templates for high throughput asynchronous circuits. The proposed templates, called N-P and N-Inverter pipelines, use a single-track handshake protocol. There are multiple stages of logic within each pipeline. The proposed techniques minimize handshake overheads associated with input tokens and intermediate logic nodes within a pipeline template. Each template can pack a significant amount of logic in a single stage, while still maintaining a fast cycle time of only 18 transitions. Noise and timing robustness constraints of our pipelined circuits are quantified across all process corners. We present completion detection scheme based on wide NOR gates, which results in significant latency and energy savings especially as the number of outputs increase. To fully quantify all design trade-offs, three separate pipeline implementations of an 8x8-bit Booth-encoded array multiplier are presented. Compared to a standard QDI pipeline implementat...
Asynchronous Design Methodologies
Among the claims made concerning the advantages of asynchronous logic are that circuits can take advantage of average case (data dependant) speed rather than worst case speed. Whilst this argument can easily De sustained for a single logic stage its extension to systems consisting of many logic stages has not been widely investigated. This paper reports on investigations into the throughput of asynchronous and synchronous pipelines consisting of alternate latches and logic stages where rhe data dependant delay is a two valued random variable. The extenl to which an average case speed of a single stage which is lower than worst case can be translated into higher throughput in an asynchronous pipeline as compared to a synchronous pipeline is found to be restricted by the coefficient of variation of the distribution of data dependant delay, the length of the pipeline, the number of latches used between each logic stage and ihe number data items in U loop.
2002
Exploiting instruction-level parallelism (ILP) is extremely important for achieving high performance in application specific instruction set processors (ASIPs) and embedded processors. Unlike conventional general purpose processors, ASIPs and embedded processors typically run a single application and hence must be optimized extensively for this in order to extract maximum performance. Further, low power and low cost requirements of ASIPs may demand reuse of pipeline stages causing pipelines with complex structural hazards. In such architectures, exploiting higher ILP is a major challenge to the designer.
2001
This paper describes a novel approach to high-level synthesis of complex pipelined circuits, including pipelined circuits with feedback. This approach combines a high-level, modular specification language with an efficient implementation. In our system, the designer specifies the circuit as a set of independent modules connected by conceptually unbounded queues. Our synthesis algorithm automatically transforms this modular, asynchronous specification into a tightly coupled, fully synchronous implementation in synthesizable Verilog. ¡ Approach: It presents a new approach to high-level synthesis. This approach combines the best of both worlds: a modular, asynchronous specification language and an automatically generated synchronous, fully pipelined implementation. ¡ Algorithms: It presents a relaxation algorithm for decreasing the clock cycle time and a coordinated global scheduling algorithm for mapping the individual operations of the modules into clock cycles. The latter is the enabling technology for efficient pipelining, as it allows the data to move together across the circuit even when the pipeline buffers are full.
IEEE Access
Proceeding miniaturization in VLSI circuits continues to pose challenges to the conventionally used synchronous design style in microprocessors. These include distribution of clock in the GHz range, robustness to delay variations, reduction in electromagnetic interference and energy conservation, and to name a few. Asynchronous logic has been known for its ability to address the aforementioned challenges by means of closed-loop handshake protocols, instead of notorious clock signals. Because of these advantages, there have been numerous attempts on building general and special purpose microprocessors during the last three decades. Still, however, the number of asynchronous processors commercially available is scarce, mainly due to an insufficient electronic design and automation tools support, an ambiguous design flow and testing mechanisms for asynchronous logic, and, most importantly, absence of a forum to look for relevant works, explaining the design steps and tools for such microprocessors. This work is intended to bridge this gap by 1) reviewing the design principles of asynchronous logic, including classification, signaling conventions, and pipelining approaches, 2) presenting the complete design flow, and available EDA tools, 3) developing an encyclopedia of various general and special purpose microprocessors proposed by far, and 4) presenting an evaluation of those works in terms of area on the die and performance metrics. This work will also serve as guidelines for asynchronous microprocessor design and implementation in all phases from specification to tape out. INDEX TERMS Asynchronous logic, electronic design and automation, microprocessor.
International Journal of Engineering Research and, 2015
Asynchronous designs have interesting features because of absence of the clock signals and it is another option while designing a digital systems. By overcoming the system timing overhead, asynchronous designs works at high speed and it provide high throughput, utilizes the dynamic power, and also provides the elasticity. However, out of several design styles, the most suitable designs for FPGA platforms are bundled data micro pipelines style, the reason behind it is simplicity of control. In this project, we propose pipeline architecture in a bundled data micro pipeline style to implement the asynchronous digital systems by targeting the FPGA devices. The execution of program or software is achieved using FPGA through parallel processing. The design summary of each architecture shows that the design architecture have better throughput by providing more number of input output bond, less delay by reducing the number of look up tables(LUT), and occupy less area by reducing the number of flip-flop's. The designed architecture is applied for UART design in order to increase the throughput and also for FIR filter design to reach the high efficiency.
Circuits and Systems
The objective of the work is to design a new clock gated based flip flop for pipelining architecture. In computing and consumer products, the major dynamic power is consumed in the system's clock signal, typically about 30% to 70% of the total dynamic (switching) power consumption. Several techniques to reduce the dynamic power have been developed, of which clock gating is predominant. In this work, a new methodology is applied for gating the Flip flop by which the power will be reduced. The clock gating is employed to the pipelining stage flip flop which is active only during valid data are arrived. The methodology used in project named Selective Look-Ahead Clock Gating computes the clock enabling signals of each FF one cycle ahead of time, based on the present cycle data of those FFs on which it depends. Similarly to data-driven gating, it is capable of stopping the majority of redundant clock pulses. In this work, the circuit implementation of the various blocks of data driven clock gating is done and the results are observed. The proposed work is used for pipelining stage in microprocessor and DSP architectures. The proposed method is simulated using the quartus for cyclone 3 kit.
2016
Techniques to speedup and accelerate the execution of sequential applications considering the multicore synergies provided by contemporary architectures, such as the ones possible to implement using Field-Programmable Gate Arrays (FPGAs), are increasingly important. One of the techniques is task-level pipelining, seen as a suitable technique for multicore based systems, especially when dealing with applications consisting of producer/consumer (P/C) tasks. In order to provide task-level pipelining, efficient data communication and synchronization schemes between producers and consumers are key. The traditional mechanisms to provide data communication and synchronization between P/C pairs, such as FIFO-channels and shared memory based empty/full flag schemes, may not be feasible and/or efficient for all types of applications. This thesis proposes an approach for pipelining tasks able to deal with in-order and out-of-order communication patterns between P/C pairs. In order to provide e...
Though asynchronous design has been studied for decades, some objective design di culties made synchronous design style more suitable for commercial applications. The promise of no clock skew, the higher degree of modularity, the low power consumption, have generated a resurgence of interest of the scienti c community for asynchronous logic.
… Workshop on Logic …, 2004
We present a complete toolflow that translates ANSI-C programs into asynchronous circuits. The toolflow is built around a compiler that converts C into a functional dataflow intermediate representation, exposing instruction-level, pipeline and memory parallelism. The compiler performs optimizations and converts the intermediate representation into pipelined asynchronous circuits, with no centralized controllers. In the resulting circuits, control is distributed, communication is achieved through local wires, and arbitration for datapath resources is unnecessary. Circuits automatically synthesized from Mediabench kernels exhibit excellent energy-delay.
IEICE Electronics Express
Voltage scaling is an effective technique for ultra-low-power applications. However, PVT variation degrades the robust of traditional synchronous pipelines severely when voltage scales into the subthreshold region. In this paper, we propose a register-based bundleddata asynchronous pipeline that can operate robustly in sub-threshold, called Snake. By looping the match delay line, the Snake halves the design overhead compared to other asynchronous pipelines. We also propose a practical asynchronous design methodology which is compatible with commercial EDA and needs only a few modifications to synchronous design flow. Monte-Carlo SPICE simulation shows that the pipelined multiplier applying the proposed techniques operates stably in 0.2V and achieves minimum power 1.3nW in 0.2V, minimum energy 1.07pJ per cycle in 0.3V. It provides 6.7 times superiority over synchronous baseline design with 22% area overhead. Comparison with other works in the state of art shows the proposed techniques are quite competitive.
2013
In recent years, there has been increasing interest on using task-level pipelining to accelerate the overall execution of applications mainly consisting of Producer-Consumer tasks. This paper proposes an approach to achieve pipelining execution of Producer-Consumer pairs of tasks in FPGA-based multi-core architectures. Our approach is able to speedup the overall execution of successive, data-dependent tasks, by using multiple cores and specific customization features provided by FPGAs. An important component of our approach is the use of customized inter-stage buffer schemes to communicate data and to synchronize the cores associated to the Producer-Consumer tasks. In order to improve performance, we propose a technique to optimize out-of-order Producer-Consumer pairs where the consumer uses more than once each data element produced, a behavior present in many applications (e.g., in image processing). All the schemes and optimizations proposed in this paper were evaluated with FPGA implementations. The experimental results show the feasibility of the approach in both in-order and out-of-order Producer-Consumer tasks. Furthermore, the results using our approach to task-level pipelining and a multi-core architecture reveal noticeable performance improvements for a number of benchmarks over a single core implementation without using task-level pipelining.
1994
The clock frequency of a synchronous circuit can be increased at the expense of increased system latency, area, and power using synchronous optimization techniques such as pipelining and retiming. Pipelining is well developed methodology, having been applied to almost every computer architecture from microprocessors to supercomputers. Retiming, on the other hand, has only recently become popular and practical application areas currently being developed. Both pipelining and retiming are reviewed in this paper. In order to make retiming more generally useful, low-level circuit delay components inherent to IC must be incorporated into the retiming process. These include variable register delay, clock skew, and interconnect delay. An algorithm is presented by the authors for incorporating variable register delays, interconnect delay, and the clock skew into retiming. The algorithm identifies and eliminates path-dependant race conditions in synchronous circuits. The results of applying the algorithm to MCNC benchmarks is presented and both performance and reliability improvements are observed
In this paper, a performance comparison of several proposed asynchronous pipeline styles is presented. The asynchronous styles include GasP, MOUSETRAP, IPCMOS, LP SR 2/1, HC, STFB, LDA, LP2/1, RSPCFB, and NCL. Both 4-bit and 16-bit 4-stage FIFO circuits are designed and simulated utilizing HSPICE. The simulation results are then used to compare the styles in terms of throughput, latency, power dissipation, transistor count, and datapath width. In addition, two figures of merit which relate the energy and the delay of the circuit are utilized in the comparison of the styles. To estimate the throughput and the latency of the circuits, a simple analytical model for the transistor delay is also proposed. The predictions of the analytical model for the throughput and the latency are compared to the simulations results to assess the accuracy of the model.
IEEE Design & Test of Computers, 2011
2001
Wave-steering is a new design methodology that realizes high throughput circuits by embedding layout friendly synthesized structures in silicon. In the wave-steering design methodology, circuits inherently utilize latches. Inside the synthesized structures they are used for signal skewing, and on the interconnects to guarantee the correct arrival times at the inputs. Recently, we proposed a novel high-throughput FPGA architecture based on the wavesteering design principle to handle throughput-intensive applications. Previously our work was focussed mainly on the Logic Block (LB) design. In this paper we discuss a pipelined interconnect scheme to support the strict timing requirements that is necessitated by the wave-steered design style. We characterize designs that best fit the new architecture and show that as technology scales down towards deep submicron (DSM), this FPGA fabric shows an increasing throughput performance.
2011
This paper proposes a technique for dynamic power reduction of pipelined processors. It is based on eliminating unnecessary transitions that are generated during the execution of NOP instructions. The approach includes the elimination of unnecessary changes in pipe register contents and the limitation of boundary movement of transitions caused by inevitable changes in pipe register contents due to insertion of a NOP into a pipelined processor. To assess its efficiency, the proposed technique is applied to MIPS, DLX, and PAYEH processors considering a number of benchmarks. The experimental results show that the techniques can lead to up to 10% reduction in the dynamic power consumption at a cost of negligible (almost zero) speed and (about 0.2%) area overheads.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.