Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays
C-slow retiming is a process of automatically increasing the throughput of a design by enabling fine grained pipelining of problems with feedback loops. This transformation is especially appropriate when applied to FPGA designs because of the large number of available registers. To demonstrate and evaluate the benefits of C-slow retiming, we constructed an automatic tool which modifies designs targeting the Xilinx Virtex family of FPGAs. Applying our tool to three benchmarks: AES encryption, Smith/Waterman sequence matching, and the LEON 1 synthesized microprocessor core, we were able to substantially increase the total throughput. For some parameters, throughput is effectively doubled.
Journal of Universal …, 2007
This article presents an architecture that encrypts data with the AES algorithm. This architecture can be implemented on the Xilinx Virtex II FPGA family, by applying pipelining and dynamic total reconfiguration (DTR). The originality of our implementation is that it computes sequentially in the FPGA the Key and Cipher part of the AES algorithm. This dynamic reconfiguration implementation allows a good optimization of logic resources with a high throughput. This architecture employs only 11619 slices allowing a considerable economy of the resources and reaching a maximum throughput of 44 Gbps.
Implementation of Encryption Standard system (AES) by efficient code optimization and partial reconfiguration techniques has been presented in this paper. 128 bit block size and cipher key have been used for this AES implementation. Rijndael algorithm which is also referred as AES is mainly used for ensuring transmission channels security. Xilinx design tool 13.3 and Xilinx project navigator tools are used for synthesis and simulation purpose.For coding of the design, VHDL language has been used. Pipelined design has been implemented on Virtex 6 FPGA device and a throughput of 49.3Gbits/s is achieved with the frequency of 384.793 MHz.
Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays - FPGA '03, 2003
In this paper, we propose a new mathematical DES description that allows us to achieve optimized implementations in term of ratio T hroughput/Area. First, we get an unrolled DES implementation that works at data rates of 21.3 Gbps (333 MHz), using Virtex-II technology. In this design, the plaintext, the key and the mode (encryption/decrytion) can be changed on a cycle-by-cycle basis with no dead cycles. In addition, we also propose sequential DES and triple-DES designs that are currently the most efficient ones in term of resources used as well as in term of throughput. Based on our DES and triple-DES results, we also set up conclusions for optimized FPGA design choices and possible improvement of cipher implementations with a modified structure description.
International Design and Test Workshop, 2007
The Advanced Encryption Standard (AES) is the last standard for cryptography and has gained wide support as means to secure digital data. In this paper, Tradeoffs of speed vs. area that are inherent in the design of a security processor are explored. Two implementations of the AES on Xilinx Virtex 4 FPGA are introduced, the first design is called optimized
Proceedings of the 2002 …, 2002
The execution runtime usually is a headache for designers performing application mapping onto reconfigurable architectures. In this article we propose a methodology, as well as the supporting toolset, targeting to provide fast application implementation onto reconfigurable architectures with the usage of a Just-In-Time (JIT) compilation framework. Experimental results prove the efficiency of the introduced framework, as we reduce the execution runtime compared to the state-of-the-art approach on average by 53.5×. Additionally, the derived solutions achieve higher operation frequencies by 1.17×, while they also exhibit significant lower fragmentation ratios of hardware resources.
Speed and area reduction are one of the major issues in VLSI applications. An implementation of the Advanced Encryption Standard (AES) algorithm is presented in this paper. The design uses looping method will reduce area and increase the speed .By using encrypted round for speed and pipelining ,isomorphic mapping method for area.This algorithm achieves efficiency and high throughput.
Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays - FPGA '03, 2003
Reprogrammable devices such as Field Programmable Gate Arrays (FPGA's) are highly attractive options for hardware implementations of encryption algorithms and this report investigates a methodology to efficiently implement block ciphers in CLB-based FPGA's. Our methodology is applied to the new Advanced Encryption Standard RIJNDAEL and the resulting designs offer better performances than previously published in literature. We propose designs that unroll the 10 AES rounds and pipeline them in order to optimize the frequency and throughput results. In addition, we implemented solutions that allow to change the plaintext and the key on a cycle-by-cycle basis with no dead cycles. Another strong focus is placed on low area circuits and we propose sequential designs with very low area requirements. Finally we demonstrate that RAM-based implementations implies different constraints but our methodology still holds.
Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021
Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the pastÐthey are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable primitives. We describe Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. We show how to use a standard instruction selection approach to lower intermediate programs to assembly programs, which can be both faster and more effective than the complex metaheuristics that existing FPGA toolchains use. We use Reticle to implement linear algebra operators and coroutines and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization.
2005
We present an automatic logic synthesis method targeted for highperformance asynchronous FPGA (AFPGA) architectures. Our method transforms sequential programs as well as high-level descriptions of asynchronous circuits into fine-grain asynchronous process netlists suitable for an AFPGA. The resulting circuits are inherently pipelined, and can be physically mapped onto our AFPGA with standard partitioning and place-and-route algorithms. For a wide variety of benchmarks, our automatic synthesis method not only yields comparable logic densities and performance to those achieved by hand placement, but also attains a throughput close to the peak performance of the FPGA.
Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186)
2008
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the application by optimizing the configuration for the exact problem at hand at run-time. If the problem changes, the system needs to be reconfigured. When this occurs too often, the total reconfiguration overhead is too high and the benefit of using dynamic hardware generation vanishes. Hence, it is important to minimize the number of reconfigurations. We propose a novell technique to reduce the number of reconfigurations by using loop transformations. Our approach is similar to the temporal data locality optimizations. By applying our technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix multiplication example. After applying the loop transformations, the number of reconfigurations decreases by an order of magnitude. Combined with a dynamic hardware generation technique with a very low overhead, our technique obtains a significant speedup over generic circuits.
Indonesian journal of electrical engineering and computer science, 2024
The importance of crucial current technical advancements, particularly those centered on the cryptography process such as Cryptographic advanced encryption standard (AES) hardware architectures are gaining momentum with respect to improving the speed and area optimizations. In this paper, we have proposed a novel architecture to implement AES on a reconfigurable hardware i.e., field programmable gate arrays (FPGA). The controller in AES algorithm is responsible to generate the signals to perform operations to generate the 128 bits ciphertext. The proposed controller uses multiplexer and synchronous register-based approach to obtain area and speed efficient on the FPGA hardware. The entire architecture of AES with proposed controller is implemented on Virtex 5, Virtex 6, and Virtex 7series using XilinxISE 14.7 and tested for critical path delay, frequency, slices, efficiency and throughput. It is observed that all the parameters are improved compared to existing architectures achieving the throughput of 32.29, 40.01, and 43.01 Gbps respectively. The key benefit of this approach is the high level of parallelism it displays in a quick and efficient manner.
2007 IEEE Northeast Workshop on Circuits and Systems, 2007
Xilinx VirtexII Pro FPGAs support dynamic reconfiguration. To benefit from this functionality, Xilinx proposes a modular and differential development flow, which consists in precompiling all possible configurations and switching from one to another in real time. The precompilation process is too slow and static. Xilinx also supplies JBits, but this tool does not support the VirtexII Pro FPGA and later devices. We aim to dynamically produce digital circuits. Unfortunately, since Xilinx does not entirely document the format of the FPGA bitstreams, it is in principle impossible to produce bitstreams without using their tools. This paper presents the methodology we have used to determine the Xilinx bitstream format in order to quickly produce valid configurations on the fly using only our tools. Our synthesis approach translates a simple expression language into a dataflow graph of predefined tiles which are placed and interconnected using the bitstream format information we gathered.
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays - FPGA '14, 2014
Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. fmax is limited by critical path delay, which is highly influenced by lower-level details of the circuit implementation such as technology mapping, placement and routing. However, for high-level synthesis (HLS) design flows, it is challenging to evaluate the real critical delay at the behavioral level. Current HLS flows typically use module pre-characterization for delay estimates. However, we will demonstrate that such delay estimates are not sufficient to obtain high fmax and also minimize total execution latency. In this paper, we introduce a new HLS flow that integrates with Altera's Quartus synthesis and fast placement and routing (PAR) tool to obtain realistic post-PAR delay estimates. This integration enables an iterative flow that improves the performance of the design with both behaviorallevel and circuit-level optimizations using realistic delay information. We demonstrate our HLS flow produces up to 24% (on average 20%) improvement in fmax and upto 22% (on average 20%) improvement in execution latency. Furthermore, results demonstrate that our flow is able to achieve from 65% to 91% of the theoretical fmax on Stratix IV devices (550MHz).
2008
Dynamic hardware generation reduces the number of FPGA resources needed and speeds up the application by optimizing the configuration for the exact problem at hand at run-time. If the problem changes, the system needs to be reconfigured. When this occurs too often, the total reconfiguration overhead is too high and the benefit of using dynamic hardware generation vanishes. Hence, it is important to minimize the number of reconfigurations. We propose a novell technique to reduce the number of reconfigurations by using loop transformations. Our approach is similar to the temporal data locality optimizations. By applying our technique, we can drastically reduce the number of reconfigurations, as indicated by the matrix multiplication example. After applying the loop transformations, the number of reconfigurations decreases by an order of magnitude. Combined with a dynamic hardware generation technique with a very low overhead, our technique obtains a significant speedup over generic circuits.
Lecture Notes in Computer Science, 2003
Performance evaluation of the Advanced Encryption Standard candidates has led to intensive study of both hardware and software implementations. However, although plentiful papers present various implementation results, it seems that efficiency could still be greatly improved by applying good design rules adapted to devices and algorithms. This paper addresses various approaches for efficient FPGA implementations of the Advanced Encryption Standard algorithm. As different applications of the AES algorithm may require different speed/area tradeoffs, we propose a rigorous study of the possible implementation schemes, but also discuss design methodology and algorithmic optimization in order to improve previously reported results. We propose heuristics to evaluate hardware efficiency at different steps of the design process. We also define an optimal pipeline that takes the place and route constraints into account. Resulting circuits significantly improve previously reported results: throughput is up to 18.5 Gbits/sec and area requirements can be limited to 542 slices and 10 RAM blocks with a ratio throughput/area improved by at least 25% of the best-known designs in the Xilinx Virtex-E technology.
Proceedings of the 34th Design Automation Conference
In this paper, we present a new algorithm, named TurboSYN, for FPGA synthesis with retiming and pipelining to minimize the clock period for sequential circuits. For a target clock period, since pipelining can eliminate all critical I O paths, but not critical loops, we concentrate on FPGA synthesis to eliminate the critical loops. We combine the combinational functional decomposition technique with retiming to perform the sequential functional decomposition, and incorporate it in the label computation of TurboMap 11 to eliminate all critical loops. The results show a signi cant improvement over the state-of-the-art FPGA mapping and resynthesis algorithms 1:72 times reduction on the clock period. Moreover, we develop a novel approach for positive loop detection which leads to over 10 50 times speedup of the algorithm. As a result, TurboSYN can optimize sequential circuits of over 10 4 gates and 10 3 ip ops in reasonable time.
2001
Although run-time reconfigurable systems have been shown to achieve very high performance, the speedups over traditional microprocessor systems are limited by the cost of configuration of the hardware. Current reconfigurable systems suffer from a significant overhead due to the time it takes to reconfigure their hardware. In order to deal with this overhead, and increase the compute power of reconfigurable systems, it is important to develop hardware and software systems to reduce or eliminate this delay. In this paper, we explore the idea of configuration compression and develop algorithms for reconfigurable systems. These algorithms, targeted to Xilinx Virtex series FPGAs with minimum modification of hardware, can significantly reduce the amount of data needed to transfer during configuration. In this work we have extensively researched the current compression techniques, including the Huffman coding, the Arithmetic coding and LZ coding. We have also developed different algorithms targeting different hardware structures. Our readback algorithm allows certain frames to be reused as a dictionary and sufficiently utilize the regularities within the configuration bitstream. In addition, we have developed frame reordering techniques that better uses the regularities by shuffling the sequence of the configuration. We have also developed the wildcard approach that can be used for true partial reconfiguration. The simulation results demonstrate that a factor of 4 compression ratio can be achieved.
Field Programmable Logic and Applications (FPL), 2015 25th International Conference on
Partial reconfiguration is a technique used to increase the flexibility of an FPGA-based system by reprogramming parts of the system dynamically without interrupting the operation of the other modules. Despite the runtime benefits offered by partially reconfigurable (PR) systems, creating and storing partial bitstreams (PBs) are becoming major concerns for system architects when the numbers of reconfigurable partitions (RPs) and PR modules (PRMs) increase. It takes significant amount of time to generate the PBs for PR systems with large number of RPs and PRMs. More importantly, when the mapping relationship between PRMs and RPs is many-to-many, several almost-identical PBs of one PRM must be stored separately which leads to inefficient utilization of the memory storage. Therefore, bitstream relocation is drawing interests from the research community as a viable solution. Yet almost none of the works are able to demonstrate a coherent method to not only create relocatable PBs for complex and large PRMs in variable-size RPs but also how to do that automatically to free the designer from the tedious and error prone manual processes. In this paper, we propose a new technique to fill that gap. The method is successfully developed for Xilinx Virtex 7 devices using Vivado design tool flow.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.