Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2018, IEEE Transactions on Parallel and Distributed Systems
Field Programmable Gate Arrays (FPGAs) are reconfigurable architectures able to provide a good balance between energy efficiency and flexibility with respect to CPUs and ASICs. The main drawback in using FPGAs, however, is their timing-consuming routing process, significantly hindering the designer productivity. An emerging solution to this problem is to accelerate the routing by parallelization. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges. To cope with these challenges, this paper explores GPU-accelerated routing approach for FPGAs. We leverage the idea of problem size reduction by limiting the single-net routing in a small subgraph rather than in an entire graph, further enabling the GPU-friendly shortest path algorithm to be used in FPGA routing. We maintain the convergence after problem size reduction by using the dynamic expansion of the routing resource subgraph, where the routing region of subgraph will be progressively expanded to find a feasible solution to each net. In addition, we are based on a GPU platform to explore the fine-grained single-net parallel routing in three ways and propose a hybrid approach to combine the static and dynamic parallelization for better speedup in FPGA routing. To explore the coarse-grained multi-net parallelization, We propose a dynamic programming-based partitioning algorithm to parallelize the routing of multiple nets while generating the equivalent routing results as the original single-net routing. Experimental results show that our proposed approach can provide an average of about 21.53Â speedup on a single GPU with a tolerable loss in the routing quality and maintain a scalable speedup on large-scale routing resource graphs. To our knowledge, this is the first work to demonstrate the effectiveness of GPU-accelerated routing for FPGAs.
2010 International Conference on Field-Programmable Technology, 2010
We consider coarse and fine-grained techniques for parallel FPGA routing on modern multi-core processors. In the coarse-grained approach, sets of design signals are assigned to different processor cores and routed concurrently. Communication between cores is through the MPI (message passing interface) communications protocol. In the fine-grained approach, the task of routing an individual load pin on a signal is parallelized using threads. Specifically, as FPGA routing resources are traversed during maze expansion, delay calculation, costing and priority queue insertion for these resources execute concurrently. The proposed techniques provide deterministic/repeatable results. Moreover, the coarse and fine-grained approaches are not mutually exclusive and can be used in tandem. Results show that on a 4-core processor, the techniques improve router run-time by ∼2.1×, on average, with no significant impact on circuit speed performance or interconnect resource usage.
2017 IEEE International Conference on Computer Design (ICCD), 2017
Quantitative effects of Moore's Law have driven qualitative changes in FPGA architecture, applications, and tools. As a consequence, the existing EDA tools takes several hours or even days to implement the applications onto FPGAs. Typically, routing is a very time-consuming process in the EDA design flow. While several attempts have accelerated this process through parallelization, they still do not provide a strong parallel scheme for FPGA routing. In this paper we introduce a dependency-aware parallel approach, named Bamboo, to accelerate the routing time for FPGAs. With the dependency detection, Bamboo partitions the nets into multiple subsets, where the nets in the same subsets are independent, and the dependency only exists among different subsets. Specifically, the independent nets in the same subset are routed in parallel, and the subsets are processed in serial according to the original routing ordering. The partitioning problem is solved optimally using dynamic programming, and the parallelization is implemented by speculative parallelism on a single GPU. Experimental results show that our approach achieves an average of 15.13× speedup with negligible influence on the routing quality. Most importantly, it effectively maintains deterministic results and always produces the same results as the serial version.
IEICE Electronics Express, 2011
VLSI physical design algorithms are generally nonpolynomial algorithms with very long runtime. In this paper, we parallelize the Pathfinder global routing algorithm-a widely used FPGA routing algorithm-for running on multi-core systems to improve runtime of routing process. Our experimental results show that the runtime of proposed multi-threaded global routing reduces by 47.8% and 70.9% (on average) with 2 and 4 concurrent threads, respectively on a quad-core processor without any quality degradation.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
Multi-FPGA systems (MFSs) are used as custom computing machines, logic emulators and rapid prototyping vehicles. A key aspect of these systems is their programmable routing architecture which is the manner in which wires, FPGAs and Field-Programmable Interconnect Devices (FPIDs) are connected. Several routing architectures for MFSs have been proposed and previous research has shown that the partial crossbar is one of the best existing architectures [Kim96] [Khal97]. In this paper we propose a new routing architecture, called the Hybrid Complete-Graph and Partial-Crossbar (HCGP) which has superior speed and cost compared to a partial crossbar. The new architecture uses both hardwired and programmable connections between the FPGAs. We compare the performance and cost of the HCGP and partial crossbar architectures experimentally, by mapping a set of 15 large benchmark circuits into each architecture. A customized set of partitioning and inter-chip routing tools were developed, with particular attention paid to architecture-appropriate inter-chip routing algorithms. We show that the cost of the partial crossbar (as measured by the number of pins on all FPGAs and FPIDs required to fit a design), is on average 20% more than the new HCGP architecture and as much as 25% more. Furthermore, the critical path delay for designs implemented on the partial crossbar were on average 20% more than the HCGP architecture and up to 43% more. Using our experimental approach, we also explore a key architecture parameter associated with the HCGP architecture: the proportion of hard-wired connections versus programmable connections, to determine its best value.
2012
Parallelization of VLSI routing algorithms is one of the challenging problems in VLSI physical design. This is due to a large number of nets as well as the shared routing resources that result in data dependency among concurrent tasks. In this paper, VLSI Maze routing using GPGPU has been proposed to enable runtime performance improvement. We report up to 3x performance gain with an average of 25% runtime performance improvement from VLSI Maze routing using CPU. The routing qualities including wirelength and overflow are better among all benchmarks comparing with CPU baseline. The solutions also scale well when the size of the problem increases.
1997
Multi-FPGA systems (MFSs) are used as custom computing machines, logic emulators and rapid prototyping vehicles. A key aspect of these systems is their programmable routing architecture, the manner in which wires, FPGAs and Field-Programmable Interconnect Devices (FPIDs) are connected. In this paper we present an experimental study for evaluating and comparing two commonly used routing architectures for multi-FPGA systems: 8-way mesh and partial crossbar. A set of 15 large benchmark circuits are mapped into these architectures, using a customized set of partitioning, placement and inter-chip routing tools. Particular attention was paid to the development of appropriate interchip routing algorithms for each architecture. The architectures are compared on the basis of cost (the total number of pins required in the system) and speed (determined by post inter-chip routing critical path delay). The results show that the 8-way mesh architecture has high cost, poor routability and speed while the partial crossbar architecture gives relatively low cost, good routability and speed. Using our experimental approach, we also explore a key architecture parameter associated with the partial crossbar architecture, and its impact on the routability and speed of the architecture. We briefly describe an inter-chip router for the partial crossbar architecture, called PCROUTE, that gives excellent routability and speed results for real benchmark circuits.
Proceedings of the 2022 ACM/IEEE Workshop on Machine Learning for CAD
Field Programmable Gate Array (FPGA) routing is one of the most time consuming tasks within the FPGA design flow, requiring hours and even days to complete for some large industrial designs. This is becoming a major concern for FPGA users and tool developers. This paper proposes a simple, yet effective, framework that reduces the runtime of PathFinder based routers. A supervised Machine Learning (ML) algorithm is developed to forecast costs (from the placement phase) associated with possible congestion and hot spot creation in the routing phase. These predicted costs are used to guide the router to avoid highly congested regions while routing nets, thus reducing the total number of iterations and rip-up and reroute operations involved. Results obtained indicate that the proposed ML approach achieves on average a 43% reduction in the number of routing iterations and 28.6% reduction in runtime when implemented in the state-of-the-art enhanced PathFinder algorithm. CCS CONCEPTS • Hardware → Wire routing.
2013 International Conference on Field-Programmable Technology (FPT), 2013
The FPGA's interconnection network not only requires the larger portion of the total silicon area in comparison to the logic available on the FPGA, it also contributes to the majority of the delay and power consumption. Therefore it is essential that routing algorithms are as efficient as possible. In this work the connection router is introduced. It is capable of partially ripping up and rerouting the routing trees of nets. To achieve this, the main congestion loop rips up and reroutes connections instead of nets, which allows the connection router to converge much faster to a solution. The connection router is compared with the VPR directed search router on the basis of VTR benchmarks on a modern commercial FPGA architecture. It is able to find routing solutions 4.4% faster for a relaxed routing problem and 84.3% faster for hard instances of the routing problem. And given the same amount of time as the VPR directed search, the connection router is able to find routing solutions with 5.8% less tracks per channel.
ArXiv, 2020
Routing of the nets in Field Programmable Gate Array (FPGA) design flow is one of the most time consuming steps. Although Versatile Place and Route (VPR), which is a commonly used algorithm for this purpose, routes effectively, it is slow in execution. One way to accelerate this design flow is to use parallelization. Since VPR is intrinsically sequential, a set of parallel algorithms have been recently proposed for this purpose (ParaLaR and ParaLarPD). These algorithms formulate the routing process as a Linear Program (LP) and solve it using the Lagrange relaxation, the sub-gradient method, and the Steiner tree algorithm. Out of the many metrics available to check the effectiveness of routing, ParaLarPD, which is an improved version of ParaLaR, suffers from large violations in the constraints of the LP problem (which is related to the minimum channel width metric) as well as an easily measurable critical path delay metric that can be improved further. In this paper, we introduce a s...
ACM Transactions on Design Automation of Electronic Systems, 2002
Incremental physical CAD is encountered frequently in the so-called engineering change order (ECO) process in which design changes are made typically late in the design process in order to correct logical and/or technological problems in the circuit. Incremental routing is a significant part of an incremental physical design methodology. Typically after an ECO process, a small portion of the circuit netlist is changed, and in order to capitalize on the enormous resources and time already spent on routing the circuit it is desirable to reroute only the ECO-affected portion of the circuit, while minimizing any routing changes in the much larger unaffected part. Incremental rerouting also needs to be fast and to effectively use available routing resources. In this article, we develop a complete incremental routing methodology for FPGAs using a novel approach called bump and refit (B&R). The basic B&R idea (which was originally proposed in in the much simpler context of extending some nets by a segment for the purpose of fault tolerance) in our algorithms is to rearrange some portions of some existing nets on other tracks within their current channels in order to find valid routings for the new/modified nets without requiring any extra routing resources and with little effect on the electrical properties of existing nets. Here we significantly extend the B&R concept to global and detailed incremental routing for FPGAs with complex switchboxes (SBox's) such as those in Lucent's ORCA and Xilinx's Virtex series. We introduce new concepts such as a B&R cost in global routing and the optimal subnet set to relocate for each bumped net (determined using an efficient dynamic programming formulation). We developed optimal and nearoptimal algorithms (called Subsec B&R and Subnet B&R, respectively) to find incremental routing solutions using the B&R paradigm in complex FPGAs (e.g., Lucent's ORCA FPGA) with i-to-j SBox's, as well as an optimal version Fullnet B&R for the VPR architecture from the University of Toronto using the simpler i-to-i SBox's. We compared our algorithms (simply called B&R when no distinction needs to be made between our versions) to two recent incremental routing techniques, Standard (Std) and Rip-up&Reroute (R&R), and to Lucent's A PAR routing tool and the University of Toronto's VPR router used in complete rerouting modes. Experimental results for the ORCA show that B&R is 10 to 20 times faster than complete rerouting using A PAR, and that B&R is also nearly 27% faster and yields new nets with nearly 10% smaller lengths compared to previous incremental routers. Furthermore, B&R routers do not change either the lengths or topologies of existing nets, a significant advantage in ECO applications, in contrast to R&R which increases the length of ripped-up nets by an average of 8.75 to 13.6%. Experimental results for the VPR architecture are dominated by the significantly larger (in many cases, orders of magnitude more) number of nets left • 665 unrouted by Std and R&R compared to B&R, which highlights the much greater efficacy of B&R-based incremental routing. However, B&R is significantly slower than the other two incremental routers, although on an absolute scale it is quite fast for two of four cases we simulated; in one case, it is about 25 times faster than VPR used in the full rerouting mode. The relative slowness of B&R for the VPR architecture arises from the fact that we used i-to-i SBox's which forces each net to be routed on the same track, thus causing significantly more bumpings and searches for rearranged solutions compared to i-to-j SBox's where a net can be routed on different interconnected tracks to minimize the amount of bumpings (as we did for the ORCA). Since modern FPGAs generally have the latter type of SBox's, B&R would be fast as well as very effective on them.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2000
In this paper, the routing problem for twodimensional (2-D) field programmable gate arrays of a Xilinx-like architecture is studied. We first propose an efficient one-step router that makes use of the main characteristics of the architecture. Then we propose an improved approach of coupling two greedy heuristics designed to avoid an undesired decaying effect, a dramatically degenerated router performance on the near completion stages. This phenomenon is commonly observed on results produced by the conventional deterministic routing strategies using a single optimization cost function. Consequently, our results are significantly improved on both the number of routing tracks and routing segments by just applying low-complexity algorithms. On the tested MCNC and industrial benchmarks, the total number of tracks used by the best known two-step global/detailed router is 28% more than that used by our proposed method.
2005
The A* algorithm is a well-known path-finding technique that is used to speed up FPGA routing. Previously published A*-based techniques are either targeted to a class of architecturally similar devices, or require prohibitive amounts of memory to preserve architecture adaptability. This work presents architecture-adaptive A* techniques that require significantly less memory than previously published work. Our techniques are able to produce routing runtimes that are within 7% (on an islandstyle architecture) and 9% better (on a hierarchical architecture) than targeted heuristic techniques. Memory improvements range between 30X (islandstyle) and 140X (hierarchical architecture).
Proceedings of the …, 1994
We propose a general framework for FPGA routing, which allows simultaneous optimization of multiple competing objectives under a smooth designercontrolled tradeo . Our approach is based on a new multi-weighted graph formulation, enabling a theoretical performance characterization, as well as a practical implementation. Our FPGA router is architectureindependent, computationally e cient, and performs well on industrial benchmarks.
2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2017
Routing is a time-consuming process in the FPGA design flow. Parallelization is a promising direction to accelerate the routing. While synchronous parallelization can converge a feasible solution, the ideal speedup is rarely achieved due to excessive communication overheads. Asynchronous parallelization can provide an almost linear speedup, but it is difficult to converge in the limited number of iterations due to net dependency. In this paper we propose SAPRoute, which coordinates synchronous and asynchronous parallelism on distributed multiprocessing environment to accelerate the routing for FPGAs. The objective is to boost the more speedup of parallel routing algorithm under the requirement of convergence. To the best of our knowledge, this is the first work to study the impact of synchronization and asynchronization during parallelization. Experimental results show that our approach have negligible explicit synchronization overhead and achieves significant speedup improvement over a set of commonly used benchmarks. Notably, SAPRoute produces the speedup of 24.27× on average compared to the default serial solution.
2011 21st International Conference on Field Programmable Logic and Applications, 2011
We propose a new FPGA routing approach that, when combined with a low-cost architecture change, results in a 34% reduction in router run-time, at the cost of a 3% area overhead, with no increase in critical path delay. Our approach begins with traditional PathFinder-style routing, which we run on a coarsened representation of the routing architecture. This leads to fast generation of a partial routing solution where signals are assigned to groups of wire segments rather than individual wire segments. A Boolean satisfiability (SAT)-based stage follows, generating a legal routing solution from the partial solution. Our approach points to a new research direction: reducing FPGA CAD run-time by exploring FPGA architectures and algorithms together.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1997
This paper presents a new performance and routability driven router for symmetrical array based field programmable gate arrays (FPGA's). The objectives of our proposed routing algorithm are twofold: 1) improving the routability of the design (i.e., minimizing the maximum required routing channel density) and 2) improving the overall performance of the design (i.e., minimizing the overall path delay). Initially, nets are routed sequentially according to their criticalities and routabilities. The nets/paths violating the routing-resource and timing constraints are then resolved iteratively by a rip-up-and-rerouter, which is guided by a simulated evolution based optimization technique. The proposed algorithm considers the path delays and routability throughout the entire routing process. Experimental results show that our router can significantly improve routability and reduce delay over many existing routing algorithms.
Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2002
To fully realize the benefits of partial and rapid reconfiguration of field-programmable devices, we often need to dynamically schedule computing tasks and generate instance-specific configurations-new graphs which must be routed during program execution. Consequently, route time can be a significant overhead cost reducing the achievable net benefits of dynamic configuration generation. By adding hardware to accelerate routing, we show that it is possible to compute routes in one thousandth the time of a traditional, software router and achieve routes that are within 5% of the state-of-the-art offline routing algorithms for a sample set of application netlists and within 25% for a set of difficult synthetic benchmarks. We further outline how strategic use of parallelism can allow the total route time to scale substantially less than linearly in graph size. We detail the source of the benefits in our approach and survey a range of options for hardware assistance that vary from a speedup of over 10× with modest hardware overhead to speedups in excess of 1000×.
2004
We have developed a hop-based complete detailed router ROAD-HOP that uses the Bump & Refit (£ ¥ ¤ § ¦
2009 International Conference on Field-Programmable Technology, 2009
This paper optimizes the routing structure for hybrid FPGAs, in which high I/O density coarse-grained units are embedded within fine-grained logic. This significantly increases the routing resource requirement between elements. We investigate the routing demand for hybrid FPGAs over a set of domainspecific applications. The trade-off in delay, area and routability of the separation distance between coarse-grained blocks are studied. The effects of adding routing switches to the coarsegrained blocks and using wider channels near them to meet extra routing demand are examined. Our optimized architectures are compared to existing column based architecture. The results show that (1) there is 44% tracks usage at the edge of the embedded blocks, (2) both the separation of embedded blocks and addition of switches to embedded blocks can increase the area and delay performance by 48.4% compared to column based FPGA architecture, (3) wider channel width reduces the area of highly congested system by 34.9%, but it cannot further improve the system with separation of embedded blocks and additional switches on embedded blocks.
2010
Throughput and programmability have always been the central, but generally conflicting concerns for modern IP router designs. Current high performance routers depend on proprietary hardware solutions, which make it difficult to adapt to ever-changing network protocols. On the other hand, software routers offer the best flexibility and programmability, but could only achieve a throughput one order of magnitude lower. Modern GPUs are offering significant computing power, and its dataparallel computing model well matches the typical patterns of packet processing on routers. Accordingly, in this research we investigate the potential of CUDA-enabled GPUs for IP routing applications. As a first step toward exploring the architecture of a GPU based software router, we developed GPU solutions for a series of core IP routing applications such as IP routing table lookup and pattern match. For the deep packet inspection application, we implemented both a Bloom-filter based string matching algorithm and a finite automata based regular expression matching algorithm. A GPU based routing table lookup solution is also proposed in this work. Experimental results proved that GPU could accelerate the routing processing by one order of magnitude. Our work suggests that, with proper architectural modifications, GPU based software routers could deliver significant higher throughput than previous CPU based solutions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.