Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2007
…
8 pages
1 file
Multiprocessors enable parallel execution of a single large application to achieve a performance improvement. An application is split at instruction, data or task level (based on the granularity), such that the overhead of partitioning is minimal. Parallelization for multiprocessors is mostly restricted to a fixed granularity. Reconfiguration enables architectural variations to allow multiple granularities of operation within a multiprocessor. This adaptability optimizes resource utilization over a fixed organization.
ACM Transactions on Reconfigurable Technology and Systems, 2010
In multiprocessors, performance improvement is typically achieved by exploring parallelism with fixed granularities, such as instruction-level, task-level, or data-level parallelism. We introduce a new reconfiguration mechanism that facilitates variations in these granularities in order to optimize resource utilization in addition to performance improvements. Our reconfigurable multiprocessor QuadroCore combines the advantages of reconfigurability and parallel processing. In this paper, a unified hardware-software approach for the design of our QuadroCore is presented. This design-flow is enabled via compiler-driven reconfiguration, which matches application-specific characteristics to a fixed set of architectural variations. A special reconfiguration mechanism has been developed that alters the architecture within a single clock cycle.
2010
The advantage in multiprocessors is the performance speedup obtained with processorlevel parallelism. Similarly, the exibility for application-specic adaptability is the advantage in recongurable architectures. To benet from both these architectures, we present a recongurable multiprocessor template that combines parallelism in multiprocessors and exibility in recongurable architectures. A fast, single cycle, resourceecient, run-time reconguration scheme accelerates customisations in the recongurable multiprocessor template. Based on this methodology, a four-core multiprocessor called QuadroCore has been implemented on UMC's 90nm standard cells and on Xilinx's FPGA. QuadroCore is customisable and adapts to variations in the granularity of parallelism, the amount of communication between tasks, and the frequency of synchronisation. To validate the advantages of this approach, a diverse set of applications has been mapped onto the QuadroCore multiprocessor. Experimental results show speedups in the range of 3 to 11 in comparison to a single processor. In addition, energy savings of up to 30% were noted on account of reconguration. Furthermore, to steer application mapping based on power considerations, an instruction-level power model has been developed. Using this model, power-driven instruction selection introduces energy savings of up to 70% in the QuadroCore multiprocessor.
International Journal of …, 2008
The coarse-grained reconfigurable architecture ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) and its compiler offer high instruction-level parallelism (ILP) to applications by means of a sparsely interconnected array of functional units and register files. As high-ILP architectures achieve only low parallelism when executing partially sequential code segments, which is also known as Amdahl's law, this paper proposes to extend ADRES to MT-ADRES (Multi-Threaded ADRES) to also exploit thread-level parallelism. On MT-ADRES architectures, the array can be partitioned in multiple smaller arrays that can execute threads in parallel. Because the partition can be changed dynamically, this extension provides more flexibility than a multi-core approach. This article presents details of the enhanced architecture and results obtained from an MPEG-2 decoder implementation that exploits a mix of thread-level parallelism and instruction-level parallelism.
Microprocessors and Microsystems, 2012
In this paper, we address the problem of organization and management of threads on a multithreading custom computing machine composed of a General Purpose Processor (GPP) and Reconfigurable Coprocessors. We target higher portability, flexibility, and performance of the prospective design solutions by means of a strictly architectural approach. Our proposal to improve overall system performance is twofold. First, we provide architectural mechanisms to accelerate applications by supporting computationally intensive kernels with reconfigurable hardware accelerators. Second, we propose an infrastructure capable of facilitating thread management. Besides the architectural and microarchitectural extensions of the reconfigurable computing system, we also propose a hierarchical programming model. The model supports balanced and performance efficient SW/HW co-execution of multithreading applications. We demonstrate that our approach provides better performance-portability and performance-flexibility trade-off characteristics compared to other state-of-the-art proposals. The experimental results, based on real applications, suggest average system speedups between 1.2 and 19.6. Based on singlethreaded synthetic benchmark, we achieve average speedup between 8.5 and 129. For multithreaded synthetic benchmark, the achieved average speedup is between 1.3 and 7.3.
Lecture Notes in Computer Science, 2011
In this paper, we address organization and management of threads on a multithreading custom computing machine composed by a General Purpose Processor (GPP) and Reconfigurable Co-Processors. Our proposal to improve overall system performance is twofold. First, we provide architectural mechanisms to accelerate applications by supporting computationally intensive kernels with reconfigurable hardware accelerators. Second, we propose an infrastructure capable to facilitate thread management. The latter can be employed by, e.g., RTOS kernel services. Besides the architectural and microarchitecural extensions of the reconfigurable computing system, we also propose a hierarchical programming model. The model supports balanced and performance efficient SW/ HW co-execution of multithreading applications. Our experimental results based on real applications suggest average system speedups between 1.2 and 19.6 times and based on synthetic benchmarks, the achieved speedups are between 1.3 and 29.8 times compared to software only implementations.
2009
We present an efficient framework for dynamic reconfiguration of application-specific custom instructions. A key component of this framework is an iterative algorithm for temporal and spatial partitioning of the loop kernels. Our algorithm maximizes the performance gain of an application while taking into consideration the dynamic reconfiguration cost. It selects the appropriate custom instructions for the loops and clubs them into one or more configurations. We model the temporal partitioning problem as a k-way graph partitioning problem. A dynamic programming based solution is used for the spatial partitioning. Comprehensive experimental results indicate that our iterative partitioning algorithm is highly scalable while producing optimal or near-optimal (99% of the optimal) performance gain. 1 Introduction Current generation embedded systems designs are characterized by the increasing demand on higher performance under stringent time-to-market constraints. In this context, application-specific customizable processor cores strike the right balance between performance and design efforts. A customizable processor is, in general, configurable with respect to the micro-architectural parameters. More importantly, a customizable processor may support application-specific extensions of the core instruction set. Custom instructions encapsulate the frequently occurring computation patterns in an application. They are implemented as custom functional units (CFU) in the datapath of the existing processor core. CFUs improve performance and energy consumption through parallelization and chaining of operations. Some examples of commercial customizable processors include Lx [14], ARC TM core [2], Xtensa [15] and Stretch S5 [3].
2011
We present a run-time system for a multi-grained reconfigurable processor in order to provide a dynamic trade-off between performance and available area budgets for both fine- as well as coarse-grained reconfigurable fabrics as part of one reconfigurable processor. Our run-time system is the first implementation of its kind that dynamically selects and steers a performance-maximizing multi-grained instruction set under run-time varying constraints. It achieves a performance improvement of more than 2× compared to state-of-the-art run-time systems for multi-grained architectures. To elaborate the benefits of our approach further, we also compare it with offline- and online-optimal instruction-set selection schemes.
Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01, 2001
The rapid growth of silicon densities has made it feasible to deploy reconfigurable hardware as a highly parallel computing platform. However, in most cases, the application needs to be programmed in hardware description or assembly languages, whereas most application programmers are familiar with the algorithmic programming paradigm. SA-C has been proposed as an expression-oriented language designed to implicitly express data parallel operations. Morphosys is a reconfigurable system-on-chip architecture that supports a data-parallel, SIMD computational model. This paper describes a compiler framework to analyze SA-C programs, perform optimizations, and map the application onto the Morphosys architecture. The mapping process involves operation scheduling, resource allocation and binding and register allocation in the context of the Morphosys architecture. The execution times of some compiled image-processing kernels can achieve up to 42x speed-up over an 800 MHz Pentium III machine.
International Journal of Electronics, 2007
In this paper, we target at a Reconfigurable Instruction Set Processor (RISP), which tightly couples a coarse-grain Reconfigurable Functional Unit (RFU) to a RISC processor. Furthermore, the architecture is supported by a flexible development framework. By allowing the definition of alternate architectural parameters the framework can be used to explore the design space and fine-tune the architecture at design time. Initially, two architectural enhancements, namely partial predicated execution and virtual opcode are proposed and the extensions performed in the architecture and the framework to support them, are presented. To evaluate these issues kernels from the multimedia domain are considered and an exploration to derive an appropriate instance of the architecture is performed. The efficiency of the derived instance and the proposed enhancements are evaluated using an MPEG-2 encoder application.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Architecture of Computing Systems – ARCS 2016, 2016
2008 Canadian Conference on Electrical and Computer Engineering, 2008
International Journal of Reconfigurable Computing, 2012
2011 International Conference on Reconfigurable Computing and FPGAs, 2011
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
IFIP – The International Federation for Information Processing, 2009
Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, 2012
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, 2015
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008