Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
…
8 pages
1 file
In this paper, we present the design and implementation of an open-source reconfigurable very long instruction word (VLIW) multiprocessor system. This processor is implemented as a softcore on a field-programmable gate arrays (FPGA) and its instruction set architecture (ISA) is based on the Lx/ST200 ISA. This multiprocessor design is based on our earlier ρ-VEX processor design. Since the ρ-VEX processor is a parameterized processor, our multiprocessor design is also parameterized. By utilizing a freely available compiler and simulator in our development framework, we are able to optimize our design and map any application written in C to our multiprocessor system. This VLIW multiprocessor can exploit data level as well as instruction level parallelism inherent in an application and make its execution faster. More importantly, we achieve our results by saving expensive FPGA area through the sharing of resources. The results show that we can achieve two times better performance for our dual-processor system (with shared resources) compared to a uni-processor system or a 2-cluster processor system for applications having data level and instruction level parallelism.
IEEE Journal of Solid-state Circuits, 2003
This paper describes a new architecture for embedded reconfigurable computing, based on a very-long instruction word (VLIW) processor enhanced with an additional run-time configurable datapath. The reconfigurable unit is tightly coupled with the processor, featuring an application-specific instruction-set extension. Mapping computation intensive algorithmic portions on the reconfigurable unit allows a more efficient elaboration, thus leading to an improvement in both timing performance and power consumption. A test chip has been implemented in a standard 0.18-m CMOS technology. The test of a signal processing algorithmic benchmark showed speedups ranging from 4.3 to 13.5 and energy consumption reduced up to 92%.
2005
The X4CP32 is an architecture that combines the parallel and reconfigurable paradigms. It consists of a grid of Reconfigurable and Programming Units (RPUs), each one containing 4 Cells (including a microprocessor in each Cell), responsible for all the processing and program flow. This paper presents architectural modifications in the X4CP32 in order to increase its performance. The RPU was implemented according to the VLIW (Very Long Instruction Word) methodology, and the Cells were redesigned with a pipelined implementation. These improvements raised the maximum IPC of the RPU from 0.5 to 4 with an area overhead of 26%. To evaluate the new architecture, versions of the 2D Discrete Cosine Transform, Montgomery Modular Multiplication and Color Space Conversion were mapped, using the baseline architecture and the pipelined VLIW architecture.
Proceedings Tenth International Conference on VLSI Design
Back-end processors have been conventionally used for speeding up of only a specific set of compute intensive functions. Such co-processors are, generally, "hardwired" and cannot be used for a new function. In this paper, we discuss the design considerations and parameters of a general purpose reconfigurable co-processor. We also propose architecture of such a co-processor and discuss its implementation issues. The concept of a reconfigurable co-processor has become feasible because of the availability of static RAM based FPGAs. The key architectural features of our system are: scalable topology, shared memory space between the main processor and co-processor and efficient reconfigurability. A small prototype of the system has been implemented. We have demonstrated a two orders of speedup using our system over pure software solutions for a set of compute intensive applications.
2010
The advantage in multiprocessors is the performance speedup obtained with processorlevel parallelism. Similarly, the exibility for application-specic adaptability is the advantage in recongurable architectures. To benet from both these architectures, we present a recongurable multiprocessor template that combines parallelism in multiprocessors and exibility in recongurable architectures. A fast, single cycle, resourceecient, run-time reconguration scheme accelerates customisations in the recongurable multiprocessor template. Based on this methodology, a four-core multiprocessor called QuadroCore has been implemented on UMC's 90nm standard cells and on Xilinx's FPGA. QuadroCore is customisable and adapts to variations in the granularity of parallelism, the amount of communication between tasks, and the frequency of synchronisation. To validate the advantages of this approach, a diverse set of applications has been mapped onto the QuadroCore multiprocessor. Experimental results show speedups in the range of 3 to 11 in comparison to a single processor. In addition, energy savings of up to 30% were noted on account of reconguration. Furthermore, to steer application mapping based on power considerations, an instruction-level power model has been developed. Using this model, power-driven instruction selection introduces energy savings of up to 70% in the QuadroCore multiprocessor.
2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), 2010
This paper presents dynamic reconfiguration of a register file of a Very Long Instruction Word (VLIW) processor implemented on an FPGA. We developed an open-source reconfigurable and parameterizable VLIW processor core based on the VLIW Example (VEX) Instruction Set Architecture (ISA), capable of supporting reconfigurable operations as well. The VEX architecture supports up to 64 multiported shared registers in a register file for a single cluster VLIW processor. This register file accounts for a considerable amount of area in terms of slices when the VLIW processor is implemented on an FPGA. Our processor design supports dynamic partial reconfiguration allowing the creation of dedicated register file sizes for different applications. Therefore, valuable area can be freed and utilized for other implementations running on the same FPGA when not the full register file size is needed. Our design requires 924 slices on a Xilinx Virtex-II Pro device for dynamically placing a chunk of 8 registers, and places registers in multiples of 8 registers to simplify the design. Consequently, when 64 registers is not needed at all times, the area utilization can be reduced during run-time.
2001
Standard microprocessors are generally designed to deal efficiently with different types of tasks; their general purpose architecture can lead to misuse of resources, creating a large gap between the computational efficiency of microprocessors and custom silicon.
Presently electronic devices such as smart phones, games consoles have requirement for heterogeneous tasks that need to execute in real time. To meet this challenging requirement heterogeneous multi core processor is the likely platforms to host the different application of tasks. However, there are many hardware/software challenges on multi-core processor heterogeneous design that can include; sharing of resources, task balancing, throughput, communication and scheduling. This paper presents a review on a heterogeneous multiprocessor reconfigurable design using Field Programmable Gate Arrays (FPGA) hardware platforms. The FPGA-based hardware design represents a middleware structure to manage the communication between cores in any specific processor through varying hardware blocks physically in FPGA platforms. The FPGA-based design provides efficient platforms for developing hardware/software multi-cores system designing for re-configurable heterogeneous multiprocessor. Finally, the re-configurable design improves heterogeneous multiprocessor performance in term of decreasing the latency in communication, power consumption and execution time.
International Journal of Electronics, 2007
In this paper, we target at a Reconfigurable Instruction Set Processor (RISP), which tightly couples a coarse-grain Reconfigurable Functional Unit (RFU) to a RISC processor. Furthermore, the architecture is supported by a flexible development framework. By allowing the definition of alternate architectural parameters the framework can be used to explore the design space and fine-tune the architecture at design time. Initially, two architectural enhancements, namely partial predicated execution and virtual opcode are proposed and the extensions performed in the architecture and the framework to support them, are presented. To evaluate these issues kernels from the multimedia domain are considered and an exploration to derive an appropriate instance of the architecture is performed. The efficiency of the derived instance and the proposed enhancements are evaluated using an MPEG-2 encoder application.
2011
We present a run-time system for a multi-grained reconfigurable processor in order to provide a dynamic trade-off between performance and available area budgets for both fine- as well as coarse-grained reconfigurable fabrics as part of one reconfigurable processor. Our run-time system is the first implementation of its kind that dynamically selects and steers a performance-maximizing multi-grained instruction set under run-time varying constraints. It achieves a performance improvement of more than 2× compared to state-of-the-art run-time systems for multi-grained architectures. To elaborate the benefits of our approach further, we also compare it with offline- and online-optimal instruction-set selection schemes.
2000
This paper introduces the notion of a Flexible Instruction Processor (FIP) for systematic customisation of instruction processor design and implementation. The features of our approach include: (a) a modular framework based on \processor templates" that capture various instruction processor styles, such as stack-based or register-based styles (b) enhancements of this framework to improve functionality and performance, such a s h ybrid processor templates and superscalar operation (c) compilation strategies involving standard compilers and FIP-speci c compilers, and the associated design ow (d) technology-independent and technology-speci c optimisations, such a s t e c hniques for e cient resource sharing in FPGA implementations. Our current implementation of the FIP framework is based on a highlevel parallel language called Handel-C, which can be compiled into hardware. Various customised Java Virtual Machines and MIPS style processors have been developed using existing FPGAs to evaluate the e ectiveness and promise of this approach.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Microprocessors and Microsystems, 2012
2010 International Conference on Field-Programmable Technology, 2010
Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01, 2001