Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Efficient utilization of available resources is a key concept in embedded systems. This paper is focused on providing the support for managing dynamic reconfiguration of computing resources in the programming model. We present an approach to map occam-pi programs to a manycore architecture, Platform 2012 (P2012). We describe the techniques used to translate the salient features of the occam-pi language to the native programing model of the P2012 architecture. We present the initial results from a case study of matrix multiplication. Our results show the simplicity of occam-pi program by 6 times reduction in lines-of-code.
International Journal of Reconfigurable Computing, 2012
Massively parallel reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet the increased computational demands of high-performance embedded systems. We propose that the occam-pi language is used for programming of the category of massively parallel reconfigurable architectures. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessor communication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language and a backend was developed to target the Ambric array of processors. We present two case-studies; DCT implementation exploiting the reconfigurability feature of occam-pi and a significantly large autofocus criterion calculation based on the dynamic parallelism capability of the occam-pi language. The results of the implemented case studies suggest that the occam-pi-language-based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.
2011
Recently we proposed occam-pi as a high-level language for programming coarse grained reconfigurable architectures. The constructs of occam-pi combine ideas from CSP and pi-calculus to facilitate expressing parallelism, communication, and reconfigurability. The feasability of this approach was illustrated by developing a compiler framework to compile occam-pi implementations to the Ambric architecture. In this paper, we demonstrate the applicability of occam-pi for programming an array of functional units, extreme Processing Platform (XPP). This is made possible by extending the compiler framework to target the XPP architecture, including automatic floating to fixed-point conversion. Different implementations of a FIR filter and a DCT algorithm were developed and evaluated on the basis of performance and resource consumption. The reported results reveal that the approach of using occam-pi to program the category of coarse grained reconfigurable architectures appears to be promising. The resulting implementations are generally much superior to those programmed in C and comparable to those hand-coded in the low-level native language NML.
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
With the advent of manycore architectures comprising hundreds of processing elements, fault management has become a major challenge. We present an approach that uses the occam-pi language to manage the fault recovery mechanism on a new manycore architecture, the Platform 2012 (P2012). The approach is made possible by extending our previously developed compiler framework to compile occam-pi implementations to the P2012 architecture. We describe the techniques used to translate the salient features of the occam-pi language to the native programming model of the P2012 architecture. We demonstrate the applicability of the approach by an experimental case study, in which the DCT algorithm is implemented on a set of four processing elements. During runtime, some of the tasks are then relocated from assumed faulty processing elements to the faultless ones by means of dynamic reconfiguration of the hardware. The working of the demonstrator and the simulation results illustrate not only the feasibility of the approach but also how the use of higher-level abstractions simplifies the fault handling.
Manycore architectures are gaining attention as a means to meet the performance and power demands of high-performance embedded systems. However, their widespread adoption is sometimes constrained by the need for mastering proprietary programming languages that are low-level and hinder portability.
2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011
Building an effective programming model for manycore processors is challenging. On the one hand, the increasing variety of platforms and their specific programming models force users to take a hardware-centric approach not only for implementing parallel applications, but also for designing them. This approach diminishes portability and, eventually, limits performance. On the other hand, to effectively cope with the increased number of large-scale workloads that require parallelization, a portable, application-centric programming model is desirable. Such a model enables programmers to focus first on extracting and exploiting parallelism from their applications, as opposed to generating parallelism for specific hardware, and only second on platform-specific implementation and optimizations.
Many-core architectures are now a reality and programming them is still a challenge.
2011
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware. This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model. This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP-an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs. a serial processor and 8.10x vs. parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms.
2016
This book “Multi-Core Architectures and Programming” is about an introductory conceptual idea about Multicore Processor with Architecture and programming using OpenMP API. It gives an outline on Multicore Architecture and its functional blocks like Intercommunication, Cache and Memory. It provides an ideology of working mechanism process scheduling in Operating System is performed in a Multicore processor. Memory programming in core processor using OpenMP API and its libraries for C language is discussed.
Lecture Notes in Computer Science, 2011
We present ΣC, a programming model and language for high performance embedded manycores. The programming model is based on process networks with non determinism extensions and process behavior specifications. The language itself extends C, with parallelism, composition and process abstractions. It is intended to support architecture independent, high-level parallel programming on embedded manycores, and allows for both low execution overhead and strong execution guarantees. ΣC is being developed as part of an industry-grade tool chain for a high performance embedded manycore architecture.
IEEE, 2024
Innovative, customizable computer architectures are driving significant advancements in performance for specific computation classes, such as dense matrix operations. However, compiling code for these new accelerators poses challenges, often requiring substantial engineering investment, particularly for tailored optimizations. In response, we are developing a compiler for a reconfigurable manycore architecture, leveraging program synthesis to distribute computation across a grid of simple processor cores. This approach eliminates the need for manual optimizations, streamlining spatial mapping. Our results show a remarkable 3.3-5.0X speedup in estimated cycle counts on microbenchmarks compared to single-core execution. Although scalability limitations are a concern, we have successfully compiled our largest microbenchmark, a 152-line code, within an hour, resulting in a non-optimal yet 1.7X faster partitioning. Our objective is to enable developers to harness efficient hardware accelerators without delving into low-level coding or waiting for the conventional development of optimized compilers. Moreover, this synthesis-based compiler technology is poised to offer enhanced flexibility for future, distinct yet related architectures.
2017
Parallelism has been used since the early days of computing to enhance performance. From the first computers to the most modern sequential processors (also called uniprocessors), the main concepts introduced by von Neumann [20] are still in use. However, the ever-increasing demand for computing performance has pushed computer architects toward implementing different techniques of parallelism. The von Neumann architecture was initially a sequential machine operating on scalar data with bit-serial operations [20]. Word-parallel operations were made possible by using more complex logic that could perform binary operations in parallel on all the bits in a computer word, and it was just the start of an adventure of innovations in parallel computer architectures.
Lecture Notes in Electrical Engineering, 2011
The 2PARMA project focuses on the development of parallel programming models and run-time resource management techniques to exploit the features of many-core processor architectures.
2012
With the dawn of the multi-core era, programmers are being challenged to write code that performs well on an increasingly diverse array of architectures. A single program or library may be used on systems ranging in power from large servers with dozens or hundreds of cores to small single-core netbooks or mobile phones. A program may need to run efficiently both on architectures with many simple cores and on those with fewer monolithic cores. Some of the systems a program encounters might have GPU coprocessors, while others might not. Looking forward, processor designs such as asymmetric multi-core [3], with different types of cores on a single chip, will present an even greater challenge for programmers to utilize effectively. Programmers often find they must make algorithmic changes to their program in order to get performance when moving between these different types
2012
The 2PARMA project focuses on the development of parallel programming models and run-time resource management techniques to exploit the features of many-core processor architectures.
Scalable Computing: Practice and Experience, 2016
Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it is not so popular in shared memory parallel processors. In this sense and given the spectacular growth in performance and in number of cores integrated in many-core architectures, the study of MTC on such architectures is becoming more and more relevant. In this paper, authors present what are those programming mechanisms to take advantages of such massively parallel features for the particular target of MTC. Also, the hardware features of the two dominant many-core platforms (NVIDIA's GPUs and Intel Xeon Phi) are also analyzed for our specific framework. Given the important differences in terms of hardware and software in our two many-core platforms, we have considered different strategies based on CUDA (for GPUs) and OpenMP (for Intel Xeon Phi). We carried out several test cases based on an appropriate and widely studied problem for benchmarking as matrix multiplication. Essentially, this study consisted of comparing the time consumed for computing in parallel several tasks one by one (the whole computational resources are used just to compute one task at a time) with the time consumed for computing in parallel the same set of tasks simultaneously (the whole computational resources are used for computing the set of tasks at very same time). Finally, we compared both software-hardware scenarios to identify the most relevant computer features in each of our many-core architectures.
2012 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012
We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a componentbased approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.