Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, IBM Journal of Research and Development
…
12 pages
1 file
This paper describes a recent system-level trend toward the use of massive on-chip parallelism combined with efficient hardware accelerators and integrated networking to enable new classes of applications and computing-systems functionality. This system transition is driven by semiconductor physics and emerging network-application requirements. In contrast to general-purpose approaches, workload and network-optimized computing provides significant cost, performance, and power advantages relative to historical frequency-scaling approaches in a serial computational model. We highlight the advantages of on-chip network optimization that enables efficient computation and new services at the network edge of the data center. Software and application development challenges are presented, and a service-oriented architecture application example is shown that characterizes the power and performance advantages for these systems. We also discuss a roadmap for next-generation systems that proportionally scale with future networking bandwidth growth rates and employ 3-D chip integration methods for design flexibility and modularity.
Journal of Systems Architecture, 2011
The era of single core processors is gone now. We are now using processors with upto 10 cores now days and we are moving towards integration levels of 1,000 core processor chips. The powerful obstacles which we are facing are power and energy consumption. To construct a multi core processor chip like this, there is a need to rebuild the whole compute stack from the ground up for energy and power efficiency. To achieve Extreme Scale Computing milestone, this is very important that we control power consumption. This is also very important that we operate the processor at lower voltage because this is the point of maximum energy efficiency. Unfortunately, in such type of environment, we have to handle huge process variation. However, it is very important to design voltage regulation efficiently, so that each region of the processor chip can operate at the most effective voltage and frequency levels. At the level of architecture, we need simple cores which are organized in a hierarchy of clusters. Furthermore, we also require techniques which can lessen the spillage of on-chip recollections and which can bring down the voltage protect groups of rationale. At long last, we additionally need to limit information development by the utilization of both equipment and programming systems. Notwithstanding we can get the required vitality efficiencies with an efficient approach that cuts over numerous layers of the processing stack.
IEEE Micro, 2004
With its throughput computing strategy, Sun Microsystems seeks to reverse long-standing trends towards increasingly elaborate processor designs by focusing instead on simple, replicated designs that effectively exploit threaded workloads and thus enable radically higher levels of performance scaling. Throughput computing is based on chip multithreading processor design technology. In CMT technology, maximizing the amount of work accomplished per unit of time or other relevant resource, rather than minimizing the time needed to complete a given task or set of tasks, defines performance. Similarly, a CMT processor might seek to use power more efficiently by clocking at a lower frequency-if this tradeoff lets it generate more work per watt of expended power. By CMT standards, the best processor accomplishes the most work per second of time, per watt of expended power, per square millimeter of die area, and so on (that is, it operates most efficiently).
IEEE Access
Minimizing execution time, energy consumption, and network load through scheduling algorithms is challenging for multi-processor-on-chip (MPSoC) based network-on-chip (NoC) systems. MPSoC based systems are prevalent in high performance computing systems. With the increase in computing capabilities of computing hardware, application requirements have increased many folds, particularly for real world scientific applications. Scheduling large scientific workflows consisting hundreds and thousands of tasks consume significant amount of time and resources. In this article, energy aware parallel scheduling techniques are presented primarily aimed at reducing the algorithm execution time while considering network load. Experimental results reveal that the proposed parallel scheduling algorithms achieve significant reduction in execution time. INDEX TERMS Network-on-chip (NoC), multiprocessor system-on-chip (MPSoC), task scheduling, parallel scheduling.
Proceedings of the IEEE, 2008
The recent trends in commodity processor architectures exploit multiple cores to achieve higher performance. Some examples include multicore processors that replicate identical serial CPU cores on a single chip, e.g., quad-core CPU chips available from Intel and AMD in 2007. The current trend seems to indicate that the number of cores is growing at a rate governed by the Moore's law. An ongoing and contrasting trend has been the development of heterogeneous processor architectures that combine fine-grain and coarse-grain parallelism using tens or hundreds of disparate processing cores. Examples of such processors include the Cell BE processor, which is used as a CPU in workstations, game consoles, and manycore accelerators (e.g., GPUs), which are designed with the goal achieving higher parallel-code performance for a class of applications.
Sensors
The rapid evolution of Cloud-based services and the growing interest in deep learning (DL)-based applications is putting increasing pressure on hyperscalers and general purpose hardware designers to provide more efficient and scalable systems. Cloud-based infrastructures must consist of more energy efficient components. The evolution must take place from the core of the infrastructure (i.e., data centers (DCs)) to the edges (Edge computing) to adequately support new/future applications. Adaptability/elasticity is one of the features required to increase the performance-to-power ratios. Hardware-based mechanisms have been proposed to support system reconfiguration mostly at the processing elements level, while fewer studies have been carried out regarding scalable, modular interconnected sub-systems. In this paper, we propose a scalable Software Defined Network-on-Chip (SDNoC)-based architecture. Our solution can easily be adapted to support devices ranging from low-power computing n...
Left unchecked, the fundamental drive to increase peak performance using tens of thousands of power hungry components will lead to intolerable operating costs and failure rates. High-performance, power-aware distributed computing reduces power and energy consumption of distributed applications and systems without sacrificing performance. Recent work has shown application characteristics of single-processor, memorybound non-interactive codes and distributed, interactive web services can be exploited to conserve power and energy with minimal performance impact. Our novel approach is to exploit parallel performance inefficiencies characteristic of non-interactive, distributed scientific applications, conserving energy using DVS (dynamic voltage scaling) without impacting time-to-solution (TTS) significantly, reducing cost and improving reliability. We present a software framework to analyze and optimize distributed power-performance using DVS implemented on a 16-node Centrino-based cluster. We use our framework to quantify and compare the power-performance efficiency for parallel Fourier transform and matrix transpose codes. Using various DVS strategies we achieve application-dependent overall system energy savings as large as 25% with as little as 2% performance impact.
Journal of Advances in Computer Networks, 2013
Modern On-chip Multi-core design will continue Moore's law and facilitate platforms for wired and wireless communications. It has been predicted that the future computing platform will have a tightly integrated complex system that can process "big data" with swift speed and high quality. However, it is not clear how the current multi-core systems would react to large volume of data and how the data volume would impact the interconnect network design and architecture of the future computing platform. The goal of this paper is to raise these questions and provide some answers. In particular, this paper provides a series of cost models and a new optimization scheme "Interconnect Communication Cost Minimization" (ICCM) to manage tasks and their data. Task flows and their partitions are considered with data amount created and consumed. The consequent partitions are mapping virtually to the multi-core system through a data and task scheduling optimizer. Through experimental results, we demonstrated that an average of 50% reduction in the communication cost, an average of 23.1% of throughput improvement and 35% of dynamic power reduction. 1 Index Terms-Multi-core, data-centric design, interconnect communication cost minimization, accuracy adaptive adder. I. INTRODUCTION Modern on-chip multi-core design will continue Moore"s law and facilitate platforms for wired and wireless communications with "big data". It has been predicted that these multi-core architectures will be the future computing platforms with swift speed and high quality. Previous success of IBM, Larrabee, and Intel just demonstrated 32 cores, 64 cores, and 80 cores examples. The recent success of Tilera on a 100 core system [1], [2] has just confirmed the future trend in exploring thousands of cores on a single die. Nevertheless, putting a lot of cores on a single chip is not simple. The most challenging part is not about how many cores we can pack on a single die. It is how to keep these cores being supplied with resources: these cores need to be powered and be supplied with data streams. Otherwise, even if we have thousands of cores on a single die, we can only activate a small portion of them. A number of researchers [1]-[3] have identified that the success of today"s Tilera lies in routing. By replacing long wires in multi-core system with routed networks, with distributing the gigantic 50-ported register file, 16-way ALU clump, and gigantic 50-ported mongo cache, the
Microprocessors and Microsystems, 2015
Dramatic environmental and economic impact of the ever increasing power and energy consumption of modern computing devices in data centers is now a critical challenge. On the one hand, designers use technology scaling as one of the methods to face the phenomenon called dark silicon (only segments of a chip function concurrently due to power restrictions). On the other hand, designers use extreme-scale systems such as teradevices to meet the performance needs of their applications which in turn increases the power consumption of the platform. In order to overcome these challenges, we need novel computing paradigms that address energy efficiency. One of the promising solutions is to incorporate parallel distributed methodologies at different abstraction levels. The FP7 project ParaDIME focuses on this objective to provide different distributed methodologies (software-hardware techniques) at different abstraction levels to attack the power-wall problem. In particular, the ParaDIME framework will utilize: circuit and architecture operation below safe voltage limits for drastic energy savings, specialized energy-aware computing accelerators, heterogeneous computing, energy-aware runtime, approximate computing and power-aware message passing. The major outcome of the project will be a noval processor architecture for a heterogeneous distributed system that utilizes future device characteristics, runtime and programming model for drastic energy savings of data centers. Wherever possible, ParaDIME will adopt multidisciplinary techniques, such as hardware support for message passing, runtime energy optimization utilizing new hardware energy performance counters, use of accelerators for error recovery from sub-safe voltage operation, and approximate computing through annotated code. Furthermore, we will establish and investigate the theoretical limits of energy savings at the device, circuit, architecture, runtime and programming model levels of the computing stack, as well as quantify the actual energy savings achieved by the ParaDIME approach for the complete computing stack with the real environment.
Lecture Notes in Computer Science, 2014
In this paper, we propose a holistic approach for the analysis of parallel applications on a high performance-low energy computer (called the HAEC platform). The HAEC platform is currently under design and refers to an architecture in which multiple 3-D stacked massively parallel processor chips are optically interconnected on a single board and multiple parallel boards are interconnected using short-range high-speed wireless links. Although not exclusively targeting high performance computing (HPC), the HAEC platform aims to deliver high performance at low energy costs, which are essential features for future HPC platforms. At the core of the proposed approach is a trace-driven simulator called haec sim which we developed to simulate the behavior of parallel applications running on this hardware. We investigate several mapping layouts to assign the parallel applications to the HAEC platform. We concentrate on analyzing the communication performance of the HAEC platform running parallel applications. The simulator can employ two communication models: dimension order routing (DOR) and practical network coding (PNC). As a first example of the usefulness of the proposed holistic analysis approach, we present simulation results using these communication models on a communication-intensive parallel benchmark. These results highlight the potential of the mapping strategies and communication models for analyzing the performance of various types of parallel applications on the HAEC platform. This work constitutes the first step towards more complex simulations and analyses of performance and energy scenarios than those presented herein.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the IEEE, 2018
Microprocessors and Microsystems, 2016
2012
VLSI Design, 2007
IEEE Micro, 2015
Journal of Physics: Conference Series, 2015
Journal of Parallel and Distributed Computing, 1992
Future Internet, 2020
The Journal of Supercomputing
2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing, 2015
Lecture Notes in Computer Science, 2014