Papers by Grigorios Magklis
Compressing address communications between processors
Prozessor und System mit einer sowie Verfahren f?r eine Frequenz- und Spannungsskalierungsarchitektur

35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings.
We describe the design, analysis, and performance of an on-line algorithm to dynamically control ... more We describe the design, analysis, and performance of an on-line algorithm to dynamically control the frequency/voltage of a Multiple Clock Domain (MCD) microarchitecture. The MCD microarchitecture allows the frequency/voltage of microprocessor regions to be adjusted independently and dynamically, allowing energy savings when the frequency of some regions can be reduced without significantly impacting performance. Our algorithm achieves on average a 19.0% reduction in Energy Per Instruction (EPI), a 3.2% increase in Cycles Per Instruction (CPI), a 16.7% improvement in Energy-Delay Product, and a Power Savings to Performance Degradation ratio of 4.6. Traditional frequency/voltage scaling techniques which apply reductions globally to a fully synchronous processor achieve a Power Savings to Performance Degradation ratio of only 2-3. Our Energy-Delay Product improvement is 85.5% of what has been achieved using an off-line algorithm. These results were achieved using a broad range of applications from the MediaBench, Olden, and Spec2000 benchmark suites using an algorithm we show to require minimal hardware resources.
Double Rounded Combined Floating-Point Multiply and Add
Mechanism for Facilitating Dynamic and Efficient Fusion of Computing Instructions in Software Programs
Profiling Asynchronous Events Resulting from the Execution of Software at Code Region Granularity

Proceedings of the 17th Ieee Acm International Symposium on Low Power Electronics and Design, 2011
In recent years, multi-core systems have become mainstream in computer industry. The design of mu... more In recent years, multi-core systems have become mainstream in computer industry. The design of multi-cores takes advantage of thread-level parallelism in emerging applications that are computationally intensive and highly parallel. Energy efficiency is one of the biggest challenges in the design of multi-core systems, and workload imbalance among parallel threads is one of sources of energy inefficiency. Many techniques based on dynamic voltage frequency scaling (DVFS) are proposed to save energy consumptions on multi-cores, but all of them assume that each core in a multi-core system contains only one hardware context and only one thread can execute on one core at a time. However, mainstream multi-core systems are moving to have simultaneous multithreading (SMT) support in cores, and existing DVFS-based techniques are not effective to achieve maximum energy savings. In this paper, we present a novel technique called thread shuffling, which combines thread migration and DVFS to achieve maximum energy savings and maintain performance on a multi-core system supporting SMT. Thread shuffling is implemented and simulated in a cycle-accurate x86 multicore system. The experiments show that it achieves up to 56% energy savings without performance penalty for selected Recognition, Mining and Synthesis (RMS) applications from Intel Labs.
Thread Migration to Improve Power Efficiency in a Parallel Processing Environment

Workshop on Application Specific Processors, 2003
Microprocessors are traditionally designed to provide "best overall" performance across a wide ra... more Microprocessors are traditionally designed to provide "best overall" performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by "downsizing" hardware resources that are underutilized by particular applications. We explore the converse: "upsizing" hardware resources in order to improve performance relative to an aggressively clocked baseline processor. Our proposal depends critically on the ability to change frequencies independently in separate domains of a globally asynchronous, locally synchronous (GALS) microprocessor. We use a variant of our multiple clock domain (MCD) processor, with four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Measuring across a broad suite of application benchmarks, we find that configuring just once per application increases performance by an average of 17.6% compared to the best fully synchronous design. When adapting to application phases, performance improves by over 20%.
Replacement policy for hot code detection
Multiple clock domain microprocessor
Managed Instruction Cache Prefetching
On using reliable network RAM in networks of workstations
Abstract File systems and databases usually make several synchronous disk write accesses in order... more Abstract File systems and databases usually make several synchronous disk write accesses in order to make sure that the disk always has a consistent view of their data, and that data can be recovered in the case of a system crash. Since synchronous disk operations are slow, some systems choose to employ asynchronous disk write operations, at the cost of low reliability: in case of a system crash all data that have not yet been written to disk are lost. In this paper we describe a software-based approach into using the network memory in a ...
Cashmere is a software distributed shared memory (SDSM) system designed for today's high perf... more Cashmere is a software distributed shared memory (SDSM) system designed for today's high performancecluster architectures. These clusters typically consist of symmetric multiprocessors (SMPs)connected by a low-latency system area network. Cashmere introduces several novel techniques for delegatingintra-node sharing to the hardware coherence mechanism available within the SMPs, and alsofor leveraging advanced network features such as remote memory access. The efficacy of

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture, 2008
Transactional Memory is a promising parallel programming model that addresses the programmability... more Transactional Memory is a promising parallel programming model that addresses the programmability issues of lockbased applications using mechanisms that are transparent to developers. Hardware Transactional Memory (HTM) implements these mechanisms in silicon to obtain better results than fine-grain locking solutions. One of these mechanisms is data version management, that decides how and where the modifications introduced by transactions are stored to guarantee their atomicity and durability. In this paper, we show that aborts are frequent especially for applications with coarse-grain transactions and many threads, and that this severely restricts the scalability of log-based HTMs. To address this issue, we propose the use of a gated store buffer to accelerate eager version management for log-based HTM. Moreover, we propose a novel design, where the store buffer is used to perform lazy version management (similar to Rock [12]) but overflowed transactions execute with a fallback log-based HTM that uses eager version management. Assuming an infinite store buffer, we show that lazy version management is better suited to applications with finegrain transactions while eager version management is better suited to applications with coarse-grain transactions. Limiting the buffer size to 32 entries, we obtain 20.1% average improvement over log-based HTM for applications with fine-grain transactions (using lazy version management) and 54.7% for applications with coarse-grain transactions (using eager version management).
Interweave
ACM SIGOPS Operating Systems Review, 2000

Thermal and Thermomechanical Proceedings 10th Intersociety Conference on Phenomena in Electronics Systems, 2006. ITHERM 2006.
With chip temperature being a major hurdle in microprocessor design, techniques to recover the pe... more With chip temperature being a major hurdle in microprocessor design, techniques to recover the performance loss due to thermal emergency mechanisms are crucial in order to sustain performance growth. Many techniques for power reduction in the past and some on thermal management more recently have contributed to alleviate this problem. Probably the most important thermal control technique is dynamic voltage and frequency scaling (DVS) which allows for almost cubic reduction in power with worst-case performance penalty only linear. So far, DVS techniques for temperature control have been studied at the chip level. Finer grain DVS is feasible if a Globally-Asynchronous Locally-Synchronous (GALS) design style is employed. GALS, also known as Multiple-Clock Domain (MCD), allows for an independent voltage and frequency control for each one of the clock domains that are part of the chip. There are several studies on DVS for GALS that aim to improve energy and power efficiency but not temperature. This paper proposes and analyses the usage of DVS at the domain level to control temperature in a clustered MCD microarchitecture with the goal of improving the performance of applications that do not meet the thermal constraints imposed by the designers.

Proceedings Eighth International Symposium on High Performance Computer Architecture
As clock frequency increases and feature size decreases, clock distribution and wire delays prese... more As clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a Multiple Clock Domain (MCD) processon in which the chip is divided into several (coarse-grained) clock domains, within which independent voltage and frequency scaling can be per$ormed. Boundaries between domains are chosen to exploit existing queues, thereby minimizing inter-domain synchronization costs. We propose four clock domains, corresponding to the front end (including LI instruction cache), integer units, floating point units, and load-store units (including Ll data cache and L2 cache). We evaluate this design using a simulation infrastructure based on SimpleScalar and Wattch. In an attempt to quantify potential energy savings independent of any particular on-line control strategy, we use of-line analysis of traces from a single-speed run of each of our benchmark applications to identify profitable reconfiguration points for a subsequent dynamic scaling run. Dynamic runs incorporate a detailed model of inter-domain synchronization delays, with latencies for intra-domain scaling similar to the whole-chip scaling latencies of Intel XScale and Transmeta LongRun technologies. Using applications from the MediaBench, Olden, and SPEC2000 benchmark suites, we obtain an average energy-delay product improvement of 20% with MCD compared to a modest 3% savings from voltage scaling a single clock and voltage system.
Uploads
Papers by Grigorios Magklis