Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
4 pages
1 file
Hardware acceleration is a widely accepted solution for performance and energy efficient computation because it removes unnecessary hardware for general computation while delivering exceptional performance via specialized control paths and execution units. The spectrum of accelerators available today ranges from coarse-grain off-load engines such as GPUs to fine-grain instruction set extensions such as SSE. This research explores the benefits and challenges of managing memory at the data-structure level and exposing those operations directly to the ISA. We call these instructions Abstract Datatype Instructions (ADIs). This paper quantifies the performance and energy impact of ADIs on the instruction and data cache hierarchies. For instruction fetch, our measurements indicate that ADIs can result in 21-48% and 16-27% reductions in instruction fetch time and energy respectively. For data delivery, we observe a 22-40% reduction in total data read/write time and 9-30% in total data read/write energy.
2005
Instruction caches typically consume 27% of the total power in modern high-end embedded systems. We propose a compiler-managed instruction store architecture (K-store) that places the computation intensive loops in a scratchpad like SRAM memory and allocates the remaining instructions to a regular instruction cache. At runtime, execution is switched dynamically between the instructions in the traditional instruction cache and the ones in the K-store, by inserting jump instructions. The necessary jump instructions add 0.038% on an average to the total dynamic instruction count. We compare the performance and energy consumption of our K-store with that of a conventional instruction cache of equal size. When used in lieu of a 8KB, 4-way associative instruction cache, K-store provides 32% reduction in energy and 7% reduction in execution time. Unlike loop caches, K-store maps the frequent code in a reserved address space and hence, it can switch between the kernel memory and the instruction cache without any noticeable performance penalty.
2003 Design, Automation and Test in Europe Conference and Exhibition, 2003
Modern embedded processors use data caches with higher and higher degrees of associativity in order to increase performance. A set-associative data cache consumes a significant fraction of the total power budget in such embedded processors. This paper describes a technique for reducing the D-cache power consumption and shows its impact on power and performance of an embedded processor. The technique utilizes cache line address locality to determine (rather than predict) the cache way prior to the cache access. It thus allows only the desired way to be accessed for both tags and data. The proposed mechanism is shown to reduce the average L1 data cache power consumption when running the MiBench embedded benchmark suite for 8, 16 and 32-way set-associate caches by, respectively, an average of 66%, 72% and 76%. The absolute power savings from this technique increase significantly with associativity. The design has no impact on performance and, given that it does not have mis-prediction penalties, it does not introduce any new non-deterministic behavior in program execution.
We propose Dynamic Associative Cache (DAC) -a low complexity design to improve the energy-efficiency of the data caches with negligible performance overhead. The key idea of DAC is to perform dynamic adaptation of cache associativity -switching the cache operation between direct-mapped and setassociative regimes -during the program execution. To monitor the program needs in terms of cache associativity, the DAC design employs a subset of shadow tags: when the main cache operates in the set-associative mode, the shadow tags operate in the directmapped mode and vice versa. The difference in the hit rates between the main tags and the shadow tags is used as an indicator for the cache mode switching. We show that DAC performs most of its accesses in the direct-mapped mode resulting in significant energy savings, at the same time maintaining performance close to that of set-associative L1 D-cache.
2005
The memory subsystem has always been a bottleneck in performance as well as significant power contributor in memory intensive applications. Many researchers have presented multilayered memory hierarchies as a mean to design energy and performance efficient systems. However, most of the previous works do not explore trade-offs systematically. We fill this gap by proposing a formalized two-step technique: In the first part, our technique takes into consideration the data reuse, the limited lifetime of the arrays of an application and the application specific prefetching opportunities, and performs a thorough trade-off exploration for different data memory layer sizes. In the second part of our technique, we optimize the instruction memory, and we achieve further energy and performance gains. Following this approach, we have been able to reduce the execution time up to 60%, and energy consumption up to 70%, for nine real-life applications.
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design - ISLPED '12, 2012
To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references' software semantics such as stack-heap bifurcation of the memory space, and user-kernel ring levels. This constitutes a waste of energy since e.g., a user-mode instruction fetch will never hit a cache block that contains kernel code. Similarly, a stack access will not hit a cacheline that contains heap data. We propose to exploit software semantics in cache design to avoid unnecessary associative searches, thus reducing dynamic power consumption. Specifically, we utilize virtual memory region properties to optimize the data cache and ring level information to optimize the instruction cache. Our design does not impact performance, and incurs very small hardware cost. Simulations results using SPEC CPU and SPECjapps indicate that the proposed designs help to reduce cache block fetches from DL1 and IL1 by 27% and 57% respectively, resulting in average savings of 15% of DL1 power and more than 30% of IL1 power compared to an aggressively clock-gated baseline.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
We introduce the Micro-Operation Cache (Uop Cache-UC) designed to reduce processor's frontend power and energy consumption without performance degradation. The UC caches basic blocks of instructions-pre-decoded into micro-operations (uops). The UC fetches a single basic-block worth of uops per cycle. Fetching complete pre-decoded basic-blocks eliminates the need to repeatedly decode variable length instructions and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. The UC design enables even a small structure to be quite effective. Results: a moderate-sized UC eliminates about 75% instruction decodes across a broad range of benchmarks and over 90% in multimedia applications and high-power tests. For existing Intel P6 family processors, the eliminated work may save about 10% of the full-chip power consumption with no performance degradation.
Small lter caches (L0 caches) can be used to obtain signicantly reduced energy consumption for embedded systems, but this benet comes at the cost of increased execution time due to frequent L0 cache misses. The In- struction Register File (IRF) is an architectural exten- sion for providing improved access to frequently occur- ring instructions. An optimizing compiler can exploit an IRF by packing an application's instructions, result- ing in decreased code size, reduced energy consumption and improved execution time primarily due to a smaller footprint in the instruction cache. The nature of the IRF also allows the execution of packed instructions to over- lap with instruction fetch, thus providing a means for tolerating increased fetch latencies. This paper explores the use of an L0 cache enhanced with an IRF to pro- vide even further reduced energy consumption with im- proved execution time. The results indicate that the IRF is an effective means for offsetting execution time ...
2006
In this paper we show that cache memories for embedded applications can be designed to both increase performance and reduce energy consumed. We show that using separate (data) caches for indexed or stream data and scalar data items can lead to substantial improvements in terms of cache misses. The sizes of the various cache structure should be customized to meet applications' needs. We show that reconfigurable split data caches can be designed to meet wide-ranging embedded applications' performance, energy and silicon area budgets. The optimal cache organizations can lead to on average 62% and 49% reduction in the overall cache size, 37% and 21% reduction in cache access time and 47% and 52% reduction in power consumption for instruction and data cache respectively when compared to an 8k byte instruction and an 8k byte unified data cache for media benchmarks from MiBench suite.
2005
Horizontally partitioned data caches are a popular architectural feature in which the processor maintains two or more data caches at the same level of hierarchy. Horizontally partitioned caches help reduce cache pollution and thereby improve performance. Consequently most previous research has focused on exploiting horizontally partitioned data caches to improve performance, and achieve energy reduction only as a byproduct of performance improvement. In constrast, in this paper we show that optimizing for performance tradesoff several opportunities for energy reduction. Our experiments on a HP iPAQ h4300-like memory subsystem demonstrate that optimizing for energy consumption results in up to 127% less memory subsystem energy consumption than the performance optimal solution. Furthermore, we show that energy optimal solution incurs on average only 1.7% performance penalty. Therefore, with energy consumption becoming a first class design constraint, there is a need for compilation techniques aimed at energy reduction. To achieve aforementioned energy savings we propose and explore several low-complexity algorithms aimed at reducing the energy consumption and show that very simple greedy heuristics achieve 97% of the possible memory subsystem energy savings.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000
The memory hierarchy of high-performance and embedded processors has been shown to be one of the major energy consumers. For example, the Level-1 (L1) instruction cache (I-Cache) of the StrongARM processor accounts for 27% of the power dissipation of the whole chip, whereas the instruction fetch unit (IFU) and the I-Cache of Intel's Pentium Pro processor are the single most important power consuming modules with 14% of the total power dissipation [2]. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the I-Cache and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. We propose, implement, and evaluate five techniques for dynamic analysis of the program instruction access behavior, which is then used to proactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes. Results for these benchmarks indicate that more than 60% of the dissipated energy in the I-Cache subsystem can be saved
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2008 Design, Automation and Test in Europe, 2008
Journal of Circuits, Systems and Computers, 2012
IEEE International [Systems-on-Chip] SOC Conference, 2003. Proceedings., 2003
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, 2014
Lecture Notes in Computer Science, 2005
Proceedings of the 2001 international symposium on Low power electronics and design - ISLPED '01, 2001
ACM Transactions on Architecture and Code Optimization, 2009
Proceedings of the 1995 international symposium on Low power design - ISLPED '95, 1995
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems - CASES '12, 2012
40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), 2007
2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP), 2007