Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2007, 2007 25th International Conference on Computer Design
Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some "decay interval" we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.
2009 International Symposium on Systems, Architectures, Modeling, and Simulation, 2009
The effect of caching is fully determined by the program locality or the data reuse and several cache management techniques try to base their decisions on the prediction of temporal locality in programs. However, prior work reports only rough techniques which either try to predict when a cache block loses its temporal locality or try to categorize cache items as highly or poorly temporal. In this work, we quantify the temporal characteristics of the cache block at run time by predicting the cache block reuse distances (measured in intervening cache accesses), based on the access patterns of the instructions (PCs) that touch the cache blocks. We show that an instruction-based reused distance predictor is very accurate and allows approximation of optimal replacement decisions, since we can "see" the future. We experimentally evaluate our prediction scheme in various sizes L2 caches using a subset of the most memory intensive SPEC2000 benchmarks. Our proposal obtains a significant improvement in terms of IPC over traditional LRU up to 130.6% (17.2% on average) and it also outperforms the previous state of the art proposal (namely Dynamic Insertion Policy or DIP) by up to 80.7% (15.8% on average).
Proceedings of the 42nd annual Southeast regional conference on - ACM-SE 42, 2004
Replacement policy, one of the key factors determining the effectiveness of a cache, becomes even more important with latest technological trends toward highly associative caches. The state-of-the-art processors employ various policies such as Random, Least Recently Used (LRU), Round-Robin, and PLRU (Pseudo LRU), indicating that there is no common wisdom about the best one. Optimal yet unattainable policy would replace cache memory block whose next reference is the farthest away in the future, among all memory blocks present in the set. In our quest for replacement policy as close to optimal as possible, we thoroughly explored the design space of existing replacement mechanisms using SimpleScalar toolset and SPEC CPU2000 benchmark suite, across wide range of cache sizes and organizations. In order to better understand the behavior of different policies, we introduced new measures, such as cumulative distribution of cache hits in the LRU stack. We also dynamically monitored the number of cache misses, per each 100000 instructions. Our results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations. However, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary. The cumulative distribution of cache hits in the LRU stack indicates a very good potential for way prediction using LRU information, since the percentage of hits to the bottom of the LRU stack is relatively high.
2015
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and do not capture these high-performance policies. Accurate pre-dictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of ser-vice, or achieve other system objectives. Without an accu-rate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses ab-solute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies...
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like leastrecently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and do not capture these high-performance policies. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses absolute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies on synthetic and real benchmarks, where its median error is less than 1%. Finally, we present two case studies showing how to use the model to improve shared and single-stream cache performance.
Microprocessors and Microsystems, 2015
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, because modern embedded processors require not only efficient power consumption but also high performance. Practical cache replacement algorithms have focused on supporting the increasing data needs of processors. The commonly used Least Recently Used (LRU) replacement policy always predicts a near-immediate re-reference interval, hence, applications that exhibit a distant re-reference interval may perform poorly under LRU replacement policy. In addition, recent studies have shown that the performance gap between LRU and theoretical optimal replacement (OPT) is large for highly-associative caches. LRU policy is also susceptible to memory-intensive workloads where a working set is greater than the available cache size. These reasons motivate the design of alternative replacement algorithms to improve cache performance. This paper explores a low-overhead, high-performance cache replacement policy for embedded processors that utilizes the mechanism of LRU replacement. Experiments indicate that the proposed policy can result in significant improvement of performance and miss rate for large, highly-associative last-level caches. The proposed policy is based on the tag-distance correlation among cache lines in a cache set. Rather than always replacing the LRU line, the victim is chosen by considering the LRU-behavior bit of the line combined with the correlation between the cache lines' tags of the set and the requested block's tag. By using the LRU-behavior bit, the LRU line is given a chance of residing longer in the set instead of being replaced immediately. Simulations with an out-of-order superscalar processor and memory-intensive benchmarks demonstrate that the proposed cache replacement algorithm can increase overall performance by 5.15% and reduce the miss rate by an average of 11.41%.
2009 IEEE International Conference on Computer Design, 2009
The increasing speed-gap between processor and memory and the limited memory bandwidth make last-level cache performance crucial for CMP architectures. Non Uniform Cache Architectures (NUCA) have been introduced to deal with this problem. This memory organization divides the whole memory space into smaller pieces or banks allowing nearer banks to have better access latencies than further banks. Moreover, an adaptive replacement policy that efficiently reduces misses in the last-level cache could boost performance, particularly if set associativity is adopted. Unfortunately, traditional replacement policies do not behave properly as they were designed for single-processors. This paper focuses on Bank Replacement. This policy involves three key decisions when there is a miss: where to place a data block within the cache set, which data to evict from the cache set and finally, where to place the evicted data. We propose a novel replacement technique that enables more intelligent replacement decisions to be taken. This technique is based on the observation that some types of data are less commonly accessed depending on which bank they reside in. We call this technique LRU-PEA (Least Recently Used with a Priority Eviction Approach). We show that the proposed technique significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8% and into an Energy per Instruction (EPI) reduction of 5%.
American Journal of Embedded Systems and Applications, 2016
Cache is an important component in computer architecture. It has great effects on the performance of systems. Nowadays, Least Recently Used (LRU) Algorithm is one of the most commonly used one because it is easy to implement and with a relatively good performance. However, in some particular cases, LRU is not a good choice. To provide references for the computer architecture designer, the study proposed a new algorithm named Fully Replacement Policy (FRP) and then analyzed various factors of effects on cache performance, carried out the simulation experiment of cache performance based on SimpleScalar toolset and SPEC2000 benchmark suite. The study compared the effects of Fully Replacement Policy with Least Recently Used (LRU) Algorithm when set size, block size, associativity and replacement methods are changed separately., By experimentally analyzing the results, it was found that FRP outperforms LRU in some particular situations.
2017 First International Conference on Embedded & Distributed Systems (EDiS), 2017
The Last-Level Cache (LLC) is a critical component of the memory hierarchy which has a direct impact on performance. Whenever a data requested by a processor core is not found in the cache a transaction to the main memory is initiated, which results in both performance and energy penalties. Decreasing LLC miss rate therefore lowers external memory transactions which is beneficial both power and performancewise. The cache replacement policy has a direct impact on the miss rate. It is responsible of data eviction of cache lines whenever the cache runs full. Thus, a good policy should evict data that will be re-used in a distant-future, and favour data that are likely to be accessed in the near-future. The most common cache replacement policy is the Least-Recently Used (LRU) strategy. It has been used for years and is cheaper in terms of hardware implementation. However, researchers have shown that LRU is not the most efficient policy from a performance point of view, and is further largely sub-optimal compared to the best theoretical strategy. In this paper, we analyze a number of cache replacement policies that have been proposed over the last decade and carry out evaluations reporting performance and energy.
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like leastrecently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.
IAEME, 2019
Modern multi-core systems allow concurrent execution of different applications on a single chip. Such multicores handle the large bandwidth requirement from the processing cores by employing multiple levels of caches with one or two levels of private caches along with a shared last-level cache (LLC). In shared LLC, when applications with varying access behavior compete with each other for space, conventional single core cache replacement techniques can significantly degrade the system performance. In such scenarios, we need an efficient replacement policy for reducing the off-chip memory traffic as well as contention for the memory bandwidth. This paper proposes a novel Application-aware Cache Replacement (ACR) policy for the shared LLC. ACR policy considers the memory access behavior of the applications during the process of victim selection to prevent victimizing a low access rate application by a high-access rate application. \textcolor{red}{ It dynamically tracks the maximum lifetime of cache lines in shared LLC for each concurrent application and helps in efficient utilization of the cache space. Experimental evaluation of ACR policy for 4-core systems, with 16-way set associative 4MB LLC, using SPEC CPU 2000 and 2006 benchmark suites shows a geometric mean speed-up of 8.7% over the least recently used (LRU) replacement policy. We show that the ACR policy performs better than recently proposed thread-aware dynamic re-reference interval prediction (TA-DRRIP) and protecting distance based (PDP) techniques for various 2-core, 4-core and 8-core workloads.
2010
We propose SCORE, a novel adaptive cache replacement policy, which uses a score system to select a cache line to replace. Results show that SCORE offers low overall miss rates on SPEC CPU2006 benchmarks, and provides an average IPC that is 4.9% higher than LRU and 7.4% higher than LIP.
ACM SIGARCH Computer Architecture News, 2010
Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a nearimmediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to nontemporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Rereference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm todate. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.
International Journal for Research in Applied Science and Engineering Technology IJRASET, 2020
Modern processors have a clock speed in the range of GHz while the main memory (DRAM) has a read/write speed in the range of MHz, so the processor needs to halt till the memory completes its request. The halt period may seem to be very small, but when seen on a broad scale, we see that most of the processor's time is wasted in the halt cycles. Cache memory is intended to reduce the speed gap between the fast processor and slow memory. When a program needs to access data from the RAM (physical memory), it first checks inside the cache (SRAM). Replacement policies are methods by which the memory blocks are replaced in a filled cache. Cache replacement policies play a significant role in the memory organization of a processor and dictate how fast a processor will receive the block demanded. Various replacement policies such as RRIP, ABRRIP, AIRRIP, etc. have been designed but not been implemented, unlike LRU. LRU is predominantly used in most of the systems. The ABRRIP policy has two levels of RRPV-one at block and one at the core level. We observe that the performance of the ABRRIP policy improves as we increase the number of instructions during the simulation.
WDDD held in conjunction with ISCA, 2007
Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of the replacement policy affects both the hit rate and the access latency of a cache system. The higher the associativity of the cache, the more vital the replacement ...
2015
Multicore architecture brings tremendous amount of processing speed. Processor speed is increasing at a very fast rate comparing to the access latency of the main memory. In order to satisfy the needs for increasing computer processing speed, there are significant changes in the design process of modern computing systems. Multicore can run multiple instructions at a same time, increasing overall speed for programs. It provides few complete execution cores instead of one, each with an independent interface to the front side bus. Normally, a level-1 cache (CL1) memory isdedicated to each core. However, the level-2 cache (CL2) can be shared or distributed. For increasing the performance ratio of processor, decrease cache miss rate. Various replacement policies were designed to reduce the miss rate and make cache performance better. This paper shows comparative study among various replacement policies with their merits and demerits and among all study found that only few algorithms are ...
ACM Transactions on Architecture and Code Optimization, 2013
Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is only slightly exhibited by the stream of references arriving at the SLLC. Thus, traditional replacement algorithms based on recency are bad choices for governing SLLC replacement. Recent proposals involve SLLC replacement policies that attempt to exploit reuse either by segmenting the replacement list or improving the rereference interval prediction. On the other hand, inclusive SLLCs are commonplace in the CMP market, but the interaction between replacement policy and the enforcement of inclusion has barely been discussed. After analyzing that interaction, this article introduces two simple replacement policies exploiting reuse locality and targeting inclusive SLLCs: Least Recently Reused (LRR) and Not Recently Reused (NRR). NRR has t...
ACM Transactions on Programming Languages and Systems, 2004
The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.
Real-Time Systems, 2007
Hard real-time systems must obey strict timing constraints. Therefore, one needs to derive guarantees on the worst-case execution times of a system's tasks. In this context, predictable behavior of system components is crucial for the derivation of tight and thus useful bounds. This paper presents results about the predictability of common cache replacement policies. To this end, we introduce three metrics, evict, fill, and mls that capture aspects of cache-state predictability. A thorough analysis of the LRU, FIFO, MRU, and PLRU policies yields the respective values under these metrics. To the best of our knowledge, this work presents the first quantitative, analytical results for the predictability of replacement policies. Our results support empirical evidence in static cache analysis.
International Journal of Advanced Computer Science and Applications, 2017
The recent advancement in the field of distributed computing depicts a need of developing highly associative and less expensive cache memories for the state-of-art processors i.e., Intel Core i6, i7, etc. Hence, various conventional studies introduced cache replacement policies which are one of the prominent key factors to determine the effectiveness of a cache memory. Most of the conventional cache replacement algorithms are found to be as not so efficient on memory management and complexity analysis. Therefore, a significant and thorough analysis is required to suggest a new optimal solution for optimizing the state-of-the-art cache replacement issues. The proposed study aims to conceptualize a theoretical model for optimal cache heap object replacement. The proposed model incorporates Tree based and MRU (Most Recently Used) pseudo-LRU (Least Recently Used) mechanism and configures it with JVM's garbage collector to replace the old referenced objects from the heap cache lines. The performance analysis of the proposed system illustrates that it outperforms the conventional state of art replacement policies with much lower cost and complexity. It also depicts that the percentage of hits on cache heap is relatively higher than the conventional technologies.
2010
This paper presents a new cache replacement policy based on Instruction-based Reuse Distance Prediction (IbRDP) Replacement Policy originally proposed by Keramidas, Petoumenos, and Kaxiras [5] and further optimized by Petoumenos et al. [6]. In these works we have proven that there is a strong correlation between the temporal characteristics of the cache blocks and the access patterns of instructions (PCs) that touch these cache blocks. Based on this observation we introduced a new class of instruction-based predictors which are able to directly predict with high accuracy at run-time when a cache block is going to be accessed in the future, a.k.a. the reuse distance of a cache block. Being able to predict the reuse distances of the cache blocks permits us to make near-optimal replacement decisions by "looking into the future." In this work, we employ an extension of the IbRDP Replacement policy [6]. We carefully re-design the organization as well as the functionality of the predictor and the corresponding replacement algorithm in order to fit into the tight area budget provided by the CRC committee [3]. Since our proposal naturally supports the ability to victimize the currently fetched blocks by not caching them at all in the cache (Selective Caching), we submit for evaluation two versions: the base-IbRDP and the IbRDP enhanced with Selective Caching (IbRDP+SC).
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.