Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
We propose SCORE, a novel adaptive cache replacement policy, which uses a score system to select a cache line to replace. Results show that SCORE offers low overall miss rates on SPEC CPU2006 benchmarks, and provides an average IPC that is 4.9% higher than LRU and 7.4% higher than LIP.
Proceedings of the 42nd annual Southeast regional conference on - ACM-SE 42, 2004
Replacement policy, one of the key factors determining the effectiveness of a cache, becomes even more important with latest technological trends toward highly associative caches. The state-of-the-art processors employ various policies such as Random, Least Recently Used (LRU), Round-Robin, and PLRU (Pseudo LRU), indicating that there is no common wisdom about the best one. Optimal yet unattainable policy would replace cache memory block whose next reference is the farthest away in the future, among all memory blocks present in the set. In our quest for replacement policy as close to optimal as possible, we thoroughly explored the design space of existing replacement mechanisms using SimpleScalar toolset and SPEC CPU2000 benchmark suite, across wide range of cache sizes and organizations. In order to better understand the behavior of different policies, we introduced new measures, such as cumulative distribution of cache hits in the LRU stack. We also dynamically monitored the number of cache misses, per each 100000 instructions. Our results show that the PLRU techniques can approximate and even outperform LRU with much lower complexity, for a wide range of cache organizations. However, a relatively large gap between LRU and optimal replacement policy, of up to 50%, indicates that new research aimed to close the gap is necessary. The cumulative distribution of cache hits in the LRU stack indicates a very good potential for way prediction using LRU information, since the percentage of hits to the bottom of the LRU stack is relatively high.
2009 IEEE International Conference on Computer Design, 2009
The increasing speed-gap between processor and memory and the limited memory bandwidth make last-level cache performance crucial for CMP architectures. Non Uniform Cache Architectures (NUCA) have been introduced to deal with this problem. This memory organization divides the whole memory space into smaller pieces or banks allowing nearer banks to have better access latencies than further banks. Moreover, an adaptive replacement policy that efficiently reduces misses in the last-level cache could boost performance, particularly if set associativity is adopted. Unfortunately, traditional replacement policies do not behave properly as they were designed for single-processors. This paper focuses on Bank Replacement. This policy involves three key decisions when there is a miss: where to place a data block within the cache set, which data to evict from the cache set and finally, where to place the evicted data. We propose a novel replacement technique that enables more intelligent replacement decisions to be taken. This technique is based on the observation that some types of data are less commonly accessed depending on which bank they reside in. We call this technique LRU-PEA (Least Recently Used with a Priority Eviction Approach). We show that the proposed technique significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8% and into an Energy per Instruction (EPI) reduction of 5%.
IAEME, 2019
Modern multi-core systems allow concurrent execution of different applications on a single chip. Such multicores handle the large bandwidth requirement from the processing cores by employing multiple levels of caches with one or two levels of private caches along with a shared last-level cache (LLC). In shared LLC, when applications with varying access behavior compete with each other for space, conventional single core cache replacement techniques can significantly degrade the system performance. In such scenarios, we need an efficient replacement policy for reducing the off-chip memory traffic as well as contention for the memory bandwidth. This paper proposes a novel Application-aware Cache Replacement (ACR) policy for the shared LLC. ACR policy considers the memory access behavior of the applications during the process of victim selection to prevent victimizing a low access rate application by a high-access rate application. \textcolor{red}{ It dynamically tracks the maximum lifetime of cache lines in shared LLC for each concurrent application and helps in efficient utilization of the cache space. Experimental evaluation of ACR policy for 4-core systems, with 16-way set associative 4MB LLC, using SPEC CPU 2000 and 2006 benchmark suites shows a geometric mean speed-up of 8.7% over the least recently used (LRU) replacement policy. We show that the ACR policy performs better than recently proposed thread-aware dynamic re-reference interval prediction (TA-DRRIP) and protecting distance based (PDP) techniques for various 2-core, 4-core and 8-core workloads.
International Journal for Research in Applied Science and Engineering Technology IJRASET, 2020
Modern processors have a clock speed in the range of GHz while the main memory (DRAM) has a read/write speed in the range of MHz, so the processor needs to halt till the memory completes its request. The halt period may seem to be very small, but when seen on a broad scale, we see that most of the processor's time is wasted in the halt cycles. Cache memory is intended to reduce the speed gap between the fast processor and slow memory. When a program needs to access data from the RAM (physical memory), it first checks inside the cache (SRAM). Replacement policies are methods by which the memory blocks are replaced in a filled cache. Cache replacement policies play a significant role in the memory organization of a processor and dictate how fast a processor will receive the block demanded. Various replacement policies such as RRIP, ABRRIP, AIRRIP, etc. have been designed but not been implemented, unlike LRU. LRU is predominantly used in most of the systems. The ABRRIP policy has two levels of RRPV-one at block and one at the core level. We observe that the performance of the ABRRIP policy improves as we increase the number of instructions during the simulation.
2017 First International Conference on Embedded & Distributed Systems (EDiS), 2017
The Last-Level Cache (LLC) is a critical component of the memory hierarchy which has a direct impact on performance. Whenever a data requested by a processor core is not found in the cache a transaction to the main memory is initiated, which results in both performance and energy penalties. Decreasing LLC miss rate therefore lowers external memory transactions which is beneficial both power and performancewise. The cache replacement policy has a direct impact on the miss rate. It is responsible of data eviction of cache lines whenever the cache runs full. Thus, a good policy should evict data that will be re-used in a distant-future, and favour data that are likely to be accessed in the near-future. The most common cache replacement policy is the Least-Recently Used (LRU) strategy. It has been used for years and is cheaper in terms of hardware implementation. However, researchers have shown that LRU is not the most efficient policy from a performance point of view, and is further largely sub-optimal compared to the best theoretical strategy. In this paper, we analyze a number of cache replacement policies that have been proposed over the last decade and carry out evaluations reporting performance and energy.
2015
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and do not capture these high-performance policies. Accurate pre-dictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of ser-vice, or achieve other system objectives. Without an accu-rate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses ab-solute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies...
WDDD held in conjunction with ISCA, 2007
Cache replacement policy is a major design parameter of any memory hierarchy. The efficiency of the replacement policy affects both the hit rate and the access latency of a cache system. The higher the associativity of the cache, the more vital the replacement ...
2015
Multicore architecture brings tremendous amount of processing speed. Processor speed is increasing at a very fast rate comparing to the access latency of the main memory. In order to satisfy the needs for increasing computer processing speed, there are significant changes in the design process of modern computing systems. Multicore can run multiple instructions at a same time, increasing overall speed for programs. It provides few complete execution cores instead of one, each with an independent interface to the front side bus. Normally, a level-1 cache (CL1) memory isdedicated to each core. However, the level-2 cache (CL2) can be shared or distributed. For increasing the performance ratio of processor, decrease cache miss rate. Various replacement policies were designed to reduce the miss rate and make cache performance better. This paper shows comparative study among various replacement policies with their merits and demerits and among all study found that only few algorithms are ...
Modern processors use high-performance cache replacement policies that outperform traditional alternatives like leastrecently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and do not capture these high-performance policies. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses absolute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies on synthetic and real benchmarks, where its median error is less than 1%. Finally, we present two case studies showing how to use the model to improve shared and single-stream cache performance.
Journal of Instruction-Level Parallelism
Classic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true performance cost of a miss should be its access penalty, i.e. the actual processing bandwidth lost because of the miss. Contrary to loads, the penalty of stores is mostly hidden in modern processors. To take advantage of this observation, we propose a simple scheme to replace load misses by store misses. We extend LRU (Least Recently Used) to reduce the aggregate miss penalty instead of the miss count. The new policy is called PS-LRU (Penalty-Sensitive LRU) and is deployed throughout most of this paper. PS-LRU systematically replaces first a block predicted to be accessed with a store next. This policy minimizes the number of load misses to the detriment of store misses. One key issue in this policy is to predict the next access type to a block, so that higher replacement priority is given to blocks that will be accessed next with a store. We introduce and evaluate various prediction schemes based on instructions and broadly inspired from branch predictors. To guide the design we run extensive trace-driven simulations on eight Spec95 benchmarks with a wide range of cache configurations and observe that PS-LRU yield positive load miss improvements over classic LRU across most the benchmarks and cache configurations. In some cases the improvements are very large. Although the total number of load misses is minimized under our simple policy, the number of store misses and the amount of memory traffic both increase. Moreover store misses are not totally "free". To evaluate this trade-off, we apply DCL and ACL (two previously proposed cost-sensitive LRU policies) to the problem of load/store miss penalty. These algorithms are more competitive than PS-LRU. Both DCL and ACL provide attractive trade-offs in which less load misses are saved than in PS-LRU, but the store miss traffic is reduced.
Microprocessors and Microsystems, 2015
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, because modern embedded processors require not only efficient power consumption but also high performance. Practical cache replacement algorithms have focused on supporting the increasing data needs of processors. The commonly used Least Recently Used (LRU) replacement policy always predicts a near-immediate re-reference interval, hence, applications that exhibit a distant re-reference interval may perform poorly under LRU replacement policy. In addition, recent studies have shown that the performance gap between LRU and theoretical optimal replacement (OPT) is large for highly-associative caches. LRU policy is also susceptible to memory-intensive workloads where a working set is greater than the available cache size. These reasons motivate the design of alternative replacement algorithms to improve cache performance. This paper explores a low-overhead, high-performance cache replacement policy for embedded processors that utilizes the mechanism of LRU replacement. Experiments indicate that the proposed policy can result in significant improvement of performance and miss rate for large, highly-associative last-level caches. The proposed policy is based on the tag-distance correlation among cache lines in a cache set. Rather than always replacing the LRU line, the victim is chosen by considering the LRU-behavior bit of the line combined with the correlation between the cache lines' tags of the set and the requested block's tag. By using the LRU-behavior bit, the LRU line is given a chance of residing longer in the set instead of being replaced immediately. Simulations with an out-of-order superscalar processor and memory-intensive benchmarks demonstrate that the proposed cache replacement algorithm can increase overall performance by 5.15% and reduce the miss rate by an average of 11.41%.
ACM SIGARCH Computer Architecture News, 2010
Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a nearimmediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to nontemporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Rereference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm todate. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.
2007 25th International Conference on Computer Design, 2007
Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some "decay interval" we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.
American Journal of Embedded Systems and Applications, 2016
Cache is an important component in computer architecture. It has great effects on the performance of systems. Nowadays, Least Recently Used (LRU) Algorithm is one of the most commonly used one because it is easy to implement and with a relatively good performance. However, in some particular cases, LRU is not a good choice. To provide references for the computer architecture designer, the study proposed a new algorithm named Fully Replacement Policy (FRP) and then analyzed various factors of effects on cache performance, carried out the simulation experiment of cache performance based on SimpleScalar toolset and SPEC2000 benchmark suite. The study compared the effects of Fully Replacement Policy with Least Recently Used (LRU) Algorithm when set size, block size, associativity and replacement methods are changed separately., By experimentally analyzing the results, it was found that FRP outperforms LRU in some particular situations.
2010 3rd International Conference on Computer Science and Information Technology, 2010
An Adaptive Subset Based Replacement Policy (ASRP) is proposed in this paper. In ASRP policy, each set in Last-Level Cache (LLC) is divided into multiple subsets, and one subset is active and others are inactive at a given time. The victim block for a miss is only chosen from the active subset using LRU policy. A counter in each set records the cache misses in the set during a period of time. When the value of the counter is greater than a threshold, the active subset is changed to the neighbor one. To adapt the program behavior changing, set dueling is used to choose a threshold from different thresholds according to which one causes the least number of misses in the sampling sets. Using the given framework for this competition our ASRP policy gets a geometric average of improvement over LRU by 4.5% for 28 SPEC CPU 2006 programs and some programs gain improvements up to 50%.
2014
The impact of various cache replacement policies act as the main deciding factor of system performance and efficiency in Chip Multi-core Processors. It is imperative for any level of cache memory in a multi-core architecture to have a well-defined, dynamic replacement algorithm in place to ensure consistent superlative performance. Many existing cache replacement policies such as Least Recently Used (LRU) policy etc have proved to work well in the shared Level 2 (L2) cache for most of the data set patterns generated by single threaded applications. But when it comes to parallel multi-threaded applications that generate differing patterns of workload at different intervals, the conventional replacement policies can prove sub-optimal as they generally do not abide by the spatial and temporal locality theories and do not acquaint dynamically to the changes in the workload. This dissertation proposes three novel cache replacement policies for L2 cache that are targeted towards such appl...
Rcc, 2003
En este trabajo, nosotros proponemos un sistema de memoria cache basado en un esquema de reemplazo adaptativo, el cual formaría parte del Sistema Manejador de la Memoria Virtual de un Sistema Operativo. Nosotros usamos un simulador de eventos discretos para comparar nuestro enfoque con trabajos previos. Nuestro esquema de reemplazo adaptativo esta basado en varias propiedades del sistema y de las aplicaciones, para estimar/escoger la mejor política de reemplazo. Nosotros definidos un valor de prioridad de reemplazo a cada bloque de la memoria cache, según el conjunto de propiedades del sistema y de las aplicaciones, para seleccionar cual bloque eliminar. El objetivo es proveer un uso efectivo de la memoria cache y un buen rendimiento para las aplicaciones.
2011 24th Internatioal Conference on VLSI Design, 2011
Parameter variations in deep sub-micron integrated circuits cause chip characteristics to deviate during semiconductor fabrication process. These variations are dominant in memory systems such as caches and the delay spread due to process variation impacts the performance of a cache based system significantly. In this paper, we propose two schemes to reduce the performance impact of variations in caches: i) Latency-Aware Least Recently Used (LA-LRU) replacement policy which ensures that cache blocks that are affected by process variation are accessed less frequently, and ii) Block Rearrangement scheme that distributes cache blocks with high latencies to all sets uniformly. We implemented our schemes on the Wattch SimpleScalar toolset for Xscale, PowerPC and Alpha21264-like processor configurations. Our experiments on SPEC 2000 benchmarks show that our scheme improves the average memory access time of caches by 11% to 22%, almost eliminating any performance degradation due to variations. We also synthesized the LA-LRU logic, to find out that we can obtain this benefit at negligible increase in the power consumption of the cache.
2010
Abstract We consider the problem of on-chip L2 cache management and replacement policies. We propose a new adaptive cache replacement policy, called Dueling CLOCK (DC), that has several advantages over the Least Recently Used (LRU) cache replacement policy. LRU's strength is that it keeps track of the'recency'information of memory accesses.
The rapid development of multi-core system and increase of data-intensive application in recent years call for larger main memory. Traditional DRAM memory can increase its capacity by reducing the feature size of storage cell. Now further scaling of DRAM faces great challenge, and the frequent refresh operations of DRAM can bring a lot of energy consumption. As an emerging technology, Phase Change Memory (PCM) is promising to be used as main memory. It draws wide attention due to the advantages of low power consumption, high density and nonvolatility, while it incurs finite endurance and relatively long write latency. To handle the problem of write, optimizing the cache replacement policy to protect dirty cache block is an efficient way. In this paper, we construct a systematically multilevel structure, and based on it propose a novel cache replacement policy called MAC. MAC can effectively reduce write traffic to PCM memory with low hardware overhead. We conduct simulation experiments on GEM5 to evaluate the performances of MAC and other related works. The results show that MAC performs best in reducing the amount of writes (averagely 25.12%) without increasing the program execution time.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.