Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2002, … of Conference on Parallel and Distributed …
AI
The paper presents a strategy to reduce cache misses in programs by leveraging programmer-driven optimizations, complementing existing hardware and compiler techniques. It highlights the growing disparity between processor and memory speeds, emphasizing the need for improved data locality to alleviate performance degradation caused by cache misses, particularly capacity misses. By employing a visualization tool based on reuse distance metrics, programmers can identify and rectify locality issues, leading to significant performance improvements, as demonstrated through a case study on the MCF program.
… , 2001. Proceedings. Fifth …, 2002
ACM Transactions on Programming Languages and Systems, 2004
The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.
2003 Design, Automation and Test in Europe Conference and Exhibition, 2003
The widening gap between processor and memory speeds renders data locality optimization a very important issue in data-intensive embedded applications. Throughout the years hardware designers and compiler writers focused on optimizing data cache locality using intelligent cache management mechanisms and program-level transformations, respectively. Until now, there has not been significant research investigating the interaction between these optimizations. In this work, we investigate this interaction and propose a selective hardware/compiler strategy to optimize cache locality for integer, numerical (arrayintensive), and mixed codes. In our framework, the role of the compiler is to identify program regions that can be optimized at compile time using loop and data transformations and to mark (at compile-time) the unoptimizable regions with special instructions that activate/deactivate a hardware optimization mechanism selectively at run-time. Our results show that our technique can improve program performance by as much as 60% with respect to the base configuration and 17% with respect to a non-selective hardware/compiler approach.
Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999
Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage of modular arithmetic to eliminate conflict misses and preserve group reuse on multiple cache levels. Loop fusion can balance increasing group reuse for the L2 (secondary) cache at the expense of losing group reuse at the smaller L1 cache. Tiling for the L1 cache also exploits locality available in the L2 cache. Experiments show enhanced algorithms are able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance for multi-level caches.
IEEE Transactions on Computers, 1998
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to minimize cache interference by improving the layout of the basic blocks of the code. However, the performance impact of this technique has been reported for application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. It is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes, in detail, the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: Rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. Based on our observations, we propose an algorithm to expose these localities and reduce interference in the cache. For a range of cache sizes, associativities, lines sizes, and organizations, we show that we reduce total instruction miss rates by 31-86 percent, or up to 2.9 absolute points. Using a simple model, this corresponds to execution time reductions of the order of 10-25 percent. In addition, our optimized operating system combines well with optimized and unoptimized applications.
IEEE Transactions on Computers, 2000
Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a parameterized model of program cache behavior. Given a cache size and associativity, this model predicts the miss rate for arbitrary data input set sizes. This model also identifies critical data input sizes where cache behavior exhibits marked changes. Experiments show this technique is within 2 percent of the hit rate for set associative caches on a set of floating-point and integer programs using array and pointer-based data structures. Building on the new model, this paper presents an interactive visualization tool that uses a three-dimensional plot to show miss rate changes across program data sizes and cache sizes and its use in evaluating compiler transformations. Other uses of this visualization tool include assisting machine and benchmark-set design. The tool can be accessed on the Web at http://www.cs.rochester.edu/research/locality.
In this paper we propose an instruction to accelerate software caches. While DMAs are very efficient for predictable data sets that can be fetched before they are needed, they introduce a large latency overhead for computations with unpredictable access behavior. Software caches are advantageous when the data set is not predictable but exhibits locality. However, software caches also incur a large overhead. Because the main overhead is in the access function, we propose an instruction that replaces the look-up function of the software cache. This instruction is evaluated using the Multidimensional Software Cache and two multimedia kernels, GLCM and H.264 Motion Compensation. The results show that the proposed instruction accelerates the software cache access time by a factor of 2.6. This improvement translates to a 2.1 speedup for GLCM and 1.28 for MC, when compared with the IBM software cache.
In embedded systems, cost, power consumption, and die size requirements push the designer to use small and simple cache memories. Such caches can provide low performance because of limited memory capacity and inflexible placement policy. A way to increase the performance is to adapt the program layout to the cache structure. This strategy needs the solution of a N-P complete problem and a very long processing time. We propose a strategy to look for a near optimum program layout within a reasonable time by means of smart heuristics. This solution does not add code and uses standard functionality's of a linker to produce the new layout. Our approach is able to reduce up to 70% the misses in case of a 2-kbyte direct access cache.
1999
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this
—Cache hierarchies have long been utilized to minimize the latency of main memory accesses by caching frequently used data closer to the processor. Significant research has been done to identify the most crucial metrics of cache performance. Though the majority of research focuses on measuring cache hit rates and data movement as the major cache performance metrics, cache utilization can be equally important. In this work, we present cache utilization performance metrics that provide insight into application behavior. We define cache utilization in two forms: 1) the fraction of data bytes in a cache line that are actually accessed at least once before eviction from cache and 2) the access frequency of data bytes in a cache line. We discuss the relationship between the utilization measurement and two important application properties: 1) spatial locality – the use of data located near data that has already been accessed, and 2) temporal locality – the reuse of data over time. In addition to measuring cache line utilization performance, we present conventional performance metrics as well to illustrate a holistic understanding of cache behavior. To facilitate this work, we build a memory simulator incorporated into the Structural Simulation Toolkit (SST). We measure and analyze the performance for several scientific mini-applications from the Mantevo suite [1]. This work justifies that caches are not necessarily the best on-chip solution for all types of applications due to the fixed cache line size.
Sigplan Notices, 1994
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.
IEEE Transactions on Computers, 1999
AbstractÐExploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.
Most if not all contemporary processors use caches to hide memory latency. In order to maintain a high clock speed, chip designers often resort to L1 caches that have no associativity, i.e.: direct-mapped caches. Since most processors in the past were designed to run a variety of applications, these caches were also designed to perform well on a variety of applications. Currently, however, many processors are embedded into devices that perform a dedicated task. As a result, application specific optimizations have become much more interesting than in the past. Memory addresses are mapped to cache lines by so-called indexing or placement functions. In traditional caches, this mapping is done by selecting a series of least significant bits from the memory address. In this work, we investigate the possibilities of adapting cache placement functions to specific application behavior. More specifically, we target to improve the performance of direct-mapped caches by using improved bit-selection placement functions. We evaluate several offline techniques to find these improved placement functions.
Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality.
1997
In recent years, loop tiling has become an increasingly popular technique for increasing cache e ectiveness. This is accomplished by transforming a loop nest so that the temporal and spatial locality can be better exploited for a given cache size. However, this optimization only targets the reduction of capacity misses. As recently demonstrated by several groups of researchers, con ict misses can still preclude e ective cache utilization. Moreover, the severity of cache con icts can vary greatly with slight variations in problem size and starting addresses, making performance difcult to even predict, let alone optimize. To reduce con ict misses, data copying has been proposed. With this technique, data layout in cache is adjusted by copying array tiles to temporary arrays that exhibit better cache behavior. Although copying has been proposed as the panacea to the problem of cache con icts, this solution experiences a cost proportional to the amount of data being copied. To date, there has been no discussion regarding either this tradeo or the problem of determining what and when to copy. In this paper, we present a compile-time technique for making this determination, and present a selective copying strategy based on this methodology. Preliminary experimental results demonstrate that, because of the sensitivity of cache con icts to small changes in problem size and base addresses, selective copying can lead to better overall performance than either no copying, complete copying or manually applied heuristics.
Ieee Micro, 2000
Journal of Parallel and Distributed Computing, 1988
In this paper we describe a method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes. The estimates take the form of a family of "reference" windows for each variable that reflects the current set of elements that should be kept in cache. It is shown that, in important special cases, we can estimate the size of the window and predict a lower bound on the number of cache hits. If the machine has local memory or cache that can be managed by the compiler, these estimates can be used to guide the management of this resource. It is also shown that these estimates can be used to guide program transformations in an attempt to optimize cache performance. 8 1988 Academic p, he.
2007 25th International Conference on Computer Design, 2007
Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some "decay interval" we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.
2009 International Symposium on Systems, Architectures, Modeling, and Simulation, 2009
The effect of caching is fully determined by the program locality or the data reuse and several cache management techniques try to base their decisions on the prediction of temporal locality in programs. However, prior work reports only rough techniques which either try to predict when a cache block loses its temporal locality or try to categorize cache items as highly or poorly temporal. In this work, we quantify the temporal characteristics of the cache block at run time by predicting the cache block reuse distances (measured in intervening cache accesses), based on the access patterns of the instructions (PCs) that touch the cache blocks. We show that an instruction-based reused distance predictor is very accurate and allows approximation of optimal replacement decisions, since we can "see" the future. We experimentally evaluate our prediction scheme in various sizes L2 caches using a subset of the most memory intensive SPEC2000 benchmarks. Our proposal obtains a significant improvement in terms of IPC over traditional LRU up to 130.6% (17.2% on average) and it also outperforms the previous state of the art proposal (namely Dynamic Insertion Policy or DIP) by up to 80.7% (15.8% on average).
Proceedings of International Conference on Microelectronic Systems Education, 1997
Cache memory design in embedded systems can take advantage from the analysis of the software that runs on that system, which usually remains the same for its whole life. Programs can be characterized, in respect of the memory hierarchy, using locality analy-sis. We propose ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.