Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this
2003 Design, Automation and Test in Europe Conference and Exhibition, 2003
The widening gap between processor and memory speeds renders data locality optimization a very important issue in data-intensive embedded applications. Throughout the years hardware designers and compiler writers focused on optimizing data cache locality using intelligent cache management mechanisms and program-level transformations, respectively. Until now, there has not been significant research investigating the interaction between these optimizations. In this work, we investigate this interaction and propose a selective hardware/compiler strategy to optimize cache locality for integer, numerical (arrayintensive), and mixed codes. In our framework, the role of the compiler is to identify program regions that can be optimized at compile time using loop and data transformations and to mark (at compile-time) the unoptimizable regions with special instructions that activate/deactivate a hardware optimization mechanism selectively at run-time. Our results show that our technique can improve program performance by as much as 60% with respect to the base configuration and 17% with respect to a non-selective hardware/compiler approach.
Most if not all contemporary processors use caches to hide memory latency. In order to maintain a high clock speed, chip designers often resort to L1 caches that have no associativity, i.e.: direct-mapped caches. Since most processors in the past were designed to run a variety of applications, these caches were also designed to perform well on a variety of applications. Currently, however, many processors are embedded into devices that perform a dedicated task. As a result, application specific optimizations have become much more interesting than in the past. Memory addresses are mapped to cache lines by so-called indexing or placement functions. In traditional caches, this mapping is done by selecting a series of least significant bits from the memory address. In this work, we investigate the possibilities of adapting cache placement functions to specific application behavior. More specifically, we target to improve the performance of direct-mapped caches by using improved bit-selection placement functions. We evaluate several offline techniques to find these improved placement functions.
IEEE Transactions on Computers, 1998
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to minimize cache interference by improving the layout of the basic blocks of the code. However, the performance impact of this technique has been reported for application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. It is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes, in detail, the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: Rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. Based on our observations, we propose an algorithm to expose these localities and reduce interference in the cache. For a range of cache sizes, associativities, lines sizes, and organizations, we show that we reduce total instruction miss rates by 31-86 percent, or up to 2.9 absolute points. Using a simple model, this corresponds to execution time reductions of the order of 10-25 percent. In addition, our optimized operating system combines well with optimized and unoptimized applications.
In this paper we propose an instruction to accelerate software caches. While DMAs are very efficient for predictable data sets that can be fetched before they are needed, they introduce a large latency overhead for computations with unpredictable access behavior. Software caches are advantageous when the data set is not predictable but exhibits locality. However, software caches also incur a large overhead. Because the main overhead is in the access function, we propose an instruction that replaces the look-up function of the software cache. This instruction is evaluated using the Multidimensional Software Cache and two multimedia kernels, GLCM and H.264 Motion Compensation. The results show that the proposed instruction accelerates the software cache access time by a factor of 2.6. This improvement translates to a 2.1 speedup for GLCM and 1.28 for MC, when compared with the IBM software cache.
… of the 3rd international conference on High …, 2008
High performance embedded architectures will in some cases combine simple caches and multithreading, two techniques that increase energy efficiency and performance at the same time. However, that combination can produce high and unpredictable cache miss rates, even when the compiler optimizes the data layout of each program for the cache.
IEEE Computer Architecture Letters, 2019
Instruction cache misses are a significant source of performance degradation in server workloads because of their large instruction footprints and complex control flow. Due to the importance of reducing the number of instruction cache misses, there has been a myriad of proposals for hardware instruction prefetchers in the past two decades. While effectual, state-of-the-art hardware instruction prefetchers either impose considerable storage overhead or require significant changes in the frontend of a processor. Unlike hardware instruction prefetchers, code-layout optimization techniques profile a program and then reorder the code layout of the program to increase spatial locality, and hence, reduce the number of instruction cache misses. While an active area of research in the 1990s, code-layout optimization techniques have largely been neglected in the past decade. We evaluate the suitability of code-layout optimization techniques for modern server workloads and show that if we combine these techniques with a simple next-line prefetcher, they can significantly reduce the number of instruction cache misses. Moreover, we propose a new code-layout optimization algorithm and show that along with a next-line prefetcher, it offers the same performance improvement as the state-of-the-art hardware instruction prefetcher, but with almost no hardware overhead.
In embedded systems, cost, power consumption, and die size requirements push the designer to use small and simple cache memories. Such caches can provide low performance because of limited memory capacity and inflexible placement policy. A way to increase the performance is to adapt the program layout to the cache structure. This strategy needs the solution of a N-P complete problem and a very long processing time. We propose a strategy to look for a near optimum program layout within a reasonable time by means of smart heuristics. This solution does not add code and uses standard functionality's of a linker to produce the new layout. Our approach is able to reduce up to 70% the misses in case of a 2-kbyte direct access cache.
ACM Transactions on Programming Languages and Systems, 2004
The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.
Real-Time Systems, 2013
ABSTRACT Caches are essential to bridge the gap between the high latency main memory and the fast processor pipeline. Standard processor architectures implement two first-level caches to avoid a structural hazard in the pipeline: an instruction cache and a data cache. For tight worst-case execution times it is important to classify memory accesses as either cache hit or cache miss. The addresses of instruction fetches are known statically and static cache hit/miss classification is possible for the instruction cache. The access to data that is cached in the data cache is harder to predict statically. Several different data areas, such as stack, global data, and heap allocated data, share the same cache. Some addresses are known statically, other addresses are only known at runtime. With a standard cache organization all those different data areas must be considered by worst-case execution time analysis. In this paper we propose to split the data cache for the different data areas. Data cache analysis can be performed individually for the different areas. Access to an unknown address in the heap does not destroy the abstract cache state for other data areas. Furthermore, we propose to use a small, highly associative cache for the heap area. We designed and implemented a static analysis for this cache, and integrated it into a worst-case execution time analysis tool.
Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999
Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage of modular arithmetic to eliminate conflict misses and preserve group reuse on multiple cache levels. Loop fusion can balance increasing group reuse for the L2 (secondary) cache at the expense of losing group reuse at the smaller L1 cache. Tiling for the L1 cache also exploits locality available in the L2 cache. Experiments show enhanced algorithms are able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance for multi-level caches.
IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004.
The predictability of memory access patterns in embedded systems can be successfully exploited to devise effective application-specific cache optimizations. In this work, we propose an improved indexing scheme for direct-mapped caches, which drastically reduces the number of conflict misses by using application-specific information; the scheme is based on the selection of a subset of the address bits. With respect to similar approaches, our solution has two main strengths. First, it models the misses analytically by building a miss equation, and exploits a symbolic algorithm to compute the exact optimum solution (i.e., the subset of address bits to be used as cache index that minimizes conflict misses). Second, we designed a re-configurable bit selector, which can be programmed at run-time to fit the optimal cache indexing to a given application. Results show an average reduction of conflict misses of 24%, measured over a set of standard benchmarks, and for different cache configurations.
The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably, the development of modern processor architectures leads to an increased number of cache misses. In this paper, we make an attempt to implement a technique for decreasing the number of competitive cache misses in the first level of cache memory. This technique enables competitive access to the entire cache memory when there is a hit – but, if there are cache misses, memory data (by using replacement techniques) is put in a virtual part given to threads, so that competitive cache misses are avoided. By using a simulator tool, the results show a decrease in the number of cache misses and performance increase for up to 15%. The conclusion that comes out of this research is that cache misses are a real challenge for future processor designers, in order to hide memory latency.
1997
In recent years, loop tiling has become an increasingly popular technique for increasing cache e ectiveness. This is accomplished by transforming a loop nest so that the temporal and spatial locality can be better exploited for a given cache size. However, this optimization only targets the reduction of capacity misses. As recently demonstrated by several groups of researchers, con ict misses can still preclude e ective cache utilization. Moreover, the severity of cache con icts can vary greatly with slight variations in problem size and starting addresses, making performance difcult to even predict, let alone optimize. To reduce con ict misses, data copying has been proposed. With this technique, data layout in cache is adjusted by copying array tiles to temporary arrays that exhibit better cache behavior. Although copying has been proposed as the panacea to the problem of cache con icts, this solution experiences a cost proportional to the amount of data being copied. To date, there has been no discussion regarding either this tradeo or the problem of determining what and when to copy. In this paper, we present a compile-time technique for making this determination, and present a selective copying strategy based on this methodology. Preliminary experimental results demonstrate that, because of the sensitivity of cache con icts to small changes in problem size and base addresses, selective copying can lead to better overall performance than either no copying, complete copying or manually applied heuristics.
2009 IEEE 15th International Symposium on High Performance Computer Architecture, 2009
In future multi-cores, large amounts of delay and power will be spent accessing data in large L2/L3 caches. It has been recently shown that OS-based page coloring allows a non-uniform cache architecture (NUCA) to provide low latencies and not be hindered by complex data search mechanisms. In this work, we extend that concept with mechanisms that dynamically move data within caches. The key innovation is the use of a shadow address space to allow hardware control of data placement in the L2 cache while being largely transparent to the user application and offchip world. These mechanisms allow the hardware and OS to dynamically manage cache capacity per thread as well as optimize placement of data shared by multiple threads. We show an average IPC improvement of 10-20% for multiprogrammed workloads with capacity allocation policies and an average IPC improvement of 8% for multi-threaded workloads with policies for shared page placement.
Lecture Notes in Computer Science, 2000
The performance of a traditional cache memory hierarchy can be improved by utilizing mechanisms such as a victim cache or a stream bu er (cache assists). The amount of on{ chip memory for cache assist is typically limited for technological reasons. In addition, the cache assist size is limited in order to maintain a fast access time. Performance gains from using a stream bu er or a victim cache, or a combination of the two, varies from program to program as well as within a program. Therefore, given a limited amount of cache assist memory, there is a need and a potential for \adaptivity" of the cache assists i.e., an ability to vary their relative size within the bounds of the cache assist memory size. We propose and study a compiler-driven adaptive cache assist organization and its e ect on system performance. Several adaptivity mechanisms are proposed and investigated. The results show that a cache assist that is adaptive at loop level clearly improves the cache memory performance, has low overhead, and can be easily implemented.
International Journal of Computer Applications, 2014
Caching is a very important technique for improving the computer system performance, it employed to hide the latency gap between memory and the CPU by exploiting locality in memory accesses.
1997
This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary 1 reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist. 2 Kernel code
IEEE Transactions on Computers, 1999
AbstractÐExploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance.
2004
In the pursuit of raw computing horsepower, designers often try to improve throughput of designs through an increase in CPU clock frequency. But at a certain point, bus bandwidth and access speed to main memory become the limiting factors in overall processor throughput. This limitation may be overcome with faster, more expensive memory. However, at some point economic feasibility becomes an issue. A good cache memory design and implementation can help to minimize bus latency and improve overall processor throughput, all with modest economic cost.
2008
Due to the large contribution of the memory subsystem to total system power, the memory subsystem is highly amenable to customization for reduced power/energy and/or improved performance. Cache parameters such as total size, line size, and associativity can be specialized to the needs of an application for system optimization. In order to determine the best values for cache parameters, most methodologies utilize repetitious application execution to individually analyze each configuration explored. In this paper we propose a simplified yet efficient technique to accurately estimate the miss rate of many different cache configurations in just one single-pass of execution. The approach utilizes simple data structures in the form of a multi-layered table and elementary bitwise operations to capture the locality characteristics of an application's addressing behavior. The proposed technique intends to ease miss rate estimation and reduce cache exploration time.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.