Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999, ACM SIGPLAN Notices
…
12 pages
1 file
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs.
Computer, 2000
To narrow the widening gap between processor and memory performance, the authors propose improving the cache locality of pointer-manipulating programs and bolstering performance by careful placement of structure elements.
2003
As memory access times continue to be a bottleneck, differential research is required for better understanding of memory access performance. Studies of cache-conscious allocation and software prefetch have recently sparked research in the area of software optimizations on memory, as pointer-based data structures previously have been elusive to the optimizing techniques available. Research on hardware prefetch mechanisms have in some cases shown improvements, but less analytical schemes have tended to degrade performance for pointer-based data structures.
Proceedings of the international …, 2011
Poor placement of data blocks in memory may negatively impact application performance because of an increase in the cache conflict miss rate . For dynamically allocated structures this placement is typically determined by the memory allocator. Cache index-oblivious allocators may inadvertently place blocks on a restricted fraction of the available cache indexes, artificially and needlessly increasing the conflict miss rate. While some allocators are less vulnerable to this phenomena, no general-purpose malloc allocator is index-aware and methodologically addresses this concern. We demonstrate that many existing state-of-the-art allocators are index-oblivious, admitting performance pathologies for certain block sizes. We show that a simple adjustment within the allocator to control the spacing of blocks can provide better index coverage, which in turn reduces the superfluous conflict miss rate in various applications, improving performance with no observed negative consequences. The result is an index-aware allocator. Our technique is general and can easily be applied to most memory allocators and to various processor architectures.
Lecture Notes in Computer Science, 1999
Recursive data structures (lists, trees, graphs, etc.) are used throughout scientific and commercial software. The common approach is to allocate storage to the individual nodes of such structures dynamically, maintaining the logical connection between them via pointers. Once such a data structure goes through a sequence of updates (inserts and deletes), it may get scattered all over memory yielding poor spatial locality, which in turn introduces many cache misses. In this paper we present the new concept of Virtual Cache Lines (VCLs). Basically, the mechanism keeps groups of consecutive nodes in close proximity-forming virtual cache lines-while allowing the groups to be stored arbitrarily far away from each other. Virtual cache lines increase the spatial locality of the given data structure resulting in better locality of references. Furthermore, since the spatial locality is improved, software prefetching becomes much more attractive. Indeed, we also present a software prefetching algorithm that can be used when dealing with VCLs, resulting in even higher data cache performance. Our results show that the average performance of linked list operations-like scan, insert, and delete-can be improved by more than 200% even in architectures that do not support prefetching. Moreover, when using prefetching, one can gain additional 100% improvement. We believe that given a program that manipulates certain recursive data structures, compilers will be able to generate VCL-based code. Until this vision becomes true, VCLs can be used to build more efficient user libraries, operating-systems, and applications programs.
2003 Design, Automation and Test in Europe Conference and Exhibition, 2003
The widening gap between processor and memory speeds renders data locality optimization a very important issue in data-intensive embedded applications. Throughout the years hardware designers and compiler writers focused on optimizing data cache locality using intelligent cache management mechanisms and program-level transformations, respectively. Until now, there has not been significant research investigating the interaction between these optimizations. In this work, we investigate this interaction and propose a selective hardware/compiler strategy to optimize cache locality for integer, numerical (arrayintensive), and mixed codes. In our framework, the role of the compiler is to identify program regions that can be optimized at compile time using loop and data transformations and to mark (at compile-time) the unoptimizable regions with special instructions that activate/deactivate a hardware optimization mechanism selectively at run-time. Our results show that our technique can improve program performance by as much as 60% with respect to the base configuration and 17% with respect to a non-selective hardware/compiler approach.
Oceans 2002 Conference and Exhibition. Conference Proceedings (Cat. No.02CH37362)
This paper describes Compiler-Directed Content-Aware Prefetching (CDCAP), an integrated compiler and hardware approach for prefetching dynamic data structures. The approach utilizes compiler-inserted prefetch instructions to convey information about a dynamic data structure to a prefetching engine. The technique eliminates the need to transform the data structure without the use of excessive prefetches and does not require prior knowledge of data traversals. The approach also eliminates the need for large hardware structures and reduces unnecessary prefetches. For pointer intensive programs, the CDCAP approach reduces memory stall time by up to 40% and out performs previously proposed prefetching techniques.
IEEE Computer Architecture Letters, 2019
Instruction cache misses are a significant source of performance degradation in server workloads because of their large instruction footprints and complex control flow. Due to the importance of reducing the number of instruction cache misses, there has been a myriad of proposals for hardware instruction prefetchers in the past two decades. While effectual, state-of-the-art hardware instruction prefetchers either impose considerable storage overhead or require significant changes in the frontend of a processor. Unlike hardware instruction prefetchers, code-layout optimization techniques profile a program and then reorder the code layout of the program to increase spatial locality, and hence, reduce the number of instruction cache misses. While an active area of research in the 1990s, code-layout optimization techniques have largely been neglected in the past decade. We evaluate the suitability of code-layout optimization techniques for modern server workloads and show that if we combine these techniques with a simple next-line prefetcher, they can significantly reduce the number of instruction cache misses. Moreover, we propose a new code-layout optimization algorithm and show that along with a next-line prefetcher, it offers the same performance improvement as the state-of-the-art hardware instruction prefetcher, but with almost no hardware overhead.
Processor speed has been improving faster than memory latency for over two decades. Thus, an increased portion of execution time is spent stalling on loads, waiting for data from the memory hierarchy. Prefetching is an effective mechanism to hide memory latency for applications with low temporal locality. However, existing hardware prefetching techniques work well for array-based programs but not effective effective for pointer-intensive applications. Existing software prefetching techniques require recompilation or even modifications to application code before running on a new system with different hardware parameters.
IEEE Transactions on Computers, 1998
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to minimize cache interference by improving the layout of the basic blocks of the code. However, the performance impact of this technique has been reported for application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. It is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes, in detail, the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: Rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. Based on our observations, we propose an algorithm to expose these localities and reduce interference in the cache. For a range of cache sizes, associativities, lines sizes, and organizations, we show that we reduce total instruction miss rates by 31-86 percent, or up to 2.9 absolute points. Using a simple model, this corresponds to execution time reductions of the order of 10-25 percent. In addition, our optimized operating system combines well with optimized and unoptimized applications.
Proceedings of the 35th …, 2002
Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predict future memory addresses based on previous patterns. Thread-based prefetchers use portions of the actual program code to determine future load addresses for prefetching. This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching. The pointer cache provides, for a given pointer's effective address, the base address of the object pointed to by the pointer. We examine using the pointer cache in a wide issue superscalar processor as a value predictor and to aid prefetching when a chain of pointers is being traversed. When a load misses in the L1 cache, but hits in the pointer cache, the first two cache blocks of the pointed to object are prefetched. In addition, the load's dependencies are broken by using the pointer cache hit as a value prediction.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2009 IEEE 15th International Symposium on High Performance Computer Architecture, 2009
Proceedings of the 16th international conference on Supercomputing - ICS '02, 2002
ACM Transactions on Programming Languages and Systems, 2004
ACM Transactions on Architecture and Code Optimization, 2004
Proceedings of the 13th international conference on Supercomputing - ICS '99, 1999
arxiv, 2011
ACM SIGPLAN Notices, 2002
2022 IEEE International Conference on Cluster Computing (CLUSTER)
Proceedings of the 1999 ACM/IEEE conference on Supercomputing, 1999