Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008, 2008 37th International Conference on Parallel Processing
Parallel applications can benefit greatly from massive computational capability, but their performance usually suffers due to large latency in I/O accesses. Conventional I/O prefetching techniques are conservative and are limited by low accuracy and coverage. As the processor performance has been increasing rapidly and the computing power is virtually free, we introduce a novel speculative approach for comprehensive and aggressive parallel I/O prefetching in this study. We present the design of our approach, as well as challenges, solutions, and the prototype implementation. The experiments have shown promising results in reducing I/O access latency.
The Journal of Supercomputing, 2008
Multiple memory models have been proposed to capture the effects of memory hierarchy culminating in the I-O model of Aggarwal and Vitter [?]. More than a decade of architectural advancements have led to new features that are not captured in the I-O model-most notably the prefetching capability. We propose a relatively simple Prefetch model that incorporates data prefetching in the traditional I-O models and show how to design optimal algorithms that can attain close to peak memory bandwidth. Unlike (the inverse of) memory latency, the memory bandwidth is much closer to the processing speed, thereby, intelligent use of prefetching can considerably mitigate the I-O bottleneck. For some fundamental problems, our algorithms attain running times approaching that of the idealized Random Access Machines under reasonable assumptions. Our work also explains more precisely the significantly superior performance of the I-O efficient algorithms in systems that support prefetching compared to ones that do not.
… , Storage and Analysis, …, 2008
Parallel applications are usually able to achieve high computational performance but suffer from large latency in I/O accesses. I/O prefetching is an effective solution for masking the latency. Most of existing I/O prefetching techniques, however, are conservative and their effectiveness is limited by low accuracy and coverage. As the processor-I/O performance gap has been increasing rapidly, data-access delay has become a dominant performance bottleneck. We argue that it is time to revisit the "I/O wall" problem and trade the excessive computing power with data-access speed. We propose a novel pre-execution approach for masking I/O latency. We describe the pre-execution I/O prefetching framework, the pre-execution thread construction methodology, the underlying library support, and the prototype implementation in the ROMIO MPI-IO implementation in MPICH2. Preliminary experiments show that the pre-execution approach is promising in reducing I/O access latency and has real potential.
isca, 2001
This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium T M ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.
Conference on Parallel andDistributed Information Systems, 1991
Improvements in the processing speed of multiprocessorsare outpacing improvements in the speed ofdisk hardware. Parallel disk I/O subsystems have beenproposed as one way to close the gap between processorand disk speeds. In a previous paper we showedthat prefetching and caching have the potential to deliverthe performance benefits of parallel file systems toparallel applications. In this paper we describe experimentswith practical
We present a simulation study of several prefetching policies to improve the I O performance of external merging using parallel I O. In particular we consider traditional sequential prefetch, forecast-based greedy prefetching, and oblivious prefetching. In conjunction with the prefetching policies we e v aluate the bene t of two di erent data placement strategies: run-level striping and block-random placement, in the presence of data skew. We show that the I O performance is greatly improved by using forecasting techniques. This method outperforms the other policies in achieving higher disk parallelism, and scales well with increased numbers of disks and increasing data skew. Additionally, the performance of block-random data placement i s s h o wn to be uniformly good, independent of the data skew, while the performance of run-level striped placement degrades with increasing skew.
2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007
Speculative precomputation enables effective cache prefetching for even irregular memory access behavior, by using an alternate thread on a multithreaded or multi-core architecture. This paper describes a system that constructs and runs precomputation based prefetching threads via event-driven dynamic optimization. Precomputation threads are dynamically constructed by a runtime compiler from the program's frequently executed hot traces, and are adapted to the memory behavior automatically. Both construction and execution of the prefetching threads happen in another thread, imposing little overhead on the main thread. This paper also presents several techniques to accelerate the precomputation threads, including colocation of p-threads with hot traces, dynamic stride prediction, and automatic adaptation of runahead and jumpstart distance. The adaptive prefetching achieves 42% speedup, a 17% improvement over existing p-thread prefetching schemes.
We have previously shown that the patterns in which files are accessed offer information that can accurately predict upcoming file accesses. Most modern caches ignore these patterns, thereby failing to use information that enables significant reductions in I/O latency. While prefetching heuristics that expect sequential accesses are often effective methods to reduce I/O latency, they cannot be applied across files, because the abstraction of a file has no intrinsic concept of a successor. This limits the ability of modern file systems to prefetch. Here we presents our implementation of a predictive prefetching system, that makes use of file access patterns to reduce I/O latency. Previously we developed a technique called Partitioned Context Modeling (PCM) [13] that efficiently models file accesses to reliably predict upcoming requests. We present our experiences in implementing predictive prefetching based on file access patterns. From the lessons learned we developed of a new technique Extended Partitioned Context Modeling (EPCM), which has even better performance. We have modified the Linux kernel to prefetch file data based on Partitioned Context Modeling and Extended Partitioned Context Modeling. With this implementation we examine how a prefetching policy, that uses such models to predict upcoming accesses, can result in large reductions in I/O latencies. We tested our implementation with four different application-based benchmarks and saw I/O latency reduced by 31% to 90% and elapsed time reduced by 11% to 16%. Ý [email protected].
2000
Abstract As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance. To reduce the bottleneck, designers have had to create methods to hide these latencies. One popular method is prefetching. This method fetches the data from the memory system before being asked for by the processor, with the expectation that it will soon be referenced. An effective prefetching scheme reduces cache miss rates and therefore hides the memory latency.
Journal of Algorithms, 2000
We provide a competitive analysis framework for online prefetching and buffer management algorithms in parallel I/O systems, using a read-once model of block references. This has widespread applicability to key I/O-bound applications such as external merging and concurrent playback of multiple video streams. Two realistic lookahead models, global lookahead and local lookahead, are defined. Algorithms NOM and GREED based on these two forms of lookahead are analyzed for shared buffer and distributed buffer configurations, both of which occur frequently in existing systems. An important aspect of our work is that we show how to implement both the models of lookahead in practice using the simple techniques of forecasting and flushing. Given a ¤-disk parallel I/O system and a globally shared I/O buffer that can hold upto ¥ disk blocks, we derive a lower bound of ¦ § © ¤ on the competitive ratio of any deterministic online prefetching algorithm with § ¥ lookahead. NOM is shown to match the lower bound using global ¥-block lookahead. In contrast, using only local lookahead results in an ¦ § ¤ competitive ratio. When the buffer is distributed into
2015
Aggressive hardware prefetching is extremely beneficial for single-threaded performance but can lead to significant slowdowns in multicore processors due to oversubscription of off-chip bandwidth and shared cache capacity. This work addresses this problem by adjusting prefetching on a per-application basis to improve overall system performance. Unfortunately, an exhaustive search of all possible perapplication prefetching combinations for a multicore workload is prohibitively expensive, even for small processors with only four cores. In this work we develop Perf-Insight, a simple, scalable mechanism for understanding and predicting the impact of any available hardware/software prefetching choices on applications' bandwidth consumption and performance. Our model considers the overall system bandwidth, the bandwidth sensitivity of each co-running application, and how applications' bandwidth usage and performance vary with prefetching choices. This allows us to profile individual applications and efficiently predict total system bandwidth and throughput. To make this practical, we develop a low-overhead profiling approach that scales linearly, rather than exponentially, with the number of cores, and allows us to profile applications while running in the mix. With Perf-Insight we are able to achieve an average weighted speedup of 21% for 14 mixes of 4 applications on commodity hardware, with no mix experiencing a slowdown. This is significantly better than hardware prefetching, which only achieves an average speedup of 9%, with three mixes experiencing slowdowns. Perf-Insight delivers performance very close to the best possible prefetch settings (22%). Our approach is simple, lowoverhead, applicable to any collection of prefetching options and performance metric, and suitable for dynamic runtime use on commodity multicore systems.
Distributed and Parallel Databases, 1993
Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to dose the gap between processor and disk speeds. In a previous paper we showed that prefetching and caching have the potentT"al to deliver the performance benefits of parallel file systems to parallel applications. In this paper we describe experiments with practical prefetching policies that base decisions only on on-line reference history, and that can be implemented efficiently. We also test the ability of those policies across a range of architectural parameters.
International Journal of Communications, Network and System Sciences, 2015
In this paper, we present a comparative study between informed and predictive prefetching mechanisms that were presented to leverage the performance gap between I/O storage systems and CPU. In particular, we will focus on transparent informed prefetching (TIP) and predictive prefetching using probability graph approach (PG). Our main objective is to show the main features, motivations, and implementation overview of each mechanism. We also conducted a performance evaluation discussion that shows a comparison between both mechanisms performance when using different cache size values.
2009
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.
2008
Data prefetching has been considered an effective way to mask data access latency caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been proposed in the last few years to reduce data access latency by taking advantage of multi-core architectures. In this paper, we propose a taxonomy that classifies various design concerns in developing a prefetching strategy. We discuss various prefetching strategies and issues that have to be considered in designing a prefetching strategy for multi-core processors.
An analysis of performance characteristics of modern disks finds that prefetching can improve the performance of nonsequential read access patterns by an order of magnitude or more, far more than demonstrated by prior work. Using this analysis, we design prefetching algorithms that make effective use of primary memory, and can sometimes gain additional speedups by reading unneeded data. We show when additional prefetching memory is most critical for performance. A contention controller automatically adjusts prefetching memory usage, preserving the benefits of prefetching while sharing available memory with other applications. When implemented in a library with some kernel changes, our prefetching system improves performance for some workloads of the GIMP image manipulation program and the SQLite database by factors of 4.9x to 20x.
1996
AbstractÐThe I/O performance of applications in multiple-disk systems can be improved by overlapping disk accesses. This requires the use of appropriate prefetching and buffer management algorithms that ensure the most useful blocks are accessed and retained in the buffer. In this paper, we answer several fundamental questions on prefetching and buffer management for distributed-buffer parallel I/O systems. First, we derive and prove the optimality of an algorithm, P-min, that minimizes the number of parallel I/Os. Second, we analyze P-con, an algorithm that always matches its replacement decisions with those of the well-known demand-paged MIN algorithm. We show that P-con can become fully sequential in the worst case. Third, we investigate the behavior of on-line algorithms for multiple-disk prefetching and buffer management. We define and analyze P-lru, a parallel version of the traditional LRU buffer management algorithm. Unexpectedly, we find that the competitive ratio of P-lru ...
2008
Parallel I/O prefetching is considered to be effective in improving I/O performance. However, the effectiveness depends on determining patterns among future I/O accesses swiftly and fetching data in time, which is difficult to achieve in general. In this study, we propose an I/O signature-based prefetching strategy. The idea is to use a predetermined I/O signature of an application to guide prefetching. To put this idea to work, we first derived a classification of patterns and introduced a simple and effective signature notation to represent patterns. We then developed a toolkit to trace and generate I/O signatures automatically. Finally, we designed and implemented a thread-based client-side collective prefetching cache layer for MPI-IO library to support prefetching. A prefetching thread reads I/O signatures of an application and adjusts them by observing I/O accesses at runtime. Experimental results show that the proposed prefetching method improves I/O performance significantly for applications with complex patterns.
Lecture Notes in Computer Science, 2006
External Memory models, most notable being the I-O Model [3], capture the effects of memory hierarchy and aid in algorithm design. More than a decade of architectural advancements have led to new features not captured in the I-O model-most notably the prefetching capability. We propose a relatively simple Prefetch model that incorporates data prefetching in the traditional I-O models and show how to design algorithms that can attain close to peak memory bandwidth. Unlike (the inverse of) memory latency, the memory bandwidth is much closer to the processing speed, thereby, intelligent use of prefetching can considerably mitigate the I-O bottleneck. For some fundamental problems, our algorithms attain running times approaching that of the idealized Random Access Machines under reasonable assumptions. Our work also explains the significantly superior performance of the I-O efficient algorithms in systems that support prefetching compared to ones that do not.
Proceedings of the 27th international ACM conference on International conference on supercomputing, 2013
Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains.
Proceedings of the 21st international conference on Parallel architectures and compilation techniques, 2012
Memory access latency is the primary performance bottleneck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial performance gains by overlapping significant portions of memory latency with useful work. Prior work has investigated this technique and measured potential performance gains in a variety of scenarios. However, its use in speeding up Hardware Transactional Memory (HTM) has remained hitherto unexplored. In several HTM designs transactions invalidate speculatively updated cache lines when they abort. Such cache lines tend to have high locality and are likely to be accessed again when the transaction re-executes. Coarse grained transactions that update relatively large amounts of data are particularly susceptible to performance degradation even under moderate contention. However, such transactions show strong locality of reference, especially when contention is high. Prefetching cache lines with high locality can, therefore, improve overall concurrency by speeding up transactions and, thereby, narrowing the window of time in which such transactions persist and can cause contention. Such transactions are important since they are likely to form a common TM use-case. We note that traditional prefetch techniques may not be able to track such lines adequately or issue prefetches quickly enough. This paper investigates the use of prefetching in HTMs, proposing a simple design to identify and request prefetch candidates, and measures potential performance gains to be had for several representative TM workloads.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.