Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, Proceedings of the 48th International Symposium on Microarchitecture
We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits. We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51-122× speedups over a single-core system, and outperforms software-only parallel algorithms by 3-18×.
IEEE Micro, 2016
IEEE Micro, 2016
2016
Multicore systems must exploit locality to scale, scheduling tasks to minimize data movement. While localityaware parallelism is well studied in non-speculative systems, it has received little attention in speculative systems (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge to reveal and exploit locality in speculative parallel programs. A hint is an abstract integer, given when a speculative task is created, that denotes the data that the task is likely to access. We show it is easy to modify programs to convey locality through hints. We design simple hardware techniques that allow a state-of-the-art, tiled speculative architecture to exploit hints by: (i) running tasks likely to access the same data on the same tile, (ii) serializing tasks likely to conflict, and (iii) balancing tasks across tiles in a locality-aware fashion. We also show that programs can often be restructured to make hints more effective. Together, these techniques make speculative parallelism practical on large-scale systems: at 256 cores, hints achieve nearlinear scalability on nine challenging applications, improving performance over hint-oblivious scheduling by 3.3× gmean and by up to 16×. Hints also make speculation far more efficient, reducing wasted work by 6.4× and traffic by 3.5× on average.
2007
Task-selection policies are critical to the performance of any architecture that uses speculation to extract parallel tasks from a sequential thread. This paper demonstrates that the immediate postdominators of conditional branches provide a larger set of parallel tasks than existing taskselection heuristics, which are limited to programming language constructs (such as loops or procedure calls). Our evaluation shows that postdominance-based task selection achieves, on average, more than double the speedup of the best individual heuristic, and 33% more speedup than the best combination of heuristics.
Sigplan Notices, 2007
Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and runtime speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarse-grain parallelism worth exploiting in irregular applications.
HotPar'10: Proceedings …, 2010
Multicore processors have forced mainstream programmers to rethink the way they design software. Parallelism will be the avenue for performance gains in these multicore processors but will require new tools and methodologies to gain full acceptance by everyday programmers. As a step towards improved parallelization tools, we propose a parallelization taxonomy that categorizes tools based on which of five fundamental stages of parallelization they assist with. Based on this taxonomy, we find that many popular parallelization tools focus on the final stages, leaving the programmer to perform the initial stages without assistance. In this paper we provide a preliminary description of pyrprof, a tool that helps the programmer locate parallel regions of code and decide which regions to parallelize first. pyrprof performs dynamic critical path analysis and utilizes the structure of programs to highlight exploitable forms of parallelism. A case study based on MPEG encoding is used to demonstrate pyrprof's effectiveness.
2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2021
Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900× improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. The reduced latency allows XQueue to provide orders of magnitude (3300×) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation with the native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4× to 6× speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116× in some cases.
2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
Many sparse or irregular scientific computations are memory bound and benefit from locality improving optimizations such as blocking or tiling. These optimizations result in asynchronous parallelism that can be represented by arbitrary task graphs. Unfortunately, most popular parallel programming models with the exception of Threading Building Blocks (TBB) do not directly execute arbitrary task graphs. In this paper, we compare the programming and execution of arbitrary task graphs qualitatively and quantitatively in TBB, the OpenMP doall model, the OpenMP 3.0 task model, and Cilk Plus. We present performance and scalability results for 8 and 40 core shared memory systems on a sparse matrix iterative solver and a molecular dynamics benchmark.
2009
Most client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointerbased data structures such as large sparse graphs and trees. However, we understand very little about the nature of parallelism in irregular algorithms, let alone how to exploit it effectively on multicore processors.
2010
Abstract Making efficient use of modern multi-core and future many-core CPUs is a major challenge. We describe a new compiler-based platform, Prospect, that supports the parallelization of sequential applications. The underlying approach is a generalization of an existing approach to parallelize runtime checks. The basic idea is to generate two variants of the application:(1) a fast variant having bare bone functionality, and (2) a slow variant with extra functionality. The fast variant is executed sequentially.
2000
Traditional parallel compilers do not effectively parallelize irregular applications because they contain little loop-level parallelism. We explore Speculative Task Parallelism (STP), where tasks are full procedures and entire natural loops. Through profiling and compiler analysis, we find tasks that are speculatively memory-and control-independent of their neighboring code. Via speculative futures, these tasks may be executed in parallel with preceding code when there is a high probability of independence.
Computer, 2006
Some emerging technologies try to exploit the parallel capabilities of modern processors.
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020
Proceedings 28th Annual International Symposium on Computer Architecture, 2001
Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far show that it is hard to deliver scalable speedups. Often, the problem is not true dependence violations, but sub-optimal architectural design. Consequently, we attempt to identify and eliminate major architectural bottlenecks that limit the scalability of speculative parallelization. The solutions that we propose are: low-complexity commit in constant time to eliminate the task commit bottleneck, a memory-based overflow area to eliminate stall due to speculative buffer overflow, and exploiting high-level access patterns to minimize speculationinduced traffic. To show that the resulting system is truly scalable, we perform simulations with up to 128 processors. With our optimizations, the speedups for 128 and 64 processors reach 63 and 48, respectively. The average speedup for 64 processors is 32, nearly four times higher than without our optimizations. ¡ ). The result will again be as shown in . Instead, for true scalability, task commit needs to complete in constant time, irrespective of the task size.
IEEE Transactions on Computers, 2014
Speculative parallelization is a runtime technique that optimistically executes sequential code in parallel, checking that no dependence violations arise. In the case of a dependence violation, all mechanisms proposed so far either switch to sequential execution, or conservatively stop and restart the offending thread and all its successors, potentially discarding work that does not depend on this particular violation. In this work we systematically explore the design space of solutions for this problem, proposing a new mechanism that reduces the number of threads that should be restarted when a data dependence violation is found. Our new solution, called exclusive squashing, keeps track of inter-thread dependencies at runtime, selectively stopping and restarting offending threads, together with all threads that have consumed data from them. We have compared this new approach with existent solutions on a real system, executing different applications with loops that are not analyzable at compile time and present as much as 10% of inter-thread dependence violations at runtime. Our experimental results show a relative performance improvement of up to 14%, together with a reduction of one-third of the numbers of squashed threads. The speculative parallelization scheme and benchmarks described in this paper are available under request.
Lecture Notes in Computer Science, 2015
Scheduling is one of the factors that most directly affect performance in Thread-Level Speculation (TLS). Since loops may present dependences that cannot be predicted before runtime, finding a good chunk size is not a simple task. The most used mechanism, Fixed-Size Chunking (FSC), requires many "dry-runs" to set the optimal chunk size. If the loop does not present dependence violations at runtime, scheduling only needs to deal with load balancing issues. For loops where the general pattern of dependences is known, as is the case with Randomized Incremental Algorithms, specialized mechanisms have been designed to maximize performance. To make TLS available to a wider community, a general scheduling algorithm that does not require a-priori knowledge of the expected pattern of dependences nor previous dry-runs to adjust any parameter is needed. In this paper, we present an algorithm that estimates at runtime the best size of the next chunk to be scheduled. This algorithm takes advantage of our previous knowledge in the design and test of other scheduling mechanisms, and it has a solid mathematical basis. The result is a method that, using information of the execution of the previous chunks, decides the size of the next chunk to be scheduled. Our experimental results show that the use of the proposed scheduling function compares or even increases the performance that can be obtained by FSC, greatly reducing the need of a a costly and careful search for the best fixed chunk size.
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2007
Implicit Parallelism with Ordered Transactions (IPOT) is an extension of sequential or explicitly parallel programming models to support speculative parallelization. The key idea is to specify opportunities for parallelization in a sequential program using annotations similar to transactions. Unlike explicit parallelism, IPOT annotations do not require the absence of data dependence, since the parallelization relies on runtime support for speculative execution. IPOT as a parallel programming model is determinate, i.e., program semantics are independent of the thread scheduling. For optimization, non-determinism can be introduced selectively. We describe the programming model of IPOT and an online tool that recommends boundaries of ordered transactions by observing a sequential execution. On three example HPC workloads we demonstrate that our method is effective in identifying opportunities for fine-grain parallelization. Using the automated task recommendation tool, we were able to perform the parallelization of each program within a few hours.
2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017
This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading. We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently. We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33× on average, while a speculation-oblivious policy yields a 1.85× speedup. SAM also reduces wasted work by 52%.
International Journal of Parallel Programming, 2015
The Single-Source Shortest Path (SSSP) problem arises in many different fields. In this paper, we present a GPU SSSP algorithm implementation. Our work significantly speeds up the computation of the SSSP, not only with respect to a CPU-based version, but also to other state-of-the-art GPU implementations based on Dijkstra. Both GPU implementations have been evaluated using the latest NVIDIA architectures. The graphs chosen as input sets vary in nature, size, and fan-out degree, in order to evaluate the behavior of the algorithms for different data classes. Additionally, we have enhanced our GPU algorithm implementation using two optimization techniques: The use of a proper choice of threadblock size; and the modification of the GPU L1 cache memory state of NVIDIA devices. These optimizations lead to performance improvements of up to 23% with respect to the non-optimized versions. In addition, we have made a platform comparison of several NVIDIA boards in order to distinguish which one is better for each class of graphs, depending on their features. Finally, we compare our results with an optimized sequential implementation of Dijkstra's algorithm included in the reference Boost library, obtaining an improvement ratio of up to 19× for some graph families, using less memory space.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.