Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014
Modern enterprise applications represent an emergent application arena that requires the processi... more Modern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in throughput and consequently overall performance. However, throughput improvement using GPUs is challenged by the distinctive memory and computational characteristics of Relational Algebra (RA) operators that are central to queries for answering business questions. This paper introduces the design, implementation, and evaluation of Red Fox, a compiler and runtime infrastructure for executing relational queries on GPUs. Red Fox is comprised of i) a language front-end for LogiQL which is a commercial query language, ii) an RA to GPU compiler, iii) optimized GPU implementation of RA operators, and iv) a supporting runtime. We report the performance on the full set of industry standard TPC-H queries on a single node GPU. Compared with a commercial LogiQL system implementation optimized for a state of art CPU machine, Red Fox on average is 6.48x faster including PCIe transfer time. We point out key bottlenecks, propose potential solutions, and analyze the GPU implementation of these queries. To the best of our knowledge, this is the first reported end-toend compilation and execution infrastructure that supports the full set of TPC-H queries on commodity GPUs.
A geodetic software analysis tool enables the user to analyze 2D crustal strain from geodetic gro... more A geodetic software analysis tool enables the user to analyze 2D crustal strain from geodetic ground motion, and create models of crustal deformation using a graphical interface. Users can use any geodetic measurements of ground motion and derive the 2D crustal strain interactively. This software also provides a forward-modeling tool that calculates a geodetic velocity and strain field for a given fault model, and lets the user compare the modeled strain field with the strain field obtained from the user s data. Users may change parameters on-the-fly and obtain a real-time recalculation of the resulting strain field. Four data products are computed: maximum shear, dilatation, shear angle, and principal components. The current view and data dependencies are processed first. The remaining data products and views are then computed in a round-robin fashion to anticipate view changes. When an analysis or display parameter is changed, the affected data products and views are invalidated a...
Concurrency and Computation: Practice and Experience, 2016
Suffix arrays are fundamental full-text index data structures of importance to a broad spectrum o... more Suffix arrays are fundamental full-text index data structures of importance to a broad spectrum of applications in such fields as bioinformatics, Burrows-Wheeler Transform (BWT)-based lossless data compression, and information retrieval. In this work, we propose and implement two massively parallel approaches on the GPU based on two classes of suffix array construction algorithms. The first, parallel skew, makes algorithmic improvements to the previous work of Deo and Keely to achieve a speedup of 1.45x over their work. The second, a hybrid skew and prefix-doubling implementation, is the first of its kind on the GPU and achieves a speedup of 2.3-4.4x over Osipov's prefix-doubling and 2.4-7.9x over our skew implementation on large datasets. Our implementations rely on two efficient parallel primitives, a merge and a segmented sort. We theoretically analyze the two formulations of suffix array construction algorithms and show performance comparisons on a large variety of practical inputs. We conclude that, with the novel use of our efficient segmented sort, prefix-doubling is more competitive than skew on the GPU. We also demonstrate the effectiveness of our methods in our implementations of the Burrows-Wheeler transform and in a parallel FM-index for pattern searching.
We implement two classes of suffix array construction algorithms on the GPU. The first, skew, mak... more We implement two classes of suffix array construction algorithms on the GPU. The first, skew, makes algorithmic improvements to the previous work of Deo and Keely to achieve a speedup of 1.45 × over their work. The second, a hybrid skew and prefix-doubling implementation, is the first of its kind on the GPU and achieves a speedup of 2.3-4.4 × over Osipov's prefix-doubling and 2.4-7.9 × over our skew implementation on large datasets. Our implementations rely on two efficient parallel primitives, a merge and a segmented sort. We also demonstrate the effectiveness of our implementations in a Burrows-Wheeler transform and a parallel FM index for pattern searching.
2015 IEEE International Parallel and Distributed Processing Symposium, 2015
Irregular computations on large workloads are a necessity in many areas of computational science.... more Irregular computations on large workloads are a necessity in many areas of computational science. Mapping these computations to modern parallel architectures, such as GPUs, is particularly challenging because the performance often depends critically on the choice of data-structure and algorithm. In this paper, we develop a parallel processing scheme, based on Merge Path partitioning, to compute segmented row-wise operations on sparse matrices that exposes parallelism at the granularity of individual nonzeros entries. Our decomposition achieves competitive performance across many diverse problems while maintaining predictable behavior dependent only on the computational work and ameliorates the impact of irregularity. We evaluate the performance of three sparse kernels: SpMV, SpAdd and SpGEMM. We show that our processing scheme for each kernel yields comparable performance to other schemes in many cases and our performance is highly correlated, nearly 1, to the computational work irrespective of the underlying structure of the matrices.
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
Finding the shortest paths from a single source to all other vertices is a fundamental method use... more Finding the shortest paths from a single source to all other vertices is a fundamental method used in a variety of higher-level graph algorithms. We present three parallelfriendly and work-efficient methods to solve this Single-Source Shortest Paths (SSSP) problem: Workfront Sweep, Near-Far and Bucketing. These methods choose different approaches to balance the tradeoff between saving work and organizational overhead. In practice, all of these methods do much less work than traditional Bellman-Ford methods, while adding only a modest amount of extra work over serial methods. These methods are designed to have a sufficient parallel workload to fill modern massively-parallel machines, and select reorganizational schemes that map well to these architectures. We show that in general our Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods. We also explore a variety of parallel load-balanced graph traversal strategies and apply them towards our SSSP solver. Our work-saving methods always outperform a traditional GPU Bellman-Ford implementation, achieving rates up to 14x higher on low-degree graphs and 340x higher on scalefree graphs. We also see significant speedups (20-60x) when compared against a serial implementation on graphs with adequately high degree.
2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 1, 2017
Existing GPU graph analytics frameworks are typically built from specialized, bottom-up implement... more Existing GPU graph analytics frameworks are typically built from specialized, bottom-up implementations of graph operators that are customized to graph computation. In this work we describe Mini-Gunrock, a lightweight graph analytics framework on the GPU. Unlike existing frameworks, Mini-Gunrock is built from graph operators implemented with generic transform-based data-parallel primitives. Using this method to bridge the gap between programmability and high performance for GPU graph analytics, we demonstrate operator performance on scale-free graphs with an average 1.5x speedup compared to Gunrock's corresponding operator performance. Mini-Gunrock's graph operators, optimizations, and applications code have 10x smaller code size and comparable overall performance vs. Gunrock.
1] Several common classes of model-free strain estimation techniques from geodetic deformation me... more 1] Several common classes of model-free strain estimation techniques from geodetic deformation measurements were investigated to assess the systematic computational artifacts introduced into strain estimates from different parameterizations. It is demonstrated that highly structured artifacts, which may be impossible to distinguish from real variations in strain, persistently appear in the strain rate field at and above the spatial scale of the network that samples the deformation field. These computational artifacts are biased by the spatial sampling, and by the orientation of the sampling network with respect to the deformation field. While such aliased strain rate representations provide some gross representation of the underlying real strain rate field, they contain numerous small-scale artifacts. As a result, in the absence of a tectonic model, the interpretation of strain rates from heterogeneous networks have limited direct use for interpreting subtleties in the underlying driving mechanisms.
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014
Modern enterprise applications represent an emergent application arena that requires the processi... more Modern enterprise applications represent an emergent application arena that requires the processing of queries and computations over massive amounts of data. Large-scale, multi-GPU cluster systems potentially present a vehicle for major improvements in throughput and consequently overall performance. However, throughput improvement using GPUs is challenged by the distinctive memory and computational characteristics of Relational Algebra (RA) operators that are central to queries for answering business questions. This paper introduces the design, implementation, and evaluation of Red Fox, a compiler and runtime infrastructure for executing relational queries on GPUs. Red Fox is comprised of i) a language front-end for LogiQL which is a commercial query language, ii) an RA to GPU compiler, iii) optimized GPU implementation of RA operators, and iv) a supporting runtime. We report the performance on the full set of industry standard TPC-H queries on a single node GPU. Compared with a commercial LogiQL system implementation optimized for a state of art CPU machine, Red Fox on average is 6.48x faster including PCIe transfer time. We point out key bottlenecks, propose potential solutions, and analyze the GPU implementation of these queries. To the best of our knowledge, this is the first reported end-toend compilation and execution infrastructure that supports the full set of TPC-H queries on commodity GPUs.
A geodetic software analysis tool enables the user to analyze 2D crustal strain from geodetic gro... more A geodetic software analysis tool enables the user to analyze 2D crustal strain from geodetic ground motion, and create models of crustal deformation using a graphical interface. Users can use any geodetic measurements of ground motion and derive the 2D crustal strain interactively. This software also provides a forward-modeling tool that calculates a geodetic velocity and strain field for a given fault model, and lets the user compare the modeled strain field with the strain field obtained from the user s data. Users may change parameters on-the-fly and obtain a real-time recalculation of the resulting strain field. Four data products are computed: maximum shear, dilatation, shear angle, and principal components. The current view and data dependencies are processed first. The remaining data products and views are then computed in a round-robin fashion to anticipate view changes. When an analysis or display parameter is changed, the affected data products and views are invalidated a...
Concurrency and Computation: Practice and Experience, 2016
Suffix arrays are fundamental full-text index data structures of importance to a broad spectrum o... more Suffix arrays are fundamental full-text index data structures of importance to a broad spectrum of applications in such fields as bioinformatics, Burrows-Wheeler Transform (BWT)-based lossless data compression, and information retrieval. In this work, we propose and implement two massively parallel approaches on the GPU based on two classes of suffix array construction algorithms. The first, parallel skew, makes algorithmic improvements to the previous work of Deo and Keely to achieve a speedup of 1.45x over their work. The second, a hybrid skew and prefix-doubling implementation, is the first of its kind on the GPU and achieves a speedup of 2.3-4.4x over Osipov's prefix-doubling and 2.4-7.9x over our skew implementation on large datasets. Our implementations rely on two efficient parallel primitives, a merge and a segmented sort. We theoretically analyze the two formulations of suffix array construction algorithms and show performance comparisons on a large variety of practical inputs. We conclude that, with the novel use of our efficient segmented sort, prefix-doubling is more competitive than skew on the GPU. We also demonstrate the effectiveness of our methods in our implementations of the Burrows-Wheeler transform and in a parallel FM-index for pattern searching.
We implement two classes of suffix array construction algorithms on the GPU. The first, skew, mak... more We implement two classes of suffix array construction algorithms on the GPU. The first, skew, makes algorithmic improvements to the previous work of Deo and Keely to achieve a speedup of 1.45 × over their work. The second, a hybrid skew and prefix-doubling implementation, is the first of its kind on the GPU and achieves a speedup of 2.3-4.4 × over Osipov's prefix-doubling and 2.4-7.9 × over our skew implementation on large datasets. Our implementations rely on two efficient parallel primitives, a merge and a segmented sort. We also demonstrate the effectiveness of our implementations in a Burrows-Wheeler transform and a parallel FM index for pattern searching.
2015 IEEE International Parallel and Distributed Processing Symposium, 2015
Irregular computations on large workloads are a necessity in many areas of computational science.... more Irregular computations on large workloads are a necessity in many areas of computational science. Mapping these computations to modern parallel architectures, such as GPUs, is particularly challenging because the performance often depends critically on the choice of data-structure and algorithm. In this paper, we develop a parallel processing scheme, based on Merge Path partitioning, to compute segmented row-wise operations on sparse matrices that exposes parallelism at the granularity of individual nonzeros entries. Our decomposition achieves competitive performance across many diverse problems while maintaining predictable behavior dependent only on the computational work and ameliorates the impact of irregularity. We evaluate the performance of three sparse kernels: SpMV, SpAdd and SpGEMM. We show that our processing scheme for each kernel yields comparable performance to other schemes in many cases and our performance is highly correlated, nearly 1, to the computational work irrespective of the underlying structure of the matrices.
2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
Finding the shortest paths from a single source to all other vertices is a fundamental method use... more Finding the shortest paths from a single source to all other vertices is a fundamental method used in a variety of higher-level graph algorithms. We present three parallelfriendly and work-efficient methods to solve this Single-Source Shortest Paths (SSSP) problem: Workfront Sweep, Near-Far and Bucketing. These methods choose different approaches to balance the tradeoff between saving work and organizational overhead. In practice, all of these methods do much less work than traditional Bellman-Ford methods, while adding only a modest amount of extra work over serial methods. These methods are designed to have a sufficient parallel workload to fill modern massively-parallel machines, and select reorganizational schemes that map well to these architectures. We show that in general our Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods. We also explore a variety of parallel load-balanced graph traversal strategies and apply them towards our SSSP solver. Our work-saving methods always outperform a traditional GPU Bellman-Ford implementation, achieving rates up to 14x higher on low-degree graphs and 340x higher on scalefree graphs. We also see significant speedups (20-60x) when compared against a serial implementation on graphs with adequately high degree.
2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 1, 2017
Existing GPU graph analytics frameworks are typically built from specialized, bottom-up implement... more Existing GPU graph analytics frameworks are typically built from specialized, bottom-up implementations of graph operators that are customized to graph computation. In this work we describe Mini-Gunrock, a lightweight graph analytics framework on the GPU. Unlike existing frameworks, Mini-Gunrock is built from graph operators implemented with generic transform-based data-parallel primitives. Using this method to bridge the gap between programmability and high performance for GPU graph analytics, we demonstrate operator performance on scale-free graphs with an average 1.5x speedup compared to Gunrock's corresponding operator performance. Mini-Gunrock's graph operators, optimizations, and applications code have 10x smaller code size and comparable overall performance vs. Gunrock.
1] Several common classes of model-free strain estimation techniques from geodetic deformation me... more 1] Several common classes of model-free strain estimation techniques from geodetic deformation measurements were investigated to assess the systematic computational artifacts introduced into strain estimates from different parameterizations. It is demonstrated that highly structured artifacts, which may be impossible to distinguish from real variations in strain, persistently appear in the strain rate field at and above the spatial scale of the network that samples the deformation field. These computational artifacts are biased by the spatial sampling, and by the orientation of the sampling network with respect to the deformation field. While such aliased strain rate representations provide some gross representation of the underlying real strain rate field, they contain numerous small-scale artifacts. As a result, in the absence of a tectonic model, the interpretation of strain rates from heterogeneous networks have limited direct use for interpreting subtleties in the underlying driving mechanisms.
Uploads
Papers by Sean Baxter