International Conference on Acoustics, Speech, and Signal Processing
Several numerical computations, including the Fast Fourier Transform 0, require that the data is ... more Several numerical computations, including the Fast Fourier Transform 0, require that the data is ordered according to a bit-reversed permutation. In fact, for several standard FIT programs, this pre or post computation is claimed to take 10-50 percent of the computation time [l]. In this paper, a linear sequential bit-reversal algorithm is presented. This is an improvement by a factor of logzn over the standard algorithms. Even at the register level (where additions and multiplications are not considered to be constant operations), the algorithm presented is shown to be linear with a low constant factor, The recursive method presented extends nicely to radix-r permutations; mixed-radix permutations are also discussed. Most importantly, however, the method is shown to provide an efficient vectorizdle bit-reversal algorithm.
Hypercube algorithms are presented for distributed block-matrix operations. These algorithms are ... more Hypercube algorithms are presented for distributed block-matrix operations. These algorithms are based entirely on an interconnection scheme which involves two orthogonal sets of binary trees. This switching topology makes use of all hypercube interconnection links in a synchronized manner. An efficient novel matrix-vector multiplication algorithm based on this technique is described. Also, matrix transpose operations moving just pointers rather than actual data, have been implemented for some applications by taking advantage of the above tree structures. For the cases where actual physical vector and matrix transposes are needed, possible techniques, including extensions of the above scheme, are discussed. The algorithms support submatrix partitionings of the data, instead of being limited to row and/or column partitionings. This allows efficient use of nodal vector processors as well as shorter interprocessor communication packets. It also produces a favorable data distribution fo...
2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010
Heterogeneous computing which includes mixed architectures with multi-core CPUs as well as hardwa... more Heterogeneous computing which includes mixed architectures with multi-core CPUs as well as hardware accelerators such as GPU hardware, is needed to satisfy future computational needs and energy requirements. Cloud computing currently offers users whose computational needs vary greatly over time, a cost-effect way to gain access to resources. While the current form of cloud-based systems is suitable for many scenarios, their evolution into a truly heterogeneous computational environments is still not fully developed. This paper describes THOR (Transparent Heterogeneous Open Resources), our framework for providing seamless access to HPC systems composed of heterogeneous resources. Our work focuses on the core module, in particular the policy engine. To validate our approach, THOR has been implemented on a scaled-down heterogeneous cluster within a cloud-based computational environment. Our testing includes an Open CL encryption/decryption algorithm that was tested for several use cases. The corresponding computational benchmarks are provided to validate our approach and gain valuable knowledge for the policy database.
Abstract. The focus on throughput and large data volumes separates Information Retrieval (IR) fro... more Abstract. The focus on throughput and large data volumes separates Information Retrieval (IR) from scientific computing, since for IR it is critical to process large amounts of data efficiently, a task which the GPU currently does not excel at. Only recently has the IR community begun to explore the possibilities, and an implementation of a search engine for the GPU was published recently in April 2009. This paper analyzes how GPUs can be improved to better suit such large data volume applications. Current graphics cards have a bottleneck regarding the transfer of data between the host and the GPU. One approach to resolve this bottleneck is to include the host memory as part of the GPUsÅ memory hierarchy. Benchmarks from NVIDIA ION, 9800m and GTX 240 are included. Several suggestions for future GPU features are also prensented.
Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given ... more Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource widens the spectrum of factors that may impact on the performance of the system. The goal of this paper is to analyze such factors by investigating and benchmarking the NVIDIA Tesla S1070. This system combines four T10 GPUs, making available up to 4 TFLOPS of computational power. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by our benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of...
Modern GPUs with their several hundred cores and more accessible programming models are becoming ... more Modern GPUs with their several hundred cores and more accessible programming models are becoming attractive devices for compute-intensive applications. They are particularly well suited for applications, such as image processing, where the end result is intended to be displayed via the graphics card. One of the more versatile and powerful graphics techniques is ray tracing. However, tracing each ray of light in a scene is very computational expensive and have traditionally been preprocessed on CPUs over hours, if not days. In this paper, Nvidia’s new OptiX ray tracing engine is used to show how the power of modern graphics cards, such as the Nvidia Quadro FX 5800, can be harnessed to ray trace several scenes that represent real-life applications in real-time speeds ranging from 20.63 to 67.15 fps. Near-perfect speedup is demonstrated on dual GPUs for scenes with complex geometries. The impact on ray tracing of the recently announced Nvidia Fermi processor, is also discussed. Categor...
Investigating how fluids flow inside the complicated geometr ies of porous rocks is an important ... more Investigating how fluids flow inside the complicated geometr ies of porous rocks is an important problem in the petroleum industry. The lattice Boltzmann method (LBM) can be used to calculate porous rockst' permeability. In this paper, we show how to implement this method efficiently on modern GPUs. Both a sequential CPU implementation and a parallelized GPU implementation is developed. Both
International Conference on Acoustics, Speech, and Signal Processing
Several numerical computations, including the Fast Fourier Transform 0, require that the data is ... more Several numerical computations, including the Fast Fourier Transform 0, require that the data is ordered according to a bit-reversed permutation. In fact, for several standard FIT programs, this pre or post computation is claimed to take 10-50 percent of the computation time [l]. In this paper, a linear sequential bit-reversal algorithm is presented. This is an improvement by a factor of logzn over the standard algorithms. Even at the register level (where additions and multiplications are not considered to be constant operations), the algorithm presented is shown to be linear with a low constant factor, The recursive method presented extends nicely to radix-r permutations; mixed-radix permutations are also discussed. Most importantly, however, the method is shown to provide an efficient vectorizdle bit-reversal algorithm.
Hypercube algorithms are presented for distributed block-matrix operations. These algorithms are ... more Hypercube algorithms are presented for distributed block-matrix operations. These algorithms are based entirely on an interconnection scheme which involves two orthogonal sets of binary trees. This switching topology makes use of all hypercube interconnection links in a synchronized manner. An efficient novel matrix-vector multiplication algorithm based on this technique is described. Also, matrix transpose operations moving just pointers rather than actual data, have been implemented for some applications by taking advantage of the above tree structures. For the cases where actual physical vector and matrix transposes are needed, possible techniques, including extensions of the above scheme, are discussed. The algorithms support submatrix partitionings of the data, instead of being limited to row and/or column partitionings. This allows efficient use of nodal vector processors as well as shorter interprocessor communication packets. It also produces a favorable data distribution fo...
2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010
Heterogeneous computing which includes mixed architectures with multi-core CPUs as well as hardwa... more Heterogeneous computing which includes mixed architectures with multi-core CPUs as well as hardware accelerators such as GPU hardware, is needed to satisfy future computational needs and energy requirements. Cloud computing currently offers users whose computational needs vary greatly over time, a cost-effect way to gain access to resources. While the current form of cloud-based systems is suitable for many scenarios, their evolution into a truly heterogeneous computational environments is still not fully developed. This paper describes THOR (Transparent Heterogeneous Open Resources), our framework for providing seamless access to HPC systems composed of heterogeneous resources. Our work focuses on the core module, in particular the policy engine. To validate our approach, THOR has been implemented on a scaled-down heterogeneous cluster within a cloud-based computational environment. Our testing includes an Open CL encryption/decryption algorithm that was tested for several use cases. The corresponding computational benchmarks are provided to validate our approach and gain valuable knowledge for the policy database.
Abstract. The focus on throughput and large data volumes separates Information Retrieval (IR) fro... more Abstract. The focus on throughput and large data volumes separates Information Retrieval (IR) from scientific computing, since for IR it is critical to process large amounts of data efficiently, a task which the GPU currently does not excel at. Only recently has the IR community begun to explore the possibilities, and an implementation of a search engine for the GPU was published recently in April 2009. This paper analyzes how GPUs can be improved to better suit such large data volume applications. Current graphics cards have a bottleneck regarding the transfer of data between the host and the GPU. One approach to resolve this bottleneck is to include the host memory as part of the GPUsÅ memory hierarchy. Benchmarks from NVIDIA ION, 9800m and GTX 240 are included. Several suggestions for future GPU features are also prensented.
Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given ... more Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource widens the spectrum of factors that may impact on the performance of the system. The goal of this paper is to analyze such factors by investigating and benchmarking the NVIDIA Tesla S1070. This system combines four T10 GPUs, making available up to 4 TFLOPS of computational power. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by our benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of...
Modern GPUs with their several hundred cores and more accessible programming models are becoming ... more Modern GPUs with their several hundred cores and more accessible programming models are becoming attractive devices for compute-intensive applications. They are particularly well suited for applications, such as image processing, where the end result is intended to be displayed via the graphics card. One of the more versatile and powerful graphics techniques is ray tracing. However, tracing each ray of light in a scene is very computational expensive and have traditionally been preprocessed on CPUs over hours, if not days. In this paper, Nvidia’s new OptiX ray tracing engine is used to show how the power of modern graphics cards, such as the Nvidia Quadro FX 5800, can be harnessed to ray trace several scenes that represent real-life applications in real-time speeds ranging from 20.63 to 67.15 fps. Near-perfect speedup is demonstrated on dual GPUs for scenes with complex geometries. The impact on ray tracing of the recently announced Nvidia Fermi processor, is also discussed. Categor...
Investigating how fluids flow inside the complicated geometr ies of porous rocks is an important ... more Investigating how fluids flow inside the complicated geometr ies of porous rocks is an important problem in the petroleum industry. The lattice Boltzmann method (LBM) can be used to calculate porous rockst' permeability. In this paper, we show how to implement this method efficiently on modern GPUs. Both a sequential CPU implementation and a parallelized GPU implementation is developed. Both
Uploads
Papers by Anne C. Elster