Anne C. Elster

Norwegian University of Science and Technology, Computer and Information Science, Faculty Member

Followers

182

Following

Co-authors

Public Views

Rafaela Pere

Marvin Chung

Ephraim Feig

Graduate Center of the City University of New York

Nicola Veneziani

Gianni Orlandi

Università degli Studi "La Sapienza" di Roma

École Normale Supérieure de Lyon

C. Chakrabarti

InterestsView All (12)

Uploads

Papers by Anne C. Elster

Fast bit-reversal algorithms

International Conference on Acoustics, Speech, and Signal Processing

Several numerical computations, including the Fast Fourier Transform 0, require that the data is ... more Several numerical computations, including the Fast Fourier Transform 0, require that the data is ordered according to a bit-reversed permutation. In fact, for several standard FIT programs, this pre or post computation is claimed to take 10-50 percent of the computation time [l]. In this paper, a linear sequential bit-reversal algorithm is presented. This is an improvement by a factor of logzn over the standard algorithms. Even at the register level (where additions and multiplications are not considered to be constant operations), the algorithm presented is shown to be linear with a low constant factor, The recursive method presented extends nicely to radix-r permutations; mixed-radix permutations are also discussed. Most importantly, however, the method is shown to provide an efficient vectorizdle bit-reversal algorithm.

Download

Block-matrix operations using orthogonal trees

Hypercube algorithms are presented for distributed block-matrix operations. These algorithms are ... more Hypercube algorithms are presented for distributed block-matrix operations. These algorithms are based entirely on an interconnection scheme which involves two orthogonal sets of binary trees. This switching topology makes use of all hypercube interconnection links in a synchronized manner. An efficient novel matrix-vector multiplication algorithm based on this technique is described. Also, matrix transpose operations moving just pointers rather than actual data, have been implemented for some applications by taking advantage of the above tree structures. For the cases where actual physical vector and matrix transposes are needed, possible techniques, including extensions of the above scheme, are discussed. The algorithms support submatrix partitionings of the data, instead of being limited to row and/or column partitionings. This allows efficient use of nodal vector processors as well as shorter interprocessor communication packets. It also produces a favorable data distribution fo...

Download

THOR: A Transparent Heterogeneous Open Resource framework

2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010

Heterogeneous computing which includes mixed architectures with multi-core CPUs as well as hardwa... more Heterogeneous computing which includes mixed architectures with multi-core CPUs as well as hardware accelerators such as GPU hardware, is needed to satisfy future computational needs and energy requirements. Cloud computing currently offers users whose computational needs vary greatly over time, a cost-effect way to gain access to resources. While the current form of cloud-based systems is suitable for many scenarios, their evolution into a truly heterogeneous computational environments is still not fully developed. This paper describes THOR (Transparent Heterogeneous Open Resources), our framework for providing seamless access to HPC systems composed of heterogeneous resources. Our work focuses on the core module, in particular the policy engine. To validate our approach, THOR has been implemented on a scaled-down heterogeneous cluster within a cloud-based computational environment. Our testing includes an Open CL encryption/decryption algorithm that was tested for several use cases. The corresponding computational benchmarks are provided to validate our approach and gain valuable knowledge for the policy database.

Download

Throughput Computing on Future GPUs 1

Abstract. The focus on throughput and large data volumes separates Information Retrieval (IR) fro... more Abstract. The focus on throughput and large data volumes separates Information Retrieval (IR) from scientific computing, since for IR it is critical to process large amounts of data efficiently, a task which the GPU currently does not excel at. Only recently has the IR community begun to explore the possibilities, and an implementation of a search engine for the GPU was published recently in April 2009. This paper analyzes how GPUs can be improved to better suit such large data volume applications. Current graphics cards have a bottleneck regarding the transfer of data between the host and the GPU. One approach to resolve this bottleneck is to include the host memory as part of the GPUsŠ memory hierarchy. Benchmarks from NVIDIA ION, 9800m and GTX 240 are included. Several suggestions for future GPU features are also prensented.

Download

Modelling Multi-GPU Systems 1

Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given ... more Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource widens the spectrum of factors that may impact on the performance of the system. The goal of this paper is to analyze such factors by investigating and benchmarking the NVIDIA Tesla S1070. This system combines four T10 GPUs, making available up to 4 TFLOPS of computational power. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by our benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of...

Download

Short Paper Real-Time Ray Tracing Using Nvidia OptiX

Modern GPUs with their several hundred cores and more accessible programming models are becoming ... more Modern GPUs with their several hundred cores and more accessible programming models are becoming attractive devices for compute-intensive applications. They are particularly well suited for applications, such as image processing, where the end result is intended to be displayed via the graphics card. One of the more versatile and powerful graphics techniques is ray tracing. However, tracing each ray of light in a scene is very computational expensive and have traditionally been preprocessed on CPUs over hours, if not days. In this paper, Nvidia’s new OptiX ray tracing engine is used to show how the power of modern graphics cards, such as the Nvidia Quadro FX 5800, can be harnessed to ray trace several scenes that represent real-life applications in real-time speeds ranging from 20.63 to 67.15 fps. Near-perfect speedup is demonstrated on dual GPUs for scenes with complex geometries. The impact on ray tracing of the recently announced Nvidia Fermi processor, is also discussed. Categor...

Download

ImageCL: Language and source-to-source compiler for performance portability, load balancing, and scalability prediction on heterogeneous systems

Concurrency and Computation: Practice and Experience

Characterizing numascale clusters with GPUs: MPI-based and GPU interconnect benchmarks

2016 International Conference on High Performance Computing & Simulation (HPCS), 2016

Parallel Visualization of Snow

Framework for Polygonal Structures Computations on Clusters

Hypercube Algorithms on the Polymorphic Torus

A CORDIC Processor Array for the SVD of a Complex Matrix

Setting Standards For Parallel Computing: The High Performance Fortran and Message Passing Interface

Fault-Tolerant Matrix Operations on Hypercube Multiprocessors

Icpp, 1989

Cluster computing as a teaching tool

Real-Time Surface Extraction and Visualization of Medical Images using OpenCL and GPUs

Real-time visualization of smoke through parallelizations

Presentation of "GPU-Based Airway Segmentation and Centerline Extraction for Image Guided Bronchoscopy

Machine Learning Based Auto-Tuning for Enhanced OpenCL Performance Portability

2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Porous Rock Simulations and Lattice Boltzmann on GPUs 1

Investigating how fluids flow inside the complicated geometr ies of porous rocks is an important ... more Investigating how fluids flow inside the complicated geometr ies of porous rocks is an important problem in the petroleum industry. The lattice Boltzmann method (LBM) can be used to calculate porous rockst' permeability. In this paper, we show how to implement this method efficiently on modern GPUs. Both a sequential CPU implementation and a parallelized GPU implementation is developed. Both

Fast bit-reversal algorithms

International Conference on Acoustics, Speech, and Signal Processing

Download

Block-matrix operations using orthogonal trees

Download

THOR: A Transparent Heterogeneous Open Resource framework

2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010

Download

Throughput Computing on Future GPUs 1

Download

Modelling Multi-GPU Systems 1

Download