Skip to main content

André Brodtkorb

Followers

15

Following

3

Co-authors

3

Public Views

SSBT, COET,North Maharashtra University, India

Taibah University, Madinah, Saudi Arabia

Mahmoud Elwaheidi

King Saud University

Daniel Cavalcante

IFCE

Universidade de Taubaté, SP, Brazil

Pamies Teixeira

Universidade Nova de Lisboa

Matthias Deegener

University of Applied Sciences Frankfurt

Md. Sahil Hassan

University of Dhaka, Bangladesh

Nurfaizey Bin Abdul Hamid

Universiti Teknikal Malaysia Melaka

Case Western Reserve University

Interests

Uploads

Papers by André Brodtkorb

metno/VolcanicAshInversion: v1.0.0

Version 1.0.0 release. Tested with datasets available on https://doi.org/10.5281/zenodo.3818195 h... more

metno/VolcanicAshInversion: v1.1.0

This release avoids the memory intensive representation of M (n_observations x n_emissions) by co... more

Supplementary Material to Evaluation of Selected Finite-Difference and Finite-Volume Approaches to Rotational Shallow-Water Flow

This release represent the supplementary material for the paper <em>Evaluation of Selected ... more This release represent the supplementary material for the paper <em>Evaluation of Selected Finite-Difference and Finite-Volume Approaches to Rotational Shallow-Water Flow</em> by Holm, Brodtkorb, Broström, Christensen and Sætra, and contains the numerical schemes and test cases used in the paper. All figures and results presented in the paper can be reproduced from the notebooks provided here.

Supplementary Software for Coastal Ocean Forecasting on the GPU using a Two-Dimensional Finite-Volume Scheme

This release represent the supplementary software for the paper Coastal Ocean Forecasting on the ... more This release represent the supplementary software for the paper Coastal Ocean Forecasting on the GPU using a Two-Dimensional Finite-Volume Scheme by André Rigland Brodtkorb and Håvard Heitlo Holm. It contains the software described in the paper, and Jupyter notebooks for setting up and running and visualizing the numerical experiments. The folder gpu_ocean/papers/realisticSimulations contains the code specific to this work.

An Asynchronous API for Numerical Linear Algebra

Scalable Computing : Practice and Experience, 2008

We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple C... more We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU.

c ○ 2008 SCPE AN ASYNCHRONOUS API FOR NUMERICAL LINEAR ALGEBRA

Abstract. We present a task-parallel asynchronous API for numerical linear algebra that utilizes ... more Abstract. We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU. Key words: asynchronous, multicore, GPU, MATLAB, CUBLAS, double precision 1. Introduction. Algorithms

The graphics processor as a mathematical coprocessor

We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms fr... more We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms from numerical linear algebra available through this interface; matrix-matrix multiplication, Gauss-Jordan elimination, PLU factorization, and tridiagonal Gaussian elimination. In addition to being a high-level abstraction to the GPU, the interface offers background processing, enabling computations to be executed on the CPU simultaneously. The algorithms are shown to be up-to 31 times faster than highly optimized CPU code. The algorithms have only been tested on single precision hardware, but will easily run on new double precision hardware.

Performance and Energy Efficiency of CUDA and OpenCL for GPU Computing Using Python

Simplified Ocean Models on GPUs

This paper describes the implementation of three different simplified ocean models on a GPU (grap... more This paper describes the implementation of three different simplified ocean models on a GPU (graphics processing unit) using Python and PyOpenCL. The three models are all based on the solving the shallow water equations on Cartesian grids, and our work is motivated by the aim of running very large ensembles of forecast models for fully nonlinear data assimilation. The models are the linearized shallow water equations, the non-linear shallow water equations, and the two-layer non-linear shallow water equations, respectively, and they contain progressively more physical properties of the ocean dynamics. We show how these models are discretized to run efficiently on a GPU, discuss how to implement them, and show some simulation results. The implementation is available online under an open source license, and may serve as a starting point for others to implement similar oceanographic models.

Real-World Oceanographic Simulations on the GPU using a Two-Dimensional Finite-Volume Scheme

ArXiv, 2019

In this work, we take a modern high-resolution finite-volume scheme for solving the rotational sh... more In this work, we take a modern high-resolution finite-volume scheme for solving the rotational shallow-water equations and extend it with features required to run real-world ocean simulations. Our contributions include a spatially varying north vector and Coriolis term required for large scale domains, moving wet-dry fronts, a static land mask, bottom shear stress, wind forcing, boundary conditions for nesting in a global model, and an efficient model reformulation that makes it well-suited for massively parallel implementations. Our model order is verified using a grid convergence test, and we show numerical experiments using three different sections along the coast of Norway based on data originating from operational forecasts run at the Norwegian Meteorological Institute. Our simulation framework shows perfect weak scaling on a modern P100 GPU, and is capable of providing tidal wave forecasts that are very close to the operational model at a fraction of the cost. All source code ...

Data Assimilation for Ocean Drift Trajectories Using Massive Ensembles and GPUs

Finite Volumes for Complex Applications IX - Methods, Theoretical Aspects, Examples, 2020

In this work, we perform fully nonlinear data assimilation of ocean drift trajectories using mult... more In this work, we perform fully nonlinear data assimilation of ocean drift trajectories using multiple GPUs. We use an ensemble of up to 10,000 members and the sequential importance resampling algorithm to assimilate observations of drift trajectories into the underlying shallow-water simulation model. Our results show an improved drift trajectory forecast using data assimilation for a complex and realistic simulation scenario, and the implementation exhibits good weak and strong scaling.

Comparison Between Algebraic Multigrid and Multilevel Multiscale Methods for Reservoir Simulation

ECMOR XVII, 2020

Summary Multiscale methods for solving strongly heterogenous systems in reservoirs have a long hi... more Summary Multiscale methods for solving strongly heterogenous systems in reservoirs have a long history from the early ideas used on incompressible flow to the newly released version in commercial simulation. Much effort has been put into making the MsFV method work for fully unstructured multiphase problems. The MsRSB version is a newly developed version, which tackles most of the "real" world problems. It is to our knowledge, the only multiscale method that has been released in a commercial simulator. You can alternatively see the method as a variant of smoothed aggregation or as an iterative approach to AMG with energy minimizing basis functions. This will be discussed in detail. So far, most work on comparing MsRSB with AMG methods has been on qualitative performance measures like iteration number rather than on pure runtime on fair code implementation. We discuss the theoretical performance and show the practical performance for our implementation. Here, we compare performance of pure AMG, standard two-level MsRSB with pure AMG as coarse solver, as well as a new truly multilevel MsRSB scheme. Our implementation uses the DUNE-ISTL framework. To limit the scope of the discussion we restrict our assessment to AMG with aggregation and smoothed aggregation and the MsRSB method. These three methods are closely related and are primarily distinguished in a preconditioner setting by the coarsening factors used, and the degree of smoothing applied to the basis. We also compare with other state-of-the-art AMG implementations, but do not investigate combinations of them with the MSRB method. For the MsRSB method, we also discuss practical considerations in different parallelization regimes including domain decomposition using MPI, shared memory using OpenMP, and GPU acceleration with CUDA. All comparisons will focus on the setting in which many similar systems should be solved, e.g. during a large-scale, multiphase flow simulation. That is, our emphasis is on the performance of updating a preconditioner and on the apply time for the preconditioner relative to the convergence rate. Performance of the solvers will be tested for pure parabolic/elliptic problems that either arise as part of a sequential splitting procedure or as a pseudo-elliptic preconditioner/solver as a part of a CPR preconditioner for a multiphase system, for which block ILU0 is used as the outer smoother.

Real-time online camera synchronization for volume carving on GPU

2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2013

Volume carving is a well-known technique for reconstructing a 3D scene from a set of 2D images, u... more Volume carving is a well-known technique for reconstructing a 3D scene from a set of 2D images, using features detected in individual cameras, and camera parameters. Spatial calibration of the cameras is well understood, but the resulting carved volume is very sensitive to temporal offsets between the cameras. Automatic synchronization between the cameras is therefore desirable. In this paper, we present a highly efficient implementation of volume carving and synchronization on a heterogeneous system fitted with commodity GPUs using an improved version of the algorithm in [1]. An online, real-time synchronization system is described and evaluated on surveillance video of an indoor scene. Improvements to the state of the art CPUbased algorithms are described.

Plu Factorization on a Cluster of Gpus Using Fast Ethernet

Two biologically active Schiff bases (imines) 4-5 were synthesized by the reaction of 2-aminophen... more Two biologically active Schiff bases (imines) 4-5 were synthesized by the reaction of 2-aminophenol 1 with 4-chloroacetophenone 2 or 4hydroxyacetophenone 3 in the presence of conc. H2SO4. The characterization of Schiff bases were carried out by using spectroscopic techniques including IR, 1 H-NMR, EI-MS along with elemental analyses. The Schiff bases were checked for biological screening and found that the compound with-OH group to be more biologically active than the compound with halo (-X) group. The Schiff base 5 is a potent antioxidant agent as well as αglucosidase inhibitor. The Schiff bases 4-5 also have excellent antibacterial activity for strains; Bacillus subtilis, Staphylococcus aureus, Escherichia coli while moderate for Salmonella typhi and Pseudomonas aeruginosa, against gentamicin as standard drug.

Supplementary Material for Test Cases for Rotational Shallow-Water Schemes

This software is provided as a supplement for the research paper "Test Cases for Rotational ... more This software is provided as a supplement for the research paper "Test Cases for Rotational Shallow-Water Schemes" by Holm, Brodtkorb, Broström, Christensen and Sætra.The paper proposes test cases for validation of numerical schemes for solving the rotational shallow-water equations, with an emphasis on important physical properties as seen from an oceanographic viewpoint. Here, we provide Jupyter Notebooks with python implementations of all test cases, presented so that all results and figures from the paper can be reproduced.

GPU Computing with Python: Performance, Energy Efficiency and Usability

Computation

In this work, we examine the performance, energy efficiency, and usability when using Python for ... more In this work, we examine the performance, energy efficiency, and usability when using Python for developing high-performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL); between GPU generations; and between low-end, mid-range, and high-end GPUs. Our findings showed that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments showed that performance in general varies more between different GPUs than between using CUDA and OpenCL. We also show that tuning for performance is a good way of tuning for energy efficiency, but that specific tuning is needed to obtain optimal energy efficiency.

Visualization of Marine Sand Dune Displacements Utilizing Modern Gpu Techniques

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015

Quantifying and visualizing deformation and material fluxes is an indispensable tool for many geo... more Quantifying and visualizing deformation and material fluxes is an indispensable tool for many geoscientific applications at different scales comprising for example global convective models (Burstedde et al., 2013), co-seismic slip (Leprince et al., 2007) or local slope deformation (Stumpf et al., 2014b). Within the European project IQmulus (<a…

The Graphics Processor as a Mathematical Coprocessor in MATLAB

2008 International Conference on Complex, Intelligent and Software Intensive Systems, 2008

We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms fr... more We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms from numerical linear algebra available through this interface; matrix-matrix multiplication, Gauss-Jordan elimination, PLU factorization, and tridiagonal Gaussian elimination. In addition to being a high-level abstraction to the GPU, the interface offers background processing, enabling computations to be executed on the CPU simultaneously. The algorithms are shown to be up-to 31 times faster than highly optimized CPU code. The algorithms have only been tested on single precision hardware, but will easily run on new double precision hardware.

Efficient GPU-Implementation of Adaptive Mesh Refinement for the Shallow-Water Equations

Journal of Scientific Computing, 2014

The shallow-water equations model hydrostatic flow below a free surface for cases in which the ra... more The shallow-water equations model hydrostatic flow below a free surface for cases in which the ratio between the vertical and horizontal length scales is small and are used to describe waves in lakes, rivers, oceans, and the atmosphere. The equations admit discontinuous solutions, and numerical solutions are typically computed using high-resolution schemes. For many practical problems, there is a need to increase the grid resolution locally to capture complicated structures or steep gradients in the solution. An efficient method to this end is adaptive mesh refinement (AMR), which recursively refines the grid in parts of the domain and adaptively updates the refinement as the simulation progresses. Several authors have demonstrated that the explicit stencil computations of high-resolution schemes map particularly well to many-core architectures seen in hardware accelerators such as graphics processing units (GPUs). Herein, we present the first full GPU-implementation of a block-based AMR method for the second-order Kurganov-Petrova central scheme. We discuss implementation details, potential pitfalls, and key insights, and present a series of performance and accuracy tests. Although it is only presented for a particular case herein, we believe our approach to GPU-implementation of AMR is transferable to other hyperbolic conservation laws, numerical schemes, and architectures similar to the GPU.

State-of-the-art in Heterogeneous Computing

Scientific Programming, 2010

Node level heterogeneous architectures have become attractive during the last decade for several ... more Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

metno/VolcanicAshInversion: v1.0.0

Version 1.0.0 release. Tested with datasets available on https://doi.org/10.5281/zenodo.3818195 h... more

metno/VolcanicAshInversion: v1.1.0

This release avoids the memory intensive representation of M (n_observations x n_emissions) by co... more

Supplementary Material to Evaluation of Selected Finite-Difference and Finite-Volume Approaches to Rotational Shallow-Water Flow

This release represent the supplementary material for the paper <em>Evaluation of Selected ... more This release represent the supplementary material for the paper <em>Evaluation of Selected Finite-Difference and Finite-Volume Approaches to Rotational Shallow-Water Flow</em> by Holm, Brodtkorb, Broström, Christensen and Sætra, and contains the numerical schemes and test cases used in the paper. All figures and results presented in the paper can be reproduced from the notebooks provided here.

Supplementary Software for Coastal Ocean Forecasting on the GPU using a Two-Dimensional Finite-Volume Scheme

This release represent the supplementary software for the paper Coastal Ocean Forecasting on the ... more This release represent the supplementary software for the paper Coastal Ocean Forecasting on the GPU using a Two-Dimensional Finite-Volume Scheme by André Rigland Brodtkorb and Håvard Heitlo Holm. It contains the software described in the paper, and Jupyter notebooks for setting up and running and visualizing the numerical experiments. The folder gpu_ocean/papers/realisticSimulations contains the code specific to this work.

An Asynchronous API for Numerical Linear Algebra

Scalable Computing : Practice and Experience, 2008

We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple C... more We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU.

c ○ 2008 SCPE AN ASYNCHRONOUS API FOR NUMERICAL LINEAR ALGEBRA

Abstract. We present a task-parallel asynchronous API for numerical linear algebra that utilizes ... more Abstract. We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU. Key words: asynchronous, multicore, GPU, MATLAB, CUBLAS, double precision 1. Introduction. Algorithms

The graphics processor as a mathematical coprocessor

We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms fr... more We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms from numerical linear algebra available through this interface; matrix-matrix multiplication, Gauss-Jordan elimination, PLU factorization, and tridiagonal Gaussian elimination. In addition to being a high-level abstraction to the GPU, the interface offers background processing, enabling computations to be executed on the CPU simultaneously. The algorithms are shown to be up-to 31 times faster than highly optimized CPU code. The algorithms have only been tested on single precision hardware, but will easily run on new double precision hardware.

Performance and Energy Efficiency of CUDA and OpenCL for GPU Computing Using Python

Simplified Ocean Models on GPUs

This paper describes the implementation of three different simplified ocean models on a GPU (grap... more This paper describes the implementation of three different simplified ocean models on a GPU (graphics processing unit) using Python and PyOpenCL. The three models are all based on the solving the shallow water equations on Cartesian grids, and our work is motivated by the aim of running very large ensembles of forecast models for fully nonlinear data assimilation. The models are the linearized shallow water equations, the non-linear shallow water equations, and the two-layer non-linear shallow water equations, respectively, and they contain progressively more physical properties of the ocean dynamics. We show how these models are discretized to run efficiently on a GPU, discuss how to implement them, and show some simulation results. The implementation is available online under an open source license, and may serve as a starting point for others to implement similar oceanographic models.

Real-World Oceanographic Simulations on the GPU using a Two-Dimensional Finite-Volume Scheme

ArXiv, 2019

In this work, we take a modern high-resolution finite-volume scheme for solving the rotational sh... more In this work, we take a modern high-resolution finite-volume scheme for solving the rotational shallow-water equations and extend it with features required to run real-world ocean simulations. Our contributions include a spatially varying north vector and Coriolis term required for large scale domains, moving wet-dry fronts, a static land mask, bottom shear stress, wind forcing, boundary conditions for nesting in a global model, and an efficient model reformulation that makes it well-suited for massively parallel implementations. Our model order is verified using a grid convergence test, and we show numerical experiments using three different sections along the coast of Norway based on data originating from operational forecasts run at the Norwegian Meteorological Institute. Our simulation framework shows perfect weak scaling on a modern P100 GPU, and is capable of providing tidal wave forecasts that are very close to the operational model at a fraction of the cost. All source code ...

Data Assimilation for Ocean Drift Trajectories Using Massive Ensembles and GPUs

Finite Volumes for Complex Applications IX - Methods, Theoretical Aspects, Examples, 2020

In this work, we perform fully nonlinear data assimilation of ocean drift trajectories using mult... more In this work, we perform fully nonlinear data assimilation of ocean drift trajectories using multiple GPUs. We use an ensemble of up to 10,000 members and the sequential importance resampling algorithm to assimilate observations of drift trajectories into the underlying shallow-water simulation model. Our results show an improved drift trajectory forecast using data assimilation for a complex and realistic simulation scenario, and the implementation exhibits good weak and strong scaling.

Comparison Between Algebraic Multigrid and Multilevel Multiscale Methods for Reservoir Simulation

ECMOR XVII, 2020

Summary Multiscale methods for solving strongly heterogenous systems in reservoirs have a long hi... more Summary Multiscale methods for solving strongly heterogenous systems in reservoirs have a long history from the early ideas used on incompressible flow to the newly released version in commercial simulation. Much effort has been put into making the MsFV method work for fully unstructured multiphase problems. The MsRSB version is a newly developed version, which tackles most of the "real" world problems. It is to our knowledge, the only multiscale method that has been released in a commercial simulator. You can alternatively see the method as a variant of smoothed aggregation or as an iterative approach to AMG with energy minimizing basis functions. This will be discussed in detail. So far, most work on comparing MsRSB with AMG methods has been on qualitative performance measures like iteration number rather than on pure runtime on fair code implementation. We discuss the theoretical performance and show the practical performance for our implementation. Here, we compare performance of pure AMG, standard two-level MsRSB with pure AMG as coarse solver, as well as a new truly multilevel MsRSB scheme. Our implementation uses the DUNE-ISTL framework. To limit the scope of the discussion we restrict our assessment to AMG with aggregation and smoothed aggregation and the MsRSB method. These three methods are closely related and are primarily distinguished in a preconditioner setting by the coarsening factors used, and the degree of smoothing applied to the basis. We also compare with other state-of-the-art AMG implementations, but do not investigate combinations of them with the MSRB method. For the MsRSB method, we also discuss practical considerations in different parallelization regimes including domain decomposition using MPI, shared memory using OpenMP, and GPU acceleration with CUDA. All comparisons will focus on the setting in which many similar systems should be solved, e.g. during a large-scale, multiphase flow simulation. That is, our emphasis is on the performance of updating a preconditioner and on the apply time for the preconditioner relative to the convergence rate. Performance of the solvers will be tested for pure parabolic/elliptic problems that either arise as part of a sequential splitting procedure or as a pseudo-elliptic preconditioner/solver as a part of a CPR preconditioner for a multiphase system, for which block ILU0 is used as the outer smoother.

Real-time online camera synchronization for volume carving on GPU

2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2013

Volume carving is a well-known technique for reconstructing a 3D scene from a set of 2D images, u... more Volume carving is a well-known technique for reconstructing a 3D scene from a set of 2D images, using features detected in individual cameras, and camera parameters. Spatial calibration of the cameras is well understood, but the resulting carved volume is very sensitive to temporal offsets between the cameras. Automatic synchronization between the cameras is therefore desirable. In this paper, we present a highly efficient implementation of volume carving and synchronization on a heterogeneous system fitted with commodity GPUs using an improved version of the algorithm in [1]. An online, real-time synchronization system is described and evaluated on surveillance video of an indoor scene. Improvements to the state of the art CPUbased algorithms are described.

Plu Factorization on a Cluster of Gpus Using Fast Ethernet

Two biologically active Schiff bases (imines) 4-5 were synthesized by the reaction of 2-aminophen... more Two biologically active Schiff bases (imines) 4-5 were synthesized by the reaction of 2-aminophenol 1 with 4-chloroacetophenone 2 or 4hydroxyacetophenone 3 in the presence of conc. H2SO4. The characterization of Schiff bases were carried out by using spectroscopic techniques including IR, 1 H-NMR, EI-MS along with elemental analyses. The Schiff bases were checked for biological screening and found that the compound with-OH group to be more biologically active than the compound with halo (-X) group. The Schiff base 5 is a potent antioxidant agent as well as αglucosidase inhibitor. The Schiff bases 4-5 also have excellent antibacterial activity for strains; Bacillus subtilis, Staphylococcus aureus, Escherichia coli while moderate for Salmonella typhi and Pseudomonas aeruginosa, against gentamicin as standard drug.

Supplementary Material for Test Cases for Rotational Shallow-Water Schemes

This software is provided as a supplement for the research paper "Test Cases for Rotational ... more This software is provided as a supplement for the research paper "Test Cases for Rotational Shallow-Water Schemes" by Holm, Brodtkorb, Broström, Christensen and Sætra.The paper proposes test cases for validation of numerical schemes for solving the rotational shallow-water equations, with an emphasis on important physical properties as seen from an oceanographic viewpoint. Here, we provide Jupyter Notebooks with python implementations of all test cases, presented so that all results and figures from the paper can be reproduced.

GPU Computing with Python: Performance, Energy Efficiency and Usability

Computation

In this work, we examine the performance, energy efficiency, and usability when using Python for ... more In this work, we examine the performance, energy efficiency, and usability when using Python for developing high-performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL); between GPU generations; and between low-end, mid-range, and high-end GPUs. Our findings showed that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments showed that performance in general varies more between different GPUs than between using CUDA and OpenCL. We also show that tuning for performance is a good way of tuning for energy efficiency, but that specific tuning is needed to obtain optimal energy efficiency.

Visualization of Marine Sand Dune Displacements Utilizing Modern Gpu Techniques

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015

Quantifying and visualizing deformation and material fluxes is an indispensable tool for many geo... more Quantifying and visualizing deformation and material fluxes is an indispensable tool for many geoscientific applications at different scales comprising for example global convective models (Burstedde et al., 2013), co-seismic slip (Leprince et al., 2007) or local slope deformation (Stumpf et al., 2014b). Within the European project IQmulus (<a…

The Graphics Processor as a Mathematical Coprocessor in MATLAB

2008 International Conference on Complex, Intelligent and Software Intensive Systems, 2008

We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms fr... more We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms from numerical linear algebra available through this interface; matrix-matrix multiplication, Gauss-Jordan elimination, PLU factorization, and tridiagonal Gaussian elimination. In addition to being a high-level abstraction to the GPU, the interface offers background processing, enabling computations to be executed on the CPU simultaneously. The algorithms are shown to be up-to 31 times faster than highly optimized CPU code. The algorithms have only been tested on single precision hardware, but will easily run on new double precision hardware.

Efficient GPU-Implementation of Adaptive Mesh Refinement for the Shallow-Water Equations

Journal of Scientific Computing, 2014

The shallow-water equations model hydrostatic flow below a free surface for cases in which the ra... more The shallow-water equations model hydrostatic flow below a free surface for cases in which the ratio between the vertical and horizontal length scales is small and are used to describe waves in lakes, rivers, oceans, and the atmosphere. The equations admit discontinuous solutions, and numerical solutions are typically computed using high-resolution schemes. For many practical problems, there is a need to increase the grid resolution locally to capture complicated structures or steep gradients in the solution. An efficient method to this end is adaptive mesh refinement (AMR), which recursively refines the grid in parts of the domain and adaptively updates the refinement as the simulation progresses. Several authors have demonstrated that the explicit stencil computations of high-resolution schemes map particularly well to many-core architectures seen in hardware accelerators such as graphics processing units (GPUs). Herein, we present the first full GPU-implementation of a block-based AMR method for the second-order Kurganov-Petrova central scheme. We discuss implementation details, potential pitfalls, and key insights, and present a series of performance and accuracy tests. Although it is only presented for a particular case herein, we believe our approach to GPU-implementation of AMR is transferable to other hyperbolic conservation laws, numerical schemes, and architectures similar to the GPU.

State-of-the-art in Heterogeneous Computing

Scientific Programming, 2010

Node level heterogeneous architectures have become attractive during the last decade for several ... more Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.