Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
…
8 pages
1 file
Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource widens the spectrum of factors that may impact on the performance of the system. The goal of this paper is to analyze such factors by investigating and benchmarking the NVIDIA Tesla S1070. This system combines four T10 GPUs, making available up to 4 TFLOPS of computational power. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by our benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of...
Computing in Science & Engineering, 2012
Unsteady computational fluid dynamics simulations of turbulence are performed using up to 64 graphics processors. The results from two GPU clusters and a CPU cluster are compared. A second-order staggered-mesh spatial discretization is coupled with a low storage three-step Runge-Kutta time advancement and pressure projection at each substep. The pressure Poisson equation dominates the solution time and is solved with the preconditioned Conjugate Gradient method. The CFD algorithm is optimized to use the fast shared-memory on the GPUs and to use communication/computation overlapping. Detailed timings reveal that the internal calculations now occur so efficiently that the operations related to communication are the scaling bottleneck at all but the very largest problem sizes that can fit on the hardware. 2. Implementation 2.1 Hardware We will primarily discuss results computed on the Lincoln supercomputer housed at NCSA. This machine has 96 Tesla S1070s (384 GPUs total). Each GPU has 4GB of memory and a theoretical bandwidth of 102 GBs to that memory. Each GPU has a 4x PCI-e Gen2 connection (2 GB/s) to its CPU host. In addition, we will also perform tests on Orion, which is an in-house GPU machine containing 4 NVIDIA 295 cards, 8 GPUs. On Orion, each GPU has 0.9 GB of memory and a theoretical bandwidth of 112 GBs. The connection between the GPUs on Orion uses a 8x PCI-e Gen2 connection (4 GB/s) and for simplicity communication still uses the MPI protocol. Also we replaced the first and second GPUs with GTX 480 and Tesla C2070 cards in order to run some cases with these new GPUs. These GPU results will be compared to the CPU cores on Lincoln (quad core Intel 64 Harpertown). The low-level CFD algorithm structure is dictated by two key features of the GPU hardware. First, the GPUs read/write memory is one order of magnitude faster when the memory is read linearly. Random reads/writes are comparatively slow on a GPU. In addition, each multi-processor on the GPU has some very fast on-chip memory (shared memory) which serves essentially as an addressable programsupervised cache. CFD, like all most three-dimensional PDE solution applications, requires considerable random memory accesses (even when using structured meshes). Roughly 90% of these slow random memory accesses can be eliminated by: (1) linearly reading large chunks of data into the shared-memory space, which is fast for all accesses, (2) operating on the data in the shared-memory, and then (3) writing the processed data back to the main GPU memory (global memory) linearly. This optimization is the key to the 45x speedup of the GPU over a CPU.
Concurrency and Computation: Practice and Experience, 2012
Graphical processing units (GPUs) are good data-parallel performance accelerators for solving regular mesh partial differential equations (PDEs) whereby low-latency communications and high compute to communications ratios can yield very high levels of computational efficiency. Finite-difference timedomain methods still play an important role for many PDE applications. Iterative multi-grid and multilevel algorithms can converge faster than ordinary finite-difference methods but can be much more difficult to parallelize with GPU memory constraints. We report on some practical algorithmic and data layout approaches and on performance data on a range of GPUs with CUDA. We focus on the use of multiple GPU devices with a single CPU host and the asynchronous CPU/GPU communications issues involved. We obtain more than two orders of magnitude of speedup over a comparable CPU core.
Proceedings of the 8th Workshop on General Purpose Processing using GPUs - GPGPU 2015, 2015
Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highlymultithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g., a 40% SLOC reduction in the host code of finite difference).
ArXiv, 2018
The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions. In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and...
2010 39th International Conference on Parallel Processing, 2010
To exploit the full potential of GPGPUs for generalpurpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependencepreserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.
Proceedings of the 2004 ACM/IEEE conference on …, 2004
Inspired by the attractive Flops/dollar ratio and the incredible growth in the speed of modern graphics processing units (GPUs), we propose to use a cluster of GPUs for high performance scientific computing. As an example application, we have developed a parallel flow simulation using the lattice Boltzmann model (LBM) on a GPU cluster and have simulated the dispersion of airborne contaminants in the Times Square area of New York City. Using 30 GPU nodes, our simulation can compute a 480x400x80 LBM in 0.31 second/step, a speed which is 4.6 times faster than that of our CPU cluster implementation. Besides the LBM, we also discuss other potential applications of the GPU cluster, such as cellular automata, PDE solvers, and FEM.
2014
Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with up to one trillion ($10^{12}$) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters. We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic elliptic PDEs which are encountered in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\%$ of the theoretical peak performance of the machine and we use more than $40\%$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.
Computing in Science & Engineering, 2021
The increasingly diverse ecosystem of high-performance architectures and programming models presents a mounting challenge for programmers seeking to accelerate scientific computing applications. Code generation offers a promising solution, transforming a simple and general representation of computations into lower level, hardware-specialized and-optimized code. We describe the philosophy set forth by the Python package Pystella, a high-performance framework built upon such tools to solve partial differential equations on structured grids. A hierarchy of abstractions provides increasingly expressive interfaces for specifying physical systems and algorithms. We present an example application from cosmology, using finite-difference methods to solve coupled hyperbolic partial differential equations on (multiple) GPUs. Such an approach may enable a broad range of domain scientists to make efficient use of increasingly heterogeneous computational resources while mitigating the drastic effort and expertise nominally required to do so.
ArXiv, 2014
Since the first idea of using GPU to general purpose computing, things have evolved over the years and now there are several approaches to GPU programming. GPU computing practically began with the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA and Stream by AMD. These are APIs designed by the GPU vendors to be used together with the hardware that they provide. A new emerging standard, OpenCL (Open Computing Language) tries to unify different GPU general computing API implementations and provides a framework for writing programs executed across heterogeneous platforms consisting of both CPUs and GPUs. OpenCL provides parallel computing using task-based and data-based parallelism. In this paper we will focus on the CUDA parallel computing architecture and programming model introduced by NVIDIA. We will present the benefits of the CUDA programming model. We will also compare the two main approaches, CUDA and AMD APP (STREAM) and the new framwork, OpenCL that tries...
Journal of Computational Science, 2019
Here we present a first multi-node/multi-GPU implementation of OpenCAL for grid-based high-performance numerical simulation. OpenCL and MPI have been adopted as low-level APIs for maximum portability and performance evaluated with respect to three different benchmarks, namely a Sobel edge detection filter, a Julia fractal generator, and the SciddicaT Cellular Automata model for fluidflows simulation. Different hardware configurations of a dual-node test cluster have been considered, allowing for executions up to four GPUs. Optimal performance has been achieved in consideration of the compute/memory bound nature of both benchmarks and hardware configurations.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IEEE Transactions on Parallel and Distributed Systems, 2016
The European Physical Journal Special Topics, 2012
International Journal of Computational Fluid Dynamics, 2017
2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2014
Journal of Computational Physics, 2012
2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014
Pollack Periodica, 2008
2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2014
Parallel Computing, 2015
2019 IEEE High Performance Extreme Computing Conference (HPEC), 2019
Parallel Computing, 2012
Proceedings of the 9th conference on Computing Frontiers - CF '12, 2012
2011 International Conference on Parallel Processing, 2011
2014 Seventh Workshop on High Performance Computational Finance, 2014
52nd Aerospace Sciences Meeting, 2014