Modelling Multi-GPU Systems 1

Anne C. Elster

Modelling Multi-GPU Systems 1

Anne C. Elster

2010

visibility

…

description

8 pages

link

1 file

Abstract. Due to the power and frequency walls, the trend is now to use multiple GPUs on a given system, much like you will find multiple cores on CPU-based systems. However, increasing the hierarchy of resource widens the spectrum of factors that may impact on the performance of the system. The goal of this paper is to analyze such factors by investigating and benchmarking the NVIDIA Tesla S1070. This system combines four T10 GPUs, making available up to 4 TFLOPS of computational power. As a case study, we develop a red-black, SOR PDE solver for Laplace equations with Dirichlet boundaries, well known for requiring constant communication in order to exchange neighboring data. To aid both design and analysis, we propose a model for multi-GPU systems targeting communication between the several GPUs. The main variables exposed by our benchmark application are: domain size and shape, kind of data partitioning, number of GPUs, width of the borders to exchange, kernels to use, and kind of...

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Blair Perot

Computing in Science & Engineering, 2012

Unsteady computational fluid dynamics simulations of turbulence are performed using up to 64 graphics processors. The results from two GPU clusters and a CPU cluster are compared. A second-order staggered-mesh spatial discretization is coupled with a low storage three-step Runge-Kutta time advancement and pressure projection at each substep. The pressure Poisson equation dominates the solution time and is solved with the preconditioned Conjugate Gradient method. The CFD algorithm is optimized to use the fast shared-memory on the GPUs and to use communication/computation overlapping. Detailed timings reveal that the internal calculations now occur so efficiently that the operations related to communication are the scaling bottleneck at all but the very largest problem sizes that can fit on the hardware. 2. Implementation 2.1 Hardware We will primarily discuss results computed on the Lincoln supercomputer housed at NCSA. This machine has 96 Tesla S1070s (384 GPUs total). Each GPU has 4GB of memory and a theoretical bandwidth of 102 GBs to that memory. Each GPU has a 4x PCI-e Gen2 connection (2 GB/s) to its CPU host. In addition, we will also perform tests on Orion, which is an in-house GPU machine containing 4 NVIDIA 295 cards, 8 GPUs. On Orion, each GPU has 0.9 GB of memory and a theoretical bandwidth of 112 GBs. The connection between the GPUs on Orion uses a 8x PCI-e Gen2 connection (4 GB/s) and for simplicity communication still uses the MPI protocol. Also we replaced the first and second GPUs with GTX 480 and Tesla C2070 cards in order to run some cases with these new GPUs. These GPU results will be compared to the CPU cores on Lincoln (quad core Intel 64 Harpertown). The low-level CFD algorithm structure is dictated by two key features of the GPU hardware. First, the GPUs read/write memory is one order of magnitude faster when the memory is read linearly. Random reads/writes are comparatively slow on a GPU. In addition, each multi-processor on the GPU has some very fast on-chip memory (shared memory) which serves essentially as an addressable programsupervised cache. CFD, like all most three-dimensional PDE solution applications, requires considerable random memory accesses (even when using structured meshes). Roughly 90% of these slow random memory accesses can be eliminated by: (1) linearly reading large chunks of data into the shared-memory space, which is fast for all accesses, (2) operating on the data in the shared-memory, and then (3) writing the processed data back to the main GPU memory (global memory) linearly. This optimization is the key to the 45x speedup of the GPU over a CPU.

Log In

Modelling Multi-GPU Systems 1

Sign up for access to the world's latest research

Related papers

Related papers

Related topics