Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
…
11 pages
1 file
Abstract. We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU. Key words: asynchronous, multicore, GPU, MATLAB, CUBLAS, double precision 1. Introduction. Algorithms
2008
We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU.
Acta Numerica, 1993
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, the singular value decomposition, and generalizations of these to two matrices. We consider dense, band and sparse matrices.
2008
We propose two high-level application programming interfaces (APIs) to use a graphics processing unit (GPU) as a coprocessor for dense linear algebra operations. Combined with an extension of the FLAME API and an implementation on top of NVIDIA CUBLAS, the result is an efficient and user-friendly tool to design, implement, and execute dense linear algebra operations on the current generation of NVIDIA graphics processors, of wide-appeal to scientists and engineers. As an application of the developed APIs, we implement and evaluate the performance of three different variants of the Cholesky factorization.
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/ watt. ratio There is strong desire to utilize such power, but it can be difficult to harness given the features and limitations of the platform. As such, libraries can be of great utility, allowing novices and experts alike to access high computational performance without knowledge of GPU programming. Some routines such as FFTs are quite common in scientific and numerical computing and it would be wasteful for each user to implement such routines that could instead be provided in a centralized library. Moreover, a library routine that is tuned by an expert will often outperform and be more feature robust, allowing the user to instead focus on their particular area of expertise. In this paper we present CULA, a library of linear algebra routines developed using a hybrid computation model employing both CPU and GPU power.
2005
In this chapter we discuss numerical software for linear algebra problems on parallel computers. We focus on some of the most common numerical operations: linear system solving and eigenvalue computations. Numerical operations such as linear system solving and eigenvalue calculations can be applied to two di erent kinds of matrix: dense and sparse. In dense systems, essentially every matrix element is nonzero. In sparse systems, a su ciently large number of matrix elements is zero that a specialized storage scheme is warranted; for an introduction to sparse storage, see [3]. Because the two classes are so di erent, usually di erent numerical softwares apply to them. We discuss ScaLAPACK and PLAPACK as the choices for dense linear system solving (see Section 13.1). For solving sparse linear systems, there exist two classes of algorithms: direct methods and iterative methods. We will discuss SuperLU as an example of a direct solver (see Section 13.2.1) and PETSc as an example of itera...
Acta Numerica
Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split int...
ACM Transactions on Mathematical Software
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.
Numerical Computations with GPUs, 2014
This chapter presents the current best design and implementation practices for the acceleration of dense linear algebra (DLA) on GPUs. Examples are given with fundamental algorithms-from the matrix-matrix multiplication kernel written in CUDA to the higher level algorithms for solving linear systems, eigenvalue and SVD problems. The implementations are available through the MAGMA library-a redesign for GPUs of the popular LAPACK. To generate the extreme level of parallelism needed for the efficient use of GPUs, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators using either static scheduling or a lightweight runtime system. The use of lightweight runtime systems keeps scheduling overhead low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows the exploration of the unique strengths of the various hardware components.
1997
Iterative algorithms for solving systems of linear equations require data exchanges at each iteration step due to data dependencies. When processed in parallel, this requirement forces frequent data exchanges among parallel sub-tasks resulting in long execution. These dependencies also prohibit effective parallel program generation by existing parallel compilers.
2012
Abstract The broad introduction of multi-core platforms into computing has brought a great opportunity to develop computationally demanding applications such as matrix computations on parallel computing platforms. Basic matrix computations such as vector and matrix addition, dot product, outer product, matrix transpose, matrix-vector and matrix multiplication are very challenging computational kernels arising in scientific computing.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012
Computers & Mathematics with Applications, 1989
Computer Physics Communications, 1996
2010
Lecture Notes in Computer Science, 1998
2011 Symposium on Application Accelerators in High-Performance Computing, 2011
Lecture Notes in Computer Science, 2006
Computational Technology Reviews, 2011
International Journal of High Performance Computing Applications, 1994
Numerical Algorithms, 1995
Computer Physics Communications, 1996
Parallel Computing, 2010
Journal of Physics: Conference Series, 2009