A scalable parallel algorithm for sparse Cholesky factorization

anshul gupta

A scalable parallel algorithm for sparse Cholesky factorization

anshul gupta

1994, Proceedings of the 1994 ACM/IEEE conference on Supercomputing - Supercomputing '94

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

In this paper, we describe a scalable parallel algorithm for sparse Cholesky factorization, analyze its performance and scalability, and present experimental results of its implementation on a 1024-processor nCUBE2 parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm improves the state of the art in parallel direct solution of sparse linear systems by an order of magnitude|both in terms of speedups and the number of processors that can be utilized e ectively for a given problem size. This algorithm incurs strictly less communication overhead and is more scalable than any known parallel formulation of sparse matrix factorization. We show that our algorithm is optimally scalable on hypercube and mesh architectures and that its asymptotic scalability is the same as that of dense matrix factorization for a wide class of sparse linear systems, including those arising in all two-and three-dimensional nite element problems.

Anshul Gupta

IEEE Transactions on Parallel and Distributed Systems, 1995

In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.

View PDFchevron_right

Highly Scalable Parallel Algorithms for Sparse Matrix Factorization

Anshul Gupta

IEEE Transactions on Parallel and Distributed Systems, 1997

In this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.

View PDFchevron_right

On evaluating parallel sparse Cholesky factorizations

Wen-Yang Lin

Proceedings of 1994 International Conference on Parallel and Distributed Systems, 1994

Though many parallel implementations of sparse Cholesky factorization with the experimental results accompanied have been proposed, it seems hard to evaluate the performance of these factorization methods theoretically because of the irregular structure of sparse matrices. This paper is an attempt to such research. On the basis of the criteria of parallel computation and communication time, we successfully evaluate four widely adopted Cholesky factorization methods, including column-Cholesky, row-Cholesky, submatrix-Cholesky and multifrontal. The results show that the multifrontal method is superior to the others.

View PDFchevron_right

Parallel Choleski Factorization of Block-Diagonal-Bordered Sparse Matrices

Sanjay Ranka

1994

This paper presents research into parallel block-diagonal-bordered sparse Choleski factorization algorithms developed with special consideration to irregular sparse matrices originating in the electrical power systems community. Direct block-diagonal-bordered sparse Choleski algorithms exhibit distinct advantages when compared to general direct parallel sparse Choleski algorithms. Task assignments for numerical factorization on distributed-memory multi-processors depend only on the assignment of data to blocks, ...

View PDFchevron_right

Design of a Multicore Sparse Cholesky Factorization Using DAGs

Jennifer Scott

SIAM Journal on Scientific Computing, 2010

The rapid emergence of multicore machines has led to the need to design new algorithms that are efficient on these architectures. Here, we consider the solution of sparse symmetric positive-definite linear systems by Cholesky factorization. We were motivated by the successful division of the computation in the dense case into tasks on blocks and use of a task manager to exploit all the parallelism that is available between these tasks, whose dependencies may be represented by a directed acyclic graph (DAG). Our algorithm is built on the assembly tree and subdivides the work at each node into tasks on blocks, whose dependencies may again be represented by a DAG. To limit memory requirements, updates of blocks are performed directly. Our algorithm is implemented within a new solver HSL MA87. It is written in Fortran 95 plus OpenMP and is available as part of the software library HSL. Using problems arising from a range of practical applications, we present experimental results that support our design choices and demonstrate HSL MA87 obtains good serial and parallel times on our 8-core test machines. Comparisons are made with existing modern solvers and show that HSL MA87 generally outperforms these solvers, particularly in the case of very large problems.

View PDFchevron_right

On Optimal Reorderings of Sparse Matrices for Parallel Cholesky Factorizations

Wen-Yang Lin

SIAM Journal on Matrix Analysis and Applications, 2005

The height of the elimination tree has long acted as the only criterion in deriving a suitable fill-preserving sparse matrix ordering for parallel factorization. Although the deficiency in adopting height as the criterion for all circumstances was well recognized, no research has succeeded in alleviating this constraint. In this paper, we extend the unit-cost fill-preserving ordering into a generalized class that can adopt various aspects in parallel factorization, such as computation, communication and algorithmic diversity. We recognize and show that if any cost function satisfies two mandatory properties, called the independence and conservation properties, a greedy ordering scheme then generates an optimal ordering with minimum completion cost. We also present an efficient implementation of the proposed ordering algorithm. Incorporating various techniques, the complexity can be improved from O(n log n + e) to O(q log q + κ), where n denotes the number of nodes, e the number of edges, q the number of maximal cliques and κ the sum of all maximal clique sizes in the filled graph. Empirical results show that the proposed algorithm can significantly reduce the parallel factorization cost without sacrificing much in terms of time efficiency.

View PDFchevron_right

Optimization of a Statically Partitioned Sparse Hypermatrix Cholesky Factorization

José R Herrero

The sparse Cholesky factorization of some large matrices can require a two dimensional partitioning of the matrix. The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. The subblocks are stored as dense matrices so BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. In this paper we present an improvement to our sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure. We compare its performance with several codes and analyze the results.

View PDFchevron_right

Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization

Juan Pablo Manrique Navarro

2006

View PDFchevron_right

Sparse Hypermatrix Cholesky: Customization for High Performance

José R Herrero

Efficient execution of numerical algorithms requires adapting the code to the underlying execution platform. In this paper we show the process of fine tuning our sparse Hypermatrix Cholesky factorization in order to exploit efficiently two important machine resources: processor and memory. Using the techniques we presented in previous papers we tune our code on a different platform. Then, we extend our work in two directions: first, we experiment with a variation of the ordering algorithm, and second, we reduce the data submatrix storage to be able to use larger submatrix sizes.

View PDFchevron_right

Efficient parallel factorization and solution of structured and unstructured linear systems

John Reif

Journal of Computer and System Sciences, 2005

This paper gives improved parallel methods for several exact factorizations of some classes of symmetric positive definite (SPD) matrices. Our factorizations also provide us similarly efficient algorithms for exact computation of the solution of the corresponding linear systems (which need not be SPD), and for finding rank and determinant magnitude. We assume the input matrices have entries that are rational numbers expressed as a ratio of integers with at most a polynomial number of bits. We assume a parallel random access machine (PRAM) model of parallel computation, with unit cost arithmetic operations, including division, over a finite field Z p , where p is a prime number whose binary representation is linear in the size of the input matrix and is randomly chosen by the algorithm. We require only bit precision O(n(+ log n)), which is the asymptotically optimal bit precision for log n. Our algorithms are randomized, giving the outputs with high likelihood 1 − 1/n (1). We compute LU and QR factorizations for dense matrices, and LU factorizations of sparse matrices which are s(n)-separable, reducing the known parallel time bounds for these factorizations from (log 3 n) to O(log 2 n), without an increase in processors (matching the best known work bounds of known parallel algorithms with polylog time bounds). Using the same parallel algorithm specialized to structured matrices, we compute LU factorizations for Toeplitz matrices and matrices of bounded displacement rank in time O(log 2 n) with n log log n processors, reducing by a nearly linear factor the best previous processor bounds for polylog times (however, these prior works did not generally

View PDFchevron_right

Parallel algorithms for forward and back substitution in direct solution of sparse linear systems

Anshul Gupta

1995

A few parallel algorithms for solving triangular systems resulting from parallel factorization of sparse linear systems have been proposed and implemented recently. We present a detailed analysis of parallel complexity and scalability of the best of these algorithms and the results of its implementation on up to 256 processors of the Cray T3D parallel computer. It has been a common belief that parallel sparse triangular solvers are quite unscalable due to a high communication to computation ratio. Our analysis and experiments show that, although not as scalable as the best parallel sparse Cholesky factorization algorithms, parallel sparse triangular solvers can yield reasonable speedups in runtime on hundreds of processors. We also show that for a wide class of problems, the sparse triangular solvers described in this paper are optimal and are asymptotically as scalable as a dense triangular solver.

View PDFchevron_right

Log In

A scalable parallel algorithm for sparse Cholesky factorization

Sign up for access to the world's latest research

Abstract

Related papers

Related topics