Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems
Though many parallel implementations of sparse Cholesky factorization with the experimental results accompanied have been proposed, it seems hard to evaluate the performance of these factorization methods theoretically because of the irregular structure of sparse matrices. This paper is an attempt to such research. On the basis of the criteria of parallel computation and communication time, we successfully evaluate four widely adopted Cholesky factorization methods, including column-Cholesky, row-Cholesky, submatrix-Cholesky and multifrontal. The results show that the multifrontal method is superior to the others.
Proceedings of the 1994 ACM/IEEE conference on Supercomputing - Supercomputing '94, 1994
In this paper, we describe a scalable parallel algorithm for sparse Cholesky factorization, analyze its performance and scalability, and present experimental results of its implementation on a 1024-processor nCUBE2 parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm improves the state of the art in parallel direct solution of sparse linear systems by an order of magnitude|both in terms of speedups and the number of processors that can be utilized e ectively for a given problem size. This algorithm incurs strictly less communication overhead and is more scalable than any known parallel formulation of sparse matrix factorization. We show that our algorithm is optimally scalable on hypercube and mesh architectures and that its asymptotic scalability is the same as that of dense matrix factorization for a wide class of sparse linear systems, including those arising in all two-and three-dimensional nite element problems.
SIAM Journal on Matrix Analysis and Applications, 2005
The height of the elimination tree has long acted as the only criterion in deriving a suitable fill-preserving sparse matrix ordering for parallel factorization. Although the deficiency in adopting height as the criterion for all circumstances was well recognized, no research has succeeded in alleviating this constraint. In this paper, we extend the unit-cost fill-preserving ordering into a generalized class that can adopt various aspects in parallel factorization, such as computation, communication and algorithmic diversity. We recognize and show that if any cost function satisfies two mandatory properties, called the independence and conservation properties, a greedy ordering scheme then generates an optimal ordering with minimum completion cost. We also present an efficient implementation of the proposed ordering algorithm. Incorporating various techniques, the complexity can be improved from O(n log n + e) to O(q log q + κ), where n denotes the number of nodes, e the number of edges, q the number of maximal cliques and κ the sum of all maximal clique sizes in the filled graph. Empirical results show that the proposed algorithm can significantly reduce the parallel factorization cost without sacrificing much in terms of time efficiency.
1994
This paper presents research into parallel block-diagonal-bordered sparse Choleski factorization algorithms developed with special consideration to irregular sparse matrices originating in the electrical power systems community. Direct block-diagonal-bordered sparse Choleski algorithms exhibit distinct advantages when compared to general direct parallel sparse Choleski algorithms. Task assignments for numerical factorization on distributed-memory multi-processors depend only on the assignment of data to blocks, ...
IEEE Transactions on Parallel and Distributed Systems, 1995
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.
2006
The sparse Cholesky factorization of some large matrices can require a two dimensional partitioning of the matrix. The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. The subblocks are stored as dense matrices so BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. In this paper we present an improvement to our sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure. We compare its performance with several codes and analyze the results.
This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the resulting executable and keep its Mflops. The best combination is then used to produce the object introduced in a library. Thus, a routine for each desired matrix size is available from the library. The large overhead incurred by the hypermatrix Cholesky factorization of sparse matrices can therefore be lessened by reducing the block size when those routines are used. Using the routines, e.g. matrix multiplication, in our small matrix library produced important speed-ups in our sparse Cholesky code. This work was supported by the Ministerio de Ciencia y Tecnología of Spain and the EU FEDER funds (TIC2001-0995-C02-01)
The sparse Cholesky factorization of some large matrices can require a two dimensional partitioning of the matrix. The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. The subblocks are stored as dense matrices so BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. In this paper we present an improvement to our sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure. We compare its performance with several codes and analyze the results.
We present our work on the sparse Cholesky factorization using a hypermatrix data structure. First, we provide some background on the sparse Cholesky factorization and explain the hypermatrix data structure. Next, we present the matrix test suite used. Afterwards, we present the techniques we have developed in pursuit of performance improvements for the sparse hypermatrix Cholesky factorization of a symmetric positive definite matrix into a lower triangular factor L.
SIAM Journal on Scientific Computing, 2010
The rapid emergence of multicore machines has led to the need to design new algorithms that are efficient on these architectures. Here, we consider the solution of sparse symmetric positive-definite linear systems by Cholesky factorization. We were motivated by the successful division of the computation in the dense case into tasks on blocks and use of a task manager to exploit all the parallelism that is available between these tasks, whose dependencies may be represented by a directed acyclic graph (DAG). Our algorithm is built on the assembly tree and subdivides the work at each node into tasks on blocks, whose dependencies may again be represented by a DAG. To limit memory requirements, updates of blocks are performed directly. Our algorithm is implemented within a new solver HSL MA87. It is written in Fortran 95 plus OpenMP and is available as part of the software library HSL. Using problems arising from a range of practical applications, we present experimental results that support our design choices and demonstrate HSL MA87 obtains good serial and parallel times on our 8-core test machines. Comparisons are made with existing modern solvers and show that HSL MA87 generally outperforms these solvers, particularly in the case of very large problems.
The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. Data subblocks are stored as dense matrices. Since we are dealing with sparse matrices some zeros can be stored in those dense blocks. The overhead introduced by the operations on zeros can become really large and considerably degrade performance. In this paper, we present several techniques for reducing the operations on zeros in a sparse hypermatrix Cholesky factorization. By associating a bit to each column within a data submatrix we create a bit vector. We can avoid computations when the bitwise AND of their bit vectors is null. By keeping information about the actual space within a data submatrix which stores non-zeros (dense window) we can reduce both storage and computation.
IEEE Transactions on Parallel and Distributed Systems, 1997
In this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.