Aceleración de Precondicionadores en GPUs
Aceleración de Precondicionadores en GPUs
on hybrid architectures
Ernesto Dufrechou
Montevideo – Uruguay
Octubre de 2018
Accelerating advanced preconditioning methods
on hybrid architectures
Ernesto Dufrechou
Director de tesis:
[Link]. Prof. Pablo Ezzatti
Codirector:
[Link]. Prof. Enrique Quintana-Ortı́
Director académico:
[Link]. Prof. Pablo Ezzatti
Montevideo – Uruguay
Octubre de 2018
Dufrechou, Ernesto
Accelerating advanced preconditioning methods on hybrid
architectures / Ernesto Dufrechou. - Montevideo: Universidad de
la República, PEDECIBA, 2018.
XVIII, 149 p. 29, 7cm.
Director de tesis:
Pablo Ezzatti
Codirector:
Enrique Quintana-Ortı́
Director académico:
Pablo Ezzatti
Tesis de Doctorado – Universidad de la República, Programa de
Doctorado en Informática, 2018.
Referencias bibliográficas: p. 101 – 149.
1. Sistemas lineales dispersos, 2. Precondicionadores,
3. Unidades de Procesamiento Gráfico (GPU), 4. Paralelismo de
datos, 5. ILUPACK.
INTEGRANTES DEL TRIBUNAL DE DEFENSA DE TESIS
Montevideo – Uruguay
Octubre de 2018
vii
viii
Agradecimentos
ix
x
RESUMEN
xi
de resolver problemas de gran escala de forma eficiente.
Palabras claves:
Sistemas lineales dispersos, Precondicionadores, Unidades de Procesamiento Gráfico
(GPU), Paralelismo de datos, ILUPACK.
xii
ABSTRACT
Many problems, in diverse areas of science and engineering, involve the solution of large-
scale sparse systems of linear equations. In most of these scenarios, they are also a com-
putational bottleneck, and therefore their efficient solution on parallel architectures has
motivated a tremendous volume of research.
For many years, direct methods based on the well-known Gaussian Elimination procedure
have been the default choice to address these problems, but the dimension of the problems
being solved nowadays poses serious difficulties for most of these algorithms, regarding
memory requirements, time to solution and complexity of implementation.
A means to alleviate this situation is the use of High Performance Computing (HPC)
techniques. Scientific and domestic computing platforms have steadily evolved to enable the
parallel execution of more and more operations. A clear example of this evolution is the
widespread adoption of hybrid hardware platforms equipped with one or several multicore
processors and compute accelerators such as Graphics Processing Units (GPUs). This type
of platforms have become extremely powerful parallel architectures that integrate thousands
of cores at a reasonable price and energy consumption. For these reasons, GPUs are now
a hardware platform of paramount importance to science and engineering, and making an
efficient use of them is crucial to achieve high performance in most applications.
This dissertation targets the use of GPUs to enhance the performance of the solution of
sparse linear systems using iterative methods complemented with state-of-the-art precondi-
tioned techniques. In particular, we study ILUPACK, a package for the solution of sparse
linear systems via Krylov subspace methods that relies on a modern inverse-based multilevel
ILU (incomplete LU) preconditioning technique.
We present new data-parallel versions of the preconditioner and the most important
solvers contained in the package that significantly improve its performance without affect-
ing its accuracy. Additionally we enhance existing task-parallel versions of ILUPACK for
shared- and distributed-memory systems with the inclusion of GPU acceleration. The results
obtained show a sensible reduction in the runtime of the methods, as well as the posibility
xiii
of addressing large-scale problems efficiently.
Keywords:
Sparse linear systems, Preconditioners, Grpahics Processing Units (GPU), Data-
parallelism, ILUPACK.
xiv
Contents
1 Introduction 1
1.1 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Structure of the document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
xv
3.2.2 Matrix-vector products . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.3 Vector operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.4 Parallelization of SpMV and other kernels . . . . . . . . . . . . . . . . 59
3.2.5 Experimental evaluation of baseline parallel versions . . . . . . . . . . 60
3.3 Enhanced data-parallel variants . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.1 Coarse-grain parallel version of BiCG for dual-GPU systems . . . . . 65
3.3.2 Concurrent BiCG for single GPU platforms . . . . . . . . . . . . . . . 67
3.3.3 GMRES with Accelerated Data-Parallel MGSO . . . . . . . . . . . . . 69
3.3.4 BiCGStab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.5 Experimental evaluation of advanced variants . . . . . . . . . . . . . . 71
5 Conclusions 93
5.1 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1.1 Development of GPU-aware variants of ILUPACK . . . . . . . . . . . 94
5.1.2 Enhanced parallel variant of GMRES . . . . . . . . . . . . . . . . . . 95
5.1.3 Task-data-parallel variants of BiCG . . . . . . . . . . . . . . . . . . . 95
5.1.4 GPU variant of BiCGStab . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.5 GPU version of ILUPACK for shared-memory platforms . . . . . . . 96
5.1.6 GPU version of ILUPACK for distributed-memory platforms . . . . 97
5.2 Open lines of research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Appendices 101
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Appendix A Solution of sparse triangular linear systems . . . . . . . . . 103
A.1 Solution of sparse triangular linear systems . . . . . . . . . . . . . . . . . . . 104
A.1.1 The level-set strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.2.1 Level-set based methods . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.2.2 Dynamically scheduled algorithms . . . . . . . . . . . . . . . . . . . . 108
A.3 Sync-free GPU triangular solver for CSR matrices . . . . . . . . . . . . . . . 109
A.4 A massively parallel level set analysis . . . . . . . . . . . . . . . . . . . . . . . 111
A.5 Combining the analysis and solution stages . . . . . . . . . . . . . . . . . . . 112
A.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.6.1 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xvi
A.6.2 Hardware platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.6.3 Evaluation of the solver routine . . . . . . . . . . . . . . . . . . . . . . 114
A.6.4 Evaluation of the analysis routine . . . . . . . . . . . . . . . . . . . . 117
A.6.5 Evaluation of the combined routine . . . . . . . . . . . . . . . . . . . . 119
Appendix B Balancing energy and efficiency in ILUPACK . . . . . . . . 123
B.1 Characterizing the efficiency of hardware platforms using ILUPACK . . . . . 124
B.1.1 Hardware and software configurations . . . . . . . . . . . . . . . . . . 125
B.1.2 Optimizing energy and performance . . . . . . . . . . . . . . . . . . . 126
B.1.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 127
B.2 Adaptation of ILUPACK to low power devices . . . . . . . . . . . . . . . . . 131
Appendix C Description of the GPU architectures used in this work. . . . . 135
C.0.1 Fermi architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
C.0.2 Kepler architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
C.0.3 Maxwell architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C.1 Summary of computing platforms . . . . . . . . . . . . . . . . . . . . . . . . . 139
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . 141
xvii
xviii
CHAPTER 1
Introduction
Partial Differential Equations (PDEs) are, without doubt, one of the most important tools for
modeling and understanding several aspects of our universe. There are countless examples
of natural phenomena and human activities that can be analyzed as processes governed by
PDEs, in areas that range from quantum mechanics to economics [25, 33].
Addressing this type of equations with an analytical approach is not possible in the
general case. Thus, in order to solve such equations numerically, a typical strategy is to
reduce the equations, which involve continuous quantities, to a set of equations with a
finite number of unknowns. This sort of procedure is known as discretization. A natural
consequence of the application of discretization techniques is the appearance of systems of
linear equations in the methods used to solve PDEs. In most cases, since the entries in
the coefficient matrix involved in such systems express local relations between discretized
fragments of the domain, these matrices tend to be large and sparse, that is, they have very
few nonzero entries.
The above elaboration should be enough to convince the reader that solving large and
sparse linear systems efficiently is of great importance to science and engineering, but there
are many other applications, such as circuit simulation or optimal control, that are not
necessarily governed by PDEs but also rely on the solution of sparse linear systems.
The discussion would end up here if solving this sort of systems was a trivial task, but it
is not. A number of current real-world applications (as for example three-dimensional PDEs)
involve linear systems with millions of equations and unknowns. Direct solvers such as those
based on Gaussian Elimination [56], which apply a sequence of matrix transformations to
reach an equivalent but easier-to-solve system, once were the default choice to tackle this
sort of problems due to their robustness. Unfortunately, today most of these algorithms fall
short when solving large-scale problems because of their excessive memory requirements,
impractical time to solution and complexity of implementation.
In these cases, iterative methods pose an appealing alternative. Although direct methods
Chapter 1. Introduction
are also iterative (for example they generally iterate over the rows and columns of the
coefficient matrix), this type of algorithms earn their name because they work by iteratively
improving an initial guess of the solution. Furthermore, they normally present smaller
computational requirements than the direct counterparts.
Although the derivation of each method can be analyzed from different angles, most
iterative methods can be interpreted as searching for the solution of the linear system inside
a predefined subspace, by taking one step in a given search direction at each iteration of the
solver. A large number of methods exist that differ from each other in how this subspace is
defined, and the criteria under which this search direction and step size are chosen.
Many times, there are optimal choices for the above criteria, which can be derived an-
alytically. In these cases, we say that the iterative method is optimal, and reaches the
solution of the linear system (provided exact arithmetic is used) in at most as many steps
as the dimension of the column-span of the matrix. Nevertheless, taking that many steps to
arrive to the solution is often as costly as solving the linear system using a direct method.
In general, it is desired that the iterative method rapidly improves the initial guess, with
the purpose of finding an acceptable solution to the system in a small number of steps.
The process of converging rapidly to an acceptable solution is often impaired in practice
by unavoidable rounding errors due to the use of finite-precision arithmetic. In many cases
these errors can even prevent the method from reaching a solution at all. How floating point
rounding errors affect the iterative method usually depends on the numerical properties of
the problem being solved.
To remedy these shortcomings, preconditioning techniques are applied. In a broad sense,
these techniques aim to transform the original linear system into an equivalent one that
presents better numerical properties, so that an iterative method can be applied and converge
faster to a solution. As an example, given A ∈ Rn×n and b ∈ Rn , consider the linear system
Ax = b.
Âx = b̂ ≡ M Ax = M b
will hopefully be closer to the solution of the system, facilitating the convergence of an
iterative solver.
The development of effective preconditioners is an active field of research in applied
mathematics. There are many preconditioning techniques but, unfortunately, none of these
are effective in all cases. One outstanding class of preconditioners, however, is based on
Incomplete LU (ILU) factorizations.
ILU-type preconditioners are based on computing a LU factorization of the matrix where
some entries are dropped during the process to maintain the sparsity in the factors, which
will determine the memory footprint of the preconditioner, as well as the computational
effort required to apply it in the context of an iterative method.
This family of preconditioners are known to achieve good results for important classes
2
of problems, such as those that arise from the discretization of elliptic PDEs. This property
has motivated their massive use, and their inclusion in software packages. However, in order
to make this technique applicable in a wider spectrum of problems, an active line of research
is devoted to make ILUs more efficient and reliable.
Among the most recent advances in this field, ILUPACK ([Link]
stands out as a package for the solution of sparse linear systems via Krylov subspace methods
that relies on an inverse-based multilevel ILU (incomplete LU) preconditioning technique for
general as well as Hermitian positive definite/indefinite linear systems [30]. An outstanding
characteristic of ILUPACK is its unique control of the growth in the magnitude of the inverse
of the triangular factors during the approximate factorization process.
Unfortunately, the favorable numerical properties of ILUPACK’s preconditioner in the
context of an iterative solvers come at cost of expensive construction and application pro-
cedures, especially for large-scale sparse linear systems. This high computational cost moti-
vated the development of variants of ILUPACK that can efficiently exploit High-Performance
Computing (HPC) platforms. Previous efforts include parallel variants of ILUPACK’s CG
method [87], for symmetric positive definite (SPD) systems, on shared-memory and message-
passing platforms [3, 4, 8]. These implementations showed remarkable results regarding
their performance and scalability in many large-scale problems, but have the potential dis-
advantage of slightly modifying the original ILUPACK preconditioner to expose task-level
parallelism, yielding distinct convergence rates (though not necessarily slower for the par-
allel versions). Additionally, the task-parallel variants usually require more floating-point
arithmetic operations (flops) than the original ILUPACK, with the overhead cost rapidly
growing with the degree of task-parallelism that is exposed [3]. This may imply a higher
cost of the preconditioner in terms of storage and energy consumption.
For more than two decades now, scientific and domestic computing platforms have
evolved to include multiple cores to enable the parallel execution of different operations,
mitigating this way the physical limitations (mainly related with heat dissipation and errors
in signal transmissions) that impair the increase in clock frequency and transistor integration
dictated by Moore’s law [93]. In this sense, hybrid hardware platforms equipped with one
or several multi-core processors and compute accelerators have experienced an important
evolution. In particular, Graphics Processing Units (GPUs) have developed into extremely
powerful parallel architectures that integrate thousands of cores at a reasonable price and
energy consumption. For these reasons, GPUs have become a tool of paramount importance
in science and engineering, especially after the explosive advances in the field of machine
learning and deep neural networks in recent years.
GPUs are now ubiquitous, making their efficient use crucial in order to obtain good
performance on the most recent hardware for scientific computing. Even in sparse linear
algebra where the computational intensity of the operations is generally low and does not
allow to take full advantage of the computational power GPUs provide, the high memory
bandwidth of these devices offers significant acceleration opportunities.
Previous to this dissertation, no parallel versions of ILUPACK existed aside the above-
mentioned task-parallel implementations. It is therefore interesting to analyze the data-level
parallelism in ILUPACK to develop new efficient parallel versions that do not suffer from
3
Chapter 1. Introduction
the limitations of the previous variants on the one hand, and to enhance the performance of
these variants on the other. In this line, the use of hardware accelerators, and in particular
of GPUs, is a valuable tool, and a clear path to explore.
The term GPU (Graphics Processing Unit) was popularized by Nvidia in 1999, which
advertised the GeForce 256 as “the world’s first GPU” [76]. At that time, these devices
were dedicated to efficiently process polygons, lighting and textures in order to display 3D
images on the screen. As the processing of the geometry and the production of complex
lighting and shading effects are the computationally most expensive stages of this graphics
pipeline, the first GPU-architectures [82] included specialized hardware to compute them.
The operations available at these stages were configurable but not programmable. For
instance, in the fixed-function pipeline, the programmer was able to control the position and
color of the vertices and the point source of light, but not the lighting model that determined
their interaction. For this reason, the first attempts to solving general-purpose problems with
the GPUs implied a mapping of the solutions to the operations of the fixed graphics pipeline.
Furthermore, this architecture presents a serious problem, since working with scenes that
present complex geometries generates an overload on the vertex shader while under-utilizing
the hardware of the pixel shader, while simple geometries with heavy lighting and texture
effects overload the pixel shader and under-utilize the vertex shader. To overcome this issue,
around 2007 NVIDIA proposed a new architecture of GPUs that replaced the vertex and
pixel shaders by a generic stream processing unit capable of executing every step of the
graphic pipeline. In conjunction with the hardware, NVIDIA introduced a programming
framework for these devices, which allowed to use this new kind of multiprocessor to solve
problems that were not related to rendering graphics in a more straightforward way, which
was coined as General Purpose Computing on GPU (GPGPU). The conjunction of the
hardware architecture with general purpose processors, the driver, a runtime, and C language
extensions to program the GPU were released under the name of CUDA, which stands for
Compute Unified Device Architecture.
More recently, the OpenCL standard (defined by the Khronos Group) has become broadly
supported, although the CUDA community presents a more advanced adoption. Nowadays,
the CUDA platform is designed to work with programming languages such as C, C++ and
Fortran, including a great number of high performance libraries, application programming
interfaces (APIs) and tools, which makes it easier for specialists in parallel programming to
use GPU resources. Moreover, CUDA allows a remarkable portability, as most CUDA codes
are backward compatible with most previous GPU architectures, and the same code can be
executed in a large variety of devices, which range from mobile devices to HPC platforms,
with very few or any adjustments.
A more detailed description of the particularities of the different NVIDIA GPU archi-
tectures employed in this work can be found in Appendix C.
4
1.2. Objectives
1.2 Objectives
Motivated by the need of robust and efficient preconditioners to enhance the convergence of
iterative linear system solvers in science and engineering applications, the main goal of this
thesis is to advance the state-of-the-art in the efficient parallel implementations of modern
preconditioning techniques. In particular, we are interested in the use of hardware accel-
erators to leverage the data-parallelism of iterative solvers working together with advanced
incomplete-factorization methods.
As exposed in the opening paragraphs of this dissertation, ILUPACK is a prominent
example of such solvers, but its remarkable numerical results come at the expense of a high
complexity, which implies costly construction and application procedures. Previous efforts
have provided task-parallel implementations of ILUPACK for shared-memory and message-
passing platforms [8, 3, 4] but, despite showing good performance and scalability results,
some shortcomings of these parallel variants exist.
In the first place, these variants of ILUPACK are limited to the solution of symmetric
and positive-definite (SPD) linear systems, which means that there are no parallel versions
of ILUPACK for non-symmetric or indefinite systems previous to this dissertation. Second,
they slightly modify the preconditioner to exploit task-parallelism. This means that these
modifications can impact the numerical properties and convergence of the preconditioner.
Therefore, to achieve our principal goal we have set the following specific objectives:
5
Chapter 1. Introduction
6
1.3. Structure of the document
of task and data parallelism in shared-memory platforms equipped with several GPUs, to
later leverage the computational power of GPUs distributed across several nodes of a cluster.
The main part of the dissertation concludes with Chapter 5, which offers some final
remarks and the discussion of future directions in which this work can be extended.
Appendix A details about a line of research that we developed in parallel with the thesis,
devoted to the study of a new approach for the solution of triangular linear systems in GPUs,
and the construction of new routines to perform this operation following this strategy. Some
of the resulting routines were used in the single-GPU version of BiCG of Chapter 3.
Another secondary line of research is related to study the energy efficiency of hardware
and software. Some work on this line involving the energy aspects of ILUPACK are covered
in Appendix B.
Finally, Appendix C gives details on the software and hardware platforms used in the
experiments included in the dissertation.
7
Chapter 1. Introduction
8
CHAPTER 2
This chapter is devoted to provide a shallow theoretical background that enables the reader
to continue comfortably through the rest of the manuscript. It is not our purpose to go
deep into mathematical details when they are not essential to the understanding of this
thesis, and we will refer to the corresponding books and articles in those cases. The chapter
starts by revisiting the solution of linear systems by Gaussian Elimination, making a special
emphasis on the sparse case and direct methods such as the sparse LU decomposition. It
will later switch to the solution of these problems via iterative methods, presenting the most
common techniques. A section about preconditioners will follow, in which, after presenting
some general concepts, we will focus on the ILU-based class, presenting the most important
developments in this subject until arriving to the state-of-the-art “inverse based” multilevel
ILU preconditioner which gives birth to ILUPACK, the software package on which the work
of this thesis develops. The chapter finishes by a review of related work.
beforehand. Furthermore these methods and are, in general, robust from the numerical
perspective, often with the help of pivoting techniques. Specifically, they generally present
a computational cost of O(n3 ) flops (floating-point operations), and techniques to improve
their cache efficiency as well as as their numerical stability are well-known since long ago.
For the sparse case, the performance of these methods is usually determined by the
appearance of new nonzero elements during the transformation procedure, known as fill-in,
which increases its cost and makes difficult the task of calculating it a priori. Therefore,
the design of direct methods for sparse matrices is usually guided by the mitigation of this
phenomenon, which is attempted by applying techniques such as re-orderings of the rows
and columns of the sparse matrix, altering the order in which unknowns are eliminated,
and the partitioning of the system, to identify dense submatrices that can be dealt with
computationally efficient dense algebra kernels. However, it is not always possible to reduce
the fill-in to satisfactory levels, especially in the non-SPDcases, where pivoting is required,
which can lead to difficulties when solving reasonably large systems of equations.
Although this thesis is mostly concerned with the solution of large and sparse linear
systems, which is precisely where direct methods fall short, some introduction about these
algorithms is mandatory since, apart from their role in solving linear systems, they provide
the framework for the development of several types of preconditioners.
of the matrix during the factorization. The theory behind these algorithms is often expressed in terms of
the adjacency graph determined by the sparse matrix.
10
2.1. Direct methods
Apart from this good properties, variants of GE lead to other important algorithms,
as the QR factorization if the elementary transformations to reach the triangular form are
replaced by Householder reflections. It can also easily be adapted to exploit matrices of
special structure, like in the Thomas algorithm for tridiagonal matrices [100], or in the
factorization of band and Hessenberg matrices. Moreover, in the sparse case, GE also gives
place to an entire family of preconditioners for the iterative solution of linear systems.
GE applies successive elementary transformations to cancel the elements below the diag-
onal of each column, one column at a time. Concretely, starting from matrix A0 = A, and
letting Ak−1 ∈ Rn×n be the result of applying k − 1 of these elementary transformations,
the result of the k-th transformation can be expressed as follows
Ak = L−1
k Ak−1 , (2.1)
with !
1 0
L−1
k =I− (k−1) (k−1) eTk (2.2)
akk a∗k
and where T
(k−1) (k−1) (k−1)
a∗k = ak+1,k , . . . , an,k (2.3)
is the Rn−k vector formed by the sub-diagonal elements in the k-th column of Ak−1 .
(k)
It should be noted that, with L−1
k defined as in (2.2), the new entry aik (with i > k)
will be calculated as
(k) −1 (k−1) (k−1)
aik = lik akk + aik
(k−1)
aik (k−1) (k−1)
=− (k−1) akk + aik (2.4)
akk
=0
and that
U = L−1 −1 −1
n−1 × · · · × L1 × L0 A (2.5)
A = LU. (2.7)
The expressions (2.2) and (2.3) allow to formulate an algorithm for GE known as the
right-looking LU factorization, or KIJ variant, which is outlined in Algorithm 1. In this
variant of GE, one column of L and one row of U are computed in each iteration, and the
trailing (n − k) × (n − k) submatrix is modified using a rank-1 update.
(k−1)
It is evident that Algorithm 1 breaks down if akk is zero or numerical difficulties arise
in case it becomes very small. Hence, the LU factorization requires of a row pivoting strategy
to ensure numerical stability [56]. In this case, the decomposition takes the form P A = LU ,
where P ∈ Rn×n is a permutation matrix. In most implementations, the L and U factors are
11
Chapter 2. Systems of Linear Equations
1: for k = 0 to n − 1 do
2: for i = k + 1 to n − 1 do
3: lik := aik /akk
4: end for
5: for j = k + 1 to n − 1 do
6: ukj := akj
7: end for
8: for i = k + 1 to n − 1 do
9: for j = k + 1 to n − 1 do
10: aij := aij − lik ukj
11: end for
12: end for
13: end for
stored in the memory space of A (i.e. in-place strategy) to reduce the memory requirements,
the main diagonal of L does not need to be stored since L is unit lower triangular, and P is
implicitly stored as a permutation vector. We do not include pivoting in the outline of the
algorithms for clarity sake.
12
2.2. Iterative methods
1: for i = k + 1 to n − 1 do
2: for k = 0 to n − 1 do
3: aik := aik /akk
4: for j = k + 1 to n − 1 do
5: aij := aij − aik ukj
6: end for
7: end for
8: end for
heavily on linked lists, which allow inserting new nonzero elements easily. Linked lists
have been abandoned in more modern implementations, as static data structures are often
preferred. In particular, a widely used technique consists of allocating an extra space so
that these new non-zeros can be inserted, adjusting the structure accordingly as more fill-in
is introduced.
In summary, a typical sparse direct solver will present four phases. First, as the effect
of fill in will depend on the nonzero pattern, a pre-ordering of the rows and columns of
the sparse matrix is applied aiming to obtain a matrix with a more convenient structure.
Although the problem of finding the permutation that minimizes fill-in is NP-hard, there
are many effective heuristics for this purpose. Second, a symbolic factorization is performed,
which produces the factorization without caring for the numerical values. Third, the actual
factors L and U are constructed in the numerical factorization phase. Finally, the forward
and backward triangular systems are solved by substitution.
Some important references in this field are [52, 38].
13
Chapter 2. Systems of Linear Equations
x0 ∈ span{b}. (2.8)
Then, following the previous idea, we should replace the current solution x0 with a linear
combination of itself with Ax0 . In this case the current solution x1 belongs to the subspace
span{b, Ab}. After k iterations, it is clear that the current iterate will belong to
M −1 Ax = M −1 b (2.10)
14
2.2. Iterative methods
approach zero.
Consider, for example, the Jacobi iteration. Zeroing out a component i of the residual
imposes
(b − Ax)(i) = 0, (2.11)
where y (i) denotes the i-th component of vector y. Expanding the matrix-vector product
yields
n
X
(i)
b = aij x(j) + aii x(i) , (2.12)
j=0,j6=i
Here each component is updated independently, using information from the previous step,
and therefore the updates can be performed in parallel. If expressed in matrix form, Equa-
tion (2.13) becomes
xk+1 = D−1 (b − (L + U )xk ), (2.14)
where D is a diagonal matrix whose diagonal is equal to that of A, and L and U are the strict
lower and upper triangles of A. Adding and subtracting xk from the previous expression
obtains
xk+1 = D−1 (b − (L + U )xk ) + xk − xk
= xk + D−1 (b − (L + U + D)xk ) (2.15)
−1
= xk + D (b − Axk ).
Similar derivations can be done for the Gauss-Seidel and SOR methods. The main
difference between the Jacobi and Gauss-Seidel methods is that, while in the former the
updates the components of the current iterate xk are performed independently, in the latter
they are performed in ascending or descending numerical order, using the updated values
to calculate the following ones. This is equivalent to solving a triangular system in each
iteration. The SOR method is in turn similar to Gauss-Seidel but incorporates a relaxation
parameter ω to the diagonal.
In general, if xk is an approximate solution to the linear system, we can express all the
previous methods as an iteration of the form
Equation (2.16) is sometimes referred as “simple iteration” [58], and can be interpreted as
correcting the current iterate by adding a transformation of the current residual. It is easy
to see that, if M is taken to be the diagonal of A, the method is the Jacobi iteration. If,
instead, M is equal to the lower triangle of the matrix A, the iteration is named as the
Gauss-Seidel method. Finally, for M = ω −1 D − L, where D is the diagonal of A, L is
the strict lower triangle, and ω is a relaxation parameter, the iteration is the Successive
Over-Relaxation (SOR).
15
Chapter 2. Systems of Linear Equations
Considering the previous discussion, it is reasonable to ask if there is a better way of choosing
the iterates xk from the Krylov subspace generated by c and B̂.
Let Kk be the subspace for the k-th iterate, where k − 1 is its dimension. In order
to find an approximate solution, one starts from an initial guess which, of course, belongs
to K0 . Then, successive attempts to improve this initial solution should be able to expand
this subspace in dimension, and some constraints should be imposed on them so that they
result in good approximations to the solution of the system. One way of describing these
constraints is as orthogonality conditions. In other words, a typical means of finding good
approximations is forcing the residual vector b − Ax to be orthogonal to k linearly indepen-
dent vectors. This set of vectors form a basis to another subspace L, of dimension k, which
is called the subspace of constraints or left subspace [88]. This general idea, known as the
Petrov-Galerkin conditions, is common to all the solvers that will be presented next.
Considering again the iteration (2.16), one can say that it is static in the sense that the
information that will be employed to compute an approximate solution xk+1 is the same
regardless of the iteration. An attempt to improve this algorithm would be to introduce a
dynamically calculated parameter αk in order to optimize how far away from the current
solution the next one should be.
Without loosing generality, assume that the system Ax = b denotes now the precondi-
tioned linear system of (2.10). Then, introducing this new parameter yields
In order to determine the value of αk , it should be noted that the residual follows the
16
2.2. Iterative methods
relation
rk+1 = rk − αk Ark . (2.20)
Thus, it is natural to choose αk so that rk+1 is minimized. Then, it is easy to prove that
this happens when the vectors rk − αk Ark and αk Ark are orthogonal, which yields
hrk , rk i
αk = . (2.21)
hArk , rk i
With this choice of αk , the expression in (2.20) can be seen as subtracting from rk its
orthogonal projection onto the direction Ark .
When the matrix A is Hermitian and positive definite, the above choice of αk combined
with formula (2.19) gives place to the so called method of steepest descent. This method
belongs to the family of gradient descent methods, and earns its name from looking at the
linear system Ax = b as the optimization problem of finding the solution x that minimizes
which is a quadratic form that determines a paraboloid with x = A−1 b as its minimum. For
a given point x = xk , −∇x f (xk ) = b − Axk points in the direction where f decreases more
quickly, i.e. its direction of steepest descent. Hence, the method can be viewed as iteratively
taking a step of size αk in the direction of steepest descent so that the next residual is
minimized. In general, one could write the previous method as
xk+1 = xk + αk pk , (2.23)
hrk ,rk i
where pk = rk and αk = hArk ,rk i , and devise different iterations by replacing the direction
pk and the step size αk by other expressions.
Although the steepest descent algorithm chooses rk+1 so that it is orthogonal to rk , it is not
necessarily orthogonal to the other k−1 residuals or, equivalently, the error ek+1 = xx+1 −x is
not A-orthogonal to pk−1 . This causes the method to often take steps in previously explored
directions. This form of approaching x is, of course, not optimal.
To improve this strategy, one approach is to look for n linearly independent search
directions {p0 , . . . , pn }, and take one step in each one so that, in step n, the exact solution x
is reached. A convenient choice is to require the search directions pk to be A-orthogonal with
each other. This way if, like in steepest descent, one also requires ek+1 to be A-orthogonal
to pk , the following expression for the step αk can be derived
hpk , rk i
αk = . (2.24)
hpk , Apk i
This is referred in the seminal article by Hestenes and Stiefel [62] as the method of conjugate
directions.
The description of the method of conjugate directions is not complete in the sense that
17
Chapter 2. Systems of Linear Equations
it does not define how the set of conjugate (or A-orthogonal) vectors {p0 , . . . , pn } should be
computed. There is, in fact, more than one way of finding such vectors, and some of these
strategies characterize particular methods themselves.
As the Gram-Schmidt procedure can be utilized to form an orthonormal basis of a given
subspace, a variation of this algorithm, called conjugate Gram-Schmidt, is one straight-
forward approach to form the set of conjugate directions. Starting from an initial set of
linearly independent vectors {u0 , . . . , un }, and setting p0 = u0 , the rest of the A-orthogonal
directions pk are calculated as follows
k−1
X huk , Apj i
pk = uk − . (2.25)
j=0
hpj , Apj i
This simply means taking a vector uk and subtracting each component of such vector that
is not A-orthogonal to the k − 1 conjugate directions previously computed.
The main disadvantages of this procedure are that, in order to compute the direction
pk , the other k − 1 directions have to be kept in memory, and it takes O(n3 ) operations to
compute the whole set. In fact, when the vectors {u0 , . . . , un } are taken to be the columns
of the identity matrix of size n × n, this process is equivalent to Gaussian Elimination.
The method of conjugate gradients [62] (CG) is a variant of the method of conjugate
directions that elegantly eliminates the problems of the conjugate Gram-Schmidt procedure
by taking the vectors {u0 , . . . , un } that are A-orthogonal to the residuals {r0 , . . . , rn }. Using
that the residual rk is orthogonal to all other k − 1 previous residuals, and that they follow
the recurrence
ri+1 = rk + αk Apk , (2.26)
it can be derived from (2.25) that the recurrence that determines the search directions is
given by
hrk , rk i
pi+1 = rk + pk . (2.27)
hrk−1 , rk−1 i
As it is a variant of the conjugate directions method, the CG method finds the exact
solution of the linear system in, at most, n iterations. Moreover, it can be shown that, in
a given step k, the method minimizes the A-norm of the error ek over the affine subspace
spanned by the search directions e0 + {p0 , . . . , pk }. In other words, the error is as small as
possible (under the A-norm) given the search direction visited so far. A proof of this result
can be found in [58].
Putting it all together, one possible formulation of the Conjugate Gradient method is
that on Algorithm 3. This formulation includes the application of left-preconditioning with
the matrix M . Of course, the matrix-vector product zk+1 := M −1 rk+1 can be replaced by
the solution of a linear system.
When the matrix A is indefinite or non-symmetric, Algorithm 3 will fail in case it finds
a direction pk for which the denominator hpk , Apk i = 0. In this sense, the Bi-conjugate
Gradient (BiCG) algorithm can be considered as a generalization of the CG method designed
18
2.2. Iterative methods
1: x0 := 0
2: r0 := b − Ax0
3: z0 := M −1 r0
4: p0 := z0
5: ρ0 := r0T z0
6: τ0 :=k r0 k2
7: k := 0
8: while (τk > τmax ) do
9: wk := Apk
10: αk := ρk /pTk wk
11: xk+1 := xk + αk pk
12: rk+1 := rk − αk wk
13: zk+1 := M −1 rk+1
T
14: ρk+1 := rk+1 zk+1
15: βk := ρk+1 /ρk
16: pk+1 := zk+1 + βk pk
17: τk+1 := krk+1 k
18: k := k + 1
19: end while
with
hr̂k , rk i
αk = , (2.30)
hp̂k , Apk i
and
pk+1 = rk − βk pk , (2.31)
where
hr̂k+1 , rk+1 i
βk = . (2.33)
hr̂k , rk i
19
Chapter 2. Systems of Linear Equations
As in the case of the CG, a nice property of the algorithm is that the above bi-
orthogonality and bi-conjugacy conditions hold for any pairs of vectors (p̂i , pj ) and (r̂i , rj )
such that i 6= j, without explicitly enforcing them. In turn, as the vectors pk+1 are noth-
ing more than linear combinations of the vectors rk+1 to r0 , the previous orthogonality
conditions also imply that
hp̂i , rj i = hpi , r̂j i = 0. (2.36)
A more detailed derivation of the algorithm, as well as proofs of the aforementioned prop-
erties can be found in [47].
1: Initialize x0 , r0 , q0 , p0 , s0 , ρ0 , τ0 ; k := 0
2: while (τk > τmax ) do
3: αk := ρk /(qkT Apk )
4: xk := xk + αk pk
5: rk := rk − αk Apk
6: tk := M −1 rk
7: zk := M −T AT qk
8: sk+1 := sk − αk zk
9: ρk+1 := (sTk+1 rk )/ρk
10: pk+1 := tk + ρk+1 pk
11: qk+1 := sk+1 − ρk+1 qk
12: τk+1 :=k rk k2
13: k := k + 1
14: end while
The Generalized Minimum Residual Method (GMRES) is a projection method for non-
symmetric and indefinite matrices that first constructs an orthogonal basis for the subspace
K(A, r0 ), and then extracts the approximate solution from this subspace so that the norm
of the residual is minimized.
In its most common formulation, this orthogonal basis is formed by the modified Gram-
Schmidt process that is outlined in Algorithm 5, which is usually referred to as Arnoldi
iteration [20].
If Qk is the matrix with columns that correspond to the k orthogonal Arnoldi vectors
20
2.2. Iterative methods
1: for j = 0 to k do
2: q̃j+1 := Aqj
3: for i = 0 to j do
4: hij := hq̃j+1 , qi i
5: q̃j+1 := q̃j+1 − hij qi
6: end for
7: hj+1,j := kq̃j+1 k
8: qj+1 := q̃j+1 /hj+1,j
9: end for
where Hk+1,k is a Hessenberg matrix (a matrix with all the elements below the first sub-
diagonal set to zero) that contains the hij coefficients of Algorithm 5.
Once this set of vectors is constructed, GMRES will compute the approximate solution
xk such that is has the form
xk = x0 + Qk yk , (2.38)
for some vector yk . In other words, the iterate xk will be computed as the initial guess plus
some linear combination of the {q0 , . . . , qk−1 } Arnoldi vectors.
To obtain the approximation of the form 2.38 that minimizes the 2-norm of the residual
kr0 − AQk yk k, GMRES solves the following least squares problem
where β = kr0 k and e1 is the first column of the identity matrix of size k + 1. The second
step of the previous equality takes into account that Qk+1 e1 , the first column of matrix
Qk+1 , is equal to r0 / kr0 k.
The solution to the above least-squares problem can be obtained inexpensively by com-
puting the QR factorization of Hk+1,k into a unitary matrix Fk and an upper-triangular
matrix Rk . Once this factorization is obtained, the solution y for (2.39) is given by the
linear system
R̃y = β(Fk e1 ), (2.40)
where R̃ denotes the k × k top-left submatrix of Rk , and Fk e1 are the top k entries of the
first column of F . This factorization is inexpensive, as it can be obtained by applying plane
rotations in order to annihilate the first sub-diagonal of Hk+1,k . Furthermore, provided that
the QR factorization of Hk+1,k is available, it is possible to compute the factorization of
21
Chapter 2. Systems of Linear Equations
Hk+2,k+1 with little overhead. Hence, the above process can be performed in a progressive
manner, and this allows to obtain the residual norm inexpensively, at each step of GMRES,
in order to check for convergence. For details we refer to [58, 88].
The GMRES algorithm becomes impractical when k is large, given that it is necessary
to store and operate with the whole set of qi vectors. One commonly-used strategy consists
of simply restarting GMRES after m iterations, where m is a user-defined parameter, and
use the current iterate as the initial guess for the restart. This is referred to as GMRES(m),
which is outlined in Algorithm 6.
Without going into details about the convergence properties of full and restarted GM-
RES, it is safe to say that small values of m will cause a slow convergence, and that large val-
ues of m can lead to considerable runtimes and memory requirements, so a trade-off between
these factors must be considered. Moreover, the restarted GMRES algorithm can stagnate,
i.e. cease to improve the approximation, when the matrix is not positive definite. This
drawback can be, in part, overcome by preconditioning techniques. The references [58, 88]
contain convergence analysis of these methods, so the interested reader should consult them
for further details.
1: Initialize x0 , r0 , q0 , p0 , . . .
2: k := 0, β := kr0 k , ξ1 := (1, 0, . . . , 0)T
3: while (k < m) do
4: zk+1 := M −1 (Aqk )
5: [H(:, k + 1), qk+1 ] := MGSO(Qk , zk+1 )
6: k := k + 1
7: end while
8: yk := arg miny kβξ1 − Hk yk
9: xk := x0 + Qk yk
10: τk := kb − Axk k
11: if τk > τmax then
12: x0 := xk
13: restart GMRES
14: end if
Just as GMRES takes the matrix A into a Hessenberg matrix by means of an Arnoldi
iteration, QMR applies a double-sided or non-symmetric Lanczos process on the original
matrix that transforms it into a tridiagonal (also Hessenberg) matrix. This transformation
can be described in matrix form by the following recurrence
22
2.2. Iterative methods
2
1: Initialize ω := kzk k , τ := 0.99, i := 1
2: while (i <= k) do
3: H(i, k + 1) := qiT zk
4: zk := zk − H(i, k + 1)qi
5: if (|H(i, k + 1)|2 > ωτ ) then
6: β := qiT zk
7: H(i, k + 1) := H(i, k + 1) + β
8: zk := zk − βqi
9: end if
10: ω := ω − |H(i, k + 1)|2
11: if (ω < 0) then
12: ω := 0
13: end if
14: i := i + 1
15: end while
16: qk+1 := zk / kzk k
17: Hk,k+1 := kzk k
where Vk are the k Lanczos basis vectors and Tk+1 are the coefficients of the three-term
recurrence of the Lanczos iteration. Once this basis Vk is obtained, the same strategy is
applied, and the iterate xk is taken to be of the form
xk = x0 + Vk y. (2.42)
The main difference with GMRES is that the Lanczos process does not generate an
orthonormal basis. By taking v0 = r0 / kr0 k and performing the same derivation as in
GMRES to get y such that the residual r0 − AVk+1 y is minimal, one obtains
where β = kr0 k and e1 is the first column of the identity matrix. Unlike in GMRES, it is
not possible to eliminate Vk+1 from (2.43) so that the above least-squares problem can be
solved easily. However, it is reasonable to consider that if y minimizes kβe1 − Tk+1,k yk, this
will be good enough to converge properly to the solution of the system. Hence, the method
earns its qualification of “quasi-minimal” instead of minimal, and kβe1 − Tk+1,k yk is called
the “quasi-residual” norm.
Now, a result due to Freund and Zha [48] states that the Lanczos process can be simplified
23
Chapter 2. Systems of Linear Equations
AT P = P A. (2.44)
In this case, the following relation can be established between the right Lanczos vectors wk
and the left Lanczos vectors vk
P vk
wk = . (2.45)
kvk k
If the matrix-vector product P vk is easy to compute, as in the case when P is sparse, this
relation allows to obtain the Lanczos basis with almost half the storage and computing
effort.
Consider next a linear system with a symmetric indefinite matrix A ∈ Rn×n and a
preconditioner M such that
Âx̂ = b̂ (2.47)
where
 = ML−1 AMR−1 , b̂ = ML−1 , andf x̂ = MR x (2.48)
Here it is not necessary that ML and MR are each others transpose, and left or right-
preconditioned variants can be obtained by simply setting MR = I or ML = I, respectively.
Choosing P = MLT MR−1 one gets
24
2.2. Iterative methods
25
Chapter 2. Systems of Linear Equations
When A is not symmetric, the simplification of the Lanczos process can not be made, and
thus the BiCG and the QMR methods imply recursions with both A and AT . To avoid using
the transpose and to improve the convergence of BiCG at comparable computational cost,
Sonneveld introduced the Conjugate Gradient Squared (CGS) algorithm [96]. The method
is based in noting that the residual vector at step k can be expressed as
rk = φk (A)r0 , (2.50)
pk = πk (A)r0 , (2.51)
which indicates that, if it is possible to get a recursion for the vectors φ2k (A)r0 and πk2 (A)r0 ,
then computing αk and, similarly, βk causes no problem. Hence, the idea of seeking an
algorithm which would give a sequence of iterates whose residual norms rk0 satisfy
where ψk (t) is a polynomial that aims to “stabilize” the convergence behavior of the original
algorithm. Specifically, ψk (t) is defined by
in which the scalar ωk can be selected such that it achieves a steepest descent step in
the residual direction obtained before multiplying the residual vector by (I − ωk A). In
26
2.3. Preconditioners
other words, ωk is chosen to minimize the 2-norm of the vector (I − ωk A)ψk (A)φk+1 (A)r0 .
Algorithm 9 outlines one of the possible formulations of the procedure. A complete derivation
of the algorithm can be found in [58, 88, 103].
1: Initialize x0 , r0 , q0 , p0 , . . .
2: k := 0, β := kr0 k , ξ1 := (1, 0, . . . , 0)T
3: while (τk > τmax ) do
4: ρk+1 = (r̂0 , rk )
5: β = (ρk+1 /ρk )(α/ωk )
6: pk+1 = rk + β(pk − ωk vk )
7: vk+1 = M −1 Apk+1
8: α = ρk+1 /(r̂0 , vk+1 )
9: s = rk − αvk+1
10: t = M −1 As
11: ωk+1 = (t, s)/(t, t)
12: xk+1 = xk + αpk − ωk+1 s
13: rk+1 = s − ωk+1 t
14: τk+1 :=k rk+1 k2
15: k := k + 1
16: end while
2.3 Preconditioners
In general, iterative solvers exhibit a lack of robustness that severely limits its applicability in
industrial contexts [88]. Preconditioners were introduced in Section 2.2 as a transformation
of the original system into an alternative one that has the same solution, performed with
the purpose of improving its condition number, which is the rate between the largest and
smallest eigenvalue of the coefficient matrix of the system. Roughly speaking, when this
number is large, the unavoidable rounding errors due to the use finite precision results in
iterative methods having a hard time converging to an acceptable solution of the system.
This modification can be defined in a number of ways, and it is usually represented by a
matrix M which, in general, should meet some minimal requirements.
In the basic form of an iterative method introduced in Equation (2.16), one possibility is
to apply the transformation to the system before starting the solver. In many cases, this is
not convenient since M −1 A is much less sparse than A and M , which increments the storage
and computation cost of the solver. The alternative is to either perform a matrix-vector
product
zk = M −1 rk (2.57)
in each iteration k, where rk is the residual computed in the previous step. Therefore, the
27
Chapter 2. Systems of Linear Equations
first requirement any preconditioner M should meet is that the linear system (2.58) is easy
to solve or, at least, easy compared to the effort of solving Ax = b with a direct method. Of
course, M should also be non-singular so that (2.58) has a unique solution.
The second requirement is more complex and it is related with the quality of the precon-
ditioner. It is evident that any of the aforementioned preconditioned iterations converge in
one iteration if the preconditioner M is set to be the matrix A itself. It is also evident that
this is not practical since the effort would be equivalent to that of computing A−1 . However,
this leads to the intuitive idea that iterative methods can converge faster if the precondi-
tioner M resembles A in some sense. The exact sense in which M should approximate A
depends on the iterative method to be used and it is not always clear. For example, when we
described the fixed-point iterations, it was mentioned that their convergence is related with
the largest eigenvalue of I − M −1 A, so a preconditioner M such that ρ(I − M −1 A) 1 is
desirable in this case. For other Krylov subspace methods, there are results that characterize
the effectiveness of the preconditioner in terms of its eigenvalue distribution or that of the
preconditioned matrix, but this is much better understood for Hermitian matrices than it
is for non-Hermitian or indefinite ones [58].
There are three ways of applying a preconditioner. Left-preconditioning was already
presented in Equation (2.10). Similarly, right-preconditioning leads to the following system
AM −1 u = b, x = M −1 u. (2.59)
Here, a change of variables u = M x is made, and the system is solved for the unknown u, so
finding the solution x of the original system will imply solving M x = u at the end. Often,
the preconditioner can be expressed in factorized form M = ML MR , where typically ML
and MR are triangular matrices. In this case, the preconditioner can be applied as follows
Splitting the preconditioner in such a way is convenient in order to preserve symmetry when
the original matrix shows this property.
Strictly speaking, a preconditioner is any form of internal solver that aims to accelerate
the convergence of an iterative method. In some cases, preconditioners can be derived from
the intrinsic properties of the original physical phenomena that the linear system or PDE is
looking to solve. Most times however, preconditioners are constructed departing from the
original coefficient matrix, as useful physical properties are rarely available in general. In
this sense, preconditioner techniques can range from simply scaling all rows of the linear
system to make the diagonal elements equal to one, to a preconditioner M that reflects a
complicated mapping involving FFT transforms, integral calculations, and additional linear
system solutions [88].
One of the most popular ways of constructing a preconditioner is by means of a factor-
ization of the original matrix A, as the Cholesky or LU factorization. When A is sparse,
the triangular factors L an U are, in general, much more dense than the original matrix.
This leads to the so-called incomplete factorizations, which are decompositions of the form
A = L̃Ũ − R, where L̃ and Ũ are similar to L and U but present a much sparser nonzero
28
2.3. Preconditioners
structure, and R is the residual or error of the factorization. Their importance and wide
range of applications motivates that we dedicate the rest of this section to present some
of their most popular variants, as well as the latest advances in order to enhance their
robustness.
ILU0
A paramount example of the positional type of ILU is the ILU factorization with no fill-in,
frequently denoted as ILU(0) or ILU0. In this widely-applied incomplete factorization, the
29
Chapter 2. Systems of Linear Equations
Figure 2.1: Regular 2D grid and sparsity structure of the stiffness matrix. Extracted from [88]
L̃ and Ũ factors present the same sparsity pattern as the lower and upper parts of the matrix
A respectively.
The ILU0 factorization algorithm can be directly derived from the ILUP algorithm by
simply choosing P to be the zero pattern of A. However, the usefulness of this preconditioner
can be better appreciated by analyzing the particular case of systems derived from the
discretization of elliptic PDEs on regular grids. Considering (for simplicity) the case of
regular 2D grids, it can be noticed that in these type of problems the stiffness matrix
presents a structure that consists of the main diagonal, two adjacent diagonals and two
other diagonals that are at a distance from the main one that is equal to the width of the
mesh. This is illustrated in Figure 2.1.
In general, the product of any lower and upper triangular matrices L̃ and Ũ such that
these two matrices present the same sparsity pattern, respectively, of the lower and upper
parts of A, will not have the same sparsity pattern than that of A. Specifically, two diagonals
of nonzero entries will appear in the product at the positions nx − 1 and −nx + 1. However,
if this fill-in diagonals are ignored, it is possible to find matrices L̃ and Ũ such that
The product of those factors would produce a matrix à that differs from A only in these two
extra fill-in diagonals, so it can be expected that à reasonably resembles A, and therefore
turns to be a good preconditioner [88]. By analyzing the algorithm derived from ILUP ,
where we replace P by the zero pattern of A, it can be observed that the above property
holds. Therefore, the ILU factorizations with no fill-in, or ILU0, can be defined broadly as
those factorizations such that the L̃ and Ũ factors have the same sparsity pattern as the
lower and upper parts of A, respectively, and such that Equation (2.62) holds. However, it
should be noted that this factorization is not unique, since, for example, it does not impose
restrictions on the magnitude of the fill-in elements generated by the product.
From the above discussion, it is natural to think that this type of preconditioners are
30
2.3. Preconditioners
especially well suited to solve elliptic PDEs. In fact, there are cases where, if the stiffness
matrix is written as
A = LA + DA + UA , (2.63)
where LA , and UA are, respectively, the strict-lower and upper part of A, and DA is the
main diagonal, it is possible to find an ILU0 preconditioner M such that
Here, D is a diagonal matrix that is derived from the terms LA and UA by imposing that M
meets the ILU criteria of Equation (2.62). The convenience of this approach is evident in the
case of a 5- (or 7)-point stencil, as then it is only necessary to store a single extra diagonal in
order to compute the preconditioner, and the entries of this diagonal can be calculated from
a simple recurrence. During the iterative solver, the application of the preconditioner to the
residual r consists in solving the systems (LA + D)y = r, D−1 z = y and (D + UA )x = z,
which can also be expressed in terms of simple relations between the coefficients of A and
D.
ILU(l)
When moving away from elliptic PDEs and considering unstructured matrices, the ILU0
approach presents higher instability and often does not lead to acceptable preconditioners.
An obvious strategy of coping with this limitation is by allowing certain degrees of fill-in in
the L̃ and Ũ factors, looking for a trade-off between the accuracy of the preconditioner and
the storage requirements to hold the factors, versus the floating point operations required
to solve the resulting triangular linear systems.
This alternative leads to a version of ILUP where the target pattern is either updated
during the GE process, or is determined prior to the numerical factorization by a symbolic
factorization phase. In this variant, a “level of fill” is assigned to each position of the sparse
matrix so that, an element is dropped if its level of fill is greater than a predefined threshold
l.
The following definition and explanation of the level of fill will take into account the
row-wise version of Gaussian Elimination. Nevertheless, all the concepts can be adapted to
the remaining variants of GE.
Recall Section 2.1.2 where the problem of fill-in was introduced. There we characterized
a fill-in entry aij , created during the inner loop of the row-wise or IKJ variant of GE,
as being generated by other two “generating entries” aik and akj . Most of the times, it
is reasonable to expect that these new non-zeros will be smaller in magnitude than their
generating entries.
To justify this assumption, a simple model is frequently used in literature [88], in which
a size δk < 1 is assigned to any element whose level of fill is k. The update of the size of an
element during GE can be derived directly from line 5 of Algorithm 2, and follows
31
Chapter 2. Systems of Linear Equations
The updated size of the entry will then depend on the relation between the exponents levij
and levik + levkj but, as δ < 1, the model assumes that this size will be close to either δ levij
or δ levik +levkj , so it defines the updated level of fill as
Now, if initial levels of fill are assigned to each matrix position such that
1 if aij 6= 0
levij = (2.67)
∞ if aij = 0
there are two considerations that should be made. The first one is that, if an entry aij is
initially a nonzero, it will remain assigned to level 1 during the entire GE process. The
second one is that, if a nonzero entry is generated during GE by generating entries aik
and akj , it will be assigned to level levik + levkj and, the higher the level of the generating
entries, the smaller the size of the new nonzero, so the level of fill could give an idea of the
importance of each fill in entry generated during GE.
In order to be consistent with the definition of ILU0, the initial level assigned to each
entry is shifted by −1, and the update of the level of fill becomes
This is just a notation issue and does not change any of the aforementioned concepts.
Using the ILU(l) factorization as preconditioner can be very effective in accelerating
the convergence of iterative methods for many problems. However, it has some important
drawbacks. For example, one major problem is that, in general, it is impossible to predict
the storage required to hold an ILU(l) factorization. In this sense, the codes that obtain
this type of ILU are often composed of an initial symbolic factorization phase, in which the
storage requirements are calculated, an a numerical factorization phase, which computes the
actual ILU. In relation to this, allowing only a few levels of fill in during the factorization
often results in factors very close to L and U . In this sense, there is a “maximum level
of fill”, which is defined as the lowest level of fill l for which the sparsity pattern of the
incomplete factors is equal to that of L and U .
An additional important issue is that the level of fill-in may not always be a good
indicator of the size of the elements that are being dropped. This can cause the dropping
of large elements which may severely harm the accuracy factorization, probably leading to
an ineffective preconditioner.
MILU
It has been observed that when ILU techniques are applied to solve elliptic PDEs, the
asymptotic convergence rate of iterative solvers, as the mesh size becomes smaller, is only
marginally better than an approach with no preconditioner [41]. In order to improve this
situation, the Modified-ILU (MILU) aims to compensate the effect of the discarded entries
32
2.3. Preconditioners
by adding a term to the update of the diagonal entry of the Ũ factor during a given step of
the factorization. This term can be problem dependent. For example, in problems derived
from second order PDEs, it is recommended to hold a term ch2 , where c 6= 0 is a constant
and h is the mesh size. In general, a popular strategy is to add all the elements that have
been dropped during the update of a given row to the diagonal entry (of U ) corresponding
to that row.
MILU has proven to be very effective in problems such as elliptic PDEs but, unfortu-
nately, standard ILUs often outperform their MILU counterparts. According to Van der
Vorst [102] this is related to a higher sensitivity of MILU to round-off errors. This has
motivated intermediate versions by introducing relaxation parameters.
ILUT
When dropping an entry, positional ILU strategies make assumptions on the importance
of the elements being dropped based on their position on the matrix. It turns out that in
many cases these assumptions are inaccurate, and lead to difficulties when addressing real
problems. To remedy this situation, it sounds reasonable to take a look at the magnitude
of a fill-in entry before taking the decision of dropping it. This gives place to the so called
threshold-based ILUs, or ILUT.
This type of ILUs is based on the same variants of GE as ILU(l). The difference is that
now “dropping rules” are introduced at some point of the algorithm, which set an entry
to zero if it meets certain criteria. For example, a simple strategy would be to drop all
entries smaller than certain threshold τ . Instead, one could be more clever and drop the
entry if its magnitude is smaller that a given fraction of the norm of the row it belongs to.
Other rules can be devised and, in fact, more than one rule can be applied. What is usually
known as ILU(p,τ ) in the literature applies two dropping rules. The first one drops any
entry of row i whose magnitude is less than τi , which is equal to τ multiplied by the norm
of row i in the original matrix A. The second rule sets a limit to the number of non-zeros
that can be kept in one row, dropping an element if it is not among the p largest values
(in magnitude) of the row. While the first rule aims to save computing effort during the
triangular solves in the preconditioner application by discarding entries that would (a priori)
not contribute significantly to the result, the second rule keeps the storage space needed to
hold the preconditioner within bounds.
The discarding criteria based on relative tolerance are, in general, more reliable than the
criteria based exclusively on the absolute tolerance, and they usually yield a good solution.
A common disadvantage for these criteria is the selection of a satisfactory value for τ . For
this selection, it is common to start from a representative subset of the linear systems to
be solved in the context of a specific application and, through experimentation, find a good
value for the systems which arise from the same application. In most cases, selecting values
of τ in the interval [10−4 , 10−2 ] yields good results, though the optimum value depends on
the concrete problem [86].
33
Chapter 2. Systems of Linear Equations
The ILU variants that have been described so far are mostly based on the so-called IKJ
variant of Gaussian Elimination. This variant is attractive from a computational point of
view because it proceeds row-wise, and it is therefore especially well-suited for row-oriented
sparse storage formats as the widespread CSR2 . Nevertheless, as it is done with IKJ-based
ILUs, one can derive ILU factorization algorithms from any other variant. Although in exact
arithmetic all variants would produce the same factorizations, each variant relies on different
kernels and the computation patterns may lead to specialized techniques and storage formats
for certain computer architectures. In this sense, these variants present considerable practical
differences.
The Crout variant, which can be seen as a combination of the IKJ variant to compute
L̃ and a transposed version to compute Ũ , aims to solve one of the main drawbacks of the
IKJ-based ILU. During the update stage, in order to update or introduce a fill-in entry, the
structure that contains the factorization has to be accessed in increasing column order. This
implies a search through the vector that holds the current row, which is being dynamically
modified by the fill-in process. There are some strategies to perform this search efficiently,
as line searches or using data structures such as binary trees. However, this issue remains
an important problem when high accuracy is required in the factorization and considerable
fill-in has to be allowed.
Another important advantage of this version of the ILU factorization, which will be
analyzed in the following section, is that it is well suited to the implementation of inverse-
based dropping strategies [26], which are an interesting breakthrough from the theoretical
perspective, and have shown to be effective in many scenarios.
1: for k = 1 to n do
2: for i = 1 to k − 1 and if aki 6= 0 do
3: ak,k:n = ak,k:n − aki × ai,k:n
4: end for
5: for i = 1 to k − 1 and if aki 6= 0 do
6: ak+1:n,k = ak+1:n,k − aik × ak+1:n,i
7: end for
8: ak+1:n,k = ak+1:n,k /akk
9: end for
34
2.3. Preconditioners
Factored
Read
Updated
unit lower triangular, U ∈ Rn×n is unit upper triangular and D ∈ Rn×n is a diagonal matrix.
This factorization is closely related to those presented previously, but is more adequate to
expose some important properties. In fact, if the LU factorization A = L̂Û exists, then
L = L̂ and DU = Û , so D = diag(Û ).
To construct this factorization, an option consists in considering the following partition
" #
β d
∈ Rn×n , (2.69)
c E
where β ∈ R, which determines the dimension of the remaining blocks in 2.69. To factorize
this matrix, at step k one has to exploit the following relation
" # " #" #" #
β d 1 0 δ 0 1 u
= , (2.70)
c E l I 0 S 0 I
where S = E − lδu ∈ Rn−k,n−k is called the Schur complement. The process is completed by
recursively applying the same factorization to S, obtaining the exact A = LDU factorization
at step n.
In the context of preconditioners, it is not necessary to compute the exact factorization
so, as explained previously with other versions of ILU, the common strategy consists in
dropping elements of vectors l and u, and using the sparsified versions of these vectors to
compute an approximate Schur complement. The approach suggested by Bollhöfer in [26],
is to construct this block as
S = E − ˜ld − (c − ˜lβ)ũ, (2.71)
where ˜l and ũ are the sparsified versions of l and u, respectively. The procedure to obtain
an incomplete factorization using this strategy is described in Algorithm 11. Pivoting is not
included for simplicity.
The justification for this choice lies in that this expression for S can be obtained from
35
Chapter 2. Systems of Linear Equations
Algorithm 11 ILDU.
Input: A
Output: L, D, U
1: Set L = U = D = I, S = A.
2: for i = 1 to n − 1 do
3: dii = sii
4: c = Si+1:n,i
5: d = Si,i+1:n
6: p = c/dii
7: q = d/dii
8: Apply dropping rule to p and q
9: Li+1:n,i = p
10: Ui,i+1:n = q
11: Ŝ = Si+1:n,i+1:n
12: Ŝ = Ŝ − pd − (c − pβ)q
13: end for
14: dnn = snn
the lower-right block of L̃−1 AŨ −1 , which can be useful in the following sense. When we
apply LDU as a preconditioner in a Krylov subspace or some other iterative method, the
matrix for the preconditioned system will be L−1 AU −1 D−1 . This means that, when we
drop an entry from the L or U factors, even if it is small, we do not know the impact of that
dropping on the preconditioned system as the inverse factors L−1 and U −1 are, generally,
not available. It is therefore interesting to devise a way in which the inverse factors can
be monitored during the process, so the choice of the expression for the Schur complement
sounds reasonable.
The computation of L−1 and U −1 during the factorization is, of course, out of the
question. Instead, one alternative is to obtain information about the approximate inverses
of the factors. Incorporating this information in the dropping rules of the ILU could, in
principle, be expected to yield good results. In this sense, important developments about the
relation between ILU factorizations and approximate inverses were introduced by Bollhöfer
and Saad at the beginning of this century [29].
Approximate Inverse factorizations (AINV) are based on constructing the incomplete
factors of A−1 explicitly. If no dropping is applied, the idea is finding unit upper triangular
matrices W ∈ Rn×n and Z ∈ Rn×n , and a diagonal matrix D ∈ Rn×n , so that W T AZ = D
or, equivalently, A−1 = ZD−1 W T . Such a factorization can be obtained, for example, by
means of a bi-orthogonalization process in which W T A and Z T AT are transformed into
triangular matrices one row/column at a time. In a similar way as it is done with ILUs,
dropping can be applied to W and Z during the factorization, obtaining an incomplete
factorization A−1 ≈ ZD−1 W T . This process is described in Algorithm 12.
Although Algorithm 12 presents some difficulties to encode an efficient implementation
for sparse matrices, like the use of rank-1 updates, it shows that it should be possible to
obtain an ILDU factorization such that L−1 ≈ W T and U −1 ≈ Z by including information
about the approximate inverse factors. In this sense, consider a modification of the ILDU
algorithm, supplementing it with a progressive inversion process, in which the inverses of
36
2.3. Preconditioners
1: p = c = (0, . . . , 0) ∈ Rn , Z = W = In .
2: for i = 1 to n do
3: pi = eTi AZeTi
4: qi = eTi W T AeTi
5: for j = i + 1 to n do
6: pj = eTj AZei /pi
7: qj = eTi W T Aej /qi
8: W1:i,j = W1:i,j − W1:i,i pj
9: Z1:i,j = Z1:i,j − Z1:i,i qj
10: end for
11: for all k ≤ i, l > i: drop wkl if |wkl | ≤ τ and drop zkl if |zkl | ≤ τ
12: dii = pi
13: end for
the L and U factors are calculated on the fly. At iteration i − 1, the factor Ũi−1 should take
the form
U11 U12 U13 A b C
O 1 O = O v wT .
O O I O O Z
The i-th step is responsible for calculating the entries of vector wT and adding them to Ui−1
to obtain Ui . If q̃ = (0, . . . , 0, q̃i+1 , . . . , q̃n ) is the row vector updated in the i-th iteration of
Algorithm 11, then
Ũi = Ũi−1 + ei q̃. (2.72)
Given the structure of both Ũi and q̃, the following relation between Ũi and Ũi−1 holds
Note that q̃ = ei (Ũi−1 − I) and ei q̃ = ei q̃ Ũi−1 , so that Equation (2.73) simply implies that
the new matrix Ũi is obtained by replacing the entries from i+1 to n with the corresponding
entries of q̃. Inverting (2.73) and operating accordingly we obtain the following expression
for the partial inverse factor Ui−1
Ũi−1 = Ũi−1
−1 −1
(I + ei q̃)−1 = Ũi−1 (I − ei q̃). (2.74)
37
Chapter 2. Systems of Linear Equations
This result is important because it introduces a dropping rule for the factor L that is
based on the entries of the approximate inverse W , and sets a bound for the error obtained
in the incomplete factorization based on the approximate inverse. An analogous result can
be stated for the U and Z factors.
It can be expected that dropping small entries in the approximate inverse factors W and
Z has a much lower impact on the preconditioned system W T AZ than dropping small entries
of L and U . In fact, there are numerical results [28], which demonstrate that approximate
inverse preconditioners behave better than ILUs in highly ill-conditioned scenarios. The
problem with this result is that it is not practical to compute the approximate inverse
factors together with the ILU factorization, so in order to apply the previous dropping rule,
some compromise should be made.
As the rule depends on the maximum entry in magnitude of the i-th column of W , which
is equivalent to the i-th row of L−1 , one option is to devise an estimate for
eTi L−1
∞
(and
U −1 ei
∞ ) that can be computed inexpensively and avoids the computation of the
i-th column of W and Z. Such estimates can be derived from a condition estimator for
upper triangular matrices [35, 62]. This estimate is based on exploiting that
eTi L−1 b
∞ ≤
−1
ei L
kbk to find a vector b such that kbk = 1 and that, for each step i = 1, . . . , n,
∞ ∞ ∞
T −1
ei L b
≈
ei L−1
.
∞ ∞
(2.76)
The approach described by [27] consists in solving the linear system Lx = b and taking
|xi | ≈ eTi L−1 b as an estimate, choosing the i-th entry of vector b dynamically in each step,
so that it (hopefully) maximizes |xi |. A common choice is to set the entries of the vector to
±1 as the solution via forward substitution advances, depending on which value maximizes
|xi |. The procedure described in Algorithm 13 also seeks to prevent, at step i, the growth
of the xk entries such that k > i.
38
2.3. Preconditioners
1: v = x = (0, . . . , 0)T ∈ Rn .
2: for i = 1 to n do
3: µ+ = 1 − vi
4: µ− = −1 − vi
5: v+ = vi+1:n + Li+1:n,i µ+
6: v− = vi+1:n + Li+1:n,i µ−
7: if |µ+ | + kv+ k∞ > |µ− | + kv− k∞ then
8: xi = µ+
9: vi+1:n = v +
10: else
11: xi = µ−
12: vi+1:n = v −
13: end if
14: end for
ILUT preconditioners, it is prone to failures as a result of zero rows appearing during the
factorization due to the combination of permutations and dropping. If significant fill-in is
allowed, ILUTP can produce accurate preconditioners, but at a high computational cost.
In [43], pivoting is avoided during the factorization by reordering the rows of the matrix
previously in order to improve its diagonal dominance. The ordering strategy aims to bring
large pivots as close to the diagonal as possible, and is combined with a static post-ordering
to reduce fill-in. This type of reordering, combined with ILU preconditioners, have yield
good results in several scenarios [23, 94].
A different strategy to improve the stability of ILUs consists in applying a reordering
to partition the rows/columns of the matrix into two sets. When solving or factorizing a
linear system with such a partition, the first set of unknowns or rows can be dealt with
immediately, while the solution or factorization of the second set will use the result of the
first one. These sets are often classified “coarse” and “fine” because of the resemblance of
this strategy with Algebraic Multigrid methods.
As an example, one can split a linear system like
" #" # " #
B F u f
Ax = b → x= (2.77)
E C y g
seeking that the block B is easy to invert, as in the case of a diagonal matrix, or that it is
diagonally dominant so it can be factorized safely without pivoting. In the first case, the
unknown u is easy to express in terms of y as B can be inverted trivially, so the system can be
expressed in terms of the Schur complement SC = C − EB −1 F . In the second case, a partial
factorization of A can be computed by first factorizing B = LU and then completing the
procedure by factorizing the Schur complement SC = C − EU −1 L−1 F . Moreover, a similar
procedure can be recursively applied to factorize SC in what is often called a multilevel
procedure.
39
Chapter 2. Systems of Linear Equations
Each level requires the construction of the Ll and Ul factors as well as the blocks El Ul−1 and
L−1
l Fl , which can be done with any of the previously presented ILUT techniques, and the
construction of the Schur complement SC . In this process, the Schur complement can suffer
substantial fill-in, which should be dealt with by applying some technique to drop small
elements of Ll , Ul , El Ul−1 , L−1
l Fl and SC itself. Once the previous blocks are obtained, the
process is repeated with Al+1 = SC until the Schur complement of that level is small or
dense enough to be handled by a direct solver or a maximum number of levels is reached.
It is worth noticing that, since the matrices EU −1 and L−1 F are only used to calculate the
Schur complement corresponding to the matrix that will be factorized in the next level, they
can be discarded once such Schur complement is available. All operations involving these
two matrices or their approximate versions can be computed using Ll , Ul , El , Fl .
The effectiveness of this approach naturally lies on the quality of the permutation chosen
to obtain the fine and coarse sets. In [89], Saad proposes several strategies to obtain a block B
that is as diagonally dominant as possible. They differ from the aforementioned strategies
in that Saad’s approach is dynamic rather than based on static orderings. Moreover, in
[30] Bollhöfer and Saad proposed another dynamic ordering scheme based on controlling the
growth of the inverse L−1
l and Ul−1 . This consists in using the condition estimator presented
in Algorithm 13 to detect rows or columns that would push the norm of the inverse factors
above a certain threshold, defined as a parameter of the algorithm. If such row or column is
detected, a diagonal pivoting is performed such that the current row and column is pushed to
the bottom-right corner of the matrix. This can be combined with previous static ordering
to generate an initial splitting like the above, where B is likely to have good properties. In
summary, one has a possibly non-symmetric permutation P̂ and Q̂ such that
" # B11 B12 F1 " #
T T T T B F B11 F̂
P̂ AQ̂ = P̃ (P AQ)Q̃ = P̃ Q̃ = B21 B22 F2 = . (2.79)
E C Ê C
E1 E2 C
40
2.4. ILUPACK
Figure 2.3: A step of the Crout variant of the multilevel ILU factorization. Extracted from [8].
2.4 ILUPACK
Consider the linear system Ax = b, where the n × n coefficient matrix A is large and sparse,
and both the right-hand side vector b and the sought-after solution x contain n elements.
ILUPACK provides software routines to calculate an inverse-based multilevel ILU precon-
ditioner M , of dimension n × n, which can be applied to accelerate the convergence of
Krylov subspace iterative solvers. The package includes numerical methods for different
matrix types, precision, and arithmetic, covering Hermitian positive definite/indefinite and
general real and complex matrices. When using an iterative solver enhanced with an ILU-
PACK preconditioner, the most demanding task from the computational point of view is the
application of the preconditioner, which occurs (at least once) per iteration of the solver.
The implementation of all solvers in ILUPACK follows a reverse communication ap-
proach, in which the backbone of the method is performed by a serial routine that is repeat-
edly called. This routine is re-entered at different points, and sets flags before exiting so that
operations such as SpMV, the application of the preconditioner, and convergence checks can
be then performed by external routines implemented by the user. This is aligned with the
decision adopted by ILUPACK to employ SPARSKIT3 as the backbone of the solvers.
Let us focus on the real case, where A, M ∈ Rn×n and x, b ∈ Rn . The computation of
ILUPACK’s preconditioner proceeds following three steps:
3 Available at [Link]
41
Chapter 2. Systems of Linear Equations
where LB , DB and UB are blocks of the factors of the multilevel LDU preconditioner (with
LB , UBT unit lower triangular and DB diagonal); and Ml+1 stands for the preconditioner
computed at level l + 1.
For the review of this operation, we consider its application at level l, for example, to
compute z := Ml−1 r. This requires solving the system of linear equations:
! ! !
LB 0 DB 0 UB UF
P̂ T P̃ T D̃−1 z = P̂ T P̃ T D̃r. (2.82)
LG I 0 Ml+1 0 I
Breaking down (2.82), we first recognize two transformations to the residual vector,
r̂ := P̂ T P̃ T (D̃r), before the following block system is defined:
! ! !
LB 0 DB 0 UB UF
w = r̂. (2.83)
LG I 0 Ml+1 0 I
where the recursion is defined in the second one. In turn, the expressions in (2.84) also need
to be solved in two steps. Assuming y and r̂ are split conformally with the factors, for the
expression on the left of (2.84) we have
! ! !
LB 0 yB r̂B
= ⇒ LB yB = r̂B , yC := r̂C − LG yB . (2.85)
LG I yC r̂C
Partitioning the vectors as earlier, the expression in the middle of (2.84) involves the
42
2.4. ILUPACK
In the recursion base step, Ml+1 is void and only xB has to be computed. Finally, after an
analogous partitioning, the expression on the right of (2.84) can be reformulated as
! ! !
UB UF wB xB
= ⇒ wC := xC , UB wB = xB − UF wC , (2.87)
0 I wC xC
To save memory, ILUPACK discards the off-diagonal blocks LG and UF once the level of
the preconditioner is calculated, keeping only the rectangular matrices G and F of (2.77),
which are often much sparser. Thus, (2.85) is changed as:
LG = GUB−1 DB
−1
⇒ yC := r̂C − GUB−1 DB
−1
yB = r̂C − GUB−1 DB
−1 −1
LB r̂B , (2.88)
−1 −1 −1 −1 −1
UF = DB LB F ⇒ UB wB = DB yB − DB LB F wC . (2.89)
To summarize the previous description of the method, the application of the precondi-
tioner requires, at each level, two sparse matrix-vector products (SpMV), solving two linear
systems with coefficient matrix of the form LDU , and a few vector kernels.
Here, kL−1
B k / κ and E contains the elements dropped during the IC factorization. Restart-
ing the process with A = Sc , we obtain a multilevel approach.
Then, the application of the preconditioner in the PCG algorithm, for a given level l, is
43
Chapter 2. Systems of Linear Equations
where M0 = M .
Operating properly on the vectors,
" # " #
ẑB r̂B
P̂ T P T D−1 z = ẑ = , P̂ T P T Dr = r̂ = , (2.93)
ẑC r̂C
The task-parallel version of this procedure employs Nested Dissection (ND) [88] to extract
parallelism. To illustrate this, consider a partitioning, defined by a permutation P̄ ∈ Rn×n ,
such that
A00 0 A02
T
P̄ AP̄ = 0 A11 A12 . (2.95)
where
S22 = A22 − (L20 D00 LT20 ) − (L21 D11 LT21 ) + E2 , (2.96)
is the approximate Schur complement. By recursively proceeding in the same manner with
S22 , the IC factorization of P̄ T AP̄ is eventually completed.
The block structure in (2.95) allows the permuted matrix to be decoupled into two
44
2.4. ILUPACK
T6
T4 T5
T0 T1 T2 T3
Figure 2.4: Dependency tree of the diagonal blocks. Task Tj is associated with block Ajj . The
leaf tasks are associated with the sub-graphs of the leading block of the ND, while inner tasks are
associated to separators.
submatrices, so that the IC factorizations of the leading block of both submatrices can be
processed concurrently, with
" # " #" #" #
A00 A02 L00 0 D00 0 LT00 LT20
= + E0
A20 A022 L20 I 0 0
S22 0 I
A22 = A022 + A122 , (2.97)
" # " #" #" #
A11 A12 L11 0 D11 0 LT11 LT21
= + E1 ,
A21 A122 L21 I 0 1
S22 0 I
and
0
S22 = A022 − L20 D00 LT20 + E20 ; 1
S22 = A122 − L21 D11 LT21 + E21 .
Once the two systems are computed, S22 can be constructed given that
To further increase the amount of task-parallelism, one could find a permutation analogous
to P̄ for the two leading blocks following the ND scheme. For example, a block structure
similar to (2.95) would yield the following decoupled matrices:
A00 0 0 0 A04 0 A06 A00 A04 A06 A11 A14 A16
0 A11 0 0
A14 0 A16
Ā 00 = A40 A044 A046 Ā11 = A41 A144 A146
0 0 A22 0
0 A25 A26
A60 A064 A066 A61 A164 A166
0 0 0 A33
0 A35 A36
→ (2.99)
A40 A41 0 0 A44 0 A46 A22 A25 A26 A33 A35 A36
0 0 A52 A53 0 A55 A56 Ā22 = A52 A255 A256 Ā33 = A53 A355 A356
A60 A61 A62 A63 A64 A65 A66 A62 A265 A266 A63 A365 A366
Figure 2.4 illustrates the dependency tree for the factorization of the diagonal blocks
in (2.99). The edges of the preconditioner directed acyclic graph (DAG) define the depen-
dencies between the diagonal blocks (tasks), which dictate the order in which these blocks
of the matrix have to be processed.
In summary, the task-parallel version of ILUPACK partitions the original matrix into a
number of decoupled blocks, and then delivers a partial IC factorization during the com-
putation of (2.97), with some differences with respect to the sequential procedure. The
main change is that the computation is restricted to the leading block, and therefore the
rejected pivots are moved to the bottom-right corner of the leading block; see Figure 2.5.
45
Chapter 2. Systems of Linear Equations
Figure 2.5: A step of the Crout variant of the parallel preconditioner computations. Extracted
from [8].
Although the recursive definition of the preconditioner, shown in (2.92), is still valid in the
task-parallel case, some recursion steps are now related to the edges of the corresponding
preconditioner DAG. Therefore different DAGs involve distinct recursion steps yielding dis-
tinct preconditioners, which nonetheless exhibit close numerical properties to that obtained
with the sequential ILUPACK [3].
As the definition of the recursion is maintained, the operations to apply the preconditioner,
in (2.94), remain valid. However, to complete the recursion step in the task parallel case, the
DAG has to be crossed two times per solve zk+1 := M −1 rk+1 at each iteration of the PCG:
once from bottom to top and a second time from top to bottom (with dependencies/arrows
reversed in the DAG).
46
2.5. Related work
et al’s [31] acceleration of multigrid problems leveraging the graphics pipeline; and the
studies of the CG method by Goodnight et al. [57], as well as Krüger et al. [67].
With the start of the CUDA era, Buatois et al. [32] implemented the CG method in
conjunction with a Jacobi preconditioner, using the block compressed sparse row (BCSR)
format. Later, Bell and Garland [21] addressed SpMV, including several sparse storage
formats, that became the basis for the development of the CUSP library [22].
A parallel CG solver with preconditioning for the Poisson equation optimized for multi-
GPU architectures was presented by Ament et al. [15]. Sudan et al. [98] introduced GPU
kernels for SpMV and the block-ILU preconditioned GMRES in flow simulations, showing
promising speed-ups. In parallel, Gupta completed a master thesis [59] implementing a
deflated preconditioned CG for Bubbly Flow.
Naumov [74] produced a solver for triangular sparse linear systems in a GPU, one of
the major kernels for preconditioned variants of iterative methods such as CG or GMRES.
Later, the same author extended the proposal to compute incomplete LU and Cholesky
factorizations with no fill-in (ILU0) in a GPU [73]. The performance of the aforementioned
algorithms strongly depends on the sparsity pattern of the coefficient matrix.
Li and Saad [69] studied the GPU implementation of SpMV with different sparse for-
mats to develop data-parallel versions of CG and GMRES. Taking into account that the
performance of the triangular solve for CG and GMRES, preconditioned with IC (Incom-
plete Cholesky) and ILU respectively, was rather low on the GPU, a hybrid approach was
proposed in which the CPU is leveraged to solve the triangular systems.
Recently, He et. al. [60] presented a hybrid CPU-GPU implementation of the GMRES
method preconditioned with an ILU-threshold preconditioner to control the fill-in of the
factors. In the same work, the authors also propose a new algorithm to compute SpMV in
the GPU.
Some of these ideas are currently implemented in libraries and frameworks such as CUS-
PARSE, CUSP, CULA, PARALUTION and MAGMA-sparse. To our best knowledge, no
GPU implementations of multi-level preconditioners such as that underlying ILUPACK have
been developed, with the exception of our previous efforts, and in some aspects the proposal
by Li and Saad.
ILU++
ILU++ is a software package, written entirely in C++ that comprises a set of iterative linear
systems solvers complemented with modern ILU preconditioners for the solution of sparse
linear systems via iterative methods. The package includes different permutation and scaling
techniques, pivoting strategies and dropping rules. In particular ILU++ bundles the inverse-
47
Chapter 2. Systems of Linear Equations
ViennaCL
The Vienna Computing Library (ViennaCL) is a scientific computing library written in C++
that provides a set of routines for the solution of large sparse systems of equations by means
of iterative methods. It uses either a host based computing back-end, an OpenCL comput-
ing back-end, or CUDA to enable the execution on parallel architectures such as multi-core
CPUs, GPUs and MIC platforms (as Xeon Phi processors). The package includes imple-
mentations of the Conjugate Gradient (CG), Stabilized BiConjugate Gradient (BiCGStab),
and Generalized Minimum Residual (GMRES) methods.
As for preconditioners, ViennaCL offers implementations of classical ILU0/IC0 and
ILUT/ICT, as well as a variant of ILU0/IC0 recently proposed by Chow and Patel [34] that
is more suitable for the execution on massively parallel devices than the standard approach.
It also offers Block-ILU preconditioners, simple Jacobi and Row-Scaling preconditioners,
and Algebraic Multi-Grid preconditioners.
The library strongly focuses on providing compatibility with the widest range of com-
puting platforms, as opposed to being specifically tailored to the hardware solutions of a
particular vendor.
pARMs
pARMS is a library of parallel solvers for distributed sparse linear systems of equations that
extends the sequential library ARMS. It provides Krylov subspace solvers, preconditioned
using a domain decomposition strategy based on a Recursive Multi-level ILU factorization.
The library includes many of the standard domain-decomposition type iterative solvers in
a single framework. For example, the standard Schwartz procedures, as well as a num-
ber of Schur complement techniques are included. ILUPACK implements its inverse-based
multilevel preconditioner using ARMS framework.
PETSc
PETSc, which stands for Portable Extensible Toolkit for Scientific Computation, is a highly
complete suite of data structures and routines for the solution of scientific applications
related to partial differential equations. It supports MPI, and GPUs through CUDA or
OpenCL, as well as hybrid MPI-GPU parallelism. Among its features, PETSc includes
a number of Krylov space solvers for linear systems (CG, GMRES, BiCGStab, CGS, and
others), as well as standard ILU, Jacobi, and AMG preconditioners.
Hypre
48
2.5. Related work
multigrid methods for both structured and unstructured grid problems. The HYPRE library
is highly portable and supports a number of languages.
Hypre contains several families of preconditioner algorithms focused on the scalable solu-
tion of very large sparse linear systems. For instance, it includes “grey-box” algorithms, such
as structured multigrid, that use additional information to solve certain classes of problems
more efficiently than general-purpose libraries.
It also offers a suite of common iterative methods, as the most commonly used Krylov-
based iterative methods complemented with scalable preconditioners, such as ILU, AMG
and AINV.
HiPS
HIPS (Hierarchical Iterative Parallel Solver) is a parallel solver for sparse linear systems
developed by Jérémie Gaidamour and Pascal Hénon in the INRIA team-project “Scalap-
plix”, in collaboration with Yousef Saad from the University of Minnesota. It uses a domain
decomposition approach based on a Hierarchical Interface Decomposition (HID), based on
building a decomposition of the adjacency graph of the system into a set of small subdo-
mains with overlap [63]. This is similar to the application of nested dissection orderings
used by ILUPACK to expose parallelism and generate a task tree. Using this partition of
the problem, HIPS provides a solver that combines direct and iterative strategies to solve
different parts. Specifically, it uses a direct factorization to solve the leading block, while
using ILU-preconditioned iterative solvers for the Schur complement.
Trilinos
The Trilinos Project is an effort of the Sandia National Laboratory (USA), to provide a suite
of algorithms for the solution of large-scale, complex multi-physics engineering and scientific
problems.
Trilinos includes a wide range of iterative and direct solvers, preconditioners, high-level
interfaces, and eigen-solvers, bundled in different software packages:
49
Chapter 2. Systems of Linear Equations
WSMP is a software package, developed by the IBM T.J. Watson Research Center, targeted
at the solution of large-scale sparse linear systems. It contains common Krylov subspace
solvers as CG, GMRES, TFQMR, and BiCGStab. Accompanying these solvers, the package
currently supports preconditioners as Jacobi, Gauss-Seidel, and Incomplete Cholesky/LDLT
preconditioners for SPD matrices. It supports Jacobi, Gauss-Seidel, and Incomplete LU fac-
torization based preconditioners for general matrices. For the incomplete Cholesky precon-
ditioner, the package allows the user to set the drop tolerance and fill factor. An automatic
tuning mechanism is also provided.
MAGMA Sparse
50
CHAPTER 3
The preconditioner on which ILUPACK is based effectively reduces the number of iterations
of common Krylov subspace solvers. Its convergence properties have been studied both with
theoretical and empirical approaches, where it has proven to be superior to other types of
ILU preconditioners in many moderate to large-scale scenarios. Unfortunately, these appeal-
ing properties come at the price of a highly complex construction process, and a multilevel
structure that implies a computationally demanding application. The considerable incre-
ment in the cost of most solvers when using ILUPACK instead of simpler preconditioners
severely limits the applicability and usefulness of the package. In other words, it is often
preferable to perform many lightweight iterations with a simple ILU preconditioner than
only a few ones with ILUPACK. Of course there are some cases in which simpler precondi-
tioners do not allow the convergence of the solvers or achieve convergence in an impractical
number of iterations, but this range of application is narrower than desired.
One alternative is then to boost the performance ILUPACK by means of HPC and
parallel computing techniques. In this sense, there are two main strategies that can be
followed, which are the exploitation of task-parallelism on the one hand, and data-parallelism
on the other. The first strategy consists in dividing the work into several tasks such that two
or more tasks can be executed concurrently in different computational units on the same
or different sets of data. The second applies when the same function or operation can be
applied to multiple data elements in parallel.
Previous efforts on the parallelization of ILUPACK have focused on the exploitation of
task-parallelism. The intrinsic sequential structure of most Krylov subspace solvers and the
multilevel preconditioner itself, however, severely constrains the execution schedule of the
different tasks involved. To overcome this limitation, the task-parallel ILUPACK versions
proposed in [3] and [4], respectively designed for shared-memory systems and for distributed-
memory platforms, leverage the parallelism extracted by a Nested Dissection ordering of the
coefficient matrix to divide both the construction and the application of the preconditioner
Chapter 3. Enabling GPU computing in sequential ILUPACK
into tasks that can be executed concurrently on different processors. This strategy will be
explained in more detail in the next chapter of this thesis, but at this point it is impor-
tant to mention the two main drawbacks of this approach. First, the strategy makes some
mathematical simplifications, and increases the amount of floating point operations neces-
sary to construct and apply the preconditioner in order to expose additional concurrency.
This results in a different preconditioner than that calculated by the sequential ILUPACK,
which could mean that some of its beneficial numerical properties no longer apply. The sec-
ond drawback is that the formulation of the task-parallel ILUPACK relies on the coefficient
matrix being symmetric and positive-definite (SPD), and an extension of this strategy to
general matrices is not trivial to derive.
This chapter investigates the exploitation of data-parallelism to accelerate the execution
of ILUPACK by utilizing massively parallel processors, such as GPUs. Unlike the previous
parallel versions, instead of trading floating-point operations (flops) for increased levels
of task-parallelism, the rationale is to exploit the data-parallelism intrinsic to the major
numerical operations in ILUPACK, off-loading them to the hardware accelerator, where they
are performed via highly-tuned data-parallel kernels. This allows to accelerate ILUPACK
without compromising its mathematical properties.
The chapter begins by describing the general strategy, which is presented in the context
of accelerating the execution of SPD systems. The same strategy is applied later to deliver
a baseline data-parallel version of ILUPACK’s solvers for general (non-symmetric) and sym-
metric indefinite linear systems on GPUs. This is immediately followed by an experimental
evaluation of these baseline versions.
Later, the chapter describes the extensions and enhancements to our baseline devel-
opments, which tackle three important problems of the baseline data-parallel versions of
ILUPACK for sparse general linear systems. This is again accompanied by an experimental
evaluation.
The major contributions of this chapter are the following:
52
3.1. Platforms and test cases
The experiments conducted to test the performance of the developments presented in this
chapter were performed on different hardware platforms using several test cases. Since some
of the contributions to be presented in the following sections were motivated by experimental
results extracted from the baseline versions, the evaluation of the baseline is presented before
the enhancements motivated by these results. In order to facilitate the reading, in this section
we compile all the platforms and test cases that will appear later in the remaining of the
chapter.
53
Chapter 3. Enabling GPU computing in sequential ILUPACK
Bach
Bach is a server equipped with an Intel i7-2600 CPU (4 cores at 3.4 GHz), 8 MB of L3 cache
and 8 GB of RAM. The GPU in this platform is an Nvidia S2070, of the Fermi generation,
with 448 cores at 1.15 GHz and 5 GB of GDDR5 RAM. The CUDA Toolkit, which includes
the compiler and accelerated libraries such as cuBlas and cuSparse, is version 4.1. The
C and Fortran compiler employed for the CPU codes was gcc v4.4.6, and the operating
system on the server is CentOS Rel. 6.2.
Mozart
This platform features a low-end Intel i3-3220 CPU (2 cores at 3.3 GHz), with 3 MB
of L3 cache and 16 GB of RAM. Although the CPU in Bach is clearly more advanced,
the performance of both CPUs in single-thread codes is comparable. The server includes a
Nvidia K20 GPU of the Kepler architecture. It contains 2,496 cores at 0.71 and 6 GB of
GDDR5 RAM. The version of the CUDA Toolkit in this case is 5.0. C and Fortran compiler
is gcc v4.4.7. The operative system is CentOS Rel. 6.4.
Beethoven
Beethoven presents an Intel i7-4770 processor (4 cores at 3.40 GHz) and 16 GB of DDR3
RAM (26 GB/s of bandwidth), connected to an NVIDIA Tesla K40 GPU, with 2,880 cuda
cores at 0.75 GHz, and 12 GB of DDR5 RAM (288 GB/s bw). NVIDIA CUBLAS/CUS-
PARSE 6.5 was employed in the experimentation. For the CPU codes we used gcc v4.9.2
with -O4 as an optimization flag.
Brahms
This platform has an Intel(R) Xeon(R) CPU E5-2620 v2 (6 cores at 2.10GHz), and 128 GB
of DDR3 RAM memory. The platform also contains two NVIDIA “Kepler” K40m GPUs,
each with 2,880 CUDA cores and 12 GB of GDDR5 RAM. The CPU codes were compiled
with the Intel(R) Parallel Studio 2016 (update 3) with the -O3 flag set. The GPU compiler
and the CUSPARSE library were those in version 6.5 of the CUDA Toolkit.
We selected a set of matrices from the SuiteSparse matrix collection that comprises three
medium to large-scale SPD matrices, five large-scale non-symmetric matrices with dimension
n > 1, 000, 000, and six symmetric indefinite ones with n > 100, 000. The dimension, number
of non-zeros and sparsity ratio of these matrices is displayed in Table 3.1.
54
3.1. Platforms and test cases
Table 3.1: Matrices from the SuiteSparse collection used in the experiments.
We considered the Laplacian equation ∆u = f in a 3D unit cube Ω = [0, 1]3 with Dirichlet
boundary conditions u = g on δΩ. The discretization consists in a uniform mesh of size
1
h = N +1 obtained from a seven-point stencil. The resulting SPD linear system Au = b
has a sparse SPD coefficient matrix with seven nonzero elements per row, and n = N 3
unknowns. The problem is set to generate five benchmark SPD linear systems of order
n ≈ 1M, 1.9M, 3.3M, 8M and 16M. In the experiments related with the SQMR solver, a
random set of eigenvalues of these matrices were modified by changing their sign, so that the
resulting problem becomes indefinite. We display the dimension and number of non-zeros
of each problem instance in Table 3.2.
We considered the PDE ε∆u + b ∗ u = f in Ω, where Ω = [0, 1]3 . For this example, we use
homogeneous Dirichlet boundary conditions, i.e. u = 0 on ∂Ω. The diffusion coefficient ε is
55
Chapter 3. Enabling GPU computing in sequential ILUPACK
1
The domain is discretized with a uniform mesh of size h = N +1 resulting in a linear
3
system of size N . For the experiments we set N = 200. For the diffusion part −ε∆u we
use a seven-point-stencil. The convective part b ∗ u is discretized using up-wind differences.
We display the circular convective function in 2D/3D for illustration.
1.2
1.2
0.8
1
0.8
0.6
0.6
0.4
0.4
0.2
0
0.2
-0.2
1.5
0 1 1.2
1
0.5 0.8
0.6
0.4
0 0.2
-0.2 0
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 -0.5 -0.2
56
3.2. Baseline Data-Parallel variants of ILUPACK
mutation vectors that correspond to D̃, P̃ , and P̂ ; see Section 2.4. The case of symmetric
matrices is analogous, and the differences reside in that only the lower triangular factor L
and the (block-)diagonal D of the LDLT factorization are stored explicitly, and that G is
not stored because it is equal to F T .
The computational cost required to apply the preconditioner is dominated by the sparse
triangular system solves (SpTrSV) and SpMVs. Nvidia cuSparse [99] library provides
efficient GPU implementations of these two kernels that support several common sparse
matrix formats. Therefore, it is convenient to rely on this library. The rest of the operations
are mainly vector scalings and re-orderings, which gain certain importance only for highly
sparse matrices of large dimension, and are accelerated in our codes via ad-hoc CUDA
kernels.
In the following we provide some details about the work performed in order to enable
the use of the GPU in each of the main types of operations: LDU systems, matrix-vector
products and vector operations.
57
Chapter 3. Enabling GPU computing in sequential ILUPACK
In order to compute the SpMV operations involved in the application of the preconditioner
on the GPU, G and F are also transferred to the device during the construction phase. As
these matrices are stored by ILUPACK in CSR format, no reorganization is needed prior to
the invocation of the cuSparse kernel for SpMV.
However, in the SPD case, the products that need to be computed during the application
of the preconditioner, at each level, are of the form x := F v and x := F T v. As both the
matrix and its transpose are involved in the calculations, two different strategies can be
applied. In particular, the cuSparse kernel for this operation, cusparseDcsrmv, includes an
argument (switch) that allows to multiply the vector with the transpose of the matrix passed
to the function. This makes it possible to store only F in the GPU, but compute both forms
of the product by setting the appropriate value for the transpose switch. Unfortunately,
the implementation of SpMV in cuSparse delivers considerable low performance when
operating with the transposed matrix.
For these reasons we allocated both F and F T in the GPU, using a cuSparse routine
to transpose the matrix, which was done only once for each level of the preconditioner.
The memory overhead incurred for storing the transpose explicitly is unavoidable in the
non-symmetric case, but we find this performance-storage trade-off acceptable.
A similar situation occurs in the case of BiCG solver, as the application of the precon-
ditioner involves G and F as well as their respective transposes. In this case, storing both
transposes can exceed significantly the memory requirements of the non-symmetric versions
of ILUPACK, as two additional matrices need to be stored, unlike in the symmetric case,
where only one extra matrix is needed. For this reason, our approach in this case only keeps
G and F in the accelerator, and operates with the transposed/non-transposed matrices by
setting the appropriate value for the transpose switch.
58
3.2. Baseline Data-Parallel variants of ILUPACK
As mentioned earlier, the solution of the triangular systems and the matrix-vector multiplica-
tions that appear at each level of the preconditioner application are the most time-consuming
operations in most problems. On the other hand, although the remaining vector operations
involved in the application of the preconditioner are not computationally-intensive, if they
are performed on the CPU, it is necessary to transfer their results to the GPU when they
are needed as well as to retrieve the results of the triangular system solves and SpMV to the
CPU. Furthermore, as the “recursive” application the preconditioner consists of a strictly
ordered sequence of steps (where concurrency is extracted from the operations that compose
each one of these steps), no overlapping between data transfers and calculations is possible.
For this reason, we off-load the entire preconditioner application to the GPU. This implies
that the residual rk+1 is transferred to the GPU before the operations commences, and then
all levels of the preconditioner are processed in the accelerator to obtain zk+1 := M −1 rk+1 .
After the operation is completed, the preconditioned residual zk+1 is retrieved back to the
CPU. To make this possible, we implemented three additional GPU kernels:
• The diagonal scaling kernel multiplies each entry of an input vector by the correspond-
ing entry of a scaling vector. This is equivalent to multiplying a diagonal matrix –as
scaling vector– by the input vector.
• The ordering kernel reorganizes an input vector vin by applying a vector permutation
p, and produces an output vector vout , with entries vout [i] := vin [p(i)].
• The third kernel simply implements the subtraction c := a − b where a, b and c are
vectors.
By performing these operations in the GPU we can avoid some data transfers and reduce
the runtime significantly, especially for those cases where the size of the vectors is rather
large compared with the number of non-zeros of the triangular factors.
These three kernels appear in every variant of the preconditioner application routine.
59
Chapter 3. Enabling GPU computing in sequential ILUPACK
Evaluation of CG
Table 3.3 compares the original CPU-based implementation of ILUPACK against our GPU-
aware version Base PCG, using the SPD cases of the benchmark collection in Bach and
Mozart hardware platforms. In particular, in the results we detail the number of iterations
needed for convergence (label “#iter”); the execution times spent during the preconditioner
application in the solution of triangular systems with coefficient matrix of the form LDLT ,
the SpMV operations, the CPU-GPU data transfers and the total preconditioner application
time (labels “LDLT time”, “SpMV time”, “Transfer time” and “Preconditioner time”,
respectively); the total solver time (label “Total PCG time”); and the relative residual
60
3.2. Baseline Data-Parallel variants of ILUPACK
Table 3.3: Experimental results for the baseline GPU-enabled implementation of ILUPACK’s CG
method (Base PCG) in Mozart and Bach. Times are in milliseconds.
Time
Platform Matrix Device #iter R(x∗ )
LDLT SpMV Transfer Preconditioner Total PCG
CPU 200 101,713 8,646 - 114,548 136,040 4.845E-07
ldoor GPU 200 74,805 1,191 1,074 77,291 96,880 4.963E-07
CPU 186 18,149 5,871 - 30,097 40,060 2.397E-08
thermal2 GPU 186 7,525 687 1,263 9,792 19,850 2.392E-08
CPU 75 8,364 3,120 - 14,844 18,290 2.674E-06
G3 Circuit GPU 75 3,603 385 645 4,787 8,200 2.674E-06
CPU 44 15,122 3,442 - 21,056 23,660 5.143E-09
Mozart
A126 GPU 44 10,359 444 469 11,381 14,010 5.143E-09
CPU 52 37,097 8,616 - 51,973 58,220 2.751E-09
A159 GPU 52 17,351 1,063 1,094 19,751 26,080 2.751E-09
CPU 76 70,275 30,370 - 124,123 142,880 5.618E-09
A200 GPU 76 20,688 3,532 3,172 28,276 47,020 5.618E-09
CPU 338 383,617 98,296 - 652,209 815,950 5.721E-08
A252 GPU 338 72,262 5,763 28,240 111,754 279,780 5.721E-08
CPU 200 90,199 7,637 - 101,381 119,800 4.845E-07
ldoor GPU 200 74,147 1,373 552 76,337 88,240 4.842E-07
CPU 186 16,021 5,126 - 26,617 34,730 2.397E-08
thermal2 GPU 186 10,948 987 658 13,024 21,450 2.302E-08
CPU 75 7,138 2,725 - 12,762 15,340 2.674E-06
G3 Circuit GPU 75 4,610 546 339 5,704 8,380 2.675E-06
CPU 44 13,509 2,972 - 18,688 20,680 5.143E-09
Bach
A126 GPU 44 10,720 541 247 11,673 13,760 5.143E-09
CPU 52 32,648 7,424 - 45,671 50,410 2.751E-09
A159 GPU 52 19,814 1,234 575 22,022 26,010 2.751E-09
CPU 76 60,513 25,626 - 106,540 120,830 5.618E-09
A200 GPU 76 32,245 4,263 1,674 39,536 54,670 5.618E-09
CPU 338 264,697 74,812 - 476,384 602,160 5.721E-08
A252 GPU 338 106,944 6,527 14,795 136,918 266,690 5.721E-08
Evaluation of SQMR
61
Chapter 3. Enabling GPU computing in sequential ILUPACK
Figure 3.1: L factor of each level of the LDL multilevel factorization of matrix c-big. Levels
increase from left to right and from top to bottom.
The largest symmetric indefinite instance we were able to test was a non-linear pro-
gramming problem from the SuiteSparse collection, also studied in [95]. The results for this
benchmark are closer to those obtained in the symmetric indefinite PDE. There are four in-
stances of this problem, which vary in size. When τ = 0.01, the fill-in of the preconditioner
allows only the three smaller instances to be executed in the GPU.
To close this discussion, we note that, contrary to the results obtained in the previous
experiments, there are discrepancies in the number of iterations as well as residual errors
for some of the tested instances. These could be caused by floating point errors and the use
of Buch-Kaufman pivoting, but further tests are required.
For the evaluation of our GPU implementations for non-symmetric matrices, we first applied
the BiCG and GMRES methods to the SPD matrices associated with the Laplacian PDE,
treating them as if they were non-symmetric. The test instances in this benchmark can
be scaled up arbitrarily while preserving certain pattern in the non-zeros structure of the
factorization. The results in Table 3.5 show an important improvement in the performance
of the two GPU-accelerated solvers, though this is more notorious for BiCG. The reason is
that the only stages of the solver that are off-loaded to the accelerator are the application
of the preconditioner and SpMV. The first one occurs twice per iteration for BiCG (as the
transposed preconditioner also has to be applied), but only once for GMRES. If we consider
the stages that involve the preconditioner, the acceleration factor reaches up to 6× for the
transposed preconditioner in the largest test case. This is not surprising, given the memory-
bound nature of the problem and that the GPU has a memory bandwidth only 11× higher
than the CPU. The matrices of this set are SPD and well-conditioned, allowing us to use
a drop tolerance factor τ = 0.1 and still converge in a few iterations. This arguably high
value of τ produces a sparser preconditioner, exposing a larger volume of data-parallelism
that is exploited by the GPU kernels.
To expose the performance of the non-symmetric solvers, we repeated the evaluation with
a set of large non-symmetric problems from the SuiteSparse collection. Table 3.5 reports
62
3.3. Enhanced data-parallel variants
Table 3.4: Experimental results for the baseline GPU-enabled implementation of ILUPACK’s
SQMR method (Base SQMR) in platform Beethoven. Times are in seconds.
fair acceleration factors for the GPU versions of the solvers. In some detail, the speed-up
values for the application of the preconditioner are quite similar to those attained for the
Laplace matrices set, but the improvement experienced by the SpMV kernel is larger in all
cases.
Next, we tested our solver on the non-symmetric CDP cases (see Section 3.1). In these
experiments, the parallel versions outperform the serial CPU solvers, with speed-ups in the
order of 2×, while the acceleration of the preconditioner is almost 3×.
Comparing both solvers, the acceleration attained by GMRES is always lower than
that observed for BiCG. This can be easily explained by noting that, in BiCG, the GPU-
accelerated stages (preconditioner application and SpMV) take most of the execution time.
For GMRES, the time of the unaccelerated stages is more important, to the extent that, in
some cases, the unaccelerated steps of the GPU versions consume a higher fraction of the
time than the accelerated ones. A detailed analysis revealed that, in these cases, the bottle-
neck of the GMRES method is the modified Gram-Schmidt re-orthogonalization. Regarding
the quality of the computed solution x∗ , the GPU-enabled solvers converge in the same
number of iterations and present the same final relative residual error (calculated according
to 3.1) as the original version of ILUPACK.
63
Chapter 3. Enabling GPU computing in sequential ILUPACK
Table 3.5: Experimental results for the baseline GPU-enabled implementation of ILUPACK’s
BiCG and GMRES methods (Base GMRES and Base BICG) in platform Beethoven. Times
are in seconds.
64
3.3. Enhanced data-parallel variants
Figure 3.2: Algorithmic formulation of the preconditioned BiCG method. The steps have been
re-arranged so that the two sequences that compose the method can be isolated and executed in
different devices.
time-consuming stage. In the case of BiCG, the application of the transposed preconditioner
is often much slower than that of the non-transposed variant. This is related to the use of
cuSparse transposed operations, and especially the transposed version of the matrix-vector
product which, in general, is much slower than the non-transposed variant.
In this section we analyze the principal bottlenecks that arise after accelerating the
preconditioner application of sequential ILUPACK. Then we propose new versions of the
corresponding solvers in order to overcome these new limitations. Finally we perform an
experimental assessment of the advanced parallel variants.
65
Chapter 3. Enabling GPU computing in sequential ILUPACK
Table 3.6: Evaluation of BiCG for selected cases in platform Brahms. The SpMV and the
application of the preconditioner both proceed in a single GPU while other minor operations are
performed by the CPU. Times are in seconds.
maintain explicit copies of the transposed operands in the GPU memory and operate directly
with them. However, this solution is not satisfactory as, for some large-scale problems, the
amount of memory in the accelerator may be insufficient.
Contrary to most Krylov-based iterative linear solvers, in which there exist strict data
dependencies that serialize the sequence of kernels appearing in the iteration, BiCG is com-
posed of two quasi-independent recurrences, one with A and a second with AT ; see the
left-hand side column of Figure 3.2. Moreover, in the preconditioned version of the method,
there is no data dependence between the application of the transposed and non-transposed
preconditioner inside the iteration, exposing coarse-grain parallelism at the recurrence-level.
To improve the performance of BiCG, we rearranged the operations in the BiCG method
so that these two sequences are isolated and executed in a server equipped with two dis-
crete graphics accelerators. This enables the exploitation of coarse-grain task-parallelism in
conjunction with the data-parallelism that is leveraged inside each accelerator.
The specific operations off-loaded to each device in our dual-GPU implementation of the
BiCG method are outlined in Figure 3.2. This partitioning of the workload executes a single
SpMV and the application of one of the preconditioners in each GPU. As these are the
most computationally-demanding steps of the solver, this distribution of the workload can
be expected to be fairly well-balanced, despite the fact that one of the devices also computes
a dot product and two axpy operations.
In a system equipped with two discrete accelerators, each GPU (A and B) has its own
separate memory. Thus, to avoid wasting memory, and simultaneously exploit the faster
version of cuSparseSpMV, we maintain the non-transposed operands in GPU A and the
transposed ones in GPU B. With this approach we do not use additional memory, but
exploit the best storage format in each device. However, it is now necessary to transfer the
transposed data to the second GPU, which adds some overhead.
In the previous implementation, the preconditioner was copied to the GPU asyn-
chronously, as it was calculated by the incomplete factorization routine, overlapping the
transferences with the calculation of the subsequent levels of the preconditioner. As the
BiCG method is the only one that makes use of the transposed matrix and preconditioner,
it makes no sense to perform the transfer of the these transposed operands by default. For
that reason we report this communication cost as an overhead of BiCG. Nevertheless, a sig-
nificant part of this cost can be hidden if the transference is performed asynchronously and
66
3.3. Enhanced data-parallel variants
overlapped with the calculation of the preconditioner, as it is the case of the non-transposed
data.
Regarding the concurrent execution in both devices, we leverage asynchronous memory
copies between the CPU and the GPU so that the accelerated section of the solver proceeds
asynchronously with respect to the host. To accomplish this, we previously register the
memory area used by the solver as non-pageable so that the operations corresponding to
different devices overlap even though they are launched serially.
In summary, by using both devices, we can exploit the increased computational power
provided by the second GPU in addition to the much faster execution of the SpMV when
invoked with non-transposed operands.
67
Chapter 3. Enabling GPU computing in sequential ILUPACK
acceleration for the transposed preconditioner on the GPU, we will only evaluate the use
of the CPU for the transposed SpMV, using cuSparse to compute the application of the
transposed preconditioner. This should allow the overlapping of an important part of the
non-transposed SpMV and the application of the non-transposed preconditioner, with the
transposed SpMV, while the eventual overlapping with the transposed preconditioner will
be due to the use of streams.
Enhancing task-parallelism
One of the main bottlenecks of the application of ILUPACK multilevel ILU preconditioner
is the solution of four triangular linear systems in each level, which can be derived from
Equation (2.90). The solution of these operations involves routine cusparseDcsrsv solve
in cuSparse, whose implementation is based on the so-called level-set strategy [17], and
appears described in [75]. In a broad sense, it consists in performing an analysis of the
triangular sparse matrix to determine sets of independent rows called levels. The triangular
solver will then launch a GPU kernel to process each one of these levels, processing the rows
that belong to each level in parallel. The number of levels that derive from the analysis can
vary greatly according to the sparsity pattern of each triangular matrix, so that for matrices
of considerable size it is usually in the order of hundreds or even a few thousands. In such
cases, the overhead due to launching the kernels that correspond to each level can become
significant [46]. In the context of a CPU-GPU concurrent execution, wasting one core on
launching kernels instead of doing useful computations can have a more significant impact
on the overall performance.
Recent research has aimed to reduce the synchronization overhead, as well as replacing
the costly analysis phase by a less computationally-demanding process, based on a self-
scheduled strategy that effectively avoids the synchronization with the CPU [70]. In [44]
we followed these ideas to develop a synchronization-free GPU routine to solve triangular
linear systems for matrices in CSR format. The cost of launching the kernels involved
in this routine is completely negligible, so that solving the triangular linear systems that
correspond to the application of the non-transposed with this strategy should enable a better
overlapping with the transposed SpMV in the GPU. We next provide a brief description of
our routine.
Our self-scheduled procedure to compute the solution of a (lower) triangular sparse linear
system proceeds row-wise, assigning a warp to each unknown. To manage the dependencies
between unknowns, each warp must busy-wait until the entries in the solution vector that
are necessary to process that row have their final values. To keep track of this, a ready
vector of booleans, with an entry for each required unknown, indicates if the corresponding
row has been processed or not.
Before a warp can start processing its assigned row, it iteratively polls the corresponding
entries of the ready vector until all of them have been set to one by the corresponding warps.
The values in the local variable of each thread of the warp are then reduced by a warp-voting
68
3.3. Enhanced data-parallel variants
primitive that returns one if the polled value is nonzero for all the active threads in the warp,
or zero otherwise.
This process will likely cause a deadlock if the warps that produce the values needed
by the current warp are not executed on the multiprocessor because all active warps are
caught executing a busy-waiting. We rely on the fact that accessing the ready vector on
global memory causes a warp stall and, therefore, the warp scheduler of the Streaming
Multiprocessor (SM) activates another warp, so eventually all the dependencies of the warp
get fulfilled.
Once this occurs, each thread of the warp multiplies a nonzero value of the current row
by the appropriate entry of the solution vector, accumulating the result in a register. If
there are more than 32 nonzero elements in the row, the warp moves back to busy-waiting
for the solution of the unknowns that correspond to the next 32 nonzero elements of the
row. The cycle is repeated until all nonzero entries in the row have been processed.
Finally, the registers that accumulate the products performed by each thread of the warp
are reduced using warp shuffle operations, and the first thread of the warp is responsible for
updating the solution and ready vector accordingly.
A more detailed explanation of this process, accompanied by experimental results, can
be found in Appendix A.
69
Chapter 3. Enabling GPU computing in sequential ILUPACK
Table 3.7: Evaluation of GMRES for selected cases in platform Brahms. We display the execution
time (in seconds) of each stage of the GMRES solver and their cost (in %) relative to the total time.
In the GPU variant, the SpMV and the application of the preconditioner (identified with the label
“Appl. M −1 ”) both proceed in the GPU, while MGSO and other minor operations are performed
by the CPU.
The implementation of MGSO integrated into ILUPACK’s routine for GMRES is described
in Algorithm 7. For our enhanced data-parallel implementation of this procedure, we de-
signed a hybrid version that leverages the best type of architecture for each operation while,
at the same time, limiting the data transferences.
In particular, we rely on the CUBLAS library to perform the dot products (O1 and O3)
and vector updates (O2, O4 and O5) on the GPU. Since we already off-loaded the SpMV
appearing in GMRES to the accelerator, the basis vectors of MGSO reside in the GPU.
The output of the process is the current basis vector, orthogonalized with respect to the
remaining vectors, and the coefficients of the current row of the Hessenberg matrix. This
matrix is small, and the following application of rotations and triangular solve expose little
parallelism, so it is natural to keep it in the CPU memory. The coefficients of the matrix are
calculated serially via vector products with the basis vectors, so the GPU computes n flops
for each coefficient that is transferred back to the CPU. A similar approach was studied
in [60].
3.3.4 BiCGStab
The BiCGStab method (see Section 2.2.2) is one of the most widespread iterative solvers
for general linear systems [88] for which, unfortunately, there is no support in the current
distribution of ILUPACK.
As stated in Section 2.4, the solvers in ILUPACK are implemented following a reverse
communication strategy. However, as BiCGStab can be efficiently implemented in the GPU
via calls to cuBlas, cuSparse and our data-parallel version of ILUPACK’s preconditioner,
it is natural to encapsulate the method in one monolithic function that receives as inputs,
70
3.3. Enhanced data-parallel variants
Table 3.8: Experimental results for the enhanced coarse-grain data-parallel version of BiCG for
dual-GPU servers (Adv BICG 2GPU) in platform Brahms.
among other parameters, the right-hand side vector and the initial guess, and produces the
approximate solution to the system in response.
Our CPU implementation of BiCGStab relies on a subset of the BLAS routines dis-
tributed with ILUPACK, though a different implementation of BLAS can be used without
any major modification. On the other hand, our implementation of BiCGStab follows the
ideas in [73] to off-load the entire solver (i.e., all kernels) to the GPU.
Table 3.8 reports the execution time of the coarse-grain data-parallel version of BiCG for
dual-GPU servers (Adv BICG 2GPU). In this table, we separate the operations of each
iteration of BiCG into three groups/stages, and assess the relative cost corresponding to
each of them as well as the total time. The first stage aggregates the operations that
were accelerated by the two GPU versions of the solver: SpMV and application of the
preconditioner. The second stage measures the overhead of the initialization due to the copy
of the transposed matrix and preconditioner to the second GPU; therefore, it is only visible
for the new dual-GPU variant. The third stage comprises the remaining operations, mostly
71
Chapter 3. Enabling GPU computing in sequential ILUPACK
dot products and vector updates, which are performed in the CPU in all our implementations
of BiCG.
The results show that the modifications produce a sensible reduction of the execution
time for the cases that take a higher number of iterations to converge. This is necessary
to compensate the additional cost of copying and transferring the preconditioner and the
coefficient matrix to the second GPU. If we consider only the first stage, the acceleration
achieved by the dual-GPU variant with respect to the CPU version is of up to 16×. Nev-
ertheless, the cost of initialization and the unaccelerated parts of the solver significantly
affect the performance. Overall, despite this costly initialization, for some of the problem
instances the execution time of the iterative solve is reduced by a factor in the range 7–10×
with respect to the sequential version.
On the other hand, it is especially remarkable that with our strategy the speed-up
associated to doubling the many-core devices overcomes the linear evolution for the largest
cases. Note that the dual-GPU version of BiCG outperforms the original GPU variant by a
factor of up to 2.88× for the whole method yielding a super-linear speed-up.
We next perform the analysis of the experimental results obtained from the execution of
the single-GPU variants discussed in Section 3.3.2. We use the same set of matrices and
hardware platform employed for the evaluation of the dual-GPU proposal.
We include four variants in the evaluation:
• Cpu BICG mkl performs all the computations on the multi-core processor. The two
SpMVs of the BiCG are performed using the multi-threaded of the MKL library
(version 2017, update 3), while the triangular solvers are computed with an optimized
sequential code. This variant does not exploit task-parallelism.
• Adv BICG 2str computes the operations corresponding to the GPU A and GPU B
blocks in Figure 3.2) in one device, assigning one GPU stream to each block. The GPU
computation of the SpMVs and sparse triangular linear systems are performed using
cuSparse library, the vector operations of BiCG are computed using cuBlas, and the
less important vector operations inside the preconditioner application are implemented
using ad-hoc GPU kernels.
• Adv BICG Hyb Cusp employs the multi-core CPU to perform the transposed SpMV
of BiCG using the MKL library. The rest of the computations are performed in the
GPU using cuSparse and cuBlas libraries, as in Adv BICG 2str.
• Adv BICG Hyb SF replaces the triangular solver of the routine that applies the non-
transposed preconditioner by our new synchronization-free routine. It employs the
MKL library to perform the transposed SpMV of BiCG, and cuSparse to compute
the main operations of the transposed preconditioner.
The results obtained for the four variants described are displayed in Table 3.9. We
only present the data corresponding to the accelerated part of BiCG because it is on these
computations that the four variants differ from each other. The runtimes of the remaining
stages are similar to those in Table 3.8. A discrepancy in the number of iterations performed
72
3.3. Enhanced data-parallel variants
by each variant is obtained for the diagonal problem, which can be attributed to the effect
of floating-point rounding errors.
It can be observed that the proposed task-parallel variants are able to significantly ac-
celerate the involved section of BiCG iteration, with speed-ups that reach 12× relative to
the data-parallel multi-core version. In most of the cases, the improvement is mainly due
to the GPU acceleration of the application of the preconditioner, which can be deduced
from the speed-ups corresponding to version Adv BICG 2str. The usage of GPU streams,
however, has little effect on the overall performance, as almost no overlapping of operations
is observed in the timelines extracted with Nvidia Visual Profiler. This can be observed for
the A200 and cage15 sparse matrices in Figures 3.3 and 3.4 respectively.
Regarding the hybrid GPU-CPU variants, the results show that off-loading the trans-
posed SpMV to the multi-core can yield important benefits. In our experiments, the im-
provements with respect to the Adv BICG 2str variant ranges from 8% to almost 65%.
A number of factors influence the relation between the performance of the Adv BICG -
2str and Adv BICG Hyb Cusp variants. For example, as can be observed in Figure 3.4
for matrix cage15, a great fraction of the performance improvement is due to the reduction
of the time taken by the transposed SpMV. In Adv BICG 2str, this takes more than half
of the execution time of the iteration, and the multi-core implementation in Adv BICG -
Hyb Cusp is able to both reduce this time in half and overlap of part of the application of
the preconditioner. A different situation is observed for matrix A200, where the cuSparse
routine for the transposed SpMV outperforms the MKL counterpart. In this case though,
the difference between the runtime of both routines is less critical, and a performance im-
provement is obtained regardless, due to the almost perfect overlapping of the application
of the preconditioner with the transposed SpMV.
The results also show that the use of our synchronization-free routine contributes with
an additional performance improvement by enabling a higher degree of overlapping between
operations. This will depend mostly on the number of level-sets of the incomplete factors,
as additional levels imply having to launch more kernels, and augment the corresponding
overhead. For instance, the analysis of the triangular factors generated for matrix A200
yields only 40 levels, whereas for cage15 it generates 616. Thus, it is not surprising that
the difference between the runtimes of Adv BICG Hyb Cusp and Adv BICG Hyb SF
for matrix A200 is minimal, while for cage15 the performance gain is relevant.
73
Chapter 3. Enabling GPU computing in sequential ILUPACK
Table 3.9: Aggregated runtime (in seconds) and acceleration of the parallelized part of BiCG in
platform Brahms.
SpMV Speedup
Matrix Routine # It.
App. Prec. vs. Cpu BICG mkl
Cpu BICG mkl 12 9.05
Adv BICG 2str 12 1.20 7.51
A200
Adv BICG Hyb Cusp 12 1.04 8.67
Adv BICG Hyb SF 12 1.04 8.74
Cpu BICG mkl 12 12.52
Adv BICG 2str 12 2.49 5.02
A252
Adv BICG Hyb Cusp 12 2.05 6.09
Adv BICG Hyb SF 12 2.04 6.14
Cpu BICG mkl 14 2.15
Adv BICG 2str 14 0.78 2.77
cage14
Adv BICG Hyb Cusp 14 0.27 7.88
Adv BICG Hyb SF 14 0.22 9.85
Cpu BICG mkl 442 145.09
Adv BICG 2str 442 37.75 3.84
Freescale1
Adv BICG Hyb Cusp 442 32.02 4.53
Adv BICG Hyb SF 442 28.41 5.11
Cpu BICG mkl 12 3.49
Adv BICG 2str 12 1.27 2.75
rajat31
Adv BICG Hyb Cusp 12 0.85 4.10
Adv BICG Hyb SF 12 0.83 4.21
Cpu BICG mkl 16 10.16
Adv BICG 2str 16 3.09 3.29
cage15
Adv BICG Hyb Cusp 16 1.17 8.67
Adv BICG Hyb SF 16 0.85 12.02
Cpu BICG mkl 170 296.24
Adv BICG 2str 190 107.91 2.75
diagonal
Adv BICG Hyb Cusp 192 97.45 3.04
Adv BICG Hyb SF 176 79.97 3.70
Cpu BICG mkl 158 262.67
Adv BICG 2str 158 86.56 3.03
circular
Adv BICG Hyb Cusp 158 78.89 3.33
Adv BICG Hyb SF 158 72.23 3.64
Cpu BICG mkl 170 303.07
Adv BICG 2str 170 97.91 3.10
unit-vector
Adv BICG Hyb Cusp 170 89.65 3.38
Adv BICG Hyb SF 170 80.27 3.78
74
cusparseSpmvT
CPU SpmvT
3.3. Enhanced data-parallel variants
75
cusparseSpmv Kernel launch Preconditoner Transposed Preconditioner
overhead
CPU SpmvT
Figure 3.3: Output from Nvidia Visual Profiler for matrix A200. The timeline is magnified to show a single iteration of the solver. The upper timeline
corresponds to Adv BICG 2str, the middle to Adv BICG Hyb Cusp, and the bottom to Adv BICG Hyb SF. The time spans of the iterations from top-down
are 237.675ms, 208.03ms and 206.801ms respectively.
cusparseSpmvT
CPU SpmvT
76
cusparseSpmv Kernel launch Preconditoner Transposed Preconditioner
overhead
CPU SpmvT
Figure 3.4: Output from Nvidia Visual Profiler for matrix cage15. The timeline is magnified to show a single iteration of the solver. The upper timeline
corresponds to Adv BICG 2str, the middle to Adv BICG Hyb Cusp, and the bottom to Adv BICG Hyb SF. The time spans of the iterations from top-down
are 500.474ms, 273.205ms and 218.361ms respectively.
3.3. Enhanced data-parallel variants
Table 3.10: Experimental results for the enhanced data-parallel version of MGSO and the GMRES
method (Adv GRMES) in platform Brahms. Times are in seconds.
Table 3.10 compares the execution time of the GPU-accelerated MGSO routine against its
CPU counterpart, showing the effect on the execution time of the entire solver. The results
demonstrate that the accelerated version of the routine achieves a notable reduction of its
execution time, reaching speed-ups between 3× and 7× over the CPU version. Moreover,
combining this with the previous enhancements, we obtain speed-ups of up to 5.8× for the
entire solver.
An additional observation from the data in Table 3.10 is that the factor which most affects
the acceleration is the number of iterations of the solver. This is due to the transference of
the basis vectors from the device to the host that is necessary in order to perform the last
step of the GMRES on the CPU. The two factors that determine the cost of MGSO at a
given iteration k of GMRES are the number of vectors involved in the orthogonalization and
their dimension, which is equivalent to k modulo the restart parameter of GMRES. Since
one of these transfers occur every time MGSO is invoked, the ratio between the volume of
transfers and the amount of computation in MGSO improves with each iteration until the
restart. As a consequence, the impact of the transference overhead is greater in those cases
that converge in a small number of steps and do not reach the restart point of GMRES,
which we set in our experiments to 30 iterations. This communication can be avoided if the
last step of GMRES is performed on the accelerator.
Table 3.11 compares our CPU and GPU variants of BiCGStab. In addition to the total
execution time of the solver, we present the average time (and speed-up) per iteration, as
in some cases we obtain slightly different values between the GPU and CPU versions. The
discrepancies are small and occur for those cases with higher condition number, which are
more prone to floating-point rounding errors.
The execution times observed for BiCGStab highlight the benefits of including this
method in the suite of CPU solvers supported by ILUPACK, as in general it attains better
convergence rates and delivers lower execution times than the CPU versions of GMRES and
77
Chapter 3. Enabling GPU computing in sequential ILUPACK
Table 3.11: Experimental results for the new data-parallel version of BiCGStab. Times are in
seconds.
Avg. speed-up
Total Avg. time per iter. with Relative
Matrix Device #Iter. time per iter. respect to. . . residual
CPU 3 4.27 1.423 8.21 1.40E-11
A200
GPU 3 0.52 0.173 – 1.40E-11
CPU 3 7.16 2.386 7.02 1.30E-11
A252
GPU 3 1.02 0.340 – 1.30E-11
CPU 3 1.93 0.643 5.36 3.40E-10
cage14
GPU 3 0.36 0.120 – 3.40E-10
CPU 92 54.21 0.589 9.07 2.30E-03
Freescale1
GPU 87 5.65 0.064 – 2.20E-03
CPU 2 1.62 0.810 5.40 7.80E-09
rajat31
GPU 2 0.30 0.150 – 7.80E-09
CPU 3 7.28 2.426 6.07 5.40E-10
cage15
GPU 3 1.20 0.400 – 5.40E-10
CPU 115 239.84 2.085 8.07 6.90E-08
circular
GPU 100 25.84 0.258 – 3.90E-08
CPU 111 281.80 2.538 10.09 6.60E-08
diagonal
GPU 114 28.68 0.251 – 5.20E-08
CPU 115 240.07 2.087 8.22 4.70E-08
unit-vector
GPU 108 27.42 0.253 – 4.00E-08
BiCG.
Regarding the use of the GPU, the acceleration factor for the solver iteration varies
between 4× and 10×, with the exact speed-up depending on characteristics of the problem
such as the dimension of the problem, the sparsity pattern of the coefficient matrix, and the
sparsity of the incomplete factors produced by the multilevel ILU factorization underlying
ILUPACK.
78
CHAPTER 4
The data-parallel variants presented in Chapter 3 are confined to execute in one compute
node, equipped with a multi-core CPU and one or two GPUs. This limitation operates
in two senses. First, compute nodes or servers equipped with more than two GPUs are
increasingly more common, in part driven by the computational requirements of machine-
learning. Second, being restricted to the memory subsystem of one node, including the
memory of the GPUs present in that node, constrains the size of the problems that can be
tackled with our current strategy.
A third drawback of our previous developments, is that their potential to exploit the cores
of modern CPUs is fairly limited as well. This is mainly a consecuence of the characteristics
of the problem at hand. Except for the BiCG method, the rest of the studied iterative
solvers are based on a sequence of basic matrix and vector computations that allow little
overlapping between them. Moreover, although data-parallelism can be exploited in these
operations, the performance gain will be strongly limited by the memory throughput of the
CPU. It is thus safe to assume that no radical performance boost can be extracted from the
exploitation of the multi-core CPU without significant effort.
In the final sections of Chapter 2 one such effort is described, based on a modification of
both the preconditioner and the PCG method of ILUPACK, which is capable of exposing
task-level parallelism. We call this software Task-Parallel ILUPACK, and the underlying
ideas are detailed in [3] and [4], where two different implementations, one for shared-memory
and the other for distributed-memory platforms, are presented and analyzed.
Although as mentioned before, the Task-Parallel ILUPACK is restricted to the SPD
case, and might have some numerical disadvantages with respect to the sequential and data-
parallel versions, these factors are compensated by its excellent performance and scalability,
which makes this variant useful in large-scale SPD scenarios.
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
Encouraged by the results presented in Chapter 3, which suggest that sensible perfor-
mance improvements can be attained by the use of GPUs, and that this only exerts a mild
effect on the accuracy attained by the preconditioner, we are now iteresting in studying how
this data-parallelism can be harnessed in the Task-Parallel ILUPACK.
80
4.1. GPU acceleration of the shared-memory variant of ILUPACK
81
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
82
4.2. Overcoming memory capacity constraints
compute thrust available to solve the systems, and thus the size of the problems that can
be tackled. In this section we present the key features of our new GPU-enabled variant
of the distributed-memory ILUPACK, which makes use of massively-parallel architectures
residing in the nodes of a cluster. As before, we focus on the stage that corresponds to the
application of the preconditioner during the iteration of the PCG method.
This parallelization of ILUPACK builds upon the task-parallel version presented in [4],
following the main aspects described in Section 2.4.2. To execute in distributed-memory
environments, this version of ILUPACK maps each leaf task of the DAG to a different MPI
rank. This correspondence between tasks and ranks is statically defined by the root node
during the initialization steps, and is maintained for all subsequent operations. Then, the
root node distributes the data required for the parallel construction of the preconditioner
to the corresponding tasks. In our case, the initialization steps also include the creation
of a CUDA Stream for each leaf task. If there are many devices available in the compute
node, the streams, and consequently the tasks resident in that node, are mapped to different
devices in a round-robin fashion, based on their task id.
As in the case of the task-parallel multi-core variant in the previous section, the appli-
cation of the preconditioner consists of forward and backward substitution processes that
are now divided across the levels of the task-tree. Once the tree is traversed from bottom
to top, the backward substitution proceeds from top to bottom, until finally reaching the
leaf tasks again. Since the information of the preconditioner is spread across the task-tree,
traversing this structure requires a communication operation, and sometimes presents data
dependencies with the next or previous levels in the task-tree hierarchy. From the point of
view of the execution on a GPU, besides storing the residual rk+1 in GPU memory at the
beginning of the forward substitution phase, now it is necessary to transfer the intermediate
results of each task back to the CPU buffers before entering the recursive step.
As earlier, the partitioning strategy implies that only the tasks lying on the bottom
level of the tree perform a significant amount of work. Moreover, the multilevel structure
of ILUPACK’s preconditioner partitions the workload further, severely limiting the amount
of data-parallelism. These are the main motivations underlying the threshold introduced
for the multi-core variant. However, not all the operations involved in the application of
the preconditioner inside a leaf task are equally affected by these constraints to parallelism.
For example, the sparse triangular system solves that appear in the application of the pre-
conditioner tend to present considerably more data dependencies between their equations,
while the sparse matrix-vector products and vector operations present a degree of paral-
lelism similar to the non-partitioned case. In response to this, we propose a variation of the
threshold strategy where, instead of migrating the computation of the smaller levels entirely
to the CPU, we do this only for the triangular solves corresponding to such levels. This way
we can take a certain advantage of the data-parallelism present in the sparse matrix-vector
products and the vector operations, even in the smaller levels of each leaf. We call this
variant DistMem Thres.
Both approaches imply that some sections of the working buffer have to be transferred
between the CPU and GPU memories during the forward and backward substitution oper-
ations. Moreover, these transfers are synchronous with respect to the current task or GPU
83
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
stream, since the application of one algebraic level of the multilevel preconditioner cannot
commence until the results from the previous level are available. In our first approach, once
the first level is processed during the forward substitution, the data to be read by all the
operations of the following levels is sent to the CPU memory. In the backward substitution
phase, the data to be read by the first level needs to be transferred to the GPU. In our
second approach, only the data needed by the triangular solves of the smaller algebraic lev-
els is transferred. Although this scatters the communications into many small transfers, an
scenario which is not desirable in the GPU context for bandwidth and latency reasons, the
amount of data to be passed is almost the same, and the number of data movements does
not increase significantly.
The multi-core GPU server utilized for the experiments is Brahms, presented in Section
3.1.
The CPU code was compiled with the Intel(R) Parallel Studio 2016 (update 3), which
partially supports version 4.5 of the OpenMP specification (only for C++). However, none
of the new features of versions 3 and 4 is used in our code, and the compiler is fully compliant
with OpenMP v2.×.
The performance evaluation of the distributed-memory variant was carried out using 4 nodes
of a cluster installed in the CETA-CIEMAT1 center (Trujillo, Spain). Each node is equipped
with a 12-core Intel(R) Xeon(R) E5-2680 v3 processor (2.50GHz), 64 GB of DDR3 RAM
memory, and 2 Tesla K40 GPU with 2,880 CUDA Cores and 12 GB of GDDR5 RAM each.
We used the OpenMPI v10.3.1 implementation of the MPI standard and gcc 4.8.5 with
the -O3 flag to compile the CPU code. The GPU compiler and the CUSPARSE library
1 Centro Extremeño de Tecnologı́as Avanzadas
84
4.3. Numerical evaluation
Test cases
The benchmark utilized for the experimental evaluation is the SPD case of scalable size
presented in Section 3.1.2.
Additionally, we tested our distributed-memory proposal on a circuit simulation problem
from the SuiteSparse2 collection (G3 Circuit), also described in Section 3.1.2 in order to
evaluate our strategy on a matrix with a different sparsity pattern than those offered by the
Laplace benchmark.
Table 4.1: Number of leaf tasks and average structure of each algebraic level of the preconditioner
using f = 2 and f = 4. To represent the structure of the levels, the average dimension, the number
of non-zeros and the rate of non-zeros per row is presented, together with the respective standard
deviations.
nnz
Matrix # th. / f # leaves level avg. n σ(n) avg. nnz σ(nnz) n σ( nnz
n )
0 1,006,831 345,798 6,193,794 2,183,862 6.1 0.1
2 3 1 317,362 113,151 9,682,114 3,486,401 30.5 0.2
2 2,875 736 10,099 2,014 3.6 0.6
A159
0 502,108 159,044 3,116,629 1,005,408 6.2 0.1
4 6 1 156,048 50,647 4,685,500 1,537,905 30.0 0.2
2 1,251 437 4,095 1,754 3.2 0.4
0 1,881,030 16,604 11,421,390 123,868 6.1 0.1
2 2 1 598,152 1,384 18,490,583 154,695 30.9 0.2
2 6,304 984 23,444 7,247 3.7 0.6
A171
0 937,998 6,011 5,764,461 71,397 6.1 0.1
4 4 1 294,702 1,310 8,967,985 72,180 30.4 0.2
2 2,845 506 10,885 3,768 3.7 0.7
0 2,003,212 795,192 12,207,556 4,592,834 6.1 0.1
2 3 1 636,895 253,696 19,665,756 8,089,987 30.8 0.4
2 6,466 3,316 23,189 12,912 3.5 0.2
A200
0 856,365 186,595 5,283,746 1,141,907 6.2 0.1
4 7 1 268,523 595,53 8,155,375 1,842,559 30.3 0.2
2 2,449 525 8,552 2,032 3.5 0.4
0 4,004,955 1,694,044 24,271,087 9,856,575 6.1 0.1
2 3 1 1,283,180 543,882 39,965,828 17,408,294 31.0 0.3
2 14,762 7,162 57,168 28,744 3.8 0.1
A252
0 1,998,470 494,294 12,196,071 3,070,313 6.1 0.1
4 6 1 635,612 159,942 19,603,140 4,936,718 30.8 0.1
2 6,523 1,429 23,807 5,758 3.6 0.3
Table 4.1 summarizes the structure of the multilevel preconditioner and the linear systems
2 [Link]
85
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
180 200
120
120
100 100
80
80
60
60
40
40
20
20 0
A159 A171 A200 A252 A159 A171 A200 A252
Figure 4.1: Execution time (in seconds) of preconditioner application for the three task parallel
variants, using two (left) and four (right) CPU threads. CPU version is the blue line with crosses.
ShMem GPU All version is the red line with circles. ShMem Thres is the black line with stars.
corresponding to leaf tasks that were generated using the aforementioned parameters. For
each tested matrix, the table presents the number of leaf tasks that resulted from the task
tree construction for f = 2 and f = 4, and next to it shows the average dimension of the
algebraic levels of the corresponding multilevel preconditioner, the average number of non-
zeros, and the average row density of the levels, with their respective standard deviations.
It can be easily observed that a higher value of f results in more leaf tasks of smaller
dimension. Regarding the algebraic levels of the factorization, the table shows how the
average dimension of the sub-matrices decreases from one level to the next. It is important
to notice how, in the second algebraic level, the submatrices already become about one
third smaller in dimension, and contains five times more nonzero elements per row. In other
words, the sub-problems become dramatically smaller and less sparse with each level of the
factorization, causing that, in this case, only the first algebraic level is attractive for GPU
acceleration.
Table 4.2 shows the results obtained for the original shared-memory version and the two
GPU-enabled ones for the matrices of the Laplace problem. The table reports the total
runtime of PCG, as well as the time spent on the preconditioner application stage and the
SpMV are presented. The table also shows the number of iterations taken to converge to the
desired residual tolerance, and the final relative residual error attained, which is calculated
as in (3.1)
First, it should be noted that from the perspective of accuracy, there are no significant
differences between the task-parallel CPU variant and the GPU-enabled ones. Specifically,
the three versions reach the same number of iterations and final relative residual error for
each case.
From the perspective of performance it can be observed that, on the one hand, ShMem -
GPU All only outperforms ShMem CPU for the largest matrix (A252) and in the context
of 2 CPU threads. This result was expected, as the GPU requires large volumes of compu-
tations to truly leverage the device and hide the overhead due to memory transfer. On the
other hand, ShMem Thres is able to accelerate the multi-core counterpart for all covered
86
4.3. Numerical evaluation
Table 4.2: Runtime (in seconds) of the three task-parallel shared-memory variants in platform
Brahms.
# threads Matrix Version Iters. tot. Spmv tot. Prec tot. PCG R(x∗ )
ShMem CPU 29.55 32.86
A159 ShMem GPU All 88 2.30 44.33 47.46 1.39E-008
ShMem Thres 20.46 23.83
ShMem CPU 39.43 43.87
A171 ShMem GPU All 97 3.07 48.02 52.36 1.52E-008
ShMem Thres 30.62 35.19
2
ShMem CPU 71.58 79.98
A200 ShMem GPU All 107 5.83 84.37 92.61 2.45E-008
ShMem Thres 47.73 56.26
ShMem CPU 175.66 195.67
A252 ShMem GPU All 131 13.86 153.48 173.62 3.23E-008
ShMem Thres 120.19 140.50
ShMem CPU 22.72 24.55
A159 ShMem GPU All 88 1.30 44.82 46.40 9.96E-009
ShMem Thres 15.21 17.15
ShMem CPU 22.43 24.76
A171 ShMem GPU All 95 1.58 57.84 59.78 2.20E-008
ShMem Thres 17.50 19.87
4
ShMem CPU 40.34 45.03
A200 ShMem GPU All 108 3.13 108.37 112.41 1.06E-008
ShMem Thres 33.80 38.60
ShMem CPU 104.21 116.37
A252 ShMem GPU All 130 8.25 193.19 204.74 2.16E-008
ShMem Thres 90.05 104.60
cases; see Figure 4.1. This result reveals the potential benefit that arises from overlapping
computations on both devices. Hence, even in cases where the matrices present modest
dimensions, this version outperforms the highly tuned multi-core version. Additionally, the
benefits related with the use of the GPU are similar for all matrices of each configuration,
though the percentage of improvement is a bit higher for the smaller cases. This behavior is
not typical for GPU-based solvers and one possible explanation is that the smaller cases are
near the optimal point (from the threshold perspective) while the largest cases are almost
able to compute 2 levels in GPU. This can be confirmed in Table 4.3, were we add a variant
that computes the first 2 levels on the accelerator. As the multilevel factorization generates
only 3 levels, with the third one very small with respect to the other two, it is not surprising
that the runtimes of this version are almost equivalent to those of ShMem GPU All. The
table shows how the penalty of computing the second level in the GPU decreases as the
problem dimension grows.
Finally, ShMem Thres also offers higher performance improvements for the 2-threads
case than for its 4-threads counterpart.
87
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
Table 4.3: Runtime (in seconds) of ShMem Thres, adjusting the threshold to compute 1 and 2
levels in the GPU, in platform Brahms.
# threads Matrix ShMem Thres (1 lev) ShMem Thres (2 lev) ShMem GPU All
A159 23.83 43.79 44.33
A171 35.19 47.52 48.02
2
A200 56.26 84.16 84.37
A252 140.50 153.79 153.48
A159 17.15 44.54 44.82
A171 19.87 57.21 57.84
4
A200 38.60 108.70 108.37
A252 104.60 185.72 193.19
Table 4.4: Runtime (in seconds) of the serial CPU version of ILUPACK’s PCG and its corre-
sponding GPU-accelerated version. The speed-up in the table can serve as an upper bound of the
maximum possible improvement when leveraging the GPUs in the distributed memory variant in
platform Falla.
can expect by including the GPU in the distributed-memory version, Table 4.4 reports the
speed-up experienced when comparing the execution of the serial CPU version and the cor-
responding GPU-accelerated version in the target platform. In the table, we report the total
runtime of PCG and the residual error achieved in each case, which is calculated according
to Equation (3.1).
Table 4.5 shows the results obtained for the original distributed-memory version and the
GPU-enabled code (single node) for the different evaluated test instances. The experiments
were run with 2, 4 and 8 tasks that are distributed among the four nodes in a round-robin
fashion. For the instances consisting of 2 and 4 tasks, each task resides in a different node
and uses a different GPU. For the instance with 8 tasks there are two tasks per compute
node, but each task runs on a distinct GPU.
The number of iterations and residual errors obtained for the CPU and GPU versions
differ slightly, especially in the case with 8 tasks. These discrepancies are due to the distinct
numerical properties of the task-parallel preconditioners and do not represent a significant
variation in the accuracy. The results for the A400 benchmark case with 2 tasks were not
added because the per-task memory requirements of this test case are greater than the 64GB
of RAM that are available in each compute node. This illustrates one of the main benefits
of our novel approach. Our previous efforts addressed the inclusion of GPUs in the shared-
memory version of the task-parallel ILUPACK. Obviously, using only one compute node
strongly limited the number of devices, as well as the amount of GPU memory available
and, therefore, the size of the problems that could be tackled with our previous parallel
version of ILUPACK.
88
4.3. Numerical evaluation
Table 4.5: Runtime (in seconds) of version DistMem Thres in platform Falla. The experiments
were performed setting ILUPACK to use 2, 4, and 8 tasks. In the executions with 2 and 4 tasks,
each one resides in a different node with one GPU. In the case of the execution with 8 tasks, there
are 2 tasks per compute node, but each one uses a different GPU.
89
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
Table 4.6: An estimation of the memory overhead due to the partitioning for two of the benchmark
cases. Sum represents the sum of the dimension of the leaves, Overhead is the difference between
Sum and the original dimension of the coefficient matrix, while Ratio is the ration between Overhead
and the original dimension of the coefficient matrix.
Number of leaves
Matrix
2 4 8 16 32
Sum 127,518 134,821 152,201 191,130 274,540
A050 Overhead 2,518 9,821 27,201 66,130 149,540
Ratio 2.01 7.86 21.76 52.90 119.63
Sum 4,047,709 4,121,004 4,298,050 4,715,372 5,591,299
A159 Overhead 28,030 101,325 278,371 695,693 1,571,620
Ratio 0.70 2.52 6.93 17.31 39.10
At this point we note that there is a certain memory overhead caused by the initial
partitioning of the matrix, as some blocks have to be shared across several processes. This
can be appreciated in the expression in (2.99). In order to assess the importance of this
overhead, we evaluated the difference between the sum of the dimensions of the sub-matrices
corresponding to the leaves of the task tree and the original dimension of the coefficient
matrix. This distance was computed for two of the benchmark matrices, one small and the
other of moderate size, generating partitions of 2, 4, 8, 16 and 32 leaf tasks. Table 4.6 shows
that this overhead increases with the granularity of the partition, and the ratio between this
overhead and the original dimension of the matrix is much smaller for the larger matrix.
This means that, in order to improve the memory efficiency, one should partition the matrix
only when there is not enough memory in one node to perform the factorization and solution.
Moreover, this scheme is suitable to address large problems, where the relative memory cost
of the partitioning is less important.
Regarding the execution times, the GPU-enabled variant outperforms the CPU version
with an equal number of processes, except for the smaller cases of the benchmark when
using 8 tasks. Utilizing the GPU offers a 2× speed-up for the largest case. Moreover, the
advantage of using the GPUs tends to increase with the dimension of the problem.
The low performance of the smaller test cases can be easily explained, as smaller work-
loads make it more difficult to fully occupy the GPU resources and strongly increase the
relative cost associated to CPU-GPU communications. Additionally, partitioning the work
into many tasks constrains the data-parallelism and undermines the performance of some
GPU kernels.
Considering the previous discussion, it is obvious that our solution cannot attain strong
scalability [61] (i.e., a linear growth of the speed-up when augmenting the number of compute
resources for a problem of fixed size). However, as it can be appreciated in Figure 4.2, the
results show that we are close to attaining weak scalability (i.e., maintaining the speed-up
when augmenting the compute resources in the same proportion as the problem size). For
example, if we take the number of nonzero values of the matrices as a rough estimation
of the problem size, the execution of the A159 problem in 2 processors/GPUs, A252 in 4
processors/GPUs, and A400 in 8 processors/GPUs all present similar acceleration factors.
Furthermore, when analyzing the speed-ups in Table 4.5, it should be taken into account
90
4.3. Numerical evaluation
1000 2.2
cpu 2 tsk gpu 2 tsk
900
cpu 4 tsk 2 gpu 4 tsk
cpu 8 tsk gpu 8 tsk
800
gpu 2 tsk
gpu 4 tsk 1.8
700
gpu 8 tsk
600
1.6
500
1.4
400
300 1.2
200
1
100
0 0.8
A159 A171 A200 A252 A318 A400 A159 A171 A200 A252 A318 A400
Figure 4.2: Execution time (in seconds) of DistMem Thres for each of the executed test cases
(left) and acceleration obtained by the inclusion of the GPU over DistMem CPU (right) in platform
Falla.
that only (part of) the application of preconditioner in the leaf tasks is accelerated using
GPUs, while the rest of the PCG operations, though being potentially parallelizable, at this
moment belong to the unaccelerated part of the program.
91
Chapter 4. Design of a Task-Parallel version of ILUPACK for Graphics Processors
92
CHAPTER 5
Conclusions
This chapter closes the thesis by offering some final comments about the work described
previously. It also presents an outline of the publications related to this dissertation, and is
closed with a discussion of several potential lines of future work.
• Aliaga, J. I., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2014).
Leveraging data-parallelism in ILUPACK using graphics processors. In IEEE 13th
International Symposium on Parallel and Distributed Computing, ISPDC 2014, Mar-
seille, France, June 24-27, 2014 , pages 119–126
• Aliaga, J. I., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S.
(2016a). A data-parallel ILUPACK for sparse general and symmetric indefinite linear
systems. In Euro-Par 2016: Parallel Processing Workshops - Euro-Par 2016 Inter-
national Workshops, Grenoble, France, August 24-26, 2016, Revised Selected Papers,
pages 121–133
94
5.1. Closing remarks
95
Chapter 5. Conclusions
• Aliaga, J. I., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2018).
Extending ILUPACK with a task-parallel version of BiCG for dual-GPU servers. In
Proceedings of the 9th International Workshop on Programming Models and Applica-
tions for Multicores and Manycores, PMAM@PPoPP 2018, February 25, 2018, Vi-
enna, Austria, pages 71–78
• Aliaga J. I., Dufrechou E., Ezzatti P., Quintana-Ortı́ E. S. Accelerating the task/data-
parallel version of ILUPACK’s BiCG in multi-CPU/GPU configurations. Under peer-
review at the Journal of Parallel Computing.
• Aliaga J. I., Bollhöfer M., Dufrechou E., Ezzatti P., Quintana-Ortı́ E. S. (2018) Extend-
ing ILUPACK with a GPU version of the BiCGStab method, XLIV Latin American
Computing Conference CLEI 2018, Sao Paulo, Brazil.
• Aliaga, J. I., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2016b). Design
of a task-parallel version of ILUPACK for graphics processors. In High Performance
Computing - Third Latin American Conference, CARLA 2016, Mexico City, Mexico,
August 29 - September 2, 2016, Revised Selected Papers, pages 91–103
96
5.2. Open lines of research
• Aliaga, J. I., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2017b). Over-
coming memory-capacity constraints in the use of ILUPACK on graphics processors.
In 29th International Symposium on Computer Architecture and High Performance
Computing, SBAC-PAD 2017, Campinas, Brazil, October 17-20, 2017 , pages 41–48
Jetson
Also in the line of energy efficiency, we adapted and evaluated the performance of our
data-parallel version of ILUPACK for SPD systems in a low power Jetson TX1 platform,
equipped with a Tegra GPU and low power ARM processors. Although the use of these
devices implies an important performance compromise, the results suggest that they can
be convenient in cases where the size of the problem does not allow to fully exploit regular
GPUs, considering their low power consumption.
The contributions relative to this topic are summarized in the following articles:
• Aliaga, J. I., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2017a). Evaluating
the NVIDIA tegra processor as a low-power alternative for sparse GPU computations.
In High Performance Computing - 4th Latin American Conference, CARLA 2017,
Buenos Aires, Argentina, and Colonia del Sacramento, Uruguay, September 20-22,
2017, Revised Selected Papers, pages 111–122
97
Chapter 5. Conclusions
explored several lines of work that led to efficient implementations of the most important
solvers in ILUPACK, but during the process we have detected some directions in which this
work can be extended.
Next we provide some details about these aspects:
98
5.2. Open lines of research
tion mechanisms introduced in CUDA 9.0. These developments are too recent to be
covered in this thesis, so they will be addressed as part of future work.
99
Chapter 5. Conclusions
100
Appendices
APPENDIX A
Many Numerical Linear Algebra methods, as the direct solution of sparse linear systems, the
application of preconditioners based on incomplete factorizations, and least squares prob-
lems, require the solution of sparse triangular linear systems as one of their most important
building blocks [56]. This is one of the reasons that explain the special attention devoted to
this kernel over the years, and the efforts towards developing efficient implementations for
almost all competitive hardware platforms.
This operation poses serious challenges regarding its parallel execution as, in general, the
elimination of one unknown depends on the previous elimination of others. Additionally,
the triangular structure of the matrices is prone to create load imbalance issues. Although
in the case of dense systems, these limitations can be partially overcome by rearranging the
operations of the solver into a sequence of smaller triangular systems and dense matrix-
vector multiplications [24], in the sparse case this solution does not offer any benefit in most
cases. However, the sparsity of the matrices often allows different unknowns to be eliminated
in parallel (e.g. consider the extreme case of a diagonal matrix).
The parallel algorithms for the solution of this operation can be classified in two main
contrasting categories. On the one hand we can find two-stage methods, which rely on a
preprocessing stage that analyzes the data dependencies to determine a scheduling for the
elimination of the unknowns that reveals as much parallelism as possible. On the other
hand, one-stage methods, based on a self-scheduled pool of tasks, on which some of the tasks
have to wait until the data necessary to perform their computations is made available by
other tasks1 . Both paradigms are a good option for some instances and not so good for
others, hence making impossible to offer a general result that determines which is the best
method. This decision problem is one of our lines of ongoing work and some preliminary
concepts are presented in [46].
1 Diagonal inverse-based methods [13, 12, 83, 14], and iterative approaches [19, 104, 101], are two alter-
native categories can be considered. These are interesting strategies but face important difficulties or are
not general enough to be widely applicable.
Appendix A. Solution of sparse triangular linear systems
Graphics Processors have evolved to become the top parallel device architecture, so the
development of algorithms that are able to exploit their benefits is of the utmost importance.
To this day, the most widespread GPU implementation of sparse triangular solvers is the one
distributed with the Nvidia cuSparse library2 , which belongs to the first class of methods.
Although the implementation is highly efficient, and has been improved and adapted to
new Nvidia architectures over the years, it has two main disadvantages. First, the analysis
phase, which determines the execution schedule, is very expensive. Second, this strategy
often pays the cost of constant synchronizations with the CPU [46].
As an alternative, a synchronization-free method, was recently proposed by W. Liu et
al. [70]. The method is based on a dynamically-scheduled strategy and involves only a
light-weight analysis of the matrix before the solution phase. However, it has the potential
disadvantage of making an extensive use of GPU atomic operations. Additionally, it targets
sparse triangular matrices stored in the CSC format, which is not as ubiquitous as CSR,
since the latter offers a much higher performance in other common operations like the Sparse
Matrix-Vector product [56].
This appendix offers details about a new GPU algorithm, presented in [44], for the
solution of sparse triangular linear systems in which the coefficient matrix is stored in the
CSR format. The algorithm belongs to the second class of methods, i.e. it does not involve
a pre-processing stage, and it is also synchronization-free.
We also present a similar strategy to compute the depth of each node of graph represented
by the sparse matrix. This information can be later used to calculate the same level sets
than those computed by cuSparse. Moreover, we design a routine that, if provided with the
sparse matrix coefficients and a right hand side vector, is able to combine the computation
of the level sets and the solution of a linear system implying little extra work. In contrast
with the routines offered by cuSparse, both our analyzer and our solver do not need to
synchronize with the CPU until the end of the computations.
Lx = b (A.1)
where L ∈ Rn×n is a (lower) sparse triangular matrix, b ∈ Rn is the right hand side
vector, and x ∈ Rn is the sought-after solution, is called forward-substitution. It consists
on substituting the value of the solved unknowns on the next equations, i.e. multiplying
the coefficients of these rows by their corresponding unknowns and subtracting the obtained
values to the right hand side, before dividing by the diagonal element in order to solve the
new unknowns. This clearly implies some serialization, since an equation cannot be solved
before all the unknowns on which it depends have been substituted by their actual values.
Figures A.1 and A.2 present serial versions of the algorithm for solving a sparse lower
triangular linear system Lx = b where the matrix L is stored in CSR and CSC sparse storage
2 Available at [Link] as part of the CUDA Toolkit.
104
A.1. Solution of sparse triangular linear systems
formats respectively. The procedure differs in order to ensure data locality in the access to
the sparse matrix.
105
Appendix A. Solution of sparse triangular linear systems
1 x
2
x
x
3 x
x
4 x x
5
x x
x
6 x x
7
x x
8
x x
9 x x x
10 x x x
1 2 3 4 5 6 7 8 9 10
1 2
3 4 5
6 7
8 9 10
Figure A.3: Nonzero pattern of lower triangular sparse matrix (top), its DAG (center), and its
level-set reordering (bottom).
A commonly used algorithm to compute the level sets in a DAG finds a topological
ordering of the graph using a variation of Kahn’s algorithm [65]. It consists on successively
finding all the nodes that have no incoming edges (root nodes), adding them to the ordered
list, and removing their outgoing edges from the graph. In a given iteration, the root nodes
are the nodes that can be processed in parallel, given that they have no dependencies, and
hence form the current level. Removing their outgoing edges ensures that the root nodes of
one iteration are the nodes that depended only on the root nodes of the previous iteration.
After all nodes have been processed, this procedure retrieves an ordered list of nodes iorder,
where those that belong to the same level are grouped together, and a list of pointers ilevels
that indicates the starting position of each level in iorder.
In [75] Naumov presented a GPU implementation of this approach. The implementation
is based on three GPU kernels that are executed iteratively until all the nodes of the graph
106
A.2. Related work
have been processed. The find roots kernel loads a buffer with the list of root nodes in the
current graph. Later, the analyze kernel processes the list of root nodes and the current
graph, removing the dependencies of the current root nodes. To optimize the process, it
also produces a list of candidates to be root nodes in the next iteration, so that the find -
roots kernel only needs to process this list. Naumov utilizes two buffers for storing root
nodes. One of them stores the root nodes that need to be processed in the current iteration,
while the other is used to write the root nodes of the next iteration. To avoid copying data
between these two arrays, Naumov simply flips the pointers to the arrays at the end of each
iteration. To the best of our knowledge, the implementation distributed with the cuSparse
library follows these ideas.
Another approach that could be used to perform the analysis phase of the parallel Sp-
TrSV is based on computing the depth of each node. In [105], Wing and Huang define
the depth of a node vi as the maximum distance from an initial node to vi . In the case of
DAGs that represent task dependencies, the depth of a node i represents the minimum step
at which task i will have its dependencies fulfilled. With this definition, a level of the DAG
can be seen as a group of nodes that share the same depth. As stated by Anderson and Saad
in [18], the depth of each node can be calculated in one sweep through the adjacency matrix
L following
depth(i) =
1 if lij = 0 ∀j < i
j<i {1 + depth(j) : lij 6= 0}
max otherwise
Once the depth of each node is calculated, the iorder vector can be computed in O(n)
time, where n is the dimension of the matrix. One possible way of calculating such an
ordering could be the following:
• Given an array idepth of length n containing the depth of each node, obtain the max-
imum, which is equal to the total number of levels.
• Allocate the vector ilevels and initialize it such that ilevels(i) contains the number of
nodes in level i.
• Performing a scan operation on this vector will yield the starting position of each level
in the final iorder array, which is the final content of ilevels.
• Maintaining an offset variable for each level, assign each node j to the iorder array the
following way
iorder(ilevels(idepth(j)) + offset(idepth(j))) = j
107
Appendix A. Solution of sparse triangular linear systems
main paradigms that have prevailed overtime and present well known GPU implementations,
namely, level-set and synchronization-free approaches.
108
A.3. Sync-free GPU triangular solver for CSR matrices
109
Appendix A. Solution of sparse triangular linear systems
iteration. The values are then reduced by a warp-voting primitive that returns one if the
value in the corresponding register is nonzero for all the active threads in the warp, and
returns zero otherwise. We keep two versions of the ready vector, one in global memory and
the other distributed across the shared memory of the different thread blocks. Each time
a warp updates its corresponding ready value, it must do it in the global memory and in
the shared memory of its block. This allows that when a warp depends on a value that is
to be updated by another warp of the same thread block, the thread polling for that value
can do so in the much faster shared memory. This situation is common in matrices that
concentrate most nonzero entries close to the diagonal. We also use a similar mechanism to
update and fetch the value of the unknowns that are solved by a warp in the same thread
block.
For this process to be successful, it is necessary that the warp scheduler of the SM
activates the warps that will fulfill the dependencies of the current warp between one iteration
and the other. We rely on the fact that accessing the ready vector will cause the warp to halt
waiting for the value to be fetched from memory. Furthermore, although the ready vector
is initially stored in global memory, there will be only a moderate number of entries being
polled at a given moment, so they will hopefully be read from the cache most of the times.
The CUDA execution model specifies that there is only a number of resident blocks in
each Streaming Multiprocessor (SM). The warps that belong to this blocks have all their
resources allocated in the SM and can issue an instruction as soon as its operands are ready.
Non-resident blocks, however, have to wait until resident blocks finish their execution, so
no resident warp should wait for data that is to be produced by a non-resident one, since it
would cause a deadlock. The triangular structure of the matrix ensures that each unknown
depends on others of lower numbering. If each warp processes the row corresponding to
its warp identifier, no deadlock should occur as long as the warp identifiers of the active
(or resident) warps are always lower than those of the non-resident warps. Unfortunately,
although the evidence suggests that this property holds for most GPU architectures (if not
all), the order in which blocks are issued to the SMs is not specified by the manufacturer.
For this reason, we use a global variable in the GPU memory that the first thread of each
block must read and increment before commencing its computations. The value read from
this variable is used instead of the block identifier to form the warp identifier. This has a
cost of one atomicAdd per block, which is negligible in practice.
Once all the dependencies have been met, the algorithm can move on to the multiplying
stage, in which each thread of the warp multiplies a nonzero value of the current row by the
appropriate entry of the vector of unknowns, accumulating the result in a register.
If there are more than 32 nonzero elements in the row, the warp moves back to busy-
waiting for the solution of the unknowns that correspond to the next 32 nonzero elements.
The cycle is repeated until every nonzero in the row has been processed.
After the multiplying stage, the registers that accumulate the products performed by
each thread of the warp are reduced using warp shuffle operations, and the result is then
stored in the vector of unknowns, divided by the diagonal pivot, by the first thread of the
warp.
Finally, the first thread of the warp is responsible of marking the unknown as solved in
110
A.4. A massively parallel level set analysis
111
Appendix A. Solution of sparse triangular linear systems
of the same thread block, we can fetch and update the corresponding values in the shared
memory instead of using the much slower global memory. We do not include this feature in
the outline of Figure A.5 for the sake of simplicity.
112
A.6. Experimental evaluation
Table A.1: Some features of the employed matrices. Levels is the number of levels in the DAG
computed by the analysis phase.
mension, number of non-zeros and the amount of levels produced by the analysis routine of
cuSparse. As the exploitation of parallelism in the level-based approach consists in pro-
cessing the rows that belong to a given level in parallel, dividing the number of rows by the
number of levels can give a rough estimation of the parallelism available for that matrix.
The selection of test cases is the same than that of [71] in order to make sure that the
results obtained for the CSC Sync. Free implementation are comparable. For the evaluation
of the solver routine we used the first set of 10 matrices, while for the analysis routine we
used the whole 20. For the evaluation of the combined solver we used only the first set,
performing a similar study than the one conducted in [46].
Ravel
Equipped with an Intel(R) Core(TM) i7-6700 CPU processor (8 cores at 3.40 GHz) and 64
GB of DDR3 RAM. The platform also features a GTX1060 (Pascal) GPU with 1,152 CUDA
Cores and 3 GB of GDDR5 RAM. The compiler for this platform is gcc 14.0.0 and for GPU
codes we use the CUDA/CUSPARSE 8.0 libraries.
Sibelius
Equipped with an Intel(R) Core(TM) i7-6700 CPU processor (8 cores at 3.40 GHz) and 64
GB of DDR3 RAM. The platform also features a NVIDIA GTX 1080 Ti (Pascal) GPU with
113
Appendix A. Solution of sparse triangular linear systems
3,584 CUDA Cores and 11 GB of GDDR5X RAM. The compiler for this platform is gcc
14.0.0 use the CUDA/CUSPARSE 8.0 for the GPU codes and libraries.
114
Table A.2: Runtime (in milliseconds) to solve triangular sparse linear systems with single and double precision in Beethoven.
nlpkkt160 856,66 15,67 872,33 7,17 45,40 52,57 45,88 0,34 19,01 0,99 1,15
hollywood-2009 1.877,85 1.312,37 3.190,22 5,17 214,37 219,53 129,65 10,12 24,61 1,65 1,69
road central 456,71 22,01 478,72 9,34 81,82 91,16 77,87 0,28 6,15 1,05 1,17
road usa 789,55 34,66 824,20 5,05 165,47 170,52 157,06 0,22 5,25 1,05 1,09
ship 003 83,03 39,90 122,93 0,20 11,04 11,24 7,09 5,63 17,33 1,56 1,59
webbase-1M 40,04 4,70 44,74 0,23 7,10 7,33 7,10 0,66 6,30 1,00 1,03
wiki-Talk 75,65 15,02 90,67 0,36 10,54 10,91 10,04 1,50 9,03 1,05 1,09
115
crankseg 1 87,44 38,07 125,51 0,27 17,29 17,56 10,47 3,63 11,98 1,65 1,68
cant 44,12 23,49 67,60 0,12 5,12 5,25 4,13 5,69 16,37 1,24 1,27
double precision
chipcool0 12,30 4,07 16,36 0,04 1,70 1,73 1,01 4,02 16,19 1,68 1,72
nlpkkt160 857,16 18,83 875,99 7,18 126,99 134,17 57,11 0,33 15,34 2,22 2,35
hollywood-2009 1.879,79 1.390,01 3.269,79 5,22 341,45 346,67 176,63 7,87 18,51 1,93 1,96
road central 459,40 25,92 485,32 9,33 114,92 124,25 93,42 0,28 5,20 1,23 1,33
road usa 790,26 40,62 830,88 5,09 239,26 244,35 192,00 0,21 4,33 1,25 1,27
ship 003 84,21 41,78 125,99 0,21 16,86 17,07 9,16 4,56 13,76 1,84 1,86
webbase-1M 39,90 4,99 44,89 0,23 11,33 11,56 8,72 0,57 5,15 1,30 1,32
wiki-Talk 76,10 15,73 91,83 0,36 14,54 14,91 12,91 1,22 7,11 1,13 1,15
crankseg 1 87,25 40,26 127,51 0,25 26,87 27,12 13,99 2,88 9,12 1,92 1,94
cant 43,62 24,52 68,14 0,12 9,29 9,41 5,09 4,82 13,39 1,83 1,85
Appendix A. Solution of sparse triangular linear systems
Table A.3: Runtime (in milliseconds) to solve triangular sparse linear systems with single and double precision in Ravel. The missing runtimes correspond to
matrices that did not fit in the GPU memory of the GTX1060.
cuSparse CSC Sync-free Speedup vs.
Matrix Analysis Resolve Total Analysis Resolve Total Proposal CUSPARSE CSC S-F
time time time time time time Solve +Anal. Solve +Anal.
single precision
chipcool0 7,42 1,61 9,02 0,06 0,38 0,44 0,37 4,31 24,19 1,02 1,17
nlpkkt160 - - - - - - - - - - -
hollywood-2009 1.186,62 515,20 1.701,82 6,88 123,95 130,83 78,00 6,61 21,82 1,59 1,68
road central 393,85 18,75 412,59 10,14 64,12 74,26 63,11 0,30 6,54 1,02 1,18
road usa - - - - - - - - - - -
ship 003 60,46 17,20 77,66 0,79 6,38 7,17 4,47 3,85 17,39 1,43 1,61
webbase-1M 32,69 2,56 35,25 1,11 4,50 5,61 5,13 0,50 6,87 0,88 1,09
116
wiki-Talk 54,63 7,01 61,64 1,27 6,82 8,08 6,68 1,05 9,23 1,02 1,21
crankseg 1 61,05 15,91 76,95 0,74 9,78 10,51 6,12 2,60 12,57 1,60 1,72
cant 30,99 10,95 41,94 0,48 2,82 3,30 2,21 4,95 18,94 1,28 1,49
double precision
chipcool0 7,45 1,88 9,34 0,30 0,56 0,86 0,61 3,07 15,23 0,91 1,40
nlpkkt160 - - - - - - - - - - -
hollywood-2009 - - - - - - - - - - -
road central 394,08 29,37 423,45 10,22 86,05 96,27 101,41 0,29 4,18 0,85 0,95
road usa - - - - - - - - - - -
ship 003 59,53 18,96 78,49 0,76 7,59 8,34 7,48 2,53 10,49 1,01 1,11
webbase-1M 32,67 3,59 36,26 1,05 6,09 7,14 8,61 0,42 4,21 0,71 0,83
wiki-Talk 54,89 9,82 64,70 1,28 11,56 12,84 15,79 0,62 4,10 0,73 0,81
crankseg 1 60,69 18,43 79,11 0,71 11,60 12,31 10,57 1,74 7,49 1,10 1,17
cant 31,49 11,24 42,73 0,50 3,34 3,84 3,48 3,23 12,27 0,96 1,10
A.6. Experimental evaluation
Experimental Results
Table A.4 summarizes the runtimes obtained for the execution of level-based and our variant
to perform the analysis phase of the SpTrSV in both the employed hardware platforms.
level−based T ime
The acceleration factor, computed as P roposal T ime , is also provided.
The differences between the execution times obtained in the two graphic cards vary ac-
cording to the test instance. In general, the benchmark executed faster in the SibeliusGPU,
as it was expected, with differences that range from marginal to more than 2×. There are,
however, a few counter intuitive results but, as in both algorithms the execution schedule
can be greatly affected by the sparsity pattern of the matrix, a much deeper and complex
analysis is required in order to explain them adequately.
The runtimes obtained in the experiments show a clear advantage of our approach. Our
routine to compute the depth of the nodes in the DAG presents acceleration factors of up
to 44× with respect to level-based analysis routine for the evaluated instances. The most
significant acceleration is obtained in the Sibelius GPU for the nlpkkt160 case, that is the
largest instance in terms of memory usage.
It should be noted that the cuSparse analysis of the nlpkkt160 and road usa matrices
failed in the Ravel device, as the 3GB of DRAM in the accelerator were not sufficient to
host the additional data structures generated by the routine. This reveals another advantage
of our method, as we only need to allocate space for the auxiliary is solved vector, while the
cuSparse routine needs an amount of additional memory proportional to the number of
117
Appendix A. Solution of sparse triangular linear systems
Table A.4: Runtime (in milliseconds) of the analysis phase of CUSPARSE and our analysis routine
in platforms Ravel and Sibelius.
cuSparse Proposal
Matrix Speedup
time time
Ravel
cant 32.00 2.18 14.68
chipcool0 8.27 0.46 17.98
crankseg 1 61.43 5.94 10.34
hollywood-2009 1,179.61 76.48 15.42
nlpkkt160 - 25.61 -
road central 465.00 57.59 8.07
road usa - 128.20 -
ship 003 61.00 4.30 14.19
webbase-1M 37.49 4.67 8.03
wiki-Talk 63.87 5.85 10.92
cit-HepTh 5.70 0.29 19.66
dc2 13.75 0.83 16.57
epb3 12.61 11.65 1.08
g7jac140sc 7.04 0.43 16.37
lung2 9.58 1.34 7.15
rajat18 7.64 0.71 10.76
rajat25 8.66 0.94 9.21
soc-sign-epinions 13.39 0.72 18.60
TSOPF RS b162 c4 7.29 0.69 10.57
wordnet3 5.61 0.14 5.61
Sibelius
cant 29.73 3.14 9.47
chipcool0 8.31 0.53 15.68
crankseg 1 54.37 4.33 12.56
hollywood-2009 1,113.87 114.75 9.71
nlpkkt160 424.75 9.66 43.97
road central 271.23 20.77 13.06
road usa 464.23 46.19 10.05
ship 003 54.27 2.71 20.03
webbase-1M 30.27 2.15 14.08
wiki-Talk 48.80 2.87 17.00
cit-HepTh 5.67 0.31 18.29
dc2 13.59 0.38 35.76
epb3 13.02 5.65 2.30
g7jac140sc 7.46 0.30 24.87
lung2 10.39 0.91 11.42
rajat18 8.47 0.42 20.17
rajat25 9.30 0.58 16.03
soc-sign-epinions 13.46 0.58 23.21
TSOPF RS b162 c4 6.07 0.55 11.04
wordnet3 5.59 0.33 16.94
118
A.6. Experimental evaluation
Table A.5: Runtime (in ms) of our combined analysis and solution routine in platform Ravel. The
column overhead represents the additional cost of solving one triangular system during the analysis.
The values in the last column are calculated as the added time of performing our analysis phase
followed by cuSparse’s solution phase. divided by the execution time of the combined analysis and
solution routine.
Tcombined − Tanalysis
× 100
Tcombined
Tables A.5 and A.6 show the results obtained for our first set of matrices in platforms
Ravel and the Sibelius, respectively, considering the solution of the system using single
and double precision arithmetic. The results obtained in the two platforms coincide in that
the simultaneous solution of the triangular system has a relatively small impact on the
performance of the analysis when using single precision, while it can imply a much larger
overhead when using double. This is expected and it is related to the large performance gap
between single and double precision in this type of graphic cards.
3 See for example, cuSparse Library user’s guide DU-06709-001 v8.0, January 2017.
119
Appendix A. Solution of sparse triangular linear systems
Table A.6: Runtime (in milliseconds) of our combined analysis and solution routine in platform
Sibelius. The column overhead represents the additional cost of solving one triangular system
during the analysis. The values in the last column are calculated as the added time of performing
our analysis phase followed by cuSparse’s solution phase, divided by the execution time of the
combined analysis and solution routine.
Regarding the convenience of the combined solution, the results suggest that this strat-
egy could yield important performance benefits in cases where only a few triangular linear
systems need to be solved. Moreover, for some of the tested instances, the runtime of our
combined routine is even smaller than that of cuSparse solve phase. This means that it is
better to solve the future triangular linear systems using our strategy (without performing
the analysis) instead of using cuSparse.
120
A.6. Experimental evaluation
global
void s p t r s v k e r n e l (
const i n t ∗ restrict row ptr ,
const i n t ∗ restrict col idx ,
const VALUE TYPE∗ restrict val ,
const VALUE TYPE∗ restrict b,
VALUE TYPE∗ x ,
int ∗ i s s o l v e d , int n ) {
// r o w c t r i s a g l o b a l c o u n t e r i n i t i a l i z e d i n 0
i f ( t h r e a d I d x . x==0) b l I d x = atomicAdd(& r o w c t r , 1 ) ;
// G l o b a l t h r e a d warp i d e n t i f i e r
i n t wrp = t h r e a d I d x . x + b l I d x ∗ blockDim . x ;
wrp /= WARP SIZE ;
// I d e n t i f i e s t h e t h r e a d r e l a t i v e t o t h e warp
int l n e = threadIdx . x & 0 x1f ;
i n t row = r o w p t r [ wrp ] ;
i n t n x t r o w = r o w p t r [ wrp + 1 ] ;
VALUE TYPE l e f t s u m = 0 ;
VALUE TYPE p i v = 1 / v a l [ nxt row − 1 ] ;
int ready ;
int l o c k = 0 ;
i f ( l n e ==0)
l e f t s u m = b [ wrp ] ;
i n t o f f = row+l n e ;
// Wait and m u l t i p l y
while ( o f f < n x t r o w − 1 )
{
ready = i s s o l v e d [ c o l i d x [ o f f ] ] ;
lock = a l l ( ready ) ;
i f ( lock )
{
l e f t s u m −= v a l [ o f f ] ∗ x [ c o l i d x [ o f f ] ] ;
o f f+=WARP SIZE ;
}
}
// R e d u c t i o n
f o r ( i n t i =16; i >=1; i /=2)
l e f t s u m += s h f l x o r ( left sum , i , 32);
i f ( l n e ==0){
// Write t h e r e s u l t
x [ wrp ] = l e f t s u m ∗ p i v ;
threadfence ();
//Mark t h e e q u a t i o n as s o l v e d
i s s o l v e d [ wrp ] = 1 ;
}
}
121
Appendix A. Solution of sparse triangular linear systems
global
void s p t r s v k e r n e l (
const i n t ∗ r o w p t r ,
const i n t ∗ c o l i d x ,
i n t ∗ depths ,
int ∗ i s s o l v e d , int n ) {
// r o w c t r i s a g l o b a l c o u n t e r i n i t i a l i z e d i n 0
i f ( t h r e a d I d x . x==0) b l I d x = atomicAdd(& r o w c t r , 1 ) ;
// G l o b a l t h r e a d warp i d e n t i f i e r
i n t wrp = t h r e a d I d x . x + b l I d x ∗ blockDim . x ;
wrp /= WARP SIZE ;
// I d e n t i f i e s t h e t h r e a d r e l a t i v e t o t h e warp
int l n e = threadIdx . x & 0 x1f ;
i n t row = r o w p t r [ wrp ] ;
i n t n x t r o w = r o w p t r [ wrp + 1 ] ;
int depht = 0 ;
int ready = 0 ;
int o f f = row+l n e ;
int colidx = col idx [ off ] ;
while ( o f f < n x t r o w − 1 )
{
// Wait and u p d a t e l o c a l d e p t h
i f ( ! ready )
{
ready = i s s o l v e d [ c o l i d x ] ;
i f ( ready ){
depth = max( depth , d e p t h s [ c o l i d x ] ) ;
}
}
if ( a l l ( ready ) ){
o f f+=WARP SIZE ;
colidx = col idx [ off ] ;
r e a d y =0;
}
// Re d u c t i o n
f o r ( i n t i =16; i >=1; i /=2)
depth = max( depth , s h f l d o w n ( depth , i ) ) ;
i f ( l n e ==0){
// Write t h e r e s u l t
d e p t h s [ wrp ] = 1+depth ;
threadfence ();
//Mark t h e e q u a t i o n as s o l v e d
i s s o l v e d [ wrp ] = 1 ;
}
}
122
APPENDIX B
For over two decades, the LINPACK benchmark [42] has been employed to compile per-
formance and throughput-per-power unit rankings of most of the world’s fastest supercom-
puters twice per year [1]. Unfortunately, this test boils down to the LU factorization [55],
a compute-bound operation that may not be representative of the performance and power
dissipation experienced by many of the complex applications running in current high per-
formance computing (HPC) sites.
The alternative High Performance Conjugate Gradients (HPCG) benchmark [2, 40] has
been recently introduced with the specific purpose of exercising computational units and
producing data access patterns that mimic those present in an ample set of important
HPC applications. This attempt to change the reference benchmark is crucial because such
metrics may guide computer architecture designers, e.g. from AMD, ARM, IBM, Intel and
NVIDIA, to invest in future hardware features and components with a real impact on the
performance and energy efficiency of these applications.
The HPCG benchmark consists of basic numerical kernels such as the sparse matrix-
vector multiplication (SpMV) and sparse triangular solve; basic vector operations as e.g.
vector updates and dot products; and a simple smoother combined with a multigrid pre-
conditioner. The reference implementation is written in C++, with parallelism extracted
via MPI and OpenMP [2]. However, in an era where general-purpose processors (CPUs) as
well as the Intel Xeon Phi accelerator contain dozens of cores, the concurrency level that is
targeted by this legacy implementation may be too fine-grain for these architectures. Fur-
thermore, the reference implementation is certainly not portable to heterogeneous platforms
equipped with graphics processing units (GPUs) comprising thousands of simple arithmetic
processing units (e.g., NVIDIA’s CUDA cores).
This appendix describes our study of the performance and energy efficiency of state-of-
the-art multicore CPUs and many-core accelerators, using the task and data-parallel versions
of ILUPACK described in this thesis. Compared with the HPCG benchmark, these multi-
Appendix B. Balancing energy and efficiency in ILUPACK
Figure B.1: Algorithmic formulation of the preconditioned CG method. Here, τmax is an upper
bound on the relative residual for the computed approximation to the solution.
threaded implementations of ILUPACK are composed of the same sort of numerical kernels
and, therefore, exhibit analogous data access patterns and arithmetic-to-memory operations
ratios. On the other hand, our task-parallel version of ILUPACK is likely better suited to
exploit the hardware parallelism of both general-purpose processors and the Intel Xeon Phi,
while our data-parallel implementation targets the large volume of CUDA cores in Nvidia
architectures.
Additionally, we analyze a data-parallel variant of ILUPACK adapted to low-power hard-
ware platforms such as the Nvidia Jestson TX1.
Our task-parallel version of ILUPACK employs the task-based programming model embed-
ded in the OmpSs1 framework to decompose the solver into tasks (routines annotated by the
user via OpenMP-like directives) as well as to detect data dependencies between tasks at
execution time (with the help of directive clauses that specify the directionality and size of
the task operands). With this information, OmpSs implicitly generates a task graph during
the execution, which is utilized by the CPU threads in order to exploit the task parallelism
implicit to the operation via a dynamic out-of-order but dependency-aware schedule.
1 [Link]
124
B.1. Characterizing the efficiency of hardware platforms using ILUPACK
Let us consider the PCG iteration. The variables that appear in these operations define
a partial order which enforces an almost strict serial execution. Specifically, at the (k + 1)-th
iteration,
(k + 1)-th iteration
. . . → O7 → O1 → O2 → O4 → O5 → O6 → O7 → O1 → . . .
must be computed in that order, but O3 and O8 can be computed any time once O2 and
O4 are respectively available. Further concurrency is exposed by dividing the application of
the preconditioner into subtasks of finer granularity, in a form analogous to that described
in Section 2.4.2.
The data-parallel version used for the energy evaluation is the baseline GPU-aware vari-
ant of the CG method described in Section 3.2.
• sandy: A server equipped with two hexacore Intel Xeon E5-2620 (“Sandy Bridge”)
processors (total of 12 cores) running at 2.0 GHz with 32 Gbytes of DDR3 RAM. The
compiler is gcc 4.4.7.
• haswell: A system with two hexacore Intel Xeon E5-2603v3 (“Haswell”) processors
(total of 12 cores) at 1.6 GHz with 32 Gbytes of DDR4 RAM. The compiler is gcc
4.4.7.
• xeon phi: A board with an Intel Xeon Phi 5110P co-processor. (The tests on this
board were ran in native mode and, therefore, the specifications of the server are
irrelevant.) The accelerator comprises 60 x86 cores running at 1,053 MHz and 8
Gbytes of GDDR5 RAM. The Intel compiler is icc 13.1.3.
• kepler: An NVIDIA K40 board (“Kepler” GK110B GPU with 2,880 cores) with
12 Gbytes of GDDR5 RAM, connected via a PCI-e Gen3 slot to a server equipped with
an Intel i7-4770 processor (4 cores at 3.40 GHz) and 16 Gbytes of DDR3 RAM. The
compiler for this platform is gcc 4.9.2, and the codes are linked to CUDA/cuSPARSE
6.5. This platform is the same as Beethoven, but is renamed in this section for
clarity purposes.
Other software included ILUPACK (2.4), the Mercurium C/C++ compiler/Nanox (releases
1.99.7/0.9a for sandy, haswell and xeon phi) with support for OmpSs, and METIS
(5.0.2) for the graph reorderings.
Power/energy was measured via RAPL in sandy and haswell, reporting the aggregated
dissipation from the packages (sockets) and the DRAM chips. For xeon phi we leveraged
125
Appendix B. Balancing energy and efficiency in ILUPACK
routine mic get inst power readings from the libmicmgmt library to obtain the power of
the accelerator. In kepler, we use RAPL to measure the consumption from the server’s
package and DRAM, and NVML library to obtain the dissipation from the GPU.
For the analysis, we employed the SPD linear system arising from the finite difference
discretization of a 3D Laplace problem in Section 3.1, with three instances of different size;
see Table B.1. In the experiments, all entries of the right-hand side vector b were initialized
to 1, and the PCG was started with the initial guess x0 ≡ 0. For the tests, the parameters
that control the fill-in and convergence of the iterative process in ILUPACK were set as
droptool = 1.0E-2, condest = 5, elbow = 10, and restol = 1.0E-6.
We use GFLOPS and GFLOPS/W to assess, respectively, the performance and energy
consumption of the parallel codes/platforms. ILUPACK is in part a memory-bound compu-
tation. Therefore, an alternative performance metric could have been based on the attained
memory transfer rate (Gbytes/s). Nevertheless, given that the data matrices are all off-chip,
and ILUPACK performs a number of flops that is proportional to the volume of memory
accesses, we prefer to stand with the GFLOPS metric. This measure has the advantage of
being more traditional among the HPC community.
126
B.1. Characterizing the efficiency of hardware platforms using ILUPACK
The OmpSs runtime allows the user to trade off performance for power (and, hopefully, en-
ergy) consumption by controlling the behaviour of idle OmpSs threads, setting it to a range
of modes that vary between pure blocking (idle-wait) and polling (busy-wait). To execute
our application in blocking mode, we set the arguments --enable-block and --spins=1 in
the NX ARGS NANOS environment variable. The first parameter enables the blocking mode
while the second one indicates the number of spins before an idle thread is blocked. For the
polling mode, we simply do not include the option --enable-block; we set --enable-yield,
which forces threads to yield on an idle loop and a conditional wait loop; and we set
--yields=1000000 to specify the number of yields before blocking an idle thread.
On heterogeneous platforms, consisting of a CPU and a GPU, our data-parallel version off-
loads a significant part of the computations to the graphics accelerator rendering the CPU
idle for a significant fraction of the execution. In this scenario, a potential source of energy
savings is to operate in the CUDA blocking synchronization mode, which allows that the
operating system puts the CPU to sleep (i.e., to promote it to a deep C-state) when idle.
• Degree of software concurrency, i.e., the number of threads that execute the applica-
tion.
• Operation “behaviour” of idle threads (CPU power states or C-states). A thread with-
out work can remain in an active state, polling for a new job to execute. Alternatively,
it can be suspended (blocked) and awakened when a new job is assigned to it. The
polling mode favors performance at the expense of higher power consumption in some
platforms. The blocking mode, on the other hand, can produce lower power con-
sumption, by allowing the operating system to promote the suspended core into a
power-saving C-state, but may negatively impact the execution time because of the
time it takes to reset the core into the active C0 state. The effect of these two modes
on energy efficiency is uncertain, as energy is the product of time and power.
• Operation frequency of active threads (CPU performance states or P-states). Active
threads can operate on a range of frequency/voltage pairs (P-states) that, for the Intel
127
Appendix B. Balancing energy and efficiency in ILUPACK
platforms evaluated in this work, can only be set on a per socket basis (i.e., for all cores
of the same socket). These modes are controlled by the operating system, though the
user can provide some general guidelines via the Linux governor modes. In general,
the P-states provide a means to trade off performance for power dissipation for active
cores/sockets.
• Binding of threads to hardware cores. The degree of software concurrency translates
into the exploitation of a certain level of hardware parallelism depending on the map-
ping of the software threads to the hardware (physical) cores. For the execution of
numerical codes on general-purpose x86 CPUs, the standard approach maps one thread
per core. For specialized hardware such as the Intel Xeon Phi (as well as the IBM
Power A2), better results may be obtained by using 2 or 4 software threads per core.
Behaviour of idle threads: Polling vs blocking Behaviour of idle threads: Polling vs blocking
2.5 0.02
SANDY-polling SANDY-polling
SANDY-blocking SANDY-blocking
HASWELL-polling HASWELL-polling
2
HASWELL-blocking 0.015 HASWELL-blocking
GFLOPS/W
1.5
GFLOPS
0.01
1
0.005
0.5
0 0
1 4 8 12 1 4 8 12
Number of threads Number of threads
Figure B.2: GFLOPS (left) and GFLOPS/W (right) obtained with the task-parallel version of
ILUPACK on sandy and haswell, using the blocking and polling modes for benchmark A318.
The initial experiments in the remainder of this subsection aim to tune the previous
configuration parameters for the execution of the task-parallel version of ILUPACK on
sandy, haswell and xeon phi. For this purpose, we select the largest dataset that fits
into the memory of each platform (A318 for both sandy and haswell, and A171 on xeon
phi), and evaluate the GFLOPS and GFLOPS/W metrics as the number of threads grows. A
direct comparison between the xeon phi and the two general-purpose x86 platforms cannot
be done at this point. For the task-parallel version of ILUPACK, the degree of software
concurrency determines the number of tasks that should be present in the bottom level of
the dependency tree (see Section B.1), and the actual number of flops that is required for
the solution of each problem case.
Figure B.2 reports the performance and energy efficiency attained with the task-parallel
version of ILUPACK, on sandy and haswell, when OmpSs is instructed to operate in
either the polling and blocking modes (see subsection B.1.2). This first experiment reveals
that the impact of these modes on both metrics is minor when up to 8 threads are employed.
However, for 12 threads, we can observe quite a different behaviour depending on the target
platform. Concretely, for sandy, it is more convenient to rely on the blocking mode, espe-
cially from the point of view of GFLOPS/W while, for haswell, the polling mode yields
superior performance and energy efficiency over its blocking counterpart. According to these
results, in the following experiments we select the blocking and polling modes for sandy
and haswell, respectively.
128
B.1. Characterizing the efficiency of hardware platforms using ILUPACK
Frequency of active threads: Linux governor modes Frequency of active threads: Linux governor modes
2.5 0.02
Userspace fmax Userspace fmax
Userspace fmin Userspace fmin
Performance Performance
2
Ondemand 0.015 Ondemand
GFLOPS/W
1.5
GFLOPS
0.01
1
0.005
0.5
0 0
1 4 8 12 1 4 8 12
Number of threads Number of threads
Frequency of active threads: Linux governor modes Frequency of active threads: Linux governor modes
2.5 0.02
Userspace fmax Userspace fmax
Userspace fmin Userspace fmin
Performance Performance
2
Ondemand 0.015 Ondemand
GFLOPS/W
1.5
GFLOPS
0.01
1
0.005
0.5
0 0
1 4 8 12 1 4 8 12
Number of threads Number of threads
Figure B.3: GFLOPS (left) and GFLOPS/W (right) obtained with the task-parallel version of
ILUPACK on sandy (top) and haswell (bottom), using different Linux governors for benchmark
A318.
Figure B.3 evaluates the impact of three Linux governors available in sandy and
haswell: performance, ondemand and userspace, with the latter set so that the sockets
operate in either the maximum or minimum frequencies (fmax and fmin , respectively) of the
corresponding platforms (fmax =2.0 GHz and fmin =1.2 GHz for sandy; and fmax =1.6 GHz
and fmin =1.2 GHz for haswell). The four plots in the figure reveal the small impact of
this configuration parameter on the performance and energy efficiency of the task-parallel
version of ILUPACK on both servers, which is only visible when 12 threads/cores are em-
ployed in the execution. Given these results, we select the userspace governor, with the P0
state (i.e., maximum frequency), in the remaining experiments with these two platforms.
1
GFLOPS/W
0.006
GFLOPS
0.8
0.6 0.004
0.4
0.002
0.2
0 0
1 4 8 16 32 64 1 4 8 16 32 64
Number of threads Number of threads
Figure B.4: GFLOPS (left) and GFLOPS/W (right) obtained with the task-parallel version of
ILUPACK on xeon phi, using different binding options for benchmark A171.
The last experiment with the configuration parameters, in Figure B.4, exposes the
effect of populating each hardware core of xeon phi with 1, 2, 4 (software) threads
129
Appendix B. Balancing energy and efficiency in ILUPACK
(Binding=4,2,1, respectively). The best choice is clearly the first option which, given a
fixed number of threads, maximizes the number of hardware cores employed in the ex-
periment. This will be the configuration adopted for the following experiments with this
platform.
Table B.2: Characterization of the four platforms obtained with the task-parallel and data-parallel
versions of ILUPACK.
Table B.2 evaluates the task-parallel version of ILUPACK running on sandy, haswell
or xeon phi, compared with the data-parallel version of the solver executed on kepler,
using four efficiency metrics: execution time, GFLOPS, (total) energy-to-solution, and
GFLOPS/W. For the Intel-based platforms, we use 12 threads in both sandy and haswell,
and 64 for xeon phi.
A direct comparison of the platforms, using the same problem case is difficult: First,
due to the small memory of the xeon phi, the largest problem that could be solved in this
platform (A171) seems too small to exploit the large amount of hardware parallelism of this
accelerator. In addition, increasing the problem dimension shows different trends depending
on the platform, with a decline in the GFLOPS and GFLOPS/W rates for sandy and
haswell, but a raise for kepler. Finally, even if the problem case is the same, the solvers
do not necessarily perform the same amount of operations to converge as the exact number
of flops depends, e.g., of the level of task-parallelism tackled by each solver/platform (12
tasks in the bottom level of the DAG for sandy and haswell, 64 for xeon phi, and a single
task for kepler) as well as variations due to rounding errors, which affect the convergence
rate.
As an alternative, let us perform a comparison based on the largest problem case that
can be tackled on each platform: A318 for sandy and haswell, A171 for xeon phi and
A252 for kepler. Consider first the two platforms equipped with general-purpose CPUs.
As the two system comprise 12 cores, in principle we could expect better performance from
haswell because the floating-point units (FPUs) available in this recent architecture can
produce up to 16 DP flops/cycle compared with the 8 DP flops/cycle of sandy. However, the
irregular data access patterns present in ILUPACK turns the exploitation of the wide vector
units (SIMD) into a difficult task which, combined with the higher maximum frequency of
130
B.2. Adaptation of ILUPACK to low power devices
sandy, explains why this platform outperforms haswell. Interestingly, the gap between the
GFLOPS rates of these two platforms, a factor of about 2.21/1.66=1.30, is captured to high
accuracy by the difference between their maximum frequencies, 2.0/1.6=1.25. This variation
is not reflected in the energy and GFLOPS/W metrics, where haswell is only slightly
behind sandy. These particular trends make us believe that haswell could match sandy’s
performance and outperform its energy efficiency if both platform were operated with the
same maximum frequency. Let us include kepler and benchmark A252 in the comparison
now. As the problems being solved are different, we will perform the comparison in terms
of GFLOPS and GFLOPS/W, and obviate time and energy. In spite of the large number
of FPUs in kepler, we see that the difference in favor of this data-parallel architecture
is moderate, with a factor of 1.43 and 1.86 over sandy and haswell, respectively, in the
GFLOPS rate; and roughly 1.40 over any of the two systems in the GFLOPS/W metric.
Finally, we note that xeon phi lags behind any of the other three platforms in both GFLOPS
and GFLOPS/W.
In this section we describe the particular strategies that were adopted in order to execute
ILUPACK on low power devices as the Jetson TX1.
The development of this low power variant is based on the SPD data parallel implemen-
tation of ILUPACK, as we consider this can be the first step towards a distributed version
of ILUPACK capable of executing on clusters of low power devices.
As in the previously presented parallel variants, it was necessary to transform ILUPACK
data structures in order to use cuSparse. This transformation is done only once, during the
construction of each level of the preconditioner, and occurs entirely in the CPU. In devices
equipped with physical Unified Memory3 , like the Jetson TX1, no transference is needed
once this transformation has been done.
Regarding the computation of the SpMV in the GPU, it is important to remark that in
this case each level of the preconditioner involves a matrix-vector multiplication with F and
F T . As in the baseline SPD version described in Section 3.2, we store both F and F T in
the GPU memory, accepting some storage overhead in order to avoid using the transposed
routine offered by cuSparse. Future implementations could consider the development of
custom kernels and the use of adequate sparse storage formats that allow a more balanced
performance ratio between the regular and transposed SpMV.
For the rest of the work we will consider two different implementations:
• JTX1 GPU: computes the entire preconditioner application on the GPU while the
rest of the computations are carried out in the ARM processor.
• JTX1 ARM: makes all the computations in the ARM processor. This variant does
not leverage any data parallelism.
131
Appendix B. Balancing energy and efficiency in ILUPACK
Jetson TX1
The NVIDIA Jetson TX1 it is a low-power GPU enabled system that includes a 256-core
Maxwell GPU and a 64-bit quad-core ARM A57 processor, which was configured in maxi-
mum performance. The platform is also equipped with 4GB of unified LPDDR4 RAM that
has a theoretical bandwidth of 25.6 GB/s (see [79]).
Table B.3 shows the comparison between the performance of the sequential and data-
parallel versions of ILUPACK executed on the Jetson module. It can be observed that
the JTX1 GPU version implies lower runtimes than the JTX1 ARM variant, but these
improvements are decreasing with the dimension of the addressed problem. This result is
consistent with other experiments, and relates to the fact that GPUs requires large volumes
of data to effectively exploit their computational power.
If we take the performance of JTX1 GPU in other kind of hardware platform into
consideration, the benefits offered by the Jetson device are easy to see. To illustrate this
aspect we compared our previous results for the GPU-based ILUPACK from [5], run on an
NVIDIA K20 GPU, to compare the runtimes. Table B.4 summarizes these results, focusing
only in the preconditioner application runtime, and contrasting it with the one obtained in
the Jetson TX1 platform.
132
B.2. Adaptation of ILUPACK to low power devices
Table B.4: Runtime (in seconds) of GPU-based ILUPACK in a K20 (from [5], using double-
precision arithmetic) and the JTX1 GPUvariant of ILUPACK in Jetson TX1 (in single-precision).
K20 Jetson
Case
iters P rec. time T ime by iter iters P rec. time T ime by iter
A126 44 11.38 .26 156 44.36 .28
A159 52 19.75 .38 206 123.93 .60
A171 - - - 222 146.09 .66
A200 76 28.28 .37 - - -
133
Appendix B. Balancing energy and efficiency in ILUPACK
not yet widespread, but some examples are the one built in the context of the Mont-Blanc
project, leaded by the Barcelona Super Computing (BSC) Spain [36], and the one con-
structed by the ICARUS project of the Institute for Applied Mathematics, TU Dortmund,
Germany [54, 53].
134
APPENDIX C
This section includes the description of the most remarkable characteristics of the three
NVIDIA GPU architectures employed in this thesis, namely the Fermi, Kepler and Maxwell
architectures.
Figure C.1: Block diagram of Fermi GF100 architecture (left) and its SM (right). Extracted from
NVIDIA Fermi GF100 Architecture Whitepaper. Available online.
to NVIDIA reports, two Kepler cores perform similar than one Fermi core but consuming
less than 50% of the energy.
From a hardware architecture point of view, Kepler employs a new generation of Stream-
ing Multiprocessor called SMX (see Figure C.3). Each SMX contains 192 single-precision
CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store
units (LD/ST). To increase the double precision performance offered by the previous gen-
eration, the SMX of Kepler has 8× more SFUs than the SM in Fermi. As before, the SMX
schedules threads in groups of 32, implicitly synchronized, parallel threads called warps.
Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dispatch Units,
allowing the concurrent execution of up to four warps. Additionally, it allows simple and
double precision instructions to be paired with other instructions.
The new design of the SMX also introduced a new high performance operation, and
significantly improved another one. On the one hand, the warp shuffle operations (SHFL)
appeared, which allow the threads within a warp to exchange data resident in registers
without using slower memory spaces like the the shared or global memory. This kind of op-
erations has four variants (indexed anytoany, shift right/left to nth neighbour and butterfly
(XOR) exchange) and are extremely useful in the implementation of many high performance
kernels. On the other hand, the atomic instructions got a 10× performance gain through
the inclusion of more atomic processors and a shorter processing pipeline. Kepler supports
five atomic operations: atomicMin, atomicMax, atomicAnd, atomicOr and atomicXor.
Regarding memories, each SMX has 64 KB of on-chip memory that can be configured as
48 KB of Shared memory with 16 KB of L1 cache, 16 KB of shared memory with 48 KB of
136
Figure C.2: Block diagram of Kepler GK110 architecture. Extracted from NVIDIA Kepler GK110
Architecture Whitepaper. Available online.
137
Appendix C. Description of the GPU architectures used in this work
Figure C.3: Block diagram of the SMX presented by Kepler. Extracted from NVIDIA Kepler
GK110 Architecture Whitepaper. Available online.
138
C.1. Summary of computing platforms
Table C.1: CPU and main memory features of each of the utilized platforms.
Model Cores Freq. (GHz) Cache (MB) RAM (GB) Max. bw. (GB/s)
Bach Intel Core i7-2600 4 3.40 8 8 21.0
Mozart Intel Core i3-3220 2 3.30 3 16 25.6
Beethoven Intel Core i7-4770 4 3.40 8 16 25.6
Brahms Intel Xeon E5-2620 v2 12 2.10 15 128 51.2
Falla Intel Xeon E5-2680 v3 12 2.50 30 64 68.0
Jetson TX1 ARM A57 4 1.73 2 4 (unified) 25.6
Ravel Intel Core i7-6700 4 3.40 8 64 34.1
Sibelius Intel Core i7-6700 4 3.40 8 64 34.1
sandy Intel Xeon E5-2620 12 2.00 15 32 51.2
haswell Intel Xeon E5-2603 v3 12 1.60 15 32 51.0
Table C.2: Main features of the accelerators in each of the utilized platforms.
Model Arch. Cores Freq. (GHz) RAM (GB) Max. bw. (GB/s)
Bach Tesla C2070 Fermi 448 1.50 5 144
Mozart Tesla K20 Kepler 2496 0.70 6 250
Beethoven Tesla K40c Kepler 2880 0.75 11 288
Brahms 2 × Tesla K40m Kepler 2880 0.75 11 288
Falla 4 × 2 × Tesla K40m Kepler 2880 0.75 11 288
Jetson TX1 Tegra X1 Maxwell 256 0.99 4 (unified) 25.6
Ravel GTX 1060 Pascal 1152 1.50 3 192
Sibelius GTX 1080 Ti Pascal 3584 1.58 11 484
xeon phi Intel Xeon Phi 5110P Intel-MIC 60 1.05 8 320
139
Appendix C. Description of the GPU architectures used in this work
Table C.3: Versions of the C and Fortran compilers and the CUDA Toolkit for each of the utilized
platforms.
140
Bibliography
[3] Aliaga, J. I., Bollhöfer, M., Martı́n, A. F., and Quintana-Ortı́, E. S. (2011). Exploit-
ing thread-level parallelism in the iterative solution of sparse linear systems. Parallel
Computing, 37(3), 183–202.
[4] Aliaga, J. I., Bollhöfer, M., Martı́n, A. F., and Quintana-Ortı́, E. S. (2012). Paralleliza-
tion of multilevel ILU preconditioners on distributed-memory multiprocessors. In Applied
Parallel and Scientific Computing, LNCS , volume 7133, pages 162–172. Springer, Berlin,
Heidelberg.
[5] Aliaga, J. I., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2014).
Leveraging data-parallelism in ILUPACK using graphics processors. In IEEE 13th In-
ternational Symposium on Parallel and Distributed Computing, ISPDC 2014, Marseille,
France, June 24-27, 2014 , pages 119–126.
[6] Aliaga, J. I., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2016a).
A data-parallel ILUPACK for sparse general and symmetric indefinite linear systems. In
Euro-Par 2016: Parallel Processing Workshops - Euro-Par 2016 International Workshops,
Grenoble, France, August 24-26, 2016, Revised Selected Papers, pages 121–133.
[7] Aliaga, J. I., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2016b). Design
of a task-parallel version of ILUPACK for graphics processors. In High Performance
Computing - Third Latin American Conference, CARLA 2016, Mexico City, Mexico,
August 29 - September 2, 2016, Revised Selected Papers, pages 91–103.
[8] Aliaga, J. I., Badia, R. M., Barreda, M., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and
Quintana-Ortı́, E. S. (2016c). Exploiting task and data parallelism in ilupack’s precondi-
tioned CG solver on NUMA architectures and many-core accelerators. Parallel Comput-
ing, 54, 97–107.
Bibliography
[9] Aliaga, J. I., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2017a). Evaluating
the NVIDIA tegra processor as a low-power alternative for sparse GPU computations. In
High Performance Computing - 4th Latin American Conference, CARLA 2017, Buenos
Aires, Argentina, and Colonia del Sacramento, Uruguay, September 20-22, 2017, Revised
Selected Papers, pages 111–122.
[10] Aliaga, J. I., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2017b). Overcoming
memory-capacity constraints in the use of ILUPACK on graphics processors. In 29th
International Symposium on Computer Architecture and High Performance Computing,
SBAC-PAD 2017, Campinas, Brazil, October 17-20, 2017 , pages 41–48.
[11] Aliaga, J. I., Bollhöfer, M., Dufrechou, E., Ezzatti, P., and Quintana-Ortı́, E. S. (2018).
Extending ILUPACK with a task-parallel version of BiCG for dual-GPU servers. In Pro-
ceedings of the 9th International Workshop on Programming Models and Applications for
Multicores and Manycores, PMAM@PPoPP 2018, February 25, 2018, Vienna, Austria,
pages 71–78.
[12] Alvarado, F. and Schreiber, R. (1991). Fast parallel solution of sparse triangular sys-
tems. In 13th IMACS World Congress on Computation and Applied Mathematics, Dublin.
[13] Alvarado, F., Yu, D., and Betancourt, R. (1989). Ordering schemes for partitioned
sparse inverses. In SIAM Symposium on Sparse Matrices, Salishan Lodge, Gleneden
Beach, Oregon.
[14] Alvarado, F. L. and Schreiber, R. (1993). Optimal parallel solution of sparse triangular
systems. SIAM Journal on Scientific Computing, 14(2), 446–460.
[15] Ament, M., Knittel, G., Weiskopf, D., and Straßer, W. (2010). A parallel preconditioned
conjugate gradient solver for the poisson problem on a multi-gpu platform. In PDP
’10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and
Networkbased Processing, pages 583–592.
[16] Amestoy, P. R., Davis, T. A., and Duff, I. S. (1996). An approximate minimum degree
ordering algorithm. SIAM J. Matrix Anal. Appl., 17(4), 886–905.
[17] Anderson, E. and Saad, Y. (1989a). Solving sparse triangular linear systems on parallel
computers. International Journal of High Speed Computing, 1(01), 73–95.
[18] Anderson, E. and Saad, Y. (1989b). Solving sparse triangularclinear systems on parallel
computers. International Journal of High Speed Computing, 01(01), 73–95.
[19] Anzt, H., Chow, E., and Dongarra, J. (2015). Iterative Sparse Triangular Solves for
Preconditioning, pages 650–661. Springer Berlin Heidelberg, Berlin, Heidelberg.
[20] Arnoldi, W. E. (1951). The principle of minimized iterations in the solution of the
matrix eigenvalue problem. Quarterly of applied mathematics, 9(1), 17–29.
142
Bibliography
[22] Bell, N. and Garland, M. (2012). Cusp: Generic parallel algorithms for sparse matrix
and graph computations. Version 0.3.0.
[23] Benzi, M. and Tuma, M. (1998). A sparse approximate inverse preconditioner for
nonsymmetric linear systems. SIAM Journal on Scientific Computing, 19(3), 968–994.
[24] Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Ortı́, E. S., and Geijn, R. A.
v. d. (2005). The science of deriving dense linear algebra algorithms. ACM Trans. Math.
Softw., 31(1), 1–26.
[26] Bollhöfer, M. (2001). A robust ILU with pivoting based on monitoring the growth of
the inverse factors. Linear Algebra and its Applications, 338(1), 201 – 218.
[27] Bollhöfer, M. (2003). A robust and efficient ILU that incorporates the growth of the
inverse triangular factors. SIAM Journal on Scientific Computing, 25(1), 86–103.
[29] Bollhöfer, M. and Saad, Y. (2002b). On the relations between ILUs and factored
approximate inverses. SIAM Journal on Matrix Analysis and Applications, 24(1), 219–
237.
[30] Bollhöfer, M. and Saad, Y. (2006). Multilevel preconditioners constructed from inverse-
based ILUs. SIAM Journal on Scientific Computing, 27(5), 1627–1650.
[31] Bolz, J., Farmer, I., Grinspun, E., and Schröoder, P. (2003). Sparse matrix solvers on
the gpu: Conjugate gradients and multigrid. ACM Trans. Graph., 22(3), 917–924.
[32] Buatois, L., Caumon, G., and Levy, B. (2007). Concurrent number cruncher: An
efficient sparse linear solver on the GPU. In High Performance Computation Conference
(HPCC), Springer Lecture Notes in Computer Sciences.
[34] Chow, E. and Patel, A. (2015). Fine-grained parallel incomplete lu factorization. SIAM
Journal on Scientific Computing, 37(2), C169–C193.
[35] Cline, A. K., Moler, C. B., Stewart, G. W., and Wilkinson, J. H. (1979). An estimate for
the condition number of a matrix. SIAM Journal on Numerical Analysis, 16(2), 368–375.
143
Bibliography
[37] Cuthill, E. (1972). Several strategies for reducing the bandwidth of matrices. In D. J.
Rose and R. A. Willoughby, editors, Sparse Matrices and their Applications: Proceedings
of a Symposium on Sparse Matrices and Their Applications, held September 9–10, 1971, at
the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored
by the Office of Naval Research, the National Science Foundation, IBM World Trade
Corporation, and the IBM Research Mathematical Sciences Department., pages 157–166,
Boston, MA. Springer US.
[38] Davis, T. (2006). Direct Methods for Sparse Linear Systems. Society for Industrial and
Applied Mathematics.
[39] Davis, T. A. and Hu, Y. (2011). The university of florida sparse matrix collection. ACM
Trans. Math. Softw., 38(1), 1:1–1:25.
[40] Dongarra, J. and Heroux, M. A. (2013). Toward a new metric for ranking high perfor-
mance computing systems. Sandia Report SAND2013-4744, Sandia National Lab.
[41] Dongarra, J., Duff, I., Sorensen, D., and van der Vorst, H. (1998). Numerical Linear
Algebra for High-Performance Computers. Society for Industrial and Applied Mathemat-
ics.
[42] Dongarra, J. J., Luszczek, P., and Petitet, A. (2003). The LINPACK benchmark: past,
present and future. Concurrency and Computation: Practice and Experience, 15(9), 803–
820.
[43] Duff, I. S. and Koster, J. (2001). On algorithms for permuting large entries to the
diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications, 22(4),
973–996.
[44] Dufrechou, E. and Ezzatti, P. (2018). Solving sparse triangular linear systems in modern
gpus: A synchronization-free algorithm. In 26th Euromicro International Conference
on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, United
Kingdom, March 21-23, 2018 , pages 196–203.
[45] (Eds.), G. M. (1999). Computer Solution of Large Linear Systems. Studies in Mathe-
matics and Its Applications 28. Elsevier.
[46] Erguiz, D., Dufrechou, E., and Ezzatti, P. (2017). Assessing sparse triangular linear
system solvers on gpus. In 2017 International Symposium on Computer Architecture and
High Performance Computing Workshops (SBAC-PADW), pages 37–42.
[47] Fletcher, R. (1976). Conjugate gradient methods for indefinite systems. In Numerical
analysis, pages 73–89. Springer.
[48] Freund, R. and Zha, H. (1994). Simplifications of the nonsymmetric lanczos process
and a new algorithm for hermitian indefinite linear systems. AT&T Numerical Analysis
Manuscript, Bell Laboratories, Murray Hill, NJ .
144
Bibliography
[50] George, A. (1973). Nested dissection of a regular finite element mesh. SIAM Journal
on Numerical Analysis, 10(2), 345–363.
[51] George, A., Heath, M. T., Liu, J., and Ng, E. (1986). Solution of sparse positive
definite systems on a shared-memory multiprocessor. International Journal of Parallel
Programming, 15(4), 309–325.
[52] George, A., Liu, J., and Ng, E. (1994). Computer solution of sparse linear systems.
Academic, Orlando.
[53] Geveler, M. and Turek, S. (2017). How applied sciences can accelerate the energy
revolution–a pleading for energy awareness in scientific computing. In Newsletter of the
European Community on Computational Methods in Apllied Sciences. -. accepted.
[54] Geveler, M., Ribbrock, D., Donner, D., Ruelmann, H., Höppke, C., Schneider, D.,
Tomaschewski, D., and Turek, S. (2016). The icarus white paper: A scalable, energy–
efficient, solar–powered HPC center based on low power GPUs. In F. Desprez, P. Dutot,
C. Kaklamanis, L. Marchal, K. Molitorisz, L. Ricci, V. Scarano, M. Vega-Rodriguez,
A. Varbanescu, S. Hunold, S. Scott, S. Lankes, and J. Weidendorfer, editors, Workshop on
Unconventional HPC , volume Euro-Par 2016: Parallel Processing Workshops: Euro-Par
2016 International Workshops, Grenoble, France, August 24-26, 2016, Revised Selected
Papers of LNCS, Euro-Par’16 , pages 737–749. Springer. isbn 978-3-319-58943-5; http:
//[Link]/10.1007/978-3-319-58943-5_59.
[55] Golub, G. H. and Loan, C. F. V. (1996). Matrix Computations. The Johns Hopkins
Univ. Press, Baltimore, 3rd edition.
[56] Golub, G. H. and Loan, C. F. V. (2012). Matrix Computations. The Johns Hopkins
Univ. Press, Baltimore, 4th edition.
[57] Goodnight, N., Woolley, C., Lewin, G., Luebke, D., and Humphreys, G. (2003). A
multigrid solver for boundary value problems using programmable graphics hardware.
In HWWS ’03: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on
Graphics hardware, pages 102–111, Aire-la-Ville, Switzerland, Switzerland. Eurographics
Association.
[58] Greenbaum, A. (1987). Iterative methods for solving linear systems. Frontiers in applied
mathematics 17. Society for Industrial and Applied Mathematics, 1 edition.
[59] Gupta, R., van Gijzen, M. B., and Vuik, C. (2013). 3d bubbly flow simulation on the
gpu - iterative solution of a linear system using sub-domain and level-set deflation. In
16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP
2008), volume 0, pages 359–366, Los Alamitos, CA, USA. IEEE Computer Society.
145
Bibliography
[60] He, K., Tan, S. X., Zhao, H., Liu, X.-X., Wang, H., and Shi, G. (2016). Parallel GMRES
solver for fast analysis of large linear dynamic systems on GPU platforms. Integration,
the VLSI Journal , 52, 10 – 22.
[62] Hestenes, M. R. and Stiefel, E. (1952). Methods of conjugate gradients for solving linear
systems. Journal of Research of the National Bureau of Standards, 49(1).
[63] Hénon, P. and Saad, Y. (2006). A parallel multistage ilu factorization based on a
hierarchical graph decomposition. SIAM Journal on Scientific Computing, 28(6), 2266–
2293.
[64] Jones, M. and Patrick, M. (1993). Bunch–kaufman factorization for real symmetric
indefinite banded matrices. SIAM Journal on Matrix Analysis and Applications, 14(2),
553–559.
[65] Kahn, A. B. (1962). Topological sorting of large networks. Commun. ACM , 5(11),
558–562.
[66] Kirk, D. and Hwu, W. (2012). Programming Massively Parallel Processors, Second
Edition: A Hands-on Approach. Morgan Kaufmann.
[67] Krüger, J., Schiwietz, T., Kipfer, P., and Westermann, R. (2004). Numerical simulations
on PC graphics hardware. In ParSim 2004 (Special Session of EuroPVM/MPI 2004),
Budapest, Hungary.
[69] Li, R. and Saad, Y. (2013). Gpu-accelerated preconditioned iterative linear solvers. The
Journal of Supercomputing, 63(2), 443–466.
[70] Liu, W., Li, A., Hogg, J., Duff, I. S., and Vinter, B. (2016). A synchronization-
free algorithm for parallel sparse triangular solves. In European Conference on Parallel
Processing, pages 617–630. Springer.
[71] Liu, W., Li, A., Hogg, J. D., Duff, I. S., and Vinter, B. (2017). Fast synchronization-free
algorithms for parallel sparse triangular solves with multiple right-hand sides. Concur-
rency and Computation: Practice and Experience, 29(21).
[72] Mayer, J. (2006). Alternative weighted dropping strategies for ilutp. SIAM Journal on
Scientific Computing, 27(4), 1424–1437.
146
Bibliography
[74] Naumov, M. (2011b). Parallel solution of sparse triangular linear systems in the pre-
conditioned iterative methods on the GPU. Nvidia white paper, NVIDIA corporation.
[75] Naumov, M. (2011c). Parallel solution of sparse triangular linear systems in the pre-
conditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech.
Rep. NVR-2011 , 1.
[76] NVIDIA (1999). Nvidia launches the world’s first graphics processing unit: Geforce
256. [Online; accessed 10-June-2017].
[78] NVIDIA (2014). Nvidia geforce gtx 980 featuring maxwell, the most advanced gpu ever
made. [Online; accessed 10-June-2017].
[79] NVIDIA (2015). NVIDIA Tegra X1 NVIDIAs New Mobile Superchip. http://
[Link]/pdf/tegra/[Link],
[Online; accessed 10-July-2017].
[80] Nvidia, C. (2008). Cublas library. NVIDIA Corporation, Santa Clara, California,
15(27), 31.
[82] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., and Phillips, J. C.
(2008). Gpu computing. Proceedings of the IEEE , 96(5), 879–899.
[83] Pothen, A. and Alvarado, F. L. (1992). A fast reordering algorithm for parallel sparse
triangular solution. SIAM journal on scientific and statistical computing, 13(2), 645–653.
[84] Rothberg, E. and Gupta, A. (1992). Parallel ICCG on a hierarchical memory multi-
processor - addressing the triangular solve bottleneck. Parallel Computing, 18(7), 719 –
741.
[85] Rumpf, M. and Strzodka, R. (2001). Using graphics cards for quantized fem compu-
tations. In in IASTED Visualization, Imaging and Image Processing Conference, pages
193–202.
[86] Saad, Y. (1994). ILUT: a dual threshold incomplete LU factorization. Numerical Linear
Algebra with Applications, 1(4), 387–402.
[87] Saad, Y. (2003a). Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia,
PA, USA, 2nd edition.
[88] Saad, Y. (2003b). Iterative methods for sparse linear systems. Society for Industrial
and Applied Mathematics, 2 edition.
[89] Saad, Y. (2005). Multilevel ILU with reorderings for diagonal dominance. SIAM Journal
on Scientific Computing, 27(3), 1032–1057.
147
Bibliography
[91] Saltz, J. H. (1990). Aggregation methods for solving sparse triangular systems on
multiprocessors. SIAM J. Sci. Stat. Comput., 11(1), 123–144.
[92] Saltz, J. H., Alyc, S., Syacercnizatxoi, B. C., Aeronauticsand, N., and Saltz, J. E.
(1987). Automated problem scheduling and reduction of synchronization delay effects.
Technical report.
[93] Schaller, R. R. (1997). Moore’s law: past, present and future. IEEE Spectrum, 34(6),
52–59.
[94] Schenk, O., Röllin, S., and Gupta, A. (2004). The effects of unsymmetric matrix permu-
tations and scalings in semiconductor device and circuit simulation. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 23, 400–411.
[95] Schenk, O., Wächter, A., and Weiser, M. (2009). Inertia-revealing preconditioning for
large-scale nonconvex constrained optimization. SIAM J. Sci. Comp., 31(2), 939–960.
[96] Sonneveld, P. (1989). CGS, A fast Lanczos-type solver for nonsymmetric linear systems.
SIAM Journal on Scientific and Statistical Computing, 10(1), 36–52.
[97] Stewart, G. W. (1998). Matrix Algorithms. Society for Industrial and Applied Mathe-
matics, 1 edition.
[98] Sudan, H., Klie, H., Li, R., and Saad, Y. (2010). High performance manycore solvers
for reservoir simulation. In 12th European conference on the mathematics of oil recovery.
[100] Thomas, L. H. (1949). Elliptic Problems in Linear Differential Equations over a Net-
work. Technical report, Columbia University.
[101] van der Vorst, H. A. (1982). A vectorizable variant of some iccg methods. SIAM
Journal on Scientific and Statistical Computing, 3(3), 350–356.
[102] Van der Vorst, H. A. (1990). The convergence behaviour of preconditioned CG and
CG-S in the presence of rounding errors. In O. Axelsson and L. Y. Kolotilina, editors,
Preconditioned Conjugate Gradient Methods, pages 126–136, Berlin, Heidelberg. Springer
Berlin Heidelberg.
[103] van der Vorst, H. A. (1992). Bi-CGSTAB: a fast and smoothly converging variant of
Bi-CG for the solution of nonsymmetric linear systems. SIAM Journal on Scientific and
Statistical Computing, 13(2), 631–644.
[104] Van Duin, A. C. (1999). Scalable parallel preconditioning with the sparse approximate
inverse of triangular matrices. SIAM journal on matrix analysis and applications, 20(4),
987–1006.
148
Bibliography
[105] Wing, O. and Huang, J. W. (1980). A computation model of parallel solution of linear
equations. IEEE Trans. Computers, 29(7), 632–638.
149