0% found this document useful (0 votes)
15 views324 pages

Least Squares Data Fitting With Applications

Uploaded by

simaopaulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views324 pages

Least Squares Data Fitting With Applications

Uploaded by

simaopaulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 324

Least Squares Data Fitting

with Applications
This page intentionally left blank
Least Squares Data Fitting
with Applications
PER CHRISTIAN HANSEN
V Í C T O R P E R E Y R A
GODELA SCHERER

The Johns Hopkins University Press


Baltimore

c 2013 The Johns Hopkins University Press
All rights reserved. Published 2013
Printed in the United States of America on acid-free paper
987654321

The Johns Hopkins University Press


2715 North Charles Street
Baltimore, Maryland 21218-4363
www.press.jhu.edu

ISBN-13: 978-1-4214-0786-9 (hardcover : alk. paper)


ISBN-10: 1-4214-0786-8 (hardcover : alk. paper)
ISBN-13: 978-1-4214-0858-3 (electronic)
IBSN-10: 1-4214-0858-9 (electronic)

Library of Congress Control Number: 2012942511

A catalog record for this book is available from the British Library.

Special discounts are available for bulk purchases of this book. For more information,
please contact Special Sales at 410-516-6936 or [email protected].

The Johns Hopkins University Press uses environmentally friendly book materials,
including recycled text paper that is composed of at least 30 percent post-consumer
waste, whenever possible.
Contents

Preface ix

Symbols and Acronyms xiii

1 The Linear Data Fitting Problem 1


1.1 Parameter estimation, data approximation 1
1.2 Formulation of the data fitting problem 4
1.3 Maximum likelihood estimation 9
1.4 The residuals and their properties 13
1.5 Robust regression 19

2 The Linear Least Squares Problem 25


2.1 Linear least squares problem formulation 25
2.2 The QR factorization and its role 33
2.3 Permuted QR factorization 39

3 Analysis of Least Squares Problems 47


3.1 The pseudoinverse 47
3.2 The singular value decomposition 50
3.3 Generalized singular value decomposition 54
3.4 Condition number and column scaling 55
3.5 Perturbation analysis 58

4 Direct Methods for Full-Rank Problems 65


4.1 Normal equations 65
4.2 LU factorization 68
4.3 QR factorization 70
4.4 Modifying least squares problems 80
4.5 Iterative refinement 85
4.6 Stability and condition number estimation 88
4.7 Comparison of the methods 89

v
vi CONTENTS

5 Direct Methods for Rank-Deficient Problems 91


5.1 Numerical rank 92
5.2 Peters-Wilkinson LU factorization 93
5.3 QR factorization with column permutations 94
5.4 UTV and VSV decompositions 98
5.5 Bidiagonalization 99
5.6 SVD computations 101

6 Methods for Large-Scale Problems 105


6.1 Iterative versus direct methods 105
6.2 Classical stationary methods 107
6.3 Non-stationary methods, Krylov methods 108
6.4 Practicalities: preconditioning and
stopping criteria 114
6.5 Block methods 117

7 Additional Topics in Least Squares 121


7.1 Constrained linear least squares problems 121
7.2 Missing data problems 131
7.3 Total least squares (TLS) 136
7.4 Convex optimization 143
7.5 Compressed sensing 144

8 Nonlinear Least Squares Problems 147


8.1 Introduction 147
8.2 Unconstrained problems 150
8.3 Optimality conditions for constrained
problems 156
8.4 Separable nonlinear least squares problems 158
8.5 Multiobjective optimization 160

9 Algorithms for Solving Nonlinear LSQ Problems 163


9.1 Newton’s method 164
9.2 The Gauss-Newton method 166
9.3 The Levenberg-Marquardt method 170
9.4 Additional considerations and software 176
9.5 Iteratively reweighted LSQ algorithms for robust data fitting
problems 178
9.6 Variable projection algorithm 181
9.7 Block methods for large-scale problems 186
CONTENTS vii

10 Ill-Conditioned Problems 191


10.1 Characterization 191
10.2 Regularization methods 192
10.3 Parameter selection techniques 195
10.4 Extensions of Tikhonov regularization 198
10.5 Ill-conditioned NLLSQ problems 201

11 Linear Least Squares Applications 203


11.1 Splines in approximation 203
11.2 Global temperatures data fitting 212
11.3 Geological surface modeling 221

12 Nonlinear Least Squares Applications 231


12.1 Neural networks training 231
12.2 Response surfaces, surrogates or proxies 238
12.3 Optimal design of a supersonic aircraft 241
12.4 NMR spectroscopy 248
12.5 Piezoelectric crystal identification 251
12.6 Travel time inversion of seismic data 258

Appendix A Sensitivity Analysis 263


A.1 Floating-point arithmetic 263
A.2 Stability, conditioning and accuracy 264

Appendix B Linear Algebra Background 267


B.1 Norms 267
B.2 Condition number 268
B.3 Orthogonality 269
B.4 Some additional matrix properties 270

Appendix C Advanced Calculus Background 271


C.1 Convergence rates 271
C.2 Multivariable calculus 272

Appendix D Statistics 275


D.1 Definitions 275
D.2 Hypothesis testing 280

References 281

Index 301
This page intentionally left blank
Preface

This book surveys basic modern techniques for the numerical solution of
linear and nonlinear least squares problems and introduces the treatment
of large and ill-conditioned problems. The theory is extensively illustrated
with examples from engineering, environmental sciences, geophysics and
other application areas.
In addition to the treatment of the numerical aspects of least squares
problems, we introduce some important topics from the area of regression
analysis in statistics, which can help to motivate, understand and evaluate
the computed least squares solutions. The inclusion of these topics is one
aspect that distinguishes the present book from other books on the subject.
The presentation of the material is designed to give an overview, with
the goal of helping the reader decide which method would be appropriate
for a given problem, point toward available algorithms/software and, if nec-
essary, help in modifying the available tools to adapt for a given application.
The emphasis is therefore on the properties of the different algorithms and
few proofs are presented; the reader is instead referred to the appropriate
articles/books. Unfortunately, several important topics had to be left out,
among them, direct methods for sparse problems.
The content is geared toward scientists and engineers who must analyze
and solve least squares problems in their fields. It can be used as course
material for an advanced undergraduate or graduate course in the sciences
and engineering, presupposing a working knowledge of linear algebra and
basic statistics. It is written mostly in a terse style in order to provide a
quick introduction to the subject, while treating some of the not so well-
known topics in more depth. This in fact presents the reader with an
opportunity to verify the understanding of the material by completing or
providing the proofs without checking the references.
The least squares problem is known under different names in different
disciplines. One of our aims is to help bridge the communication gap be-
tween the statistics and the numerical analysis literature on the subject,
often due to the use of different terminology, such as l2 -approximation,

ix
x PREFACE

regularization, regression analysis, parameter estimation, filtering, process


identification, etc.
Least squares methods have been with us for many years, since Gauss
invented and used them in his surveying activities [83]. In 1965, the paper
by G. H. Golub [92] on using the QR factorization and later his devel-
opment of a stable algorithm for the computation of the SVD started a
renewed interest in the subject in the, by then, changed work environment
of computers.
Thanks also to, among many others, Å. Björck, L. Eldén, C. C. Paige,
M. A. Saunders, G. W. Stewart, S. van Huffel and P.-. Wedin, the topic
is now available in a robust, algorithmic and well-founded form.
There are many books partially or completely dedicated to linear and
nonlinear least squares. The first and one of the fundamental references for
linear problems is Lawson and Hanson’s monograph [150]. Besides summa-
rizing the state of the art at the time of its publication, it highlighted the
practical aspects of solving least squares problems. Bates and Watts [9]
have an early comprehensive book focused on the nonlinear least squares
problem with a strong statistical approach. Björck’s book [20] contains a
very careful and comprehensive survey of numerical methods for both lin-
ear and nonlinear problems, including the treatment of large, sparse prob-
lems. Golub and Van Loan’s Matrix Computations [105] includes several
chapters on different aspects of least squares solution and on total least
squares. The total least squares problem, known in statistics as latent root
regression, is discussed in the book by S. van Huffel and J. Vandewalle
[239]. Seber and Wild [223] consider exhaustively all aspects of nonlinear
least squares estimation and modeling. Although it is a general treatise
on optimization, Nocedal and Wright’s book [170] includes a very clear
chapter on nonlinear least squares. Additional material can be found in
[21, 63, 128, 232, 242, 251].

We would like to acknowledge the help of Michael Saunders (iCME, Stan-


ford University), who read carefully the whole manuscript and made a
myriad of observations and corrections that have greatly improved the final
product.
Per Christian Hansen would like to thank several colleagues from DTU
Informatics who assisted with the statistical aspects.
Godela Scherer gives thanks for all the support at the Department of
Mathematics and Statistics, University of Reading, where she was a visiting
research fellow while working on this book. In particular, she would like
to thank Professor Mike J. Baines and Dr. I Llatas for numerous inspiring
discussions.
Victor Pereyra acknowledges Weidlinger Associates Inc. and most es-
pecially David Vaughan and Howard Levine, for their unflagging support
PREFACE xi

and for letting him keep his office and access to computing facilities after
retirement. Also Ms. P. Tennant helped immensely by improving many of
the figures.
Special thanks are due to the professional handling of the manuscript
by the publishers and more specifically to executive editor Vincent J. Burke
and production editor Andre Barnett.
Prior to his untimely death on November 2007, Professor Gene Golub
had been an integral part of this project team. Although the book has
changed significantly since then, it has greatly benefited from his insight
and knowledge. He was an inspiring mentor and great friend, and we miss
him dearly.
This page intentionally left blank
Symbols and Acronyms

Symbol Represents
A m × n matrix
A† , A− , AT pseudoinverse, generalized inverse and transpose of A
b right-hand side, length m
cond(·) condition number of matrix in l2 -norm
Cov(·) covariance matrix
diag(·) diagonal matrix
e vector of noise, length m
ei noise component in data
i
e canonical unit vector
εM machine precision
E(·) expected value
fj (t) model basis function
Γ(t) pure-data function
M (x , t) fitting model
N normal (or Gaussian) distribution
null(A) null space of A
p degree of polynomial
P, Pi , Px probability
PX projection onto space X
Π permutation matrix
Q m × m orthogonal matrix, partitioned as Q = ( Q1 Q2 )
r = r(A) rank of matrix A
r, r∗ residual vector, least squares residual vector
ri residual for ith data
range(A) range of matrix A
R, R1 m × n and n × n upper triangular matrix
span{w 1 , . . . , w p } subspace generated by the vectors
Σ diagonal SVD matrix

xiii
xiv SYMBOLS and ACRONYMS

Symbol Represents
ζ, ζi standard deviation
σi singular value
t independent variable in data fitting problem
ti abscissa in data fitting problem
U, V m × m and n × n left and right SVD matrices
u i, v i left and right singular vectors, length m and n respectively
W m × m diagonal weight matrix
wi weight in weighted least squares problem
x, x∗ vector of unknowns, least squares solution, length n
x ∗ minimum-norm LSQ solution
x B , x TLS basic LSQ solution and total least squares solution
xi coefficient in a linear fitting model
y vector of data in a data fitting problem, length m
yi data in data fitting problem
 · 2 2-norm, x 2 = (x21 + · · · + x2n )1/2
 perturbed version of 
SYMBOLS and ACRONYMS xv

Acronym Name
CG conjugate gradient
CGLS conjugate gradient for LSQ
FG fast Givens
GCV generalized cross validation
G-N Gauss-Newton
GS Gram-Schmidt factorization
GSVD generalized singular value decomposition
LASVD SVD for large, sparse matrices
L-M Levenberg-Marquardt
LP linear prediction
LSE equality constrained LSQ
LSI inequality constrained LSQ
LSQI quadratically constrained LSQ
LSQ least squares
LSQR Paige-Saunders algorithm
LU LU factorization
MGS modified Gram-Schmidt
NLLSQ nonlinear least squares
NMR nuclear magnetic resonance
NN neural network
QR QR decomposition
RMS root mean square
RRQR rank revealing QR decomposition
SVD singular value decomposition
TLS total least squares problem
TSVD truncated singular value decomposition
UTV UTV decomposition
VARPRO variable projection algorithm, Netlib version
VP variable projection
This page intentionally left blank
Least Squares Data Fitting
with Applications
This page intentionally left blank
Chapter 1

The Linear Data Fitting


Problem

This chapter gives an introduction to the linear data fitting problem: how
it is defined, its mathematical aspects and how it is analyzed. We also give
important statistical background that provides insight into the data fitting
problem. Anyone with more interest in the subject is encouraged to consult
the pedagogical expositions by Bevington [13], Rust [213], Strutz [233] and
van den Bos [242].
We start with a couple of simple examples that introduce the basic
concepts of data fitting. Then we move on to a more formal definition, and
we discuss some statistical aspects. Throughout the first chapters of this
book we will return to these data fitting problems in order to illustrate the
ensemble of numerical methods and techniques available to solve them.

1.1 Parameter estimation, data approximation


Example 1. Parameter estimation. In food-quality analysis, the amount
and mobility of water in meat has been shown to affect quality attributes like
appearance, texture and storage stability. The water contents can be mea-
sured by means of nuclear magnetic resonance (NMR) techniques, in which
the measured signal reflects the amount and properties of different types of
water environments in the meat. Here we consider a simplified example
involving frozen cod, where the ideal time signal φ(t) from NMR is a sum
of two damped exponentials plus a constant background,
φ(t) = x1 e−λ1 t + x2 e−λ2 t + x3 , λ1 , λ2 > 0.
In this example we assume that we know the parameters λ1 and λ2 that
control the decay of the two exponential components. In practice we do not

1
2 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 1.1.1: Noisy measurements of the time signal φ(t) from NMR, for
the example with frozen cod meat.

measure this pure signal, but rather a noisy realization of it as shown in


Figure 1.1.1.
The parameters λ1 = 27 s−1 and λ2 = 8 s−1 characterize two different
types of proton environments, responsible for two different water mobilities.
The amplitudes x1 and x2 are proportional to the amount of water contained
in the two kinds of proton environments. The constant x3 accounts for an
undesired background (bias) in the measurements. Thus, there are three
unknown parameters in this model, namely, x1 , x2 and x3 . The goal of data
fitting in relation to this problem is to use the measured data to estimate
the three unknown parameters and then compute the different kinds of water
contents in the meat sample. The actual fit is presented in Figure 1.2.1.

In this example we used the technique of data fitting for the


purpose of estimating unknown parameters in a mathemat-
ical model from measured data. The model was dictated by
the physical or other laws that describe the data.

Example 2. Data approximation. We are given measurements of air


pollution, in the form of the concentration of NO, over a period of 24 hours,
on a busy street in a major city. Since the NO concentration is mainly due
to the cars, it has maximum values in the morning and in the afternoon,
when the traffic is most intense. The data is shown in Table 1.1 and the
plot in Figure 1.2.2.
For further analysis of the air pollution we need to fit a smooth curve
to the measurements, so that we can compute the concentration at an arbi-
trary time between 0 and 24 hours. For example, we can use a low-degree
polynomial to model the data, i.e., we assume that the NO concentration
can be approximated by

f (t) = x1 tp + x2 tp−1 + · · · + xp t + xp+1 ,


THE LINEAR DATA FITTING PROBLEM 3

ti yi ti yi ti yi ti yi ti yi
0 110.49 5 29.37 10 294.75 15 245.04 20 216.73
1 73.72 6 74.74 11 253.78 16 286.74 21 185.78
2 23.39 7 117.02 12 250.48 17 304.78 22 171.19
3 17.11 8 298.04 13 239.48 18 288.76 23 171.73
4 20.31 9 348.13 14 236.52 19 247.11 24 164.05

Table 1.1: Measurements of NO concentration yi as a function of time ti .


The units of yi and ti are μg/m3 and hours, respectively.

where t is the time, p is the degree of the polynomial and


x1 , x2 , . . . , xp+1 are the unknown coefficients in the polynomial. A better
model however, since the data repeats every day, would use periodic func-
tions:

f (t) = x1 + x2 sin(ω t) + x3 cos(ω t) + x4 sin(2ω t) + x5 cos(2ω t) + · · ·

where ω = 2π/24 is the period. Again, x1 , x2 , . . . are the unknown coeffi-


cients. The goal of data fitting in relation to this problem is to estimate the
coefficients x1 , x2 , . . ., such that we can evaluate the function f (t) for any
argument t. At the same time we want to suppress the influence of errors
present in the data.

In this example we used the technique of data fitting for the


purpose of approximating measured discrete data: we fitted a
model to given data in order to be able to compute smoothed
data for any value of the independent variable in the model.
We were free to choose the model as long as it gave an ade-
quate fit to the data.

Both examples illustrate that we are given data with measurement er-
rors and that we want to fit a model to these data that captures the “overall
behavior” of it without being too sensitive to the errors. The difference be-
tween the two examples is that in the first case the model arises from a
physical theory, while in the second there is an arbitrary continuous ap-
proximation to a set of discrete data.
Data fitting is distinctly different from the problem of interpolation,
where we seek a model – a function f (t) – that interpolates the given data,
i.e., it satisfies f (ti ) = yi for all the data points. We are not interested
in interpolation (which is not suited for noisy data) – rather, we want to
approximate the noisy data with a parametric model that is either given or
that we can choose, in such a way that the result is not too sensitive to the
noise. In this data fitting approach there is redundant data: i.e., more data
than unknown parameters, which also helps to decrease the uncertainty in
4 LEAST SQUARES DATA FITTING WITH APPLICATIONS

the parameters of the model. See Example 15 in the next chapter for a
justification of this.

1.2 Formulation of the data fitting problem


Let us now give a precise definition of the data fitting problem. We assume
that we are given m data points

(t1 , y1 ), (t2 , y2 ), . . . , (tm , ym ),

which can be described by the relation

yi = Γ(ti ) + ei , i = 1, 2, . . . , m. (1.2.1)

The function Γ(t), which we call the pure-data function, describes the
noise-free data (it may be unknown, or given by the application), while
e1 , e2 , . . . , em are the data errors (they are unknown, but we may have
some statistical information about them). The data errors – also referred
to as noise – represent measurement errors as well as random variations
in the physical process that generates the data. Without loss of generality
we can assume that the abscissas ti appear in non-decreasing order, i.e.,
t 1 ≤ t 2 ≤ · · · ≤ tm .
In data fitting we wish to compute an approximation to Γ(t) – typically
in the interval [t1 , tm ]. The approximation is given by the fitting model
M (x, t), where the vector x = (x1 , x2 , . . . , xn )T contains n parameters
that characterize the model and are to be determined from the given noisy
data. In the linear data fitting problem we always have a model of the form


n
Linear fitting model: M (x, t) = xj fj (t). (1.2.2)
j=1

The functions fj (t) are called the model basis functions, and the number
n – the order of the fit – should preferably be smaller than the number
m of data points. A notable modern exception is related to the so-called
compressed sensing, which we discuss briefly in Section 7.5.
The form of the function M (x, t) – i.e., the choice of basis functions –
depends on the precise goal of the data fitting. These functions may be
given by the underlying mathematical model that describes the data – in
which case M (x, t) is often equal to, or an approximation to, the pure-data
function Γ(t) – or the basis functions may be chosen arbitrarily among all
functions that give the desired approximation and allow for stable numerical
computations.
The method of least squares (LSQ) is a standard technique for deter-
mining the unknown parameters in the fitting model. The least squares fit
THE LINEAR DATA FITTING PROBLEM 5

is defined as follows. We introduce the residual ri associated with the data


points as
ri = yi − M (x, ti ), i = 1, 2, . . . , m,
and we note that each residual is a function of the parameter vector x, i.e.,
ri = ri (x). A least squares fit is a choice of the parameter vector x that
minimizes the sum-of-squares of the residuals:


m 
m
 2
LSQ fit: min ri (x)2 = min yi − M (x, ti ) . (1.2.3)
x x
i=1 i=1

In the next chapter we shall describe in which circumstances the least


squares fit is unique, and in the following chapters we shall describe a
number of efficient computational methods for obtaining the least squares
parameter vector x.
We note in passing that there are other related criteria used in data
fitting; for example, one could replace the sum-of-squares in (1.2.3) with
the sum-of-absolute-values:

m 
m
 
min |ri (x)| = min yi − M (x, ti ). (1.2.4)
x x
i=1 i=1

Below we shall use a statistical perspective to describe when these two


choices are appropriate. However, we emphasize that the book focuses on
the least squares fit.
In order to obtain a better understanding of the least squares data fitting
problem we take a closer look at the residuals, which we can write as
   
ri = yi − M (x, ti ) = yi − Γ(ti ) + Γ(ti ) − M (x, ti )
 
= ei + Γ(ti ) − M (x, ti ) ,
i = 1, 2, . . . , m. (1.2.5)

We see that the ith residual consists of two components: the data error
ei comes from the measurements, while the approximation error Γ(ti ) −
M (x, ti ) is due to the discrepancy between the pure-data function and the
computed fitting model. We emphasize that even if Γ(t) and M (x, t) have
the same form, there is no guarantee that the estimated parameters x used
in M (x, t) will be identical to those underlying the pure-data function Γ(t).
At any rate, we see from this dichotomy that a good fitting model M (x, t)
is one for which the approximation errors are of the same size as the data
errors.
Underlying the least squares formulation in (1.2.3) are the assumptions
that the data and the errors are independent and that the errors are “white
noise.” The latter means that all data errors are uncorrelated and of the
6 LEAST SQUARES DATA FITTING WITH APPLICATIONS

same size – or in more precise statistical terms, that the errors ei have mean
zero and identical variance: E(ei ) = 0 and E(e2i ) = ς 2 for i = 1, 2, . . . , m
(where ς is the standard deviation of the errors).
This ideal situation is not always the case in practice! Hence, we also
need to consider the more general case where the standard deviation de-
pends on the index i, i.e.,

E(ei ) = 0, E(e2i ) = ςi2 , i = 1, 2, . . . , m,

where ςi is the standard deviation of ei . In this case, the maximum likeli-


hood principle in statistics (see Section 1.3), tells us that we should min-
imize the weighted residuals, with weights equal to the reciprocals of the
standard deviations:
m  ri (x) 2 m  yi −M (x,ti ) 2
minx i=1 ςi = minx i=1 ςi . (1.2.6)

Now consider the expected value of the weighted sum-of-squares:


m
ri (x)
2 
m
ri (x)2
E = E
i=1
ςi i=1
ςi2

m
e2i 
m
(Γ(ti ) − M (x, ti ))2
= E + E
i=1
ςi2
i=1
ςi2
 
m
E (Γ(ti ) − M (x, ti ))2
= m+ ,
i=1
ςi2

where we have used that E(ei ) = 0 and E(e2i ) = ςi2 . The consequence of
this relation is the intuitive result that we can allow the expected value of
the approximation errors to be larger for those data (ti , yi ) that have larger
standard deviations (i.e., larger errors). Example 4 illustrates the usefulness
of this approach. See Chapter 3 in [233] for a thorough discussion on how
to estimate weights for a given data set.
We are now ready to state the least squares data fitting problem in
terms of matrix-vector notation. We define the matrix A ∈ Rm×n and the
vectors y, r ∈ Rm as follows,
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
f1 (t1 ) f2 (t1 ) · · · fn (t1 ) y1 r1
⎜ f1 (t2 ) f2 (t2 ) · · · fn (t2 ) ⎟ ⎜ y2 ⎟ ⎜ r2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
A=⎜ . .. .. ⎟ , y = ⎜ .. ⎟ , r = ⎜ .. ⎟ ,
⎝ . . . . ⎠ ⎝ . ⎠ ⎝ . ⎠
f1 (tm ) f2 (tm ) · · · fn (tm ) ym rm

i.e., y is the vector of observations, r is the vector of residuals and the


matrix A is constructed such that the jth column is the jth model basis
THE LINEAR DATA FITTING PROBLEM 7

function sampled at the abscissas t1 , t2 , . . . , tm . Then it is easy to see that


for the un-weighted data fitting problem we have the relations

m
r = y − Ax and ρ(x) = ri (x)2 = r22 = y − A x22 .
i=1

Similarly, for the weighted problem we have



m
ri (x)
2
ρW (x) = = W (y − A x)22 ,
i=1
ςi

with the weighting matrix and weights


W = diag(w1 , . . . , wm ), wi = ςi−1 , i = 1, 2, . . . , m.
In both cases, the computation of the coefficients in the least squares fit is
identical to the solution of a linear least squares problem for x. Throughout
the book we will study these least squares problems in detail and give
efficient computational algorithms to solve them.
Example 3. We return to the NMR data fitting problem from Example 1.
For this problem there are 50 measured data points and the model basis
functions are
f1 (t) = e−λ1 t , f2 (t) = e−λ2 t , f3 (t) = 1,
and hence we have m = 50 and n = 3. In this example the errors in all
data points have the same standard deviation ς = 0.1, so we can use the
un-weighted approach. The solution to the 50 × 3 least squares problem is
x1 = 1.303, x2 = 1.973, x3 = 0.305.
The exact parameters used to generate the data are 1.27, 2.04 and 0.3,
respectively. These data were then perturbed with random errors. Figure
1.2.1 shows the data together with the least squares fit M (x, t); note how
the residuals are distributed on both sides of the fit.
For the data fitting problem in Example 2, we try both the polynomial
fit and the trigonometric fit. In the first case the basis functions are the
monomials fj (t) = tn−j , for j = 1, . . . , n = p + 1, where p is the degree of
the polynomial. In the second case the basis functions are the trigonometric
functions:
f1 (t) = 1, f2 (t) = sin(ω t), f3 (t) = cos(ω t),
f4 (t) = sin(2ω t), f5 (t) = cos(2ω t), ...
Figure 1.2.2 shows the two fits using a polynomial of degree p = 8 (giving
a fit of order n = 9) and a trigonometric fit with n = 9. The trigonometric
fit looks better. We shall later introduce computational tools that let us
investigate this aspect in more rigorous ways.
8 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 1.2.1: The least squares fit (solid line) to the measured NMR data
(dots) from Figure 1.1.1 in Example 1.

Figure 1.2.2: Two least squares fits (both of order n = 9) to the mea-
sured NO data from Example 2, using a polynomial (left) and trigonometric
functions (right).
THE LINEAR DATA FITTING PROBLEM 9

Figure 1.2.3: Histograms of the computed values of x2 for the modified


NMR data in which the first 10 data points have larger errors. The left
plot shows results from solving the un-weighted LSQ problem, while the
right plot shows the results when weights wi = ςi−1 are included in the LSQ
problem.

Example 4. This example illustrates the importance of using weights when


computing the fit. We use again the NMR data from Example 1, except
this time we add larger Gaussian noise to the first 10 data points, with
standard deviation 0.5. Thus we have ςi = 0.5 for i = 1, 2, . . . , 10 (the
first 10 data with larger errors) and ςi = 0.1 for i = 11, 12, . . . , 50 (the
remaining data with smaller errors). The corresponding weights wi = ςi−1
are therefore 2, 2, . . . , 2, 10, 10, . . . , 10. We solve the data fitting problem
with and without weights for 10, 000 instances of the noise. To evaluate the
results, we consider how well we estimate the second parameter x2 whose
exact value is 2.04. The results are shown in Figure 1.2.3 in the form of
histograms of the computed values of x2 . Clearly, the weighted fit gives more
robust results because it is less influenced by the data with large errors.

1.3 Maximum likelihood estimation


At first glance the problems of interpolation and data fitting seem to re-
semble each other. In both cases, we use approximation theory to select
a good model (through the model basis functions) for the given data that
results in small approximation errors – in the case of data fitting, given by
the term Γ(t) − M (x, t) for t ∈ [t1 , tm ]. The main difference between the
two problems comes from the presence of data errors and the way we deal
with these errors.
In data fitting we deliberately avoid interpolating the data and instead
settle for less degrees of freedom in the model, in order to reduce the model’s
sensitivity to errors. Also, it is clear that the data noise plays an important
role in data fitting problems, and we should use concepts and tools from
statistics to deal with it.
The classical statistical motivation for the least squares fit is based
on the maximum likelihood principle. Our presentation follows [13]. We
assume that the data are given by (1.2.1), that the errors ei are unbiased
10 LEAST SQUARES DATA FITTING WITH APPLICATIONS

and uncorrelated, and that each error ei has a Gaussian distribution with
standard deviation ςi , i.e.,

yi = Γ(ti ) + ei , ei ∼ N (0, ςi2 ).

Here, N (0, ςi2 ) denotes the normal (Gaussian) distribution with zero mean
and standard deviation σi . Gaussian errors arise, e.g., from the measure-
ment process or the measuring devices, and they are also good models of
composite errors that arise when several sources contribute to the noise.
Then the probability Pi for making the observation yi is given by
“ ”2
yi −Γ(ti )
1 −1 1 1 ei 2
Pi = √ e 2 ςi
= √ e− 2 ( ςi ) .
ςi 2π ςi 2π
The probability P for making the observed set of measurements y1 , y2 , . . . , ym
is the product of these probabilities:
“ ”2 “ ”2

m
1 −1
ei 
m
− 12
ei
P = P1 P2 · · · Pm = √ e 2 ςi
=K e ςi
,
ς 2π
i=1 i i=1

m  √ −1
where the factor K = i=1 ςi 2π is independent of the pure-data
function.
Now assume that the pure-data function Γ(t) is identical to the fitting
model M (x, t), for a specific but unknown parameter vector x∗ . The proba-
bility for the given data y1 , y2 , . . . , ym to be produced by the model M (x, t)
for an arbitrary parameter vector x is given by
“ ”2 Pm “ yi −M (x,ti ) ”2

m
− 21
yi −M (x,ti )
− 12
ςi i=1 ςi
Px = K e =Ke . (1.3.1)
i=1

The method of maximum likelihood applied to data fitting consists of mak-


ing the assumption that the given data are more likely to have come from
our model M (x, t) – with specific but unknown parameters x∗ – than any
other model of this form (i.e., with any other parameter vector x). The best
estimate of the model is therefore the one that maximizes the probability
Px given above, as a function of x. Clearly, Px is maximized by mini-
mizing the exponent, i.e., the weighted sum-of-squared residuals. We have
thus – under the assumption of Gaussian errors – derived the weighted least
squares data fitting problem (1.2.6). In practice, we use the same principle
when there is a discrepancy between the pure-data function and the fitting
model.
The same approach can be used to derive the maximum likelihood fit
for other types of errors. As an important example, we consider errors
that follow a Laplace distribution, for which the probability of making the
THE LINEAR DATA FITTING PROBLEM 11

observation yi is given by
1 − |yi −Γ(t
ςi
i )| 1 − |eςi |
Pi = e = e i ,
2ςi 2ςi
where again ςi is the standard deviation. The Laplace density function
decays slower than the Gaussian density function, and thus the Laplace
distribution describes measurement situations where we are more likely to
have large errors than in the case of the Gaussian distribution.
Following once again the maximum likelihood approach we arrive at the
problem of maximizing the function

m
|yi −M (x,ti )|
Px 
= K e
− ςi

i=1
P ˛ ˛
˛ yi −M (x,ti ) ˛
e − mi=1 ˛ ς ˛
= K i

 
with K  = m 2ςi −1 . Hence, for these errors we should minimize the
i=1
sum-of-absolute-values of the weighted residuals. This is the linear 1-norm
minimization problem that we mentioned in (1.2.4).
While the principle of maximum likelihood is universally applicable, it
can lead to complicated or intractable computational problems. As an ex-
ample, consider the case of Poisson data, where yi comes from a Poisson
distribution, with expected value Γ(ti ) and standard deviation Γ(ti )1/2 .
Poisson data typically show up in counting measurements, such as the pho-
ton counts underlying optical detectors. Then the probability for making
the observation yi is
Γ(ti )yi −Γ(ti )
Pi = e ,
yi !
and hence, we should maximize the probability


m
M (x, ti )yi
Px = e−M (x,ti )
i=1
yi !
m
1 
m Pm
= M (x, ti )yi e− i=1 M (x,ti ) .
y!
i=1 i i=1

Unfortunately, it is computationally demanding to maximize this quantity


with respect to x, and instead one usually makes the assumption that the
Poisson errors for each data yi are nearly Gaussian, with standard deviation
1/2
σi = Γ(ti )1/2  yi (see, e.g., pp. 342–343 in [146] for a justification of this
assumption). Hence, the above weighted least squares approach derived for
−1/2
Gaussian noise, with weights wi = yi , will give a good approximation
to the maximum likelihood fit for Poisson data.
12 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The Gauss and Laplace errors discussed above are used to model addi-
tive errors in the data. We finish this section with a brief look at relative
errors, which arise when the size of the error ei is – perhaps to a good
approximation – proportional to the magnitude of the pure data Γ(ti ). A
straightforward way to model such errors, which fits into the above frame-
work, is to assume that the data yi can be described by a normal distribu-
tion with mean Γ(ti ) and standard deviation ςi = |Γ(ti )| ς. This “relative
Gaussian errors” model can also be written as

yi = Γ(ti ) (1 + ei ), ei ∼ N (0, ς 2 ). (1.3.2)

Then the probability for making the observation yi is


“ ”2
yi −Γ(ti )
1 −1
Pi = √ e 2 Γ(ti ) ς
.
|Γ(ti )| ς 2π

Using the maximum likelihood principle again and substituting the mea-
sured data yi for the unknown pure data Γ(ti ), we arrive at the following
weighted least squares problem:


m
yi − M (x, ti )
2
min = min W (y − A x)22 , (1.3.3)
x
i=1
yi x

with weights wi = yi−1 .


An alternative formulation, which is suited for problems with positive
data Γ(ti ) > 0 and yi > 0, is to assume that yi can be described by a log-
normal distribution, for which log yi has a normal distribution, with mean
log Γ(ti ) and standard deviation ς:

yi = Γ(ti ) ei ⇔ log yi = log Γ(ti ) + i , i ∼ N (0, ς 2 ).

In this case we again arrive at a sum-of-squares minimization problem,


but now involving the difference of the logarithms of the data yi and the
model M (x, ti ). Even when M (x, t) is a linear model, this is not a linear
problem in x.
In the above log-normal model with standard deviation ς, the probabil-
ity Pi for making the observation yi is given by
“ ”2 “ ”2
log yi −log Γ(ti ) log ŷi
1 − 12 1 − 12
Pi = √ e ς
= √ e ς
,
yi ς 2π yi ς 2π

with ŷi = yi /Γ(ti ). Now let us assume that ς is small compared to Γ(ti ),
such that yi  Γ(ti ) and ŷi  1. Then, we can write yi = Γ(ti ) (1 + ei ) ⇔
ŷi = 1 + ei , with ei 1 and log ŷi = ei + O(e2i ). Hence, the exponential
THE LINEAR DATA FITTING PROBLEM 13

factor in Pi becomes
„ «2 „ «
“ ”2 ˆi +O(e2
 i) e2 O(e3
i)
− 21
log ŷi − 12 ς − 12 i
ς2
+ ς2
ς
e = e = e
2 O(e3
i) 2
= e − 12 ( eςi ) e− ς2
1 ei
= e− 2 ( ς ) O(1),
while the other factor in Pi becomes
1 1 1 + O(ei )
√ = √ = √ .
yi ς 2π Γ(ti ) (1 + ei ) ς 2π Γ(ti ) ς 2π
Hence, as long as ς Γ(ti ) we have the approximation
“ ”2
yi −Γ(ti )
1 1 ei 2 1 −1
Pi  √ e− 2 ( ς ) = √ e 2 Γ(ti ) ς
,
Γ(ti ) ς 2π Γ(ti ) ς 2π
which is the probability introduced above for the case of “relative Gaussian
errors.” Hence, for small noise levels ς |Γ(ti )|, the two different models
for introducing relative errors in the data are practically identical, leading
to the same weighted LSQ problem (1.3.3).

1.4 The residuals and their properties


This section focuses on the residuals ri = yi − M (x, ti ) for a given fit
and how they can be used to analyze the “quality” of the fit M (x, t) that
we have computed. Throughout the section we assume that the residuals
behave like a time series, i.e., they have a natural ordering r1 , r2 , . . . , rm
associated with the ordering t1 < t2 < · · · < tm of the samples of the
independent variable t.
As we already saw in Equation (1.2.5), each residual ri consists of two
components – the data error ei and the approximation error Γ(ti )−M (x, ti ).
For a good fitting model, the approximation error should be of the same
size as the data errors (or smaller). At the same time, we do not want
the residuals to be too small, since then the model M (x, t) may overfit the
data: i.e., not only will it capture the behavior of the pure-data function
Γ(t), but it will also adapt to the errors, which is undesirable.
In order to choose a good fitting model M (x, t) we must be able to
analyze the residuals ri and determine whether the model captures the
pure-data function “well enough.” We can say that this is achieved when the
approximation errors are smaller than the data errors, so that the residuals
are practically dominated by the data errors. In that case, some of the
statistical properties of the errors will carry over to the residuals. For
example, if the noise is white (cf. Section 1.1), then we will expect that the
residuals associated with a satisfactory fit show properties similar to white
noise.
14 LEAST SQUARES DATA FITTING WITH APPLICATIONS

If, on the other hand, the fitting model does not capture the main
behavior of the pure-data function, then we can expect that the residuals
are dominated by the approximation errors. When this is the case, the
residuals will not have the characteristics of noise, but instead they will
tend to behave as a sampled signal, i.e., the residuals will show strong
local correlations. We will use the term trend to characterize a long-term
movement in the residuals when considered as a time series.
Below we will discuss some statistical tests that can be used to determine
whether the residuals behave like noise or include trends. These and many
other tests are often used in time series analysis and signal processing.
Throughout this section we make the following assumptions about the data
errors ei :

• They are random variables with mean zero and identical variance,
i.e., E(ei ) = 0 and E(e2i ) = ς 2 for i = 1, 2, . . . , m.

• They belong to a normal distribution, ei ∼ N (0, ς 2 ).

We will describe three tests with three different properties:

• Randomness test: check for randomness of the signs of the residuals.

• Autocorrelation test: check whether the residuals are uncorrelated.

• White noise test: check for randomness of the residuals.

The use of the tools introduced here is illustrated below and in Chapter 11
on applications.

Test for random signs


Perhaps the simplest analysis of the residuals is based on the statistical
question: can we consider the signs of the residuals to be random? (Which
will often be the case when ei is white noise with zero mean.) We can
answer this question by means of a run test from time series analysis; see,
e.g., Section 10.4 in [134].
Given a sequence of two symbols – in our case, “+” and “−” for positive
and negative residuals ri – a run is defined as a succession of identical
symbols surrounded by different symbols. For example, the sequence “+ +
+ − − − − + + − − − − − + + +” has m = 17 elements, n+ = 8 pluses,
n− = 9 minuses and u = 5 runs: + + +, − − −−, ++, − − − − − and
+ + +. The distribution of runs u (not the residuals!) can be approximated
by a normal distribution with mean μu and standard deviation ςu given by

2 n+ n− (μu − 1) (μu − 2)
μu = + 1, ςu2 = . (1.4.1)
m m−1
THE LINEAR DATA FITTING PROBLEM 15

With a 5% significance level we will accept the sign sequence as random if

|u − μu |
z± = < 1.96 (1.4.2)
ςu

(other values of the threshold, for other significance levels, can be found in
any book on statistics). If the signs of the residuals are not random, then
it is likely that trends are present in the residuals. In the above example
with 5 runs we have z± = 2.25, and according to (1.4.2) the sequence of
signs is not random.

Test for correlation


Another question we can ask is whether short sequences of residuals are
correlated, which is a clear indication of trends. The autocorrelation of the
residuals is a statistical tool for analyzing this. We define the autocorrela-
tion of the residuals, as well as the trend threshold T , as the quantities


m−1
1 m
= ri ri+1 , T = √ ri2 . (1.4.3)
i=1
m − 1 i=1

Since is the sum of products of neighboring residuals, it is in fact the


unit-lag autocorrelation. Autocorrelations with larger lags, or distances in
the index, can also be considered. Then, we say that trends are likely to be
present in the residuals if the absolute value of the autocorrelation exceeds
the trend threshold, i.e., if | | > T . Similar techniques, based on shorter
sequences of residuals, are used for placing knots in connection with spline
fitting; see Chapter 6 in [125].
We note that in some presentations, the mean of the residuals is sub-
tracted before computing and T . In our applications this should not be
necessary, as we assume that the errors have zero mean.

Test for white noise


Yet another question we can ask is whether the sequence of residuals be-
haves like white noise, which can be answered by means of the normalized
cumulative periodogram. The underlying idea is that white noise has a flat
spectrum, i.e., all frequency components in the discrete Fourier spectrum
have the same probability; hence, we must determine whether this is the
case. Let the complex numbers r̂k denote the components of the discrete
Fourier transform of the residuals, i.e.,


m
r̂k = ri e(2π ı̂ (i−1)(k−1)/m) , k = 1, . . . , m,
i=1
16 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 1.4.1: Two artificially created data sets used in Example 5. The
second data set is inspired by the data on p. 60 in [125].

where ı̂ denotes the imaginary unit. Our indices are in the range 1, . . . , m
and thus shifted by 1 relative to the range 0, . . . , m−1 that is common in sig-
nal processing. Note that r̂1 is the sum of the residuals (called the DC com-
ponent in signal processing), while r̂q+1 with q = m/2 is the component of
the highest frequency. The squared absolute values |r̂1 |2 , |r̂2 |2 , . . . , |r̂q+1 |2
are known as the periodogram (in statistics) or the power spectrum (in
signal processing). Then the normalized cumulative periodogram consists
of the q numbers

|r̂2 |2 + |r̂3 |2 + · · · + |r̂i+1 |2


ci = , i = 1, . . . , q, q = m/2 ,
|r̂2 |2 + |r̂3 |2 + · · · + |r̂q+1 |2

which form an increasing sequence from 0 to 1. Note that the sums exclude
the first term in the periodogram.
If the residuals are white noise, then the expected values of the normal-
ized cumulative periodogram lie on a straight line from (0, 0) to (q, 1). Any
realization of white noise residuals should produce a normalized cumula-
tive periodogram close to a straight line. For example, with the common
5% significance level from statistics, the numbers ci should lie within the
Kolmogorov-Smirnoff limit ±1.35/q of the straight line. If the maximum
deviation maxi {|ci − i/q|} is smaller than this limit, then we recognize the
residual as white noise.
Example 5. Residual analysis. We finish this section with an example
that illustrates the above analysis techniques. We use the two different data
sets shown in Figure 1.4.1; both sets are artificially generated (in fact, the
second set is the first set with the ti and yi values interchanged). In both
examples we have m = 43 data points, and in the test for white noise
we have q = 21 and 1.35/q = 0.0643. The fitting model M (x, t) is the
polynomial of degree p = n − 1.
For fitting orders n = 2, 3, . . . , 9, Figure 1.4.2 shows the residuals and
the normalized cumulative periodograms, together with z± (1.4.2), the ratios
THE LINEAR DATA FITTING PROBLEM 17

Figure 1.4.2: Residual analysis for polynomial fits to artificial data set 1.
18 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 1.4.3: Residual analysis for polynomial fits to artificial data set 2.
THE LINEAR DATA FITTING PROBLEM 19

/T from (1.4.3) and the maximum distance of the normalized cumulative
periodogram to the straight line. A visual inspection of the residuals in the
left part of the figure indicates that for small values of n the polynomial
model does not capture all the information in the data, as there are obvious
trends in the residuals, while for n ≥ 5 the residuals appear to be more
random. The test for random signs confirms this: for n ≥ 5 the numbers
z± are less than 1.96, indicating that the signs of the residuals could be
considered random. The autocorrelation analysis leads to approximately the
same conclusion: for n = 6 and 7 the absolute value of the autocorrelation
is smaller than the threshold T .
The normalized cumulative periodograms are shown in the right part
of Figure 1.4.2. For small values, the curves rise fast toward a flat part,
showing that the residuals are dominated by low-frequency components. The
closest we get to a straight line is for n = 6, but the maximum distance 0.134
to the straight line is still too large to clearly signify that the residuals are
white noise. The conclusion from these three tests is nevertheless that n = 6
is a good choice of the order of the fit.
Figure 1.4.3 presents the residual analysis for the second data set. A
visual inspection of the residuals clearly shows that the polynomial model
is not well suited for this data set – the residuals have a slowly varying
trend for all values of n. This is confirmed by the normalized cumulative
periodograms that show that the residuals are dominated by low-frequency
components. The random-sign test and the autocorrelation analysis also
give a clear indication of trends in the residuals.

1.5 Robust regression


The least squares fit introduced in this chapter is convenient and useful in
a large number of practical applications – but it is not always the right
choice for a data fitting problem. In fact, we have already seen in Section
1.3 that the least squares fit is closely connected to the assumption about
Gaussian errors in the data. There we also saw that other types of noise,
in the framework of maximum likelihood estimation, lead to other criteria
for a best fit – such as the sum-of-absolute-values of the residuals (the 1-
norm) associated with the Laplace distribution for the noise. The more
dominating the “tails” of the probablility density function for the noise, the
more important it is to use another criterion than the least squares fit.
Another situation where the least squares fit is not appropriate is when
the data contain outliers, i.e., observations with exceptionally large errors
and residuals. We can say that an outlier is a data point (ti , yi ) whose value
yi is unusual compared to its predicted value (based on all the reliable data
points). Such outliers may come from different sources:
20 LEAST SQUARES DATA FITTING WITH APPLICATIONS

• The data errors may come from more than one statistical distribution.
This could arise, e.g., in an astronomical CCD camera, where we have
Poisson noise (or “photon noise”) from the incoming light, Gaussian
noise from the electronic circuits (amplifier and A/D-converter), and
occasional large errors from cosmic radiation (so-called cosmic ray
events).
• The outliers may be due to data recording errors arising, e.g., when
the measurement device has a malfunction or the person recording
the data makes a blunder and enters a wrong number.
A manual inspection can sometimes be used to delete blunders from the
data set, but it may not always be obvious which data are blunders or
outliers. Therefore we prefer to have a mathematical formulation of the
data fitting problem that handles outliers in such a way that all data are
used, and yet the outliers do not have a deteriorating influence on the fit.
This is the goal of robust regression. Quoting from [82] we say that “an
estimator or statistical procedure is robust if it provides useful information
even if some of the assumptions used to justify the estimation method are
not applicable.”
Example 6. Mean and median. Assume we are given n − 1 samples
z1 , . . . , zn−1 from the same distribution and a single sample zn that is an
outlier. Clearly, the arithmetic mean n1 (z1 + z2 + · · · + zn ) is not a good
estimate of the expected value because the outlier constributes with the same
weight as all the other data points. On the other hand, the median gives a
robust estimate of the expected value since it is insensitive to a few outliers;
we recall that if the data are sorted then the median is z(n+1)/2 if n is odd,
and 21 (zn/2 + zn/2+1 ) if n is even.
The most common method for robust data fitting – or robust regression,
as statisticians call it – is based on the principle of M-estimation introduced
by Huber [130], which can be considered as a generalization of maximum
likelihood estimation. Here we consider un-weighted problems only (the
extension to weighted problems is straightforward). The underlying idea
is to replace the sum of squared residuals in (1.2.3) with the sum of some
function of the residuals:

m
  
m
 
Robust fit: min ρ ri (x) = min ρ yi − M (x, ti ) , (1.5.1)
x x
i=1 i=1

where the function ρ defines the contribution of each residual to the function
to be minimized. In particular, we obtain the least squares fit when ρ(r) =
1 2
2 r . The function ρ must satisfy the following criteria:

1. Non-negativity: ρ(r) ≥ 0 ∀r.


THE LINEAR DATA FITTING PROBLEM 21

Figure 1.5.1: Four functions ρ(r) used in the robust data fitting problem.
All of them increase slower than 12 r2 that defines the LSQ problem and thus
they lead to robust data fitting problems that are less sensitive to outliers.

2. Zero only when the argument is zero: ρ(r) = 0 ⇔ r = 0.


3. Symmetry: ρ(−r) = ρ(r).
4. Monotonicity: ρ(r ) ≥ ρ(r) for r ≥ r.
Some well-known examples of the function ρ are (cf. [82, 171]):
 1 2
2r , |r| ≤ β
Huber : ρ(r) = (1.5.2)
β|r| − 21 β 2 , |r| > β
 1 2
2r , |r| ≤ β
Talwar : ρ(r) = (1.5.3)
2 β , |r| > β
1 2

z
Bisquare : ρ(r) = β 2 log cosh (1.5.4)
β

|r| |r|
Logistic : ρ(r) = β 2 − log 1 + (1.5.5)
β β
Note that all four functions include a problem-dependent positive parameter
β that is used to control the behavior of the function for “large” values of
r, corresponding to the outliers. Figure 1.5.1 shows these functions for the
case β = 1, and we see that all of them increase slower than the function
1 2
2 r , which underlies the LSQ problem. This is precisely why they lead to a
robust data fitting problem whose solution is less sensitive to outliers than
the LSQ solution. The parameter β should be chosen from our knowledge
of the standard deviation ς of the noise; if ς is not known, then it can be
estimated from the fit as we will discuss in Section 2.2.
It appears that the choice of function ρ for a given problem relies mainly
on experience with the specific data for that problem. Still, the Huber
22 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 1.5.2: The pure-data function Γ(t) (thin line) and the data with
Gaussian noise (dots); the outlier (t60 , y60 ) = (3, 2.5) is outside the plot.
The fitting model M (x, t) is a polynomial with n = 9. Left: the LSQ fit
and the corresponding residuals; this fit is dramatically influenced by the
outlier. Right: the robust Huber fit, using β = 0.025, together with the
residuals; this is a much better fit to the given data because it approximates
the pure-data function well.

function has attained widespread use, perhaps due to the natural way it
distinguishes between large and small residuals:

• Small residual satisfying |ri | ≤ β are treated in the same way as in


the LSQ fitting problem; if there are no outliers then we obtain the
LSQ solution.

• Large residuals satisfying |ri | > β are essentially treated as |ri | and
the robust fit is therefore not so sensitive to the corresponding data
points.

Thus, robust regression is a compromise between excluding the outliers


entirely from the analysis and including all the data points and treating
all of them equally in the LSQ regression. The idea of robust regression
is to weight the observations differently based on how well behaved these
observations are. For an early use in seismic data processing see [48, 219].

Example 7. Robust data fitting with the Huber function. This


example illustrates that the Huber function gives a more robust fit than the
LSQ fit. The pure-data funtion Γ(t) is given by
 
Γ(t) = sin πe−t , 0 ≤ t ≤ 5.
THE LINEAR DATA FITTING PROBLEM 23

We use m = 100 data points with ti = 0.05i, and we add Gaussian noise
with standard deviation ς = 0.05. Then we change the 60th data point to
an outlier with (t60 , y60 ) = (3, 2.5); Figure 1.5.2 shows the function Γ(t)
and the noisy data. We note that the outlier is located outside the plot.
As fitting model M (x, t) we use a polynomial with n = 9 and the left
part of Figure 1.5.2 shows the least squares fit with this model, together
with the corresponding residuals. Clearly, this fit is dramatically influenced
by the outlier, which is evident from the plot of the fit and as well by the
behavior of the residuals, which exhihit a strong positive “trend” in the range
2 ≤ t ≤ 4. This illustrates the inability of the LSQ fit to handle outliers in
a satisfactory way.
The right part of Figure 1.5.2 shows the robust Huber fit, with parameter
β = 0.025 (this parameter is chosen to reflect the noise level in the data).
The resulting fit is not influenced by the outlier, and the residuals do not
seem to exhibit any strong “trend.” This is a good illustration of robust
regression.

In Section 9.5 we describe numerical algorithms for computing the so-


lutions to robust data fitting problems.
This page intentionally left blank
Chapter 2

The Linear Least Squares


Problem

This chapter covers some of the basic mathematical facts of the linear least
squares problem (LSQ), as well as some important additional statistical
results for data fitting. We introduce the two formulations of the least
squares problem: the linear system of normal equations and the optimiza-
tion problem form.
The computation of the LSQ solution via an optimization problem has
two aspects: simplification of the problem structure and actual minimiza-
tion. In this and in the next chapter we present a number of matrix factor-
izations, for both full-rank and rank-deficient problems, which transform
the original problem to one easier to solve. The QR factorization is empha-
sized, for both the analysis and the solution of the LSQ problem, while in
the last section we look into the more expensive complete factorizations.
Some very interesting historical paper on Gaussian elimination that in-
cludes also least squares problems can be found in Grcar [109, 110].

2.1 Linear least squares problem formulation


As we saw in the previous chapter, underlying the linear (and possibly
weighted) least squares data fitting problem is the linear least squares prob-
lem
min b − A x2 or min W (b − A x)2 ,
x x

where A ∈ R m×n
is the matrix with samples of the model basis functions,
x is the vector of parameters to be determined, the right-hand side b is
the vector of observations and W is a diagonal weight matrix (possibly the
identity matrix). Since the weights can always be absorbed into A and b

25
26 LEAST SQUARES DATA FITTING WITH APPLICATIONS

76 ∗


b r




 
1
  b∗ = A x∗


range(A)

Figure 2.1.1: The geometric interpretation of the linear least squares so-
lution x∗ . The plane represents the range of A, and if the vector b has
a component outside this subspace, then we have an inconsistent system.
Moreover, b∗ = A x∗ is the orthogonal projection of b on range(A), and r ∗
is the LSQ residual vector.

in the mathematical formulation we can, without loss of generality, restrict


our discussion to the un-weighted case. Also, from this point on, when
discussing the generic linear least squares problem, we will use the notation
b for the right-hand side (instead of y that we used in Chapter 1), which
is more common in the LSQ literature.
Although most of the material in this chapter is also applicable to the
underdetermined case (m < n), for notational simplicity we will always
consider the overdetermined case m ≥ n. We denote by r the rank of
A, and we consider both full-rank and rank-deficient problems. Thus we
always have r ≤ n ≤ m in this chapter.
It is appropriate to remember at this point that an m × n matrix A is
always a representation of a linear transformation x → A x with A : Rn →
Rm , and therefore there are two important subspaces associated with it:
The range or column space,

range(A) = {z ∈ Rm | x ∈ Rn , z = A x} ,

and its orthogonal complement, the null space of AT :


 
null(AT ) = y ∈ Rm | AT y = 0 .

When A is square and has full rank, then the LSQ problem minx A x−
b2 reduces to the linear system of equations A x = b. In all other cases,
due to the data errors, it is highly probable that the problem is inconsistent,
i.e., b ∈
/ range(A), and as a consequence there is no exact solution, i.e., no
coefficients xj exist that express b as a linear combination of columns of A.
THE LINEAR LEAST SQUARES PROBLEM 27

Instead, we can find the coefficients xj for a vector b∗ in the range of A


and “closest” to b. As we have seen, for data fitting problems it is natural
to use the Euclidean norm as our measure of closeness, resulting in the least
squares problem

Problem LSQ : minx b − A x22 , A∈ Rm×n , r≤n≤m (2.1.1)

with the corresponding residual vector r given by

r = b − A x. (2.1.2)

See Figure 2.1.1 for a geometric interpretation. The minimizer, i.e., the
least squares solution – which may not be unique, as it will be seen later –
is denoted by x∗ . We note that the vector b∗ ∈ range(A) mentioned above
is given by b∗ = A x∗ .
The LSQ problem can also be looked at from the following point of view.
When our data are contaminated by errors, then the data are not in the
span of the model basis functions fj (t) underlying the data fitting problem
(cf. Chapter 1). In that case the data vector b cannot – and should not – be
precisely “predicted” by the model, i.e., the columns of A. Hence, it must
be perturbed by a minimum amount r, so that it can then be “represented”
by A, in the form of b∗ = A x∗ . This approach will establish a viewpoint
used in Section 7.3 to introduce the total least squares problem.
As already mentioned, there are good statistical reasons to use the Eu-
clidean norm. The underlying statistical assumption that motivates this
norm is that the vector r has random error elements, uncorrelated, with
zero mean and a common variance. This is justified by the following theo-
rem.

Theorem 8. (Gauss-Markov) Consider the problem of fitting a model


M (x, t) with the n-parameter vector x to a set of data bi = Γ(ti ) + ei
for i = 1, . . . , m (see Chapter 1 for details).
In the case of a linear model b = A x, if the errors are uncorrelated with
mean zero and constant variance ς 2 (not necessarily normally distributed)
and assuming that the m × n matrix A obtained by evaluating the model at
the data abscissas {ti }i=1,...,m has full rank n, then the best linear unbiased
estimator is the least squares estimator x∗ , obtained by solving the problem
2
minx b − A x2 .

For more details see [20] Theorem 1.1.1. Recall also the discussion
on maximum likelihood estimation in Chapter 1. Similarly, for nonlinear
models, if the errors ei for i = 1, . . . , m have a normal distribution, the
unknown parameter vector x estimated from the data using a least squares
criterion is the maximum likelihood estimator.
28 LEAST SQUARES DATA FITTING WITH APPLICATIONS

There are also clear mathematical and computational advantages asso-


ciated with the Euclidean norm: the objective function in (2.1.1) is dif-
ferentiable, and the resulting gradient system of equations has convenient
properties. Since the Euclidean norm is preserved under orthogonal trans-
formations, this gives rise to a range of stable numerical algorithms for the
LSQ problem.
Theorem 9. A necessary and sufficient condition for x∗ to be a minimizer
of b − A x22 is that it satisfies
AT (b − A x) = 0. (2.1.3)
Proof. The minimizer of ρ(x) = b − A x22 must satisfy ∇ρ(x) = 0, i.e.,
∂ρ(x)/∂xk = 0 for k = 1, . . . , n. The kth partial derivative has the form

∂ρ(x) 
m 
n 
m
= 2 bi − xj aij (−aik ) = −2 ri aik
∂xk i=1 j=1 i=1

= −2 r T A( : , k) = −2 A( : , k)T r,
where A( : , k) denotes the kth column of A. Hence, the gradient can be
written as
∇ρ(x) = −2 AT r = −2 AT (b − A x)
and the requirement that ∇ρ(x) = 0 immediately leads to (2.1.3).
Definition 10. The two conditions (2.1.2) and (2.1.3) can be written as a
symmetric (m + n) × (m + n) system in x and r, the so-called augmented
system:
I A r b
= . (2.1.4)
AT 0 x 0
This formulation preserves any special structure that A might have,
such as sparsity. Also, it is the formulation used in an iterative refinement
procedure for the LSQ solution (discussed in Section 4.5), because of the
relevance it gives to the residual.
Theorem 9 leads to the normal equations for the solution x∗ of the least
squares problem:

Normal equations: AT A x = AT b. (2.1.5)

The normal equation matrix AT A, which is sometimes called the Gram-


mian, is square, symmetric and additionally:
• If r = n (A has full rank), then AT A is positive definite and the
LSQ problem has a unique solution. (Since the Hessian for the least
squares problem is equal to 2AT A, this establishes the uniqueness of
x∗ .)
THE LINEAR LEAST SQUARES PROBLEM 29

• If r < n (A is rank deficient), then AT A is non-negative definite. In


this case, the set of solutions forms a linear manifold of dimension
n − r that is a translation of the subspace null(A).
Theorem 9 also states that the residual vector of the LSQ solution lies
in null(AT ). Hence, the right-hand side b can be decomposed into two
orthogonal components
b = A x + r,
with A x ∈ range(A) and r ∈ null(AT ), i.e., A x is the orthogonal projection
of b onto range(A) (the subspace spanned by the columns of A) and r is
orthogonal to range(A).
Example 11. The normal equations for the NMR problem in Example 1
take the form
⎛ ⎞⎛ ⎞ ⎛ ⎞
2.805 4.024 5.055 x1 13.14
⎝4.024 8.156 1.521⎠ ⎝x2 ⎠ = ⎝25.98⎠ ,
5.055 1.521 50 x3 51.87

giving the least squares solution x∗ = ( 1.303 , 1.973 , 0.305 )T .


Example 12. Simplified NMR problem. In the NMR problem, let us
assume that we know that the constant background is 0.3, corresponding to
fixing x3 = 0.3. The resulting 2 × 2 normal equations for x1 and x2 take
the form
2.805 4.024 x1 10.74
=
4.024 8.156 x2 20.35
and the LSQ solution to this simplified problem is x1 = 1.287 and x2 =
1.991. Figure 2.1.2 illustrates the geometry of the minimization associated
with the simplified LSQ problem for the two unknowns x1 and x2 . The
left plot shows the residual norm surface as a function of x1 and x2 , and
the right plot shows the elliptic contour curves for this surface; the unique
minimum – the LSQ solution – is marked with a dot.
In the rank-deficient case the LSQ solution is not unique, but one can re-
duce the solution set by imposing additional constraints. For example, the
linear least squares problem often arises from a linearization of a nonlinear
least squares problem, and it may be of interest then to impose the addi-
tional constraint that the solution has minimal l2 -norm, x = minx x2 , so
that the solution stays in the region where the linearization is valid. Be-
cause the set of all minimizers is convex there is a unique solution. Another
reason for imposing minimal length is stability, as we will see in the section
on regularization.
For data approximation problems, where we are free to choose the
model basis functions fj (t), cf. (1.2.2), one should do it in a way that
30 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 2.1.2: Illustration of the LSQ problem for the simplified NMR
problem. Left: the residual norm as a function of the two unknowns x1 and
x2 . Right: the corresponding contour lines for the residual norm.

gives A full rank. A necessary condition is that the (continuous) functions


f1 (t), . . . , fn (t) are linearly independent, but furthermore, they have to de-
fine linearly independent vectors when evaluated on the specific discrete set
of abscissas. More formally:
A necessary and sufficient condition for the matrix A to have full rank
is that the model basis functions be linearly independent over the abscissas
t1 , . . . , tm :


n
αj fj (ti ) = 0 for i = 1, . . . , m ⇔ αj = 0 for j = 1, ..., n.
j=1

Example 13. Consider the linearly independent functions f1 (t) = sin(t),


f2 (t) = sin(2t) and f3 (t) = sin(3t); if we choose the data abscissas ti =
π/4 + iπ/2, i = 1, . . . , m, the matrix A has rank r = 2, whereas the same
functions generate a full-rank matrix A when evaluated on the abscissas
ti = π(i/m), i = 1 . . . , m − 1.

An even stronger requirement is that the model basis functions fj (t) be


such that the columns of A are orthogonal.

Example 14. In general, for data fitting problems, where the model basis
functions fj (t) arise from the underlying model, the properties of the matrix
A are dictated by these functions. In the case of polynomial data fitting,
it is possible to choose the functions fj (t) so that the columns of A are
orthogonal, as described by Forsythe [81], which simplifies the computation
of the LSQ solution. The key is to choose a clever representation of the
fitting polynomials, different from the standard one with the monomials:
THE LINEAR LEAST SQUARES PROBLEM 31

fj (t) = tj−1 , j = 1, ..., n, such that the sampled polynomials satisfy



m
fj (ti ) fk (ti ) = 0 for j = k. (2.1.6)
i=1

When this is the case we say that the functions are orthogonal over the
given abscissas. This is satisfied by the family of orthogonal polynomials
defined by the recursion:

f1 (t) = 1
f2 (t) = t − α1
fj+1 (t) = (t − αj ) fj (t) − βj fj−1 (t), j = 2, . . . n − 1,

where the constants are given by

1 
m
αj = ti fj (ti )2 , j = 0, 1, . . . , n − 1
s2j i=1
s2j
βj = , j = 0, 1, . . . , n − 1
s2j−1

m
s2j = fj (ti )2 , j = 2, . . . n,
i=1

i.e., sj is the l2 -norm of the jth column of A. These polynomials satisfy


(2.1.6), hence, the normal equations matrix AT A is diagonal, and it follows
that the LSQ coefficients are given by

1 
m
x∗j = yi fj (ti ), j = 1, . . . , n.
s2j i=1

When A has full rank, it follows from the normal equations (2.1.5) that
we can write the least squares solution as

x∗ = (AT A)−1 AT b,

which allows us to analyze the solution and the residual vector in statis-
tical terms. Consider the case where the data errors ei are independent,
uncorrelated and have identical standard deviations ς, meaning that the
covariance for b is given by

Cov(b) = ς 2 Im ,

since the errors ei are independent of the exact Γ(ti ). Then a standard
result in statistics says that the covariance matrix for the LSQ solution is

Cov(x∗ ) = (AT A)−1 AT Cov(b) A (AT A)−1 = ς 2 (AT A)−1 .


32 LEAST SQUARES DATA FITTING WITH APPLICATIONS

We see that the unknown coefficients in the fit – the elements of x∗ – are


uncorrelated if and only if AT A is a diagonal matrix, i.e., when the columns
of A are orthogonal. This is the case when the model basis functions are
orthogonal over the abscissas t1 , . . . , tm ; cf. (2.1.6).

Example 15. More data gives better accuracy. Intuitively we expect


that if we increase the number of data points then we can compute a more
accurate LSQ solution, and the present example confirms this. Specifically
we give an asymptotic analysis of how the solution’s variance depends on
the number m of data points, in the case of linear data fitting. There is no
assumption about the distribution of the abscissas ti except that they belong
to the interval [a, b] and appear in increasing order. Now let hi = ti − ti−1
for i = 2, . . . , m and let h = (b − a)/m denote the average spacing between
the abscissas. Then for j, k = 1, . . . , n the elements of the normal equation
matrix can be approximated as


m
1
m
(AT A)jk = h−1
i fj (ti )fk (ti )hi  fj (ti )fk (ti )hi
i=1
h i=1
 b
m
 fj (t)fk (t) dt,
b−a a

and the accuracy of these approximations increases as m increases. Hence,


if F denotes the matrix whose elements are the scaled inner products of the
model basis functions,
 b
1
Fjk = fj (t)fk (t)dt, i, j = 1, . . . , n,
b−a a

then for large m the normal equation matrix approximately satisfies

1 −1
AT A  m F ⇔ (AT A)−1  F ,
m
where the matrix F is independent of m. Hence, the asymptotic result (as
m increases) is that no matter the choice of abscissas and basis functions,
as long as AT A is invertible we have the approximation for the white-noise
case:
ς2
Cov(x∗ ) = ς 2 (AT A)−1  F −1 .
m
We see that the solution’s variance is (to a good approximation) inversely
proportional to the number m of data points.
To illustrate the above result we consider again the frozen cod meat
example, this time with two sets of abscissas ti uniformly distributed in
[0, 0.4] for m = 50 and m = 200, leading to the two matrices (AT A)−1
THE LINEAR LEAST SQUARES PROBLEM 33

Figure 2.1.3: Histograms of the error norms xexact − x∗ 2 for the two
test problems with additive white noise; the errors are clearly reduced by a
factor of 2 when we increase m from 50 to 200.

given by
⎛ ⎞ ⎛ ⎞
1.846 −1.300 0.209 0.535 −0.359 0.057
⎝−1.300 1.200 −0.234⎠ and ⎝−0.359 0.315 −0.061⎠ ,
0.209 −0.234 0.070 0.057 −0.061 0.018

respectively. The average ratio between the elements in the two matrices
is 3.71, i.e., fairly close to the factor 4 we expect from the above analysis
when increasing m by a factor 4.
We also solved the two LSQ problems for 1000 realizations of additive
white noise, and Figure 2.1.3 shows histograms of the error norms xexact −
x∗ 2 , where xexact = (1.27, 2.04, 0.3)T is the vector of exact parameters for
the problem. These results confirm that the errors are reduced by a factor
of 2 corresponding to the expected reduction of the standard deviation by
the same factor.

2.2 The QR factorization and its role


In this and the next section we discuss the QR factorization and its role in
the analysis and solution of the LSQ problem. We start with the simpler
case of full-rank matrices in this section and then move on to rank-deficient
matrices in the next section.
The first step in the computation of a solution to the least squares
problem is the reduction of the problem to an equivalent one with a more
convenient matrix structure. This can be done through an explicit fac-
torization, usually based on orthogonal transformations, where instead of
solving the original LSQ problem (2.1.1) one solves an equivalent problem
with a triangular matrix. The basis of this procedure is the QR factoriza-
tion, the less expensive decomposition that takes advantage of the isometric
properties of orthogonal transformations (proofs for all the theorems in this
section can be found in [20], [105] and many other references).
34 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Theorem 16. QR factorization. Any real m × n matrix A can be fac-


tored as

R1
A = QR with Q ∈ Rm×m , R = ∈ Rm×n , (2.2.1)
0

where Q is orthogonal (i.e., QT Q = Im ) and R1 ∈ Rn×n is upper triangular.


If A has full rank, then so has R and therefore all its diagonal elements are
nonzero.

Theorem 17. Economical QR factorization. Let A ∈ Rm×n have full


column rank r = n. The economical (or thin) QR factorization of A is

A = Q1 R1 with Q1 ∈ Rm×n , R1 ∈ Rn×n , (2.2.2)

where Q1 has orthonormal columns (i.e., QT1 Q1 = In ) and the upper trian-
gular matrix R1 has nonzero diagonal entries. Moreover, Q1 can be chosen
such that the diagonal elements of R1 are positive, in which case R1 is the
Cholesky factor of AT A.

Similar theorems hold if the matrix A is complex, with the factor Q now
a unitary matrix.

Remark 18. If we partition the m × m matrix Q in the full QR factoriza-


tion (2.2.1) as
Q = ( Q1 Q2 ),
then the sub-matrix Q1 is the one that appears in the economical QR fac-
torization (2.2.2). The m × (m − n) matrix Q2 satisfies QT2 Q1 = 0 and
Q1 QT1 + Q2 QT2 = Im .

Geometrically, the QR factorization corresponds to an orthogonalization


of the linearly independent columns of A. The columns of matrix Q1 are an
orthonormal basis for range(A) and those of Q2 are an orthonormal basis
for null(AT ).
The following theorem expresses the least squares solution of the full-
rank problem in terms of the economical QR factorization.

Theorem 19. Let A ∈ Rm×n have full column rank r = n, with the eco-
nomical QR factorization A = Q1 R1 from Theorem 17. Considering that

QT1 b R1
b − Ax22 = QT (b − Ax)22 =  − x22
QT2 b 0
= QT1 b − R1 x22 + QT2 b22 ,
THE LINEAR LEAST SQUARES PROBLEM 35

then, the unique solution of the LSQ problem minx b − A x22 can be com-
puted from the simpler, equivalent problem

min QT1 b − R1 x22 ,


x

whose solution is
x∗ = R1−1 QT1 b (2.2.3)
and the corresponding least squares residual is given by

r ∗ = b − A x∗ = (Im − Q1 QT1 )b = Q2 QT2 b, (2.2.4)

with the matrix Q2 that was introduced in Remark 18.


Of course, (2.2.3) is short-hand for solving R x∗ = QT1 b, and one point
of this reduction is that it is much simpler to solve a triangular system of
equations than a full one. Further on we will also see that this approach has
better numerical properties, as compared to solving the normal equations
introduced in the previous section.
Example 20. In Example 11 we saw the normal equations for the NMR
problem from Example 1; here we take a look at the economical QR factor-
ization for the same problem:
⎛ ⎞
1.00 1.00 1
⎜ 0.80 0.94 1⎟
⎜ ⎟
⎜ 0.64 0.88 1⎟
⎜ ⎟
⎜ .. .. .. ⎟ ,
A = ⎜ . . .⎟
⎜ ⎟
⎜3.2 · 10−5 4.6 · 10−2 1⎟
⎜ ⎟
⎝2.5 · 10−5 4.4 · 10−2 1⎠
2.0 · 10−5 4.1 · 10−2 1
⎛ ⎞
0.597 −0.281 0.172
⎜ 0.479 −0.139 0.071 ⎟
⎜ ⎟
⎜ 0.384 −0.029 −0.002⎟
⎜ ⎟
⎜ .. .. .. ⎟ ,
Q1 = ⎜ . . . ⎟
⎜ ⎟
⎜1.89 · 10−5 0.030 0.224 ⎟
⎜ ⎟
⎝1.52 · 10 −5
0.028 0.226 ⎠
1.22 · 10−5 0.026 0.229
⎛ ⎞ ⎛ ⎞
1.67 2.40 3.02 7.81
R1 = ⎝ 0 1.54 5.16⎠ , Qb = ⎝4.32⎠ .
0 0 3.78 1.19
We note that the upper triangular matrix R1 is also the Cholesky factor of
the normal equation matrix, i.e., AT A = R1T R1 .
36 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The QR factorization allows us to study the residual vector in more


detail. Consider first the case where we augment A with an additional
column, corresponding to adding an additional model basis function in the
data fitting problem.
Theorem 21. Let the augmented matrix A = ( A , an+1 ) have the QR
factorization
R1
A = ( Q1 Q2 ) ,
0
T
with Q1 = ( Q1 q̄ ), QT1 q̄ = 0 and Q2 q̄ = 0. Then the norms of the least
T
squares residual vectors r ∗ = (Im − Q1 QT1 )b and r̄ ∗ = (Im − Q1 Q1 )b are
related by
r ∗ 22 = r̄ ∗ 22 + (q̄ T b)2 .
T
Proof. From the relation Q1 Q1 = Q1 QT1 + q̄q̄ T it follows that Im −
T
Q1 QT1 = Im − Q1 Q1 + q̄q̄ T , and hence,
T
r ∗ 22 = (Im − Q1 QT1 )b22 = (Im − Q1 Q1 )b + q̄q̄ T b22
T
= (Im − Q1 Q1 )b22 + q̄q̄ T b22 = r̄ ∗ 22 + (q̄ T b)2 ,

where we used that the two components of r ∗ are orthogonal and that
q̄q̄ T b2 = |q̄ T b| q̄2 = |q̄ T b|. 
This theorem shows that, when we increase the number of model basis
functions for the fit in such a way that the matrix retains full rank, then
the least squares residual norm decreases (or stays fixed if b is orthogonal
to q̄).
To obtain more insight into the least squares residual we study the
influence of the approximation and data errors. According to (1.2.1) we
can write the right-hand side as

b = Γ + e,

where the two vectors


 T
Γ = Γ(t1 ), . . . , Γ(tm ) and e = (e1 , . . . , em )T

contain the pure data (the sampled pure-data function) and the data errors,
respectively. Hence, the least squares residual vector is

r ∗ = Γ − A x∗ + e, (2.2.5)

where the vector Γ−A x∗ is the approximation error. From (2.2.5) it follows
that the least squares residual vector can be written as

r ∗ = (Im − Q1 QT1 ) Γ + (Im − Q1 QT1 ) e = Q2 QT2 Γ + Q2 QT2 e.


THE LINEAR LEAST SQUARES PROBLEM 37

We see that the residual vector consists of two terms. The first term Q2 QT2 Γ
is an “approximation residual,” due to the discrepancy between the n model
basis functions (represented by the columns of A) and the pure-data func-
tion. The second term is the “projected error”, i.e., the component of the
data errors that lies in the subspace null(AT ). We can summarize the
statistical properties of the least squares residual vector as follows.

Theorem 22. The least squares residual vector r ∗ = b − A x∗ has the


following properties:

E(r ∗ ) = Q2 QT2 Γ, Cov(r ∗ ) = Q2 QT2 Cov(e)Q2 QT2 ,

E(r ∗ 22 ) = QT2 Γ22 + E(QT2 e22 ).


If e is white noise, i.e., Cov(e) = ς 2 Im , then

Cov(r ∗ ) = ς 2 Q2 QT2 , E(r ∗ 22 ) = QT2 Γ22 + (m − n) ς 2 .

Proof. It follows immediately that

E(Q2 QT2 e) = Q2 QT2 E(e) = 0 and E(ΓT Q2 QT2 e) = 0,

as well as

Cov(r ∗ ) = Q2 QT2 Cov(Γ + e)Q2 QT2 and Cov(Γ + e) = Cov(e).

Moreover,

E(r ∗ 22 ) = E(Q2 QT2 Γ22 ) + E(Q2 QT2 e22 ) + E(2ΓT Q2 QT2 e).

It follows that

Cov(QT2 e) = ς 2 Im−n and E(QT2 e22 ) = trace(Cov(QT2 e)) = (m − n)ς 2 .

From the above theorem we see that if the approximation error Γ−A x∗
is somewhat smaller than the data error e then, in the case of white noise,
the scaled residual norm s∗ (sometimes referred to as the standard error),
defined by
r ∗ 2
s∗ = √ , (2.2.6)
m−n
provides an estimate for the standard deviation ς of the errors in the data.
Moreover, provided that the approximation error decreases sufficiently fast
when the fitting order n increases, then we should expect that for large
38 LEAST SQUARES DATA FITTING WITH APPLICATIONS

enough n the least squares residual norm becomes dominated by the pro-
jected error term, i.e.,
r ∗  Q2 QT2 e for n sufficiently large.
Hence, if we monitor the scaled residual norm s∗ = s∗ (n) as a function of n,
then we expect to see that s∗ (n) initially decreases – when it is dominated
by the approximation error – while at a later stage it levels off, when the
projected data error dominates. The transition between the two stages of
the behavior of s∗ (n) indicates a good choice for the fitting order n.
Example 23. We return to the air pollution example from Example 2. We
compute the polynomial fit for n = 1, 2, . . . , 19 and the trigonometric fit for
n = 1, 3, 5, . . . , 19 (only odd values of n are used, because we always need
a sin-cos pair). Figure 2.2.1 shows the residual norm r ∗ 2 and the scaled
residual norm s∗ as functions of n.
The residual norm decreases monotonically with n, while the scaled
residual norm shows the expected behavior mentioned above, i.e., a decaying
phase (when the approximation error dominates), followed by a more flat
or slightly increasing phase when the data errors dominate.
The standard error s∗ introduced in (2.2.6) above, defined as the residual
norm adjusted by the degrees of freedom in the residual, is just one example
of a quantity from statistics that plays a central role in the analysis of
LSQ problems. Another quantity arising from statistics is the coefficient of
determination R2 , which is used in the context of linear regression analysis
(statistical modeling) as a measure of how well a linear model fits the data.
Given a model M (x, t) that predicts the observations b1 , b2 , . . . , bm and the
residual vector r = (b1 − M (x, t1 ), . . . , bm − M (x, tm ))T , the coefficient of
determination is defined by
2
r2
R2 = 1 − m , (2.2.7)
i=1 (bi − b)
2

where b is the mean of the observations. In general, it is an approximation


of the unexplained variance, since the second term compares the variance in
the model’s errors with the total variance of the data. Yet another useful
quantity for analysis is the adjusted coefficient of determination, adj R2 ,
defined in the same way as the coefficient of determination R2 , but adjusted
using the residual degrees of freedom,
(s∗ )2
adj R2 = 1 − m , (2.2.8)
i=1 (bi − b̄) /(m − 1)
2

making it similar in spirit to the squared standard error (s∗ )2 . In Chapter


11 we demonstrate the use of these statistical tools.
THE LINEAR LEAST SQUARES PROBLEM 39

Figure 2.2.1: The residual norm and the scaled residual norm, as functions
of the fitting order n, for the polynomial and trigonometric fits to the air
pollution data.

2.3 Permuted QR factorization


The previous section covered in detail full-rank problems, and we saw that
the QR factorization was well suited for solving such problems. However,
for parameter estimation problems – where the model is given – there is
no guarantee that A always has full rank, and therefore we must also con-
sider the rank-deficient case. We give first an overview of some matrix
factorizations that are useful for detecting and treating rank-deficient prob-
lems, although they are of course also applicable in the full-rank case. The
minimum-norm solution from Definition 27 below plays a central role in
this discussion.
When A is rank deficient we cannot always compute a QR factorization
(2.2.1) that has a convenient economical version, where the range of A is
spanned by the first columns of Q. The following example illustrates that
a column permutation is needed to achieve such a form.

Example 24. Consider the factorization

0 0 c −s 0 s
A= = , for any c2 + s2 = 1.
0 1 s c 0 c

This QR factorization has the required form, i.e., the first factor is orthog-
onal and the second is upper triangular – but range(A) is not spanned by
the first column of the orthogonal factor. However, a permutation Π of the
40 LEAST SQUARES DATA FITTING WITH APPLICATIONS

columns of A gives a QR factorization of the desired form,


0 0 0 1 1 0
AΠ = = QR = ,
1 0 1 0 0 0
with a triangular R and such that the range of A is spanned by the first
column of Q.
In general, we need a permutation of columns that selects the linearly
independent columns of A and places them first. The following theorem
formalizes this idea.
Theorem 25. QR factorization with column permutation. If A is
real, m × n with rank(A) = r < n ≤ m, then there exists a permutation Π,
not necessarily unique, and an orthogonal matrix Q such that
R11 R12 r
AΠ = Q , (2.3.1)
0 0 m−r
where R11 is r × r upper triangular with positive diagonal elements. The
range of A is spanned by the first r columns of Q.
Similar results hold for complex matrices where Q now is unitary.
The first r columns of the matrix A Π are guaranteed to be linearly
independent. For a model with basis functions that are not linearly de-
pendent over the abscissas, this provides a method for choosing r linearly
independent functions. The rank-deficient least squares problem can now
be solved as follows.
Theorem 26. Let A be a rank-deficient m × n matrix with the pivoted QR
factorization in Theorem 25. Then the LSQ problem (2.1.1) takes the form
minx QT A Π ΠT x − QT b22 =
 2
 R11 R12 y1 d1 
miny 
 0 −  =
0 y2 d2 2
 
miny R11 y 1 − R12 y 2 − d1 22 + d2 22 ,
where we have introduced
d1 y1
QT b = and y = ΠT x = .
d2 y2
The general solution is
−1
R11 (d1 − R12 y 2 )
x∗ = Π , y 2 = arbitrary (2.3.2)
y2
and any choice of y 2 leads to a least squares solution with residual norm
r ∗ 2 = d2 2 .
THE LINEAR LEAST SQUARES PROBLEM 41

Definition 27. Given the LSQ problem with a rank deficient matrix A
 ∗ as the solution of
and the general solution given by (2.3.2), we define x
minimal l2 -norm that satisfies

 ∗ = arg minx x2


x subject to b − Ax2 = min .

The choice y 2 = 0 in (2.3.2) is an important special case that leads to


the so-called basic solution,
−1 T
R11 Q1 b
xB = Π ,
0

with at least n−r zero components. This corresponds to using only the first
r columns of A Π in the solution, while setting the remaining elements to
zero. As already mentioned, this is an important choice in data fitting – as
well as other applications – because it implies that b is represented by the
smallest subset of r columns of A, i.e., it is fitted with as few variables as
possible. It is also related to the new field of compressed sensing [4, 35, 235].

Example 28. Linear prediction. We consider a digital signal, i.e., a


vector s ∈ RN , and we seek a relation between neighboring elements of the
form

si = xi si−j , i = p + 1, . . . , N, (2.3.3)
j=1

for some (small) value of . The technique of estimating the ith element
from a number of previous elements is called linear prediction (LP), and the
LP coefficients xi can be used to characterize various underlying properties
of the signal. Throughout this book we will use a test problem where the
elements of the noise-free signal are given by

si = α1 sin(ω1 ti ) + α2 sin(ω2 ti ) + · · · + αp sin(ωp ti ), i = 1, 2, . . . , N.

In this particular example, we use N = 32, p = 2, α1 = 2, α2 = −1 and no


noise.
There are many ways to estimate the LP coefficients in (2.3.3). One
of the popular methods amounts to forming a Toeplitz matrix A (i.e., a
matrix with constant diagonals) and a right-hand side b from the signal,
with elements given by

aij = sn+i−j , bi = sn+i , i = 1, . . . , m, j = 1, . . . , n,

where the matrix dimensions m and n satisfy m+n = N and min(m, n) ≥ .


We choose N = 32 and n = 7 giving m = 25, and the first 7 rows of A and
42 LEAST SQUARES DATA FITTING WITH APPLICATIONS

the first 7 elements of b are


⎛ 1.011 −1.151 −0.918 −2.099 −0.029 2.770 0.875
⎞ ⎛ 2.928 ⎞
⎜ 2.928 1.011 −1.151 −0.918 −2.099 −0.029 2.770 ⎟ ⎜ 0.0101 ⎟
⎜ 1.056 −1.151 −0.918 −2.099 −0.029⎟ ⎜ ⎟
⎜ 2.928 1.011 ⎟ ⎜−1.079⎟
⎜−1.079 1.056 2.928 1.011 −1.151 −0.918 −2.099 ⎟ , ⎜−1.197⎟ .
⎜ ⎟ ⎜ ⎟
⎜−1.197 −1.079 1.056 2.928 1.011 −1.151 −0.918⎟ ⎜−2.027⎟
⎝ ⎠ ⎝ ⎠
−2.027 −1.197 −1.079 1.056 2.928 1.011 −1.151 1.160
1.160 −2.027 −1.197 −1.079 1.056 2.928 1.011 2.559

The matrix A is rank deficient and it turns out that for this problem we can
safely compute the ordinary QR factorization without pivoting, correspond-
ing to Π = I. The matrix R1 and the vector QT1 b are
⎛7.970 2.427 −3.392 −4.781 −5.273 1.890 7.510
⎞ ⎛ 2.573 ⎞
⎜ 0 7.678 3.725 −1.700 −3.289 −6.482 −0.542⎟ ⎜−4.250⎟
⎜ 0 −4.530 −1.136 −2.765⎟ ⎜−1.942⎟
⎜ 0 6.041 2.360 ⎟ ⎜ ⎟
⎜ 0 −0.563 −4.195 0.252 ⎟ ⎜−5.836⎟ ,
⎜ 0 0 5.836 ⎟, ⎜ ⎟
⎜ 0 0 0 0    ⎟ ⎜  ⎟
⎝ ⎠ ⎝ ⎠
0 0 0 0 0   
0 0 0 0 0 0  

where denotes an element whose absolute value is of the order 10−14 or


smaller. We see that the numerical rank of A is r = 4 and that b is
the weighted sum of columns 1 through 4 of the matrix A, i.e., four LP
coefficients are needed in (2.3.3). A basic solution is obtained by setting the
last three elements of the solution to zero:

xB = ( −0.096 , −0.728 , −0.096 , −1.000 , 0 , 0 , 0 )T .

A numerically safer approach for rank-deficient problems is to use the QR


factorization with column permutations from Theorem 25, for which we get
⎛0 0 0 0 0 1

0
⎜1 0 0 0 0 0 0⎟
⎜0 1⎟
⎜ 0 0 0 0 0 ⎟
Π=⎜
⎜0 0 1 0 0 0 0⎟

⎜0 0 0 1 0 0 0⎟
⎝ ⎠
0 0 0 0 1 0 0
0 1 0 0 0 0 0

and
0 1 0 1
8.052 1.747 −3.062 −4.725 −3.277
B 0 7.833 −4.076 −2.194C B 3.990 C
R11 =B
@ 0
C, d1 = B C
@−5.952A .
0 5.972 −3.434A
0 0 0 5.675 6.79
THE LINEAR LEAST SQUARES PROBLEM 43

The basic solution corresponding to this factorization is


xB = ( 0 , −0.721 , 0 , −0.990 , 0.115 , 0 , 0.027 )T .
This basic solution expresses b as a weighted sum of columns 1, 3, 4 and
6 of A. The example shows that the basic solution is not unique – both
basic solutions given above solve the LSQ problem associated with the linear
prediction problem.
The basic solution introduced above is one way of defining a particular
type of solution of the rank-deficient LSQ problem, and it is useful in some
applications. However, in other applications we require the minimum-norm
 ∗ from Definition 27, whose computation reduces to solving the
solution x
least squares problem
 
 −1
R11 (d1 − R12 y 2 ) 

min x2 = min Π  .
x y2 y2 
2

Using the basic solution xB the problem is reduced to


 
 −1
R11 R12 
min 
xB − Π y 
2 .
y2 I 2
−1
R11 R12
This is a full-rank least squares problem with matrix Π . The
I
right-hand side xB and the solution y ∗2 can be obtained via a QR factor-
ization. The following results from [101] relates the norms of the basic and
minimum-norm solutions:

xB 2 −1
1≤ ≤ 1 + R11 R12 22 .
x∗ 2


Complete orthogonal factorization


As demonstrated above, we cannot immediately compute the minimum-
norm least squares solution x  ∗ from the pivoted QR factorization. How-
ever, the QR factorization with column permutations can be considered as
a first step toward the so-called complete orthogonal factorization. This is a
decomposition that, through basis changes by means of orthogonal transfor-
mations in both Rm and Rn , concentrates the whole information of A into
a leading square nonsingular matrix of size r × r. This gives a more direct
way of computing x  ∗ . The existence of complete orthogonal factorizations
is stated in the following theorem.
Theorem 29. Let A be a real m × n matrix of rank r. Then there is an
m × m orthogonal matrix U and an n × n orthogonal matrix V such that
R11 0
A = U RV T with R = , (2.3.4)
0 0
44 LEAST SQUARES DATA FITTING WITH APPLICATIONS

where R11 is an r × r nonsingular triangular matrix.


A similar result holds for complex matrices with U and V unitary. The
LSQ solution can now be obtained as stated in the following theorem.
Theorem 30. Let A have the complete orthogonal decomposition (2.3.4)
and introduce the auxiliary vectors
g1 r y1 r
UT b = g = , VTx = y = . (2.3.5)
g2 m−r y2 n−r
Then the solutions to minx b − A x2 are given by
y1 −1
x∗ = V , y 1 = R11 g1 , y 2 = arbitrary, (2.3.6)
y2
and the residual norm is r ∗ 2 = g 2 2 . In particular, the minimum-norm
 ∗ is obtained by setting y 2 = 0.
solution x
Proof. Replacing A by its complete orthogonal decomposition we get

b−A x22 = b−U R V T x22 = U T b−R V T x22 = g 1 −R11 y 1 22 +g 2 22 .
Since the sub-vector y 2 cannot lower this minimum, it can be chosen
arbitrarily and the result follows.
The triangular matrix R11 contains all the fundamental information of
A. The SVD, which we will introduce shortly, is a special case of a complete
orthogonal factorization, which is more computationally demanding and
involves an iterative part. The most sparse structure that can be obtained
by a finite number of orthogonal transformations, the bidiagonal case, is left
to be analyzed exhaustively in the chapter on direct numerical methods.
Example 31. We return to the linear prediction example from the previous
section; this time we compute the complete orthogonal factorization from
Theorem 29 and get
⎛ ⎞ ⎛ ⎞
−9.027 −4.690 −1.193 5.626 −1.640
⎜ 0 −8.186 −3.923 −0.373⎟ ⎜ ⎟
R11 = ⎜ ⎟ , g = ⎜ 3.792 ⎟ ,
⎝ 0 0 9.749 5.391 ⎠ ⎝ −6.704⎠
0 0 0 9.789 0.712
and
⎛ ⎞
−0.035 0.355 −0.109 −0.521 −0.113 −0.417 0.634
⎜ 0.582 −0.005 0.501 −0.103 0.310 −0.498 −0.237⎟
⎜ ⎟
⎜ 0.078 0.548 0.044 0.534 0.507 0.172 0.347 ⎟
⎜ ⎟
V =⎜
⎜−0.809 0.034 0.369 0.001 0.277 −0.324 −0.163⎟⎟.
⎜ 0 −0.757 −0.006 0.142 0.325 −0.083 0.543 ⎟
⎜ ⎟
⎝ 0 0 −0.774 0.037 0.375 −0.408 −0.305⎠
0 0 0 −0.641 0.558 0.520 −0.086
THE LINEAR LEAST SQUARES PROBLEM 45

This factorization is not unique and the zeros in V are due to the particular
algorithm from [78] used here. The minimum-norm solution is
−1
R11 g1
∗
x = V
0
= ( −0.013 , −0.150 , −0.028 , −0.581 , 0.104 , 0.566 , −0.047 )T .

This solution, as well as the basic solutions from the previous example, all
solve the rank-deficient least squares problem.
Example 32. We will show yet another way to compute the linear predic-
tion coefficients that uses the null space information in the complete orthog-
onal factorization. In particular, we observe that the last three columns of
V span the null space of A. If we extract them to form the matrix V0 and
compute a QR factorization of its transpose, then we get
⎛ ⎞
0.767 0 0
⎜ 0.031 0.631 0 ⎟
⎜ ⎟
⎜ 0.119 −0.022 0.626⎟
⎜ ⎟
V0T = Q0 R0 , R0T = ⎜⎜ 0.001 0.452 0.060⎟ ⎟.
⎜ 0.446 0.001 0.456 ⎟
⎜ ⎟
⎝−0.085 0.624 0.060⎠
−0.436 −0.083 0.626

Since A R0T = 0, we can normalize the last column (by dividing it by its
maximum norm) to obtain the null vector

v 0 = ( 0 , 0 , 1 , 0.096 , 0.728 , 0.096 , 1 )T ,

which is another way to describe the linear dependence between five neigh-
boring columns aj of A. Specifically, this v 0 states that a7 = −0.096a6 −
0.728a5 − 0.96a4 − a3 , and it follows that the LP coefficients are x1 =
−0.728, x2 = −0.096, x3 = −0.096 and x4 = −1, which is identical to the
results in Example 28.
This page intentionally left blank
Chapter 3

Analysis of Least Squares


Problems

The previous chapter has set the stage for introducing some fundamental
tools for analyzing further the LSQ problem, namely, the pseudoinverse
and the singular value decomposition. After introducing these concepts,
we complete this chapter with a definition of the condition number for
least squares problems, followed by a discussion of the robustness of the
solution to perturbations in the data or the model.

3.1 The pseudoinverse


The previous chapter described computational expressions for computing
LSQ solutions to full-rank and rank-deficient problems. However, it is also
convenient to have a simple, closed form expression for the LSQ solution,
similar to the notation x = A−1 b for full-rank square problems.
The goal of this section is to define the pseudoinverse of a rectangu-
lar matrix that plays a similar role to the inverse of a square nonsingular
matrix, but has only some of its properties. The concept of generalized
inverse found in the literature varies. We will use the following definition,
appropriate for the least squares problem.

Definition 33. Pseudoinverse. Corresponding to any real rectangular


m × n matrix A there exists a unique n × m matrix X having the following
four properties:
(i) AXA = A, (ii) XAX = X,

(iii) (AX)T = AX, (iv) (XA)T = XA.

47
48 LEAST SQUARES DATA FITTING WITH APPLICATIONS

For a complex matrix, the last two conditions are (AX)H = AX and
(XA)H = XA, where H stands for conjugate transpose. This matrix X is
called the pseudoinverse (or Moore-Penrose inverse) and is denoted by A† .
A matrix X that satisfies conditions (i)–(iii) is called a generalized inverse
A− ; if A is rank deficient, then A− is not unique.

The pseudoinverse satisfies the relations (A† )† = A and (AT )† = (A† )T


(or (AH )† = (A† )H in the complex case), but we note that there are impor-
tant “inverse properties” that the pseudoinverse does not have in general,
such as (AB)† = (BA)† and A A† = A† A. In several special cases the pseu-
doinverse of A has an expression in terms of the inverse of some matrix:

• If r = n = m, then A† = A−1 .

• If r = n < m and A is real, then A† = (AT A)−1 AT .

• If both A ∈ Rm×r and B ∈ Rr×n have full rank r ≤ min(m, n), then

(AB)† = B † A† = B T (BB T )−1 (AT A)−1 AT .

• The pseudoinverse has a simple expression in terms of the complete


orthogonal decomposition of Theorem 29:
−1
R11 0 R11 0
A=U VT ⇔ A† = V UT .
0 0 0 0

The following results relate the generalized inverse, the pseudoinverse and
the least squares solution to one another.

Lemma 34. For a rectangular m × n matrix A with rank r ≤ n ≤ m, there


is an n × (n − r) matrix B whose columns form a basis for null(A), such
that the set of generalized inverses can be written as

A− = A† + BY,

where Y is an arbitrary (n − r) × m matrix. If A has full rank r = n, then


there is only one generalized inverse: A− = A† .

The matrix AA† is a projection matrix that plays an important role in


the analysis of least squares problems. In terms of the pseudoinverse, the
orthogonal projections onto the fundamental subspaces associated with A
can be expressed as

Projection onto the column space of A: Prange(A) = AA† .


Projection onto the null space of AT : Pnull(AT ) = I − AA† .
ANALYSIS OF LEAST SQUARES PROBLEMS 49

Theorem 35. If A is rank deficient with a generalized inverse A− , then


any least squares solution x∗ = argminx b − A x2 can be expressed in
terms of A− as

 ∗ + z,
x∗ = A− b = x z ∈ null(A),

 is the solution of minimal l2 -norm, given by
where x

 ∗ = A† b.
x (3.1.1)
The norm of the corresponding least squares residuals is

r ∗ 2 = (I − AA† )b2 = Pnull(AT ) b2 .


r ∗ 2 = 

In other words:
• If A is a full-rank matrix, the LSQ problem has a unique solution that
can be expressed using the pseudoinverse: x∗ = A† b.
• If A is rank deficient, there are infinitely many solutions to the LSQ
problem, and they can be expressed by generalized inverses; the
minimum-norm solution is x  ∗ = A† b.
We emphasize that the pseudoinverse is mainly a theoretical tool for analy-
sis of the properties of least squares problems. For numerical computations
one should use methods based on orthogonal transformations and factor-
izations, as discussed in the previous sections. We give a more detailed
discussion of these methods in the next chapters.
An important result, needed later for the sensitivity analysis of the
least squares solution, is the continuity of the pseudoinverse. In fact, the
pseudoinverse of A+ΔA, with ΔA a perturbation matrix, is a well-behaved
function as long as the rank of the matrix does not change.
Theorem 36. If rank(A + ΔA) = rank(A) and η = A† 2 ΔA2 < 1,
then the pseudoinverse of the perturbed matrix is bounded:
1
(A + ΔA)† 2 ≤ A† 2 .
1−η
Note that rounding errors always introduce perturbations of A. In terms
of data fitting we can think of ΔA as also representing inaccuracies in the
model M (x, t); the above theorem gives an upper bound for the size of
these errors that ensures continuity of the pseudoinverse and a fortiori that
of the LSQ solution. If the rank of A changes due to the perturbation
ΔA, the pseudoinverse of the perturbed matrix may become unbounded as
shown, e.g., in [20], p. 26.
In a later chapter we will also need the following formula from [101] for
the derivative of an orthogonal projection.
50 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Lemma 37. Let A = A(α) ∈ Rm×n be a matrix of local constant rank,


differentiable with respect to the vector of variables α ∈ Rp , and let A† be
the pseudoinverse of A. Then

d dA † dAT
(AA† ) = Pnull(AT ) A + (A† )T P T ,
dα dα dα null(A )
where the notation dAdα stands for taking the gradient of each component of
A and constructing a three-dimensional tensor of partial derivatives (this
is also called the Fréchet derivative).

3.2 The singular value decomposition


The singular value decomposition (SVD) is a very useful tool, both theo-
retically and computationally. Given the matrix A, the SVD provides sets
of orthogonal basis for Rm and Rn such that the mapping associated with
A is represented by a diagonal matrix.
Theorem 38. SVD. Let A be a real m × n matrix with rank r satisfying
r ≤ n ≤ m. Then there exist orthogonal matrices U ∈ Rm×m , V ∈ Rn×n
and a diagonal matrix Σ ∈ Rm×n such that
⎛ ⎞
σ1
⎜ .. ⎟
⎜ . ⎟
A = U ΣV T, Σ=⎜ ⎟. (3.2.1)
⎝ σn ⎠
0

The diagonal elements σi of Σ are the singular values of A, and they appear
in descending order. The rank r is determined by the number of positive
singular values; i.e., rank(A) = r iff σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and σr+1 =
· · · = σn = 0.
The columns of U = (u1 , . . . , um ) and V = (v 1 , ..., v n ) are the left and
right singular vectors associated with σi ; they form an orthonormal basis
for Rm and Rn , respectively.
A simpler form of the decomposition omits the zero singular values,
giving rise to the so-called economical (or thin) version of the SVD.
Theorem 39. Economical SVD. Let A be an m × n matrix of rank r
satisfying r ≤ n ≤ m. Then A can be factorized as A = Ur Σr VrT , where
Σr is an r × r diagonal matrix with positive elements. Moreover, the m × r
matrix Ur satisfies UrT Ur = Ir and the n × r matrix Vr satisfies VrT Vr = Ir .
The next theorem expresses a rank-r matrix as a sum of rank-one ma-
trices.
ANALYSIS OF LEAST SQUARES PROBLEMS 51

Theorem 40. If the matrix A has rank r and singular value decomposition
A = U Σ V T , with the matrices defined in Theorems 38 or 39, then A can
be written as a so-called dyadic decomposition of rank-1 matrices:


r
A= σi ui v Ti .
i=1

There are complex versions of these theorems when A has complex entries,
where the matrices U and V are unitary (the singular values are always
real). We summarize some useful results for working with the SVD:

• Pre- or post-multiplication by an orthogonal matrix Q does not alter


the singular values of A. In particular, any permutation of the rows
or columns of A does not alter its singular values.

• Av i = σi ui and AT ui = σi v i for i = 1, . . . , n, while AT ui = 0 for


i = n + 1, . . . , m.

• In addition, if r < n, then Av i = 0 and AT ui = 0 for i = r + 1, . . . , n.

• The sets of left singular vectors {u1 , . . . , ur } and {ur+1 , . . . , um } are,


respectively, orthonormal bases for range(A) and its orthogonal com-
plement null(AT ).

• The sets of right singular vectors {v 1 , . . . , v r } and


{v r+1 , . . . , v n } are, respectively, orthonormal bases for range(AT ) and
its orthogonal complement null(A).

• Spectral norm: A2 = σ1 .


 1/2
• Frobenius norm: AF = σ12 + σ22 + ... + σr2 .

• The SVD is related to the eigenproblem of the symmetric, non-negative


definite matrices AT A and AAT . In fact, the eigenvalue-vector decom-
positions of these matrices are
n−r
  
T T 2
V (A A) V = Σ = diag(σ12 , σ22 , . . . , σr2 , 0, . . . , 0),

Σ 0
U T (AAT ) U = = diag(σ12 , σ22 , . . . , σr2 , 0, . . . , 0).
0 0   
m−r

• The unique σi2 correspond to the eigenvalues of AT A (but calculating


them this way is not a stable numerical process and is therefore not
recommended).
52 LEAST SQUARES DATA FITTING WITH APPLICATIONS

• The v i are the unique eigenvectors for simple σi2 . For multiple eigen-
values there is a unique associated subspace, but the orthogonal basis
vectors that span the subspace and constitute the corresponding right
singular vectors are not unique.
• The ui for i = 1, ..., n can, in principle, be computed from Av i =
σi ui . The {ui }-set is then completed up to m columns with arbitrary
orthogonal vectors. Hence, a matrix may have many singular value
decompositions.

The singular values of a matrix are well conditioned with respect to changes
in the elements of the matrix. The following theorem is directly related to
a theorem for an associated eigenvalue problem (see [150] section 5 or [20]
p. 14).
Theorem 41. Let A and ΔA be m × n matrices and denote the singular
values of A and A + ΔA by σi and σi , respectively. Then

|σi − σi | ≤ ΔA2 , i = 1, . . . , n and |σi − σi |2 ≤ ΔA2F .

In other words, the perturbation in the singular values is at most as


large as the perturbation in the matrix elements. We cite from [20] p. 14
the following perturbation theorem for singular vectors.
Theorem 42. Let the SVD of A be as in Theorem 38 and let the singular
values and vectors of the perturbed matrix A + ΔA be σi , ui and v i . Let
θ(ui , ui ) and θ(v i , v i ) denote the angles between the exact and perturbed
singular vectors and let γi = min(σi−1 − σi , σi − σi+1 ). Then
  ΔA2
max sin θ(ui , ui ) , sin θ(v i , v i ) ≤ ,
γi − ΔA2
provided that ΔA2 < γi .
Thus, if σi is well separated from its neighbors, its corresponding sin-
gular vectors ui and v i are well defined with respect to perturbations. On
the other hand, if σi is close to a neighboring singular value, then their
corresponding singular vectors are possibly ill-defined. For a cluster of sin-
gular values, the individual singular vectors may be inaccurate, but if the
cluster is well separated from the other singular values, then the subspace
itself, defined by the span of the singular vectors, is well defined (see [105],
p. 450). We note in passing that for the particular case of bidiagonal matri-
ces, the bounds depend on the relative gap between a singular value and its
neighbors, and therefore the errors produced by a perturbation are much
smaller [60].
The SVD can be used to show how close a given matrix is to the set
of matrices with lower rank and, in particular, how close it is to being
ANALYSIS OF LEAST SQUARES PROBLEMS 53

rank deficient. This is frequently used to determine numerical rank, as the


following Eckart-Young-Mirski theorem [72] states.

Theorem 43. Let A be an m × n matrix of rank r and define the rank-k


matrix
k
Ak = σi ui v Ti .
i=1

Then Ak is the rank-k matrix closest to A and the distance is

A − Ak 2 = σk+1 , A − Ak F = σk+1
2
+ · · · + σr2 .

Given the singular value decomposition of a matrix A, we have the


necessary tool to give an explicit expression for the pseudoinverse in terms
of the SVD components. Specifically, the pseudoinverse of a general real,
rectangular matrix A of rank r is given by

A† = V Σ† U T , Σ = ( Σ†1 , 0 ), Σ = diag(σ1−1 , σ2−1 , . . . , σr−1 , 0, . . . , 0).

The Moore-Penrose conditions are easily verified for this matrix. Notice
that
A† 2 = σr−1
for a rank-r matrix A.

Theorem 44. The minimum-norm LSQ solution (that is identical to the


LSQ solution x∗ when A has full rank) is given by


r
uT b
 ∗ = A† b = V Σ† U T b =
x i
vi , (3.2.2)
i=1
σi

and the corresponding solution and residual norms are


r
uTi b
2 
m
x∗ 22 =
 , r ∗ 22 = (uTi b)2 . (3.2.3)
i=1
σi i=r+1

All these results can be extended to the complex case.

Example 45. We have already seen the normal equations and the QR
factorization for the NMR problem from Example 1; here we show the SVD
analysis of this problem:
⎛ ⎞
7.46 0 0
Σ3 = ⎝ 0 2.23 0 ⎠,
0 0 0.59
54 LEAST SQUARES DATA FITTING WITH APPLICATIONS
⎛ ⎞ ⎛ ⎞
0.114 0.613 0.782 7.846
V3 = ⎝0.312 0.725 −0.614⎠ , U3T b = ⎝ 4.758 ⎠ ,
0.943 −0.314 0.109 −0.095
and therefore

uT1 b uT2 b uT3 b


= 1.052, = 2.134, = −0.161.
σ1 σ2 σ3

Hence, the LSQ solution is expressed in terms of the SVD as


3
uT b
x∗ = i
vi
i=1
σ i
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0.114 0.613 0.782
= 1.052 ⎝0.312⎠ + 2.134 ⎝ 0.725 ⎠ − 0.161 ⎝−0.614⎠
0.943 −0.314 0.109
⎛ ⎞
1.303
= ⎝1.973⎠ .
0.305

3.3 Generalized singular value decomposition


For some of the material on least squares problems with constraints, a
generalized version of the SVD is needed. Given two matrices A ∈ Rm×n
and B ∈ Rp×n , the generalized SVD (GSVD) provides sets of orthogonal
bases in Rm , Rn and Rp , such that the mappings associated with both A
and B are represented by diagonal matrices.

Theorem 46. Given A ∈ Rm×n and B ∈ Rp×n , there are orthogonal ma-
trices U ∈ Rm×m and V ∈ Rp×p and a nonsingular matrix X ∈ Rn×n such
that U T A X = DA = diag(α1 , . . . , αn ), V B X = DB = diag(β1 , . . . , βq ),
where αi ≥ 0 for i = 1, . . . , n, βi ≥ 0 for i = 1, . . . , q and q = min(n, p).

The generalized SVD is related to the generalized eigenvalue problem:


there is a nonsingular matrix X that is common to both matrices, such that
both X T (AT A) X and X T (B T B) X are diagonal.
The definition of the GSVD in [20] is slightly stronger; it assumes that
A
the diagonal elements are ordered and involves the rank of . The
B
generalized SVD is but one of several generalized orthogonal decompositions
that are designed to avoid the explicit computation, for accuracy reasons,
of some matrix operations. In the present case, if A and B are square and
B is nonsingular, then the GSVD corresponds to the SVD of AB −1 .
ANALYSIS OF LEAST SQUARES PROBLEMS 55

Figure 3.4.1: The geometric interpretation of the SVD, based on the re-
lation Av i = σi ui for i = 1, . . . , n. The image AS of the hyper-sphere
S = {x ∈ Rn | x2 = 1} is a hyper-ellipsoid in Rm centered at the origin,
with principal axes being the singular vectors ui and with the singular values
σi as the half-axes lengths. The condition number of A is the eccentriticy,
i.e., the ratio between the largest and smallest half-axes lengths.

3.4 Condition number and column scaling


The definition of the condition number given for square nonsingular matri-
ces can now be extended in a natural way to rectangular matrices, including
those that are rank deficient.

Definition 47. The l2 (or spectral) condition number of a rectangular rank-


r matrix with singular values σ1 ≥ σ2 ≥ . . . ≥ σr > σr+1 = . . . = σn = 0 is
defined by
σ1
l2 condition number: cond(A) = . (3.4.1)
σr

If the columns of A were orthonormal, then the condition number would


be cond(A) = 1, indicating a perfectly conditioned matrix. As we shall see
below, the condition number plays an important role in the study of the
sensitivity of the LSQ problem to data, model and computational errors.
One can visualize the relation between the singular values and the con-
dition number of A by considering the image by A of the unit hyper-sphere
S = {x ∈ Rn | x2 = 1} with center at the origin. The image AS is a
hyper-ellipsoid in Rm centered at the origin, with the singular vectors ui
as principal axes and with the singular values σi as the half-axes lengths.
This is easily deduced from the facts that the {v i }i=1,...,n are an or-
thonormal basis for Rn , v i ∈ S and Av i = σi ui . See also Figure 3.4.1. The
eccentricity of the hyper-ellipsoid, i.e., the ratio between the largest and
the smallest half-axes, is equal to cond(A). Note that if A is rank deficient
then the hyper-ellipsoid has collapsed to a smaller dimension.
The elongation of the hyper-ellipsoid gives a good graphical idea of
the condition number of the matrix and its nearness to singularity. For a
56 LEAST SQUARES DATA FITTING WITH APPLICATIONS

perfectly conditioned matrix the hyper-ellipsoid becomes a hyper-sphere.


Full-rank matrices generate a hyper-ellipsoid whose eccentricity increases
with the condition number, and in the rank-deficient case the presence of
zero singular values implies that the hyper-ellipsoid has collapsed in the
n − r directions ui corresponding to σi = 0.
One can pre-process the original LSQ problem by scaling to improve
the condition. The two options are row and column scaling. Row scaling
involves left multiplication of both A and b by a diagonal matrix; however,
this is equivalent to the use of weights and therefore changes the LSQ
problem and its solution. Column scaling, on the other hand, involves a
right multiplication of A by a diagonal matrix D corresponding to a scaling
of the variables.
Column scaling is therefore a candidate technique to improve the con-
dition number of the matrix in the LSQ problem, adding robustness to the
numerical least squares computations. The LSQ problem changes to

min b − ADD−1 x2 = min b − ADy2 with y = D−1 x.


x y

If A has full rank, the unique solution x∗ is obtained correctly from y ∗ .


However, if A is rank deficient, although the correct set of residual-minimi-
zing vectors
   ∗ is chosen to minimize
is obtained, the minimal-length vector y
D−1 x , not x , and therefore the computed solution x = D y ∗ does
2 2

not correspond to the correct minimum-norm solution x  .
The following theorem from [243] advises on an appropriate choice of
the diagonal scaling matrix D in order to reduce as much as possible the
condition number cond(AD) of the column-scaled matrix.
Theorem 48. Let B denote a symmetric positive definite matrix whose
diagonal elements are identical. Then

cond(B) ≤ n min cond(DT B D),


D

where D varies over the set of all diagonal matrices.


This theorem, when applied to the normal equations matrix AT A, states
that, in the full-rank case, if the scaling D is such that all columns of AD
have unit Euclidean norm – so that the diagonal elements of AT A√are 1
– then the condition number of the matrix AD is within a factor n of
the optimum minD cond(AD). Hence, unless there is additional statistical
knowledge, the model basis functions fj (t) in (1.2.3) should preferably be
scaled, so that all columns of A are of unit Euclidean length. The effect
can be important, as shown by the following example from [20], p. 50.
Example 49. In data fitting with polynomials one often uses the mono-
mials fj (t) = tj−1 for j = 1, . . . , n as the model basis functions. Then the
ANALYSIS OF LEAST SQUARES PROBLEMS 57

matrix is a Vandermonde matrix with elements aij = tj−1 i , where t1 , . . . , tm


are the data abscissas. In this example we use m = 21, n = 6 and the ab-
scissas 0, 1, 2, . . . , m. The corresponding Vandermonde matrix has elements
(i − 1)j−1 (there is a misprint in [20]) and condition number cond(A) =
6.40 · 106 . Scaling its columns to unit Euclidean norm reduces the condition
number to cond(AD) = 2.22 · 103 – a dramatic reduction.
It is interesting to know that in the square case there are fast algo-
rithms (requiring O(n2 ) flops) for solving Vandermonde systems that take
advantage of the special structure and are fairly impervious to this generic
ill-conditioning [27]. Higham [128] has provided a detailed analysis showing
why this algorithm performs as well as it does. This alerts us to the fact
that condition numbers are problem and algorithm dependent. In fact, for
ill-conditioned Vandermonde systems Gaussian elimination blows up (i.e.,
it is unstable).
Demeure and Sharf [62] derived a fast algorithm to compute the QR
factors of a complex Vandermonde matrix with complexity O(5mn) flops.
No discussion of the numerical stability of the method is given. See also
[206].
Example 50. We return to the NMR problem from Example 1, whose
normal equation matrix AT A is shown in Example 11. It follows from the
SVD analysis in Example 45 that cond(A) = σ1 /σ3 = 7.46/0.59 = 12.64.
The optimal column scaling for this matrix is approximately

Dopt = diag(0.659, 0.438, 0.141),

leading to the condition number cond(ADopt ) = 6.09 (there is an arbitrary


scaling in Dopt which we resolved by setting (dopt )33 = A(:, 3)−1
2 ). The
scaling D that normalizes the columns of A is

D = diag(0.697, 0.350, 0.141),

leading to cond(AD) = 6.21, which is just slightly larger than the optimal
one.
One important last observation is that column scaling affects the singu-
lar values; in fact, scaling may lead to a better computation of the numerical
rank (see the next section). This is particularly important in the case of
nearly rank-deficient matrices. For a careful discussion of scaling, espe-
cially the use of statistical information to define scaling matrices, see [150],
chapter 25 and [104].
Example 51. We finish this section with another example related to poly-
nomial fitting with monomials as basis functions. As we have already seen,
the condition number of the associated Vandermonde matrix can be quite
58 LEAST SQUARES DATA FITTING WITH APPLICATIONS

n cond(A) cond(A ) cond(A )


3 5.18 · 102 21.0 3.51
4 1.16 · 104 1.10 · 102 7.51
5 2.65 · 105 5.93 · 102 17, 2
6 6.40 · 106 3.26 · 103 38.9
7 1.66 · 108 1.8e · 104 91.9
8 4.58 · 109 1.05 · 105 2.16 · 102
9 1.33 · 1011 6.17 · 105 5.29 · 102
10 4.06 · 1012 3.73 · 106 1.29 · 103

Table 3.1: Condition numbers for Vandermonde matrices of size 21 × n in


polynomial data fitting with monomials as basis functions. The matrix A
corresponds to using the abscissas 0, 1, 2, . . . , 20, while A and A arise when
the t-interval is transformed to [0, 1] and [−1, 1], respectively.

large because of the growth of the polynomials. Hence, a way to improve the
conditioning of the matrix is to apply a change of variables in order to reduce
the t-interval. Table 3.1 shows the condition numbers for the 21 × n ma-
trices associated with the three invervals [0, 21], [0, 1] and [−1, 1] for fitting
orders n = 3, . . . , 10; clearly, the interval [−1, 1] gives the best-conditioned
problems.

3.5 Perturbation analysis


We finish this chapter with a study of the robustness of the least squares
solution under perturbations. We distinguish between two different kinds of
perturbations, namely, data errors in the form of measurement errors, which
manifest themselves as a perturbation of the right-hand side, and model
errors in the form of errors in the fitting model M (x, t), which manifest
themselves as perturbations of the matrix. At this stage we ignore the
rounding errors during the computations; we return to this aspect in the
next chapter.
When assessing the quality of a least squares solution and depending on
the problem, one might be more interested in a bound for either the LSQ
solution x∗ (for parameter estimation problems) or for the residual r ∗ (for
data approximation problems).
As is well known from the analysis of linear systems of equations,
cond(A) plays an important role in sensitivity studies. In particular, we
know that if the matrix is ill-conditioned, the solution of the linear system
may be very sensitive to perturbations. We shall see that cond(A) is also
important for the least squares problem, but we will also show that it does
not tell the full story about the sensitivity of the least squares solution.
ANALYSIS OF LEAST SQUARES PROBLEMS 59

This will motivate the introduction of the condition number cond(A, b) for
the least squares problem.
Throughout this section we use the following standard notation for per-
turbation analysis. The unperturbed problem minx b−A x2 has the LSQ
solution x∗ and the corresponding residual r ∗ . The perturbed system has
matrix A + ΔA and right-hand side b + Δb; we write the perturbed solution
as x∗ + Δx and the corresponding solution as r ∗ + Δr. We are interested
in deriving upper bounds for the norms of the perturbations Δx and Δr.
We have seen that the least squares solution x∗ is the solution to the
normal equations (2.1.5) – when A has full rank – and one can therefore,
in principle, use the perturbation bounds available for linear systems of
equations.
Example 52. Sensitivity analysis via normal equations. It follows
immediately from the normal equations that if ΔA = 0 then

Δx2 = (AT A)−1 AT Δb2 ≤ (AT A)−1 2 A2 Δb2 .


Moreover, we have A x∗ 2 ≤ A2 x∗ 2 . From the SVD of A we know
that A2 = σ1 and (AT A)−1 2 = max{σi−2 } = σn−2 . Combining these
bounds and using cond(A) = σ1 /σn , we obtain the following perturbation
bound for the solution:
Δx2 Δb2
≤ cond(A)2 .
x∗ 2 b2
The upper bound in this expression is too pessimistic, due to the presence of
the squared condition number; below we give attainable perturbation bounds,
and we show that a squared condition number only arises when matrix per-
turbations are present. However, we emphasize that the rounding errors in
the normal equations approach are proportional to cond(A)2 .
Even as a measure of the sensitivity of the solution of a linear system,
the matrix condition number may be too conservative. Chan and Foulser
[41] demonstrate how the projection of the right-hand side b onto the range
of A, in addition to cond(A), can strongly affect the sensitivity of the linear
system solution in some instances. Thus, the definition of the condition
number of a least squares problem that takes into account the right-hand
side is a natural step.
The following theorem, adapted from [20], p. 31, applies to the full-rank
case. It gives attainable upper bounds for the norms of the solution and
residual perturbations in terms of perturbations of A and b. These bounds
are obtained under the hypothesis that the components of A and b have
all errors of roughly the same order. Note that the use of norms ignores
how the perturbations are actually distributed among the components of
an array. For component-wise bounds see [21].
60 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Theorem 53. LSQ perturbation bound. Assume that A has full rank
and that x∗ minimizes b − A x2 with residual r ∗ . Let ΔA, Δb and Δx
be perturbations to A, b and x∗ , so that x∗ + Δx minimizes (b + Δb) −
(A + ΔA)(x + Δx2 with residual r ∗ + Δr . Moreover, assume that

ΔA2 Δb2
≤ A, ≤ b
A2 b2

and that cond(A) A < 1. Then the perturbation in the solution satisfies

Δx2 cond(A) b b2 cond(A) A r ∗ 2


≤ A + +
x∗ 2 1 − cond(A) A A2 x∗ 2 A2 x∗ 2
(3.5.1)
and in the residual

Δr2 A2 x∗ 2 b2


≤ A cond(A) + + b . (3.5.2)
r ∗ 2 r ∗ 2 r ∗ 2

The condition cond(A) A < 1 ensures that the rank of A does not change
under the perturbation.

We emphasize that when matrix perturbations are present then the


solution’s sensitivity to these errors is proportional to cond(A)2 , while the
sensitivity of the residual is proportional to cond(A).
If we only consider data errors, i.e., if A = 0, then the perturbation
bounds for the LSQ solution and the residual take the simpler forms

Δx2 b2 Δr2 b2


≤ cond(A) b , ≤ b , (3.5.3)
x∗ 2 A2 x∗ 2 r ∗ 2 r ∗ 2

which are similar to the bounds for linear systems of equations where
b = A x. We see that the condition number cond(A) governs the data
perturbations in this case.
Let us now consider the opposite case, with only model errors, i.e.,
b = 0. Then the perturbation bound reduces to

Δx2 cond(A) r ∗ 2
≤ 1 + cond(A) A,
x∗ 2 1 − cond(A) A A2 x∗ 2

and we recall that the last factor A is a bound for the relative perturbation
in A. Inspired by this result it is common to define the condition number
of the LSQ problem with respect to model perturbations as

r ∗ 2
cond(A, b) = 1 + cond(A) cond(A). (3.5.4)
A2 x∗ 2
ANALYSIS OF LEAST SQUARES PROBLEMS 61

We note that the least squares problem condition number, cond(A, b), in-
cludes a term with the squared matrix condition number, although multi-
plied by the relative residual norm. The other important difference from the
traditional matrix condition number is its explicit dependence on b, through
x∗ and r ∗ . If A is ill-conditioned, then both condition numbers are large.
One can bound cond(A, b) from below by sec(θ) cond(A) ≤ cond(A, b), cf.
[108]. Here, θ is the angle between the right-hand side b and the subspace
range(A), i.e., a measure of the consistency of the problem. From this lower
bound it is straightforward to see that the problem is ill-conditioned if either
the matrix is ill-conditioned (i.e., cond(A) is large) or b is almost perpen-
dicular to range(A) (i.e., sec(θ) is large). The following example based on
an example from [105] with a modification by Grcar [108] illustrates this
aspect
Example 54. Consider the LSQ problem
⎛ ⎞ ⎛ ⎞
1 0 βc
A = ⎝0 α⎠ , b = ⎝β s⎠ , c2 + s2 = 1,
0 0 1
where α < 1 is such that cond(A) = α−1 . Moreover β controls the size of
the component of b in the range of A and s controls the components of b
along the left singular vectors u1 = (1 0 0)T and u2 = (0 1 0)T . The LSQ
solution, its residual and their norms are
⎛ ⎞ !
β c 0
∗ ⎝ ⎠ s2
x = β ∗
, r = 0 , x 2 = β c2 + 2 , r ∗ 2 = 1.
α s
α
1
Thus, the problem condition number is
2
1 1 (c2 + αs 2 )−1/2
cond(A, b) = 1+ .
α α β

There are two circumstances in which this condition number can become
large. One is when α is small, which corresponds to an ill-conditioned
matrix (in which case both cond(A) and cond(A, b) become large), while the
other situation is when β is small, which happens when the right-hand side
b has only a small component in the range of A (such that the LSQ solution
norm is small).
Let us now consider how the problem condition number depends on the
parameter s that controls the components of b along the two left singular
vectors. We have
⎧  
⎨ 1 + β1 α1 , for s → 1 (c → 0)
cond(A, b) →  
⎩ 1 + 1 1 1 , for s → 0 (c → 1).
α β α
62 LEAST SQUARES DATA FITTING WITH APPLICATIONS

s → 1 means that b becomes dominated by the singular vector u2 and we


see that the condition number is essentially proportional to β −1 and α−1 .
However, when s → 0, meaning that b becomes dominated by the principal
singular vector u1 , then the condition number is essentially proportional to
β −1 and α−2 ; note the squaring of the matrix condition number cond(A)
in the latter case.

The above example illustrates that the problem can be ill-conditioned


even if the matrix is well conditioned; this happens when the right-hand
side has only a small component in the range of A. This situation should,
of course, be avoided in data fitting because it means that the model is
not capable of describing the data very well. The example also illustrates
under which circumstances the influence of cond(A)2 is noticeable, namely,
when the right-hand side is dominated by components along the singular
vectors corresponding to the larger singular values (which is typical for
discretizations of ill-posed problems). In all cases, however, we should recall
that these effects depend on the size of the LSQ residual! If the problem is
almost consistent – because the data errors and approximation errors are
small – then the factor r ∗ 2 in cond(A, b) is also small and this condition
number behaves like cond(A). Only for large-residual problems we see a
major difference in the two condition numbers.
In practical applications all these error bounds are valuable only if there
is an efficient way to compute the condition number. For most of the al-
gorithms described in the next sections, at least an estimate of the matrix
condition number cond(A) can be computed inexpensively using informa-
tion generated while solving the problem.
In order to bound the actual error in a computed least squares solution,
i.e., to determine its accuracy, one can combine the above error estimates
with the backward stability bounds for the algorithm used, as described in
detail in Appendix A. For each of the algorithms that will be described in
the next chapters, the backward stability is also discussed.
We have examined the sensitivity of full-rank problems to model errors.
In the case of a rank-deficient matrix A no such estimate can be derived,
in general, because the solution may not be a continuous function of the
matrix elements, as one can see from the discontinuity of the pseudoinverse,
so one needs to restrict the size of the perturbations so that the rank remains
constant.
We finish this section with an example that illustrates the connection
between the SVD and the perturbations in the right-hand side. We also
demonstrate that the SVD gives a more detailed picture of the sensitivity
than the norm bounds.

Example 55. We consider the simplified NMR problem from Example 12,
ANALYSIS OF LEAST SQUARES PROBLEMS 63

Figure 3.5.1: Level curves for simplified NMR problem.

for which the singular values and right singular vectors are

0.472 −0.881
σ1 = 3.211, σ2 = 0.805, v1 = , v2 = .
0.881 0.472

Figure 3.5.1 shows the level curves for the least squares problem, where
the arrows depict the right singular vectors v 1 and v 2 respectively. The
cloud of dots are the LSQ solutions for 50 perturbed problems with the
same standard deviation ς = 0.2. This figure illustrates that the sensitivity
of the LSQ solution x∗ = (uT1 b/σ1 ) v 1 + (uT2 b/σ2 ) v 2 is different in the two
directions given by the singular vectors v 1 and v 2 . The solution is clearly
more sensitive in the direction of the singular vector v 2 that corresponds to
the smaller singular value and less sensitive in the direction of v 1 . This is
true in general – the sensitivity of the LSQ solution in the direction of v i
is proportional to σi−1 .
This page intentionally left blank
Chapter 4

Direct Methods for


Full-Rank Problems

This chapter describes a number of algorithms and techniques for the case
where the matrix has full rank, and therefore there is a unique solution
to the LSQ problem. We compare the efficiency of the different methods
and make suggestions about their use. We start with the least expensive,
albeit also the least robust method: solution of the normal equations, a
convenient approach only if the LSQ problem is well conditioned and the
number of rows of A is small. Next, a more robust method for slightly
overdetermined problems is described. The largest part of this chapter is
taken up with methods using orthogonal transformations.
A word on our notation: Throughout this chapter,  ei denotes the ith
column of the identity matrix – not to be confused with the noise vector e
that does not appear in this chapter. The computational work is measured
in flops, where one “flop” is a single floating-point operation (+, −, ∗ or /).

4.1 Normal equations


Theorem 9 shows that the LSQ solution x∗ can be obtained by solving
the normal equations AT Ax = AT b, which in the full-rank case is a sym-
metric positive definite system. While A is an m × n matrix, the Gram-
mian matrix AT A of the normal equations is n × n and thus comparatively
smaller if m  n. Also, only about half of the elements (more precisely,
1 T
2 n(n + 1)) of A A need to be computed and stored, resulting in a net
data compression. The condition number of the normal equations system,
however, is cond(AT A) = cond(A)2 , thus making it potentially more sensi-
tive to rounding errors than methods based on orthogonal transformations
applied directly to A.

65
66 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The natural solution method for the normal equations is the Cholesky
factorization [20, 105] (known as the square root method in statistics). The
total operation count for forming the triangular part of the Grammian AT A,
computing the Cholesky factor R1 , such that AT A = R1T R1 , forming AT b,
and finally solving the resulting two triangular systems, amounts to n2 (m+
1
3 n) flops, which is more economical than other methods for LSQ problems.
The following theorem shows that, if we work with the Grammian for the
bordered matrix ( A b ), then as a by-product we obtain the norm of the
LSQ residual.
Theorem 56. The Grammian for the bordered matrix ( A b ) has the
(n + 1) × (n + 1) Cholesky factor
R1 QT1 b
C= ,
0 r ∗ 2
where the diagonal elements of R1 are positive and A = Q1 R1 is the eco-
nomical QR factorization (17). Moreover, QT1 b = R1−1 (AT b) is the reduced
right-hand side of the normal equations and r ∗ is the LSQ residual.
Proof. Using the full QR factorization of A it follows immediately that
R1 QT1 b
( A b ) = ( QR QQT b ) = Q .
0 QT2 b
By using an orthogonal transformation H2 (in fact, a Householder reflec-
tion, cf. Section 4.3), we can obtain H2T QT2 b = (QT2 b2 , 0, . . . , 0)T . Since
QT2 b2 = Q2 QT2 b2 = r ∗ 2 we get
⎛ ⎞
R1 QT1 b
( A b ) = Q ⎝ 0 r ∗ 2 ⎠ ,
0 0
showing that C is the Cholesky factor of the bordered matrix. Moreover,
from A = Q1 R1 we get QT1 = R1−1 AT , so that QT1 b = R1−1 (AT b) is the
reduced right-hand side of the normal equations.
There are some disadvantages when one uses the normal equations for
ill-conditioned problems. The first one is related to the computation of the
Grammian AT A. If the columns of A are close to being linearly dependent,
then there might be significant rounding error in the computation of this
matrix, because there is in fact a squaring of the elements, as shown in the
following example from [57].
Example 57. Consider the nonsingular matrix
⎛ ⎞
1 1
A = ⎝10−3 0 ⎠.
0 10−3
DIRECT METHODS FOR FULL-RANK PROBLEMS 67

If we use 6- or 7-digit chopped arithmetic, then the computed versions of


the normal equations matrix are
1 1 1 + 10−6 1
f 6 (AT A) = , f 7 (AT A) = .
1 1 1 1 + 10−6
The first matrix is singular due to the effect of the finite precision! The
second matrix is computed with just enough precision to be nonsingular
(but it is ill-conditioned).
The recommendation is to use extended precision accumulation for the
inner products in order to diminish rounding errors and so avoid a pos-
sible breakdown of the Cholesky factorization. Wilkinson has shown that
if εM cond(AT A) ≤ 1 then the Cholesky process can be completed, since
no square root of a negative number arises (see [105], p. 147). This condi-
tion number can be reduced by normalizing the columns of A, so that the
resulting matrix has columns of unit Euclidean norm, as discussed in the
previous chapter.
The least squares solution via the normal equations is not backward
stable, i.e., the computed solution is not, in general, the exact solution of
a slightly different least squares problem. If D = diag(A(:, j)2 ), then a
bound for the computed LSQ solution x∗ is approximately (see [128], pp.
387)  
D(x∗ − x∗ )  

2
 O cond(AD−1 )2 εM .
Dx 2
In theory, there exists a scaling D that gives a realistic upper bound for the
relative errors that is as small as possible, but in practice the scaling is not
so important, and experiments (illustrated by our figure) show that the nu-
merical results are practically independent of the scaling. In short, the nor-
mal equations method is the least expensive way to solve well-conditioned
problems. An example of this is data fitting with orthogonal polynomials,
where AT A is diagonal.
Example 58. Figure 4.1.1 illustrates the accuracy of the LSQ solution
when it is computed via the normal equations. We generated coefficient
matrices A with condition numbers varying from 102 to 1011 and solved the
LSQ problems via the normal equations, both in their standard formulation
and with the columns of A scaled to unit length. We generated two types of
problems – either with zero residual (i.e., b ∈ range(A)) or with a nonzero
residual r ∗ . The figure shows the relative errors x∗ − x∗ 2 / x∗ 2 , where
x∗ is the computed solution, and we see that these errors are proportional to
cond(A)2 . As cond(A) approaches 108 the relative errors approach 1, and
when cond(A) exceeds approximately 109 , then the Cholesky factorization
breaks down in most cases. We note that the relative errors are, on average,
the same for the standard and scaled versions of the problem.
68 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 4.1.1: The relative errors of LSQ residuals computed by means of


the normal equations, both in their standard form and in a variant where A
is normalized to have columns of unit 2-norm. The errors for both cases are
similar and proportional to cond(A)2 . The left plot is for a LSQ problem
with zero residual, while the right plot is for a problem with nonzero residual.

One practical suggestion for large problems, when secondary storage is


needed for the matrix A [20], is to compute AT A and AT b in an outer
product form instead of the usual inner product that requires the access to
each column of A many times. The idea of the outer product form assumes
that the bordered matrix ( A b ) is stored row-wise: [ A(i, :) , b(i) ]. The
matrix product AT A can be written as the sum of rank-one matrices:


m 
m
AT A = A(i, :)T A(i, :), AT b = b(i) A(i, :)T .
i=1 i=1

For each term in the above sums, only the ith row must be retrieved, and
thus it is retrieved only once.

4.2 LU factorization
A method introduced by Peters and Wilkinson [201], suited for slightly
overdetermined LSQ problems with m − n n, is to factorize the rectan-
gular matrix A into a product of two full-rank matrices. For early uses of
this factorization see [185, 186, 193] and Section 2.5 in [20] for more details.
The method has been adapted successfully to sparse and weighted problems
[22]. The existence of the factorization is proven in the following theorem:

Theorem 59. If the real rectangular m × n matrix A has rank r, then it


can be factorized in the form

A = B C,
DIRECT METHODS FOR FULL-RANK PROBLEMS 69

where both B ∈ Rm×r and C ∈ Rr×n have full rank r. Although this
factorization is not unique, the unique pseudoinverse is given by:
A† = C † B † = C T (C C T )−1 (B T B)−1 B T .
If A is itself full rank, Gaussian elimination can be used to compute the
factors of A = L U , with L a lower trapezoidal and U an upper triangular
matrix. For convenience, the upper triangular matrix can be decomposed
into D U, with D diagonal and U upper triangular with unit diagonal.
In order to improve stability, it is essential to use either full pivoting or
the less expensive partial pivoting, in order to keep |Lij | ≤ 1 (and |Uij | ≤ 1
when using complete pivoting). Note that, in principle, one does not know
if A is truly full rank, and as part of the elimination process one should
check for pivots (i.e., the elements of D) that are small in magnitude.
Assuming that A was indeed full rank, we obtain
L1
Π1 A Π2 = LDU, L= ,
L2
where Π1 and Π2 are permutation matrices (Π2 is the identity if partial
pivoting is used) that try to keep L well conditioned. Moreover, L1 is n × n
lower triangular and U is n × n upper triangular, both with unit diagonal.
The original LSQ problem is now replaced by another LSQ problem in L:
min L y − Π1 b2 , with y = DU ΠT2 x. (4.2.1)
y

A good option for solving the transformed problem is to use the correspond-
ing normal equations LT L y = LT Π1 b. The total cost of this approach is
2n2 (m − 31 n3 ) flops, and it is numerically safe because L is usually well
conditioned. This is illustrated by the following example due to Noble and
cited in [20].
Example 60. Consider the matrices
⎛ ⎞
1 1
3 3
A = ⎝1 1 + ε⎠ , AT A = ;
3 3 + 2ε2
1 1−ε

the latter has condition number cond(AT A)  6 ε−2 and is numerically


singular if ε is less than the square root of the machine precision. On the
other hand, if we use the above LU factorization, keeping |Lij | ≤ 1, then
⎛ ⎞
1 0
1 1 3 0
L U = ⎝1 1 ⎠ , LT L = ,
0 0 2
1 −1

with cond(LT L) = 1.5, and we see that LT L is much better conditioned


than AT A.
70 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Another option due to Cline [49], for minimization of the transformed


problem (4.2.1), is to reduce the matrix L by orthogonal transformations
to lower triangular form,

L1 L c1
=Q , QT Π1 b = .
L2 0 c2

The resulting problem is then


 
 L̄ c1 
min L y − Π1 b2 = min  y−  ,
y y  0 c2 
2

i.e., one has to solve two triangular systems L̄y = c1 and U x = y. The
total cost of this approach is 3n2 (m − 97 n) flops, i.e., larger than the cost
of the Peters-Wilkinson algorithm. Since LU factorization is (in practice)
backward stable, as well as the Householder transformations, we believe
that the combined method is also backward stable. As far as we know,
there is no formal analysis published.

4.3 QR factorization
The QR factorization method is based on the factorization of A into an
orthogonal matrix and an upper triangular one, as described in Theorems
16 and 17, and it is a more stable – although also a more expensive – method
to solve LSQ problems. The factorization can be computed in several ways,
either implicitly by Householder or Givens transformations or by explicit
Gram-Schmidt orthogonalization. The basic steps for the LSQ solution are

• Transformation of the original problem to a nonsingular triangular


system of equations using orthogonal transformations.

• Solution by back-substitution of this system and, optionally, compu-


tation of the residual.

The actual computations, using either Householder or Givens transforma-


tion or Gram-Schmidt orthogonalization, are described below.

Householder and Givens transformations


As pointed out already, orthogonal transformations are isometric, i.e., they
preserve the l2 norm of vectors. Geometrically this means that they are
either rotations or reflections. Given a vector v, Householder reflections H
and Givens rotations G are designed to introduce zeros in specific positions
of a vector Hv or Gv, respectively.
DIRECT METHODS FOR FULL-RANK PROBLEMS 71

Householder reflections

Householder transformations have the form H = I − uT2 u u uT , where u is


an arbitrary nonzero vector and thus Hv ∈ span{u, v}. Geometrically, the
transformations are reflections with respect to the hyperplane with normal
u. Assume that we want to zero all elements of an m-vector v except the
first and obtain
Hv = α
e1 , where α = ± v2 .

Then u must be chosen as u = v ∓ v2  e1 . The reason for the sign


ambiguity is to avoid cancellation in the substraction; to prevent this, the
appropriate choice is sign(α) = −sign(v1 ).
The Householder matrix need not be formed explicitly, since we just
need its action on the vector v:

uT v
Hv = v − 2 u.
uT u

To store the information of a Householder transformation, only one vector


and one scalar are needed. The number of operations used when applying
it to v ∈ Rm is 4m flops. If one wants to zero all elements of an m-vector
v from and including vk+1 , one can embed an appropriate Householder
transformation Hk in a unit matrix, so that the transformation applied to
v is
Ik−1 0
Hk = .
0 Hk

This leaves the first k − 1 elements of Hk v unchanged. The submatrix


Hk is defined as before, using now the sub-vector (vk , vk+1 , . . . , vm )T of
v. A block version of the Householder transformation, suited for high-
performance computing, is described in [105], p. 213.
If instead of zeroing the elements in the columns of a matrix one wants
to zero elements in the matrix rows, the definition of the Householder trans-
formations will now involve the row elements, and the transformation will
be applied from the right.

Givens rotations

Instead of zeroing a vector of elements at once with a Householder reflection,


one can zero a single element by means of a Givens plane rotation, defined
in the 2 × 2 case by

c s
G= with c = cos(ϑ), s = sin(ϑ).
−s c
72 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The product of G with a vector rotates the vector clockwise by an angle ϑ


defined by c, s. Specifically, if

a a b
v= , c= √ , s= √ ,
b a2+ b2 a2 + b2
then the application of G zeros the second component of v:

ca + sb a2 + b2
Gv = = .
−sa + cb 0

The cost of computing c and s is 5 flops. The implementation must take


into account possible under/overflow when constructing c and s (see, e.g.,
[20], p. 54, for a detailed algorithm). One can embed G into an m × m
identity matrix in order to manipulate single elements of a vector v ∈ Rm ,
⎛ ⎞
I
⎜ c s ⎟ i
⎜ ⎟
G(ij) = ⎜
⎜ I ⎟
⎟ .
⎝ −s c ⎠ j
I
The matrix G(ij) represents a clockwise rotation of angle ϑ in the plane
ei , 
span{ ej }. Given a vector v we can choose
1 1
c = vi /(vi2 + vj2 ) 2 , s = vj /(vi2 + vj2 ) 2 ,

in order to zero the jth element vj by multiplying v with G(ij) . Note that
there is no restriction on i, as long as i < j.
(ij)
The rotation matrix is never explicitly formed, and the √ storage of G
can actually be reduced to one scalar only, since s = 1 − c2 , see [20],
p. 55. As in the case of Householder transformations, one can zero elements
in rows by defining appropriate Givens rotations that will now involve only
row elements and are applied from the right.
The ability to selectively zero a single element makes the Givens rotation
better suited than Householder transformations for sparse and structured
matrices. When pre-multiplying or post-multiplying a matrix A with a
Givens rotation, only the two i, j rows or columns are affected. This al-
lows us to interchange the order of rotations and therefore to apply sets of
rotations in parallel, an additional advantage.

Fast Givens rotations


To take advantage of the flexibility of the Givens transformation, several
algorithms have been designed to reduce the number of operations; see, for
example, [115, 116]. We will describe the method fast Givens (FG), which
DIRECT METHODS FOR FULL-RANK PROBLEMS 73

eliminates the need for the square root when computing c and also halves
the number of multiplications used when applying G.
Consider a 2 × 2 Givens rotation matrix G applied to a 2 × n matrix A:

c s a11 ··· a1n


GA = .
−s c a21 ··· a2n

The main idea behind FG is the simplification of operations when G is


applied to a diagonal matrix. If K = diag(k1 , k2 ), then

c s k1 0 ck1 sk2
GK = = .
−s c 0 k2 −sk1 ck2

But this matrix can also be written as




⎪ k1 0 1 s k2

⎪ c k1
⎨ c 0 k2 − sc kk12 1
if c > s,
GK =

⎪ k1 0 k2 c
1

⎪ s k1 s otherwise.
⎩ 0 k2 1 k1 c
k2 s

So GK is actually of the form cKC or sKS, where C and S are 2×2 matrices
with only two elements different from 1. As both cases are analogous, we
will work out in detail only the case when c > s. The FG algorithm then
works with a scaling of the matrix A so that

GA = GKK −1 A ≡ GKA = cKCA .

The product CA involves only 4 flops for each column of A . It is


straightforward to verify that no square roots are needed, not even to com-
pute cK (for details, see [20], p. 57). If a sequence of FG transformations
has to be applied, the resulting matrix GA is already in the necessary scaled
form, now with the diagonal matrix cK.
There is one numerical consideration though: at each FG application,
the diagonal scaling matrix is multiplied by a factor c or s, smaller in
magnitude than 1, so there is a danger of underflow and some rescaling
may need to be applied [3]. Again, as with the previous transformations, for
applications to a general m × n matrix, the transformations are embedded
into an m × m identity matrix.

Summary of Householder and Givens transformations


First, a couple of comments: only the relevant information of the transfor-
mations needs to be stored. We display the cost of zeroing all the elements,
except the first one, of the first column of an m × n matrix. This amounts
to one Householder reflection or m − 1 Givens rotations.
74 LEAST SQUARES DATA FITTING WITH APPLICATIONS

• Storage:

– Householder reflections: one m-vector u, one scalar uT u.


– m − 1 Givens rotations: one scalar per rotation, giving a total
of m − 1 scalars.

• Work:

– Householder reflections: 2m flops and 1 square root to compute


u and zero all elements below the first entry of an m-vector.
Plus 2m(n − 1) flops for applying it to the rest of the m × n
matrix.
– m − 1 Givens rotations: 5(m − 1) flops and m − 1 square roots
to compute the rotations. Plus 6n × (m − 1) flops to apply them
to the m × n matrix.

All the transformations have very good numerical behavior, including fast
Givens. They are backward stable, i.e., the numerical application of an
orthogonal transformation is equivalent to an exact transformation of a
slightly perturbed vector, with a relative perturbation of the size of the
machine precision.

QR factorization using Householder or Givens


transformations
Using Householder or Givens transformations one can obtain a QR factor-
ization with a square m × m matrix Q. The solution steps are as follows:
1. Successive application of either n − 1 Householder reflections or (m −
1), (m − 2), . . . , (m − n) Givens rotations, to zero successive columns
of A below the diagonal:

• A(1) = A and for k = 1, . . . , n − 1:


• either A(k+1) = Hk A(k) or A(k+1) = Gk A(k) .

Here, Hk and Gk = G(k+1) · · · G(m) denote the Householder transfor-


mation and the product of Givens transformations that are necessary
to zero the elements below the diagonal in the kth column of A(k) .
Hence
• QT = Hn−1 · · · H1 or QT = Gn−1 · · · G1 .
2. The right-hand side must also be transformed, i.e., we must compute

d1 d1
• d= = Hn−1 · · · H1 b or d = = Gn−1 · · · G1 b.
d2 d2
DIRECT METHODS FOR FULL-RANK PROBLEMS 75

3. Solution of the equivalent LSQ problem and computation of the resid-


ual norm:

• x∗ = R1−1 d1 , r ∗ 2 = d2 2 .

4. Optional: Compute a condition number estimate via R1 ; see [20] p. 114


ff.
The matrix Q is stored in factorized form. To save memory, this can be
done by overwriting the matrix A with Q and R, plus using some small
additional storage only. The total amount of work is 2n2 (m − 13 n) flops
for Householder and 3n2 (m − 13 n) flops for Givens, for a full matrix. The
Givens method is more efficient in the structured sparse case, for example,
when applied to a Hessenberg matrix (an upper Hessenberg matrix has zero
entries below the first subdiagonal and a lower Hessenberg matrix has zero
entries above the first superdiagonal).
It is proved in [105] that the computed LSQ solution has an approx-
imate relative error bounded by εM [cond(A) + r ∗ 2 cond(A)2 ]. Any ill-
conditioning of A will appear as ill-conditioning of the triangular matrix
R1 of the QR factorization, because cond(R1 ) = cond(A). Therefore, both
Householder and Givens QR factorization algorithms break down in the
back-substitution if cond(A)  1/εM , as shown by the following example
computed in MATLAB with machine precision εM  2.22 · 10−16 .
Example 61. Define the 3 × 2 matrix
⎛ ⎞
1 1
A=⎝ 1 1 ⎠,
−16
1 1 − 10

with condition number cond(A) = 3.78 · 1016 . Using Householder transfor-


mations to compute the QR factorization we obtain the triangular matrix

−1.7321 −1.7321
R1 = ,
0 −9.1575 · 10−17

which is essentially singular, revealing that Ax ≈ b should be treated as a


rank-deficient problem.

Variants of Gram-Schmidt orthogonalization


Given a set of linearly independent vectors {ak }nk=1 in Rm , the classi-
cal Gram-Schmidt (GS) process generates an orthonormal set of vectors
{q k }nk=1 spanning the same subspace. The q k are generated successively
by subtracting from each ak the components in the directions of already ob-
tained orthonormal vectors. While being an important theoretical tool and
76 LEAST SQUARES DATA FITTING WITH APPLICATIONS

a computationally advantageous algorithm, because it is well adapted for


parallel computation (the main work can be performed as a matrix-vector
multiplication), numerically GS often produces a non-orthogonal set of vec-
tors due to cancellations in the substractions. A mathematically equivalent
and more robust variant of GS, although not as convenient for paralleliza-
tion, is the modified Gram-Schmidt (MGS) orthogonalization process:

Modified Gram-Schmidt
for k = 1, . . . , n
rkk = ak 2
q k = ak /rkk
for j = k + 1, . . . , n
rkj = (aTj q k )
aj = aj − rkj q k
end
end

In this algorithm the process of orthogonalizing a vector ak is refined


successively, i.e., q k is obtained by first projecting ak onto the subspace
orthogonal to q 1 , then the resulting vector is projected onto the subspace
orthogonal to span{q 1 , q 2 }, up to the projection onto the subspace orthogo-
nal to span{q 1 , q 2 , . . . , q k−1 }. This avoids an amplification of the rounding
errors affecting the orthogonality of previously computed q k , which may
happen in classical GS. This amplification may also occur in MGS, but to
a lesser extent, depending on the linear independence of the original vec-
tors, i.e., on cond(A) (details can be found in the experimental work of
[26], [136] or [208]). Björck proved that for MGS the loss of orthogonality
can be bounded by I − QT1 Q1 2 ≤ cond(A)εM , whereas there is no such
bound for the GS algorithm. In any case, if one needs a highly accurate
orthogonal set, one can always reorthogonalize with MGS, a process that
needs only to be done once, as described in [20].
The algorithm requires 2mn2 flops, as it always manipulates vectors
of size m. To save storage, A can be overwritten with Q1 = [q 1 , . . . , q n ],
but R1 needs extra storage. The process can be written in matrix form
as right multiplication of A by the square n × n upper triangular matrices
R(1) , R(2) , . . . , R(n) , such that A R(1) R(2) · · · R(n) = [q 1 , . . . , q n ], with the
R(i) matrices given by
⎛ ⎞
1
r11 − rr12
11
··· − rr1n
11
⎜ 1 ⎟
⎜ ⎟
R(1) = ⎜ .. ⎟,
⎝ . ⎠
1
DIRECT METHODS FOR FULL-RANK PROBLEMS 77
⎛ ⎞
1 ⎛ ⎞
⎜ 1
− rr23 ··· − rr2n ⎟ 1
⎜ r22 22 22 ⎟
⎜ 1 ⎟ ⎜ .. ⎟
R(2) =⎜ ⎟,...,R = ⎝ . ⎠.
⎜ .. ⎟
⎝ . ⎠ 1
rnn
1
Here the rij are the elements of the current transformed A matrix. Thus,
the MGS algorithm can be used to produce the economical QR factorization
of A = Q1 R1 with R = (R(1) R(2) · · · R(n) )−1 = (R(n) )−1 · · · (R(1) )−1 . This
factorization generates at the kth step the kth column of Q1 and the kth
row of R1 . This is another useful aspect of MGS: it produces vectors of the
new orthonormal basis from the first iteration, whereas in order to obtain
any basis vectors from the Householder or Givens transformations one has
to wait until completion, i.e., until all m columns of the square matrix Q
have been computed.
In order to apply MGS when the rank is unknown, one can improve the
basic algorithm by adding column pivoting, as will be explained in detail in
the next chapter. The following example from [20] illustrates the superiority
of MGS to classical GS.
Example 62. Consider the matrix
⎛ ⎞
1 1 1
⎜ε ⎟
A=⎜ ⎝
⎟,
⎠ ε small enough so that f (1 + ε2 ) = 1.
ε
ε

Then the Q1 matrices obtained through GS and MGS are


⎛ ⎞ ⎛ ⎞
1 1
⎜ε −α −α⎟ ⎜ ⎟
QGS = ⎜ ⎟ , Q = ⎜ε −α −β/2⎟ ,
⎝ α ⎠ ⎝ α −β/2⎠
α β

with α = 2−1/2 , β = ( 32 )1/2 . The maximum deviation from orthogonality


&
for GS is |q T2 q 3 | = 1/2 and for MGS |q T1 q 3 | = 2/3 ε, where could be as
√ −8
large as M ≈ 10 .

QR factorization using modified Gram-Schmidt


In principle, once the economical QR factorization A = Q1 R1 is computed
using MGS, one could obtain the solution x∗ of the least squares problem
by solving the triangular system R1 x = QT1 b. However, this may not give an
accurate solution for an ill-conditioned matrix due to the potential loss of
orthogonality in the MGS process, as shown above. A reorthogonalization
78 LEAST SQUARES DATA FITTING WITH APPLICATIONS

would give an accurate solution while duplicating the cost. A more stable
algorithm is obtained if one instead applies MGS to the bordered matrix:

R1 z
( A b ) = ( Q1 q n+1 ) .
0

Then we have
 
 x   
A x − b2 = 
( A b ) −1
 = Q1 (R1 x − z) − q n+1  .
 2
2

Since q n+1 is orthogonal to Q1 , because it was produced by the MGS


process, the term q n+1 is not taken into consideration in the minimization
process and the LSQ solution is therefore:

x∗ = R1−1 z, r ∗ 2 =  q n+1 2 = | |.

For more details see [16, 20, 105]. Note that it is not necessary to assume
that Q1 is perfectly orthogonal.
MGS has been proved to be backward stable by interpreting the MGS
decomposition as the Householder QR factorization applied to the matrix
A augmented with a square n × n matrix of zeros on top:

0
.
A

For details see [26].


One final point to stress is, quoting from Peters and Wilkinson (see [20],
p. 66): “Evidence is accumulating that the modified Gram-Schmidt method
(applied to the augmented system) gives better results than Householder.”
Experimental evidence shows that the LSQ solutions obtained via MGS
have roughly 15% more correct figures. The reason seems to be related to
the comparative invariance of MGS to row permutations. In fact, Björck
and Paige [26] state: “If the matrix is not well row scaled, row interchanges
may be needed to give an accurate solution for the least squares problem.
In this context, it is interesting to note that MGS is numerically invariant
under row permutations of A. However if row interchanges are used also
in QR and the row norms of A vary widely, QR is less expensive because
MGS would need reorthogonalization.”

Example 63. Figure 4.3.1 compares the accuracy of most of the LSQ algo-
rithms covered so far, namely, normal equations, LU factorization (Peters
and Wilkinson algorithm), Householder QR factorization, MGS and MGS
applied to the bordered matrix (A, b). The test matrices are the same as in
DIRECT METHODS FOR FULL-RANK PROBLEMS 79

Figure 4.3.1: The relative errors of LSQ solutions as functions of cond(A)


computed by means of the following methods: N. eq. = normal equations,
MGS = modified Gram-Schmidt, LU = LU factorization, QR = House-
holder QR factorization and MGS-b = MGS applied to the bordered matrix
( A b ). We show errors for problems with zero/negligible residual (top)
and a larger residual (bottom). The last three methods are always supe-
rior to the first two, and notice in the bottom plot that the errors change
from being proportional to cond(A) for well-conditioned problems to being
proportional to cond(A)2 for ill-conditioned ones. The top plot is for an
LSQ problem with zero residual, while the bottom plot is for a problem with
nonzero residual.
80 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Example 58, and we show the relative errors as functions of cond(A). This
example shows that the errors from the normal equations and the “naive”
MGS approach are proportional to cond(A)2 independently of the size of
the residual.
The errors from the other three approaches, on the other hand, depend
on the size of the residual. For problems with a small/negligible residual,
the errors are proportional to cond(A), as seen in the top plot. For prob-
lems with a larger residual, the behavior of the errors depend on the size of
cond(A), as seen in the bottom plot. Notice the bend of the error curve at
cond(A) ≈ 103 . For well-conditioned problems, the errors are proportional
to cond(A), while they are proportional to cond(A)2 for ill-conditioned prob-
lems, as predicted by the theory. This correctly matches the sensitivity of
least squares solutions to perturbations in the data and confirms the stability
of the algorithms.

The above example illustrates that LU and QR factorization and MGS


applied to the bordered matrix are preferable when accuracy matters. The
example also illustrates that the error analysis from Section 3.5 carries
over to the influence of rounding errors, meaning that for ill-conditioned
problems with large residual we see a dependence on cond(A)2 for all three
accurate methods.

4.4 Modifying least squares problems


In many applications, a number of closely related least squares problems
have to be solved. For example, there might be some errors in the matrix
and therefore one might want to change some of its elements. In time series
analysis, such as in the neural network application of Section 12.1, a data
window slides in time, deleting old data and adding new data, i.e., chang-
ing the rows of the data matrix (down- and up-dating). Or in regression
analysis, one checks if a variable is representative by adding or eliminating
it from the model used, i.e., changing columns. Additional uses appear
in some of the constrained LSQ algorithms. Therefore, we will consider
how a matrix factorization is changed under two possible types of modifica-
tions: adding/deleting rows (i.e., data) and adding/deleting columns (i.e.,
variables).
An important point in the design of modification procedures is the effi-
ciency of the algorithms, because often the process has to be done in real
time. In addition, stability is essential because, for efficiency the computa-
tions are often performed in single precision.
One of the first considerations when modifying a least squares problem
with full-rank matrix A is to check if the modified matrix A remains of full
rank. Based on Theorem 4 from Appendix B about interlacing singular
DIRECT METHODS FOR FULL-RANK PROBLEMS 81

add delete
column A may be singular A is full-rank
row A is full-rank A may be singular

Table 4.1: Consequences of modifying a matrix A to a new matrix A by


adding or deleting a row or column.

values, the consequences of adding or deleting rows or columns of A are


summarized in Table 4.1.
Although updating algorithms have been designed for a number of ma-
trix factorizations like LU, Cholesky and complete orthogonal decomposi-
tions, we will not consider them here, but refer to [20, 87] for an exhaustive
treatment and rather concentrate on the less expensive algorithms, i.e., de-
scribe how to modify the normal equations and the QR factorization of a
full-rank matrix.

Normal equations (recursive least squares)


The algorithm known as recursive least squares is used in many applications
(signal analysis, for example), due to its simplicity and recursive nature.
It is also known as the covariance matrix method
−1 because it modifies the
inverse of the normal equations matrix AT A , which is the unscaled
covariance matrix. Unfortunately, it is very sensitive to rounding errors.
Given the normal equations formulation of the least squares problem,
assume that a new observation wT x = β is added. The updated solution
x̄∗ satisfies the normal equations system,
 T 
A A + w wT x̄∗ = AT b + βw.
This solution can be obtained via the Sherman-Morrison-Woodbury for-
mula (see B.4) and the inverse of AT A,
β − w T x∗  −1
x̄∗ = x∗ + u, u = AT A w.
1 + wT u
The inverse is really not necessary since u can be computed via the Cholesky
factor of AT A. If, instead, an observation is deleted, then the updated
solution x̄ satisfies the system
 T 
A A − w wT x̄∗ = AT b − βw
and the updated solution is
β − w T x∗
x̄∗ = x∗ − u,
1 − wT u
with the same u as before. It is of course assumed in both cases that the
updated least squares problem continues to have a full-rank matrix.
82 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Full QR decompositions
Even without pivoting, the QR factorization algorithm is stable, and it is
used extensively in applications that require updating. For simplicity we
will consider changes that involve adding/deleting only one column or row,
but the algorithms given can easily be modified for changes involving blocks
of rows or columns [70]. A good reference for the material in this subsection
is chapter 12 of [105]. We note that the updating algorithms can also be
used in connection with QR factorization of the bordered matrix ( A b );
cf. Theorem 56.
As seen from Table 4.1, the modified matrix may have a different rank
and in particular it may become rank deficient. Because using the SVD and
then updating is too expensive (the amount of work involved being com-
parable to an SVD decomposition from scratch), QR-based decompositions
such as ULV and URV are used for this purpose. See [20] and [78] for more
details.
For the row modification algorithms we will assume, for simplicity, that
the first row is involved. When instead a row is added/deleted in an arbi-
trary position, the procedure is similar but involves some permutations.

Adding a row. Assume that, starting with A = Q R, we wish to obtain


wT
the QR factorization of A = . Note that
A

1 0 wT
Ā = = H,
0 QT R

with H an (m + 1) × n upper Hessenberg matrix. Using Givens rotations


G1 , . . . , Gn to zero the elements of R from r11 onward until rnn will trans-
form H to an upper triangular matrix R. The decomposition of A is

1 0
A= GT1 · · · GTn R.
0 QT

The number of operations involved in the update is 3n2 +O(n) flops.

Deleting a row. Deleting a row is somewhat more involved. Given an


m × n matrix A = QR, one wants the decomposition of A, the lower part
of A:
zT
A=
A

The algorithm has the following steps:


DIRECT METHODS FOR FULL-RANK PROBLEMS 83

• Use m − 1 Givens rotations G1 , . . . , Gm−1 such that when applied to


the first row q T of Q one obtains GT1 ...GTm−1 q = ±
e1 . Then,
±1 0
Q Gm−1 ...G1 = ,
0 Q
with Q orthogonal and dimensions (m − 1) × (m − 1).
• When the Givens rotations are applied to R one obtains an upper
Hessenberg matrix
vT
H = GT1 · · · GTm−1 R =
R
and
zT
A = = (QGm−1 · · · G1 )(GT1 · · · GTm−1 R).

±1 0 vT
=
0 Q̄ R
Therefore A = Q R.
Here the matrix Q is needed to determine the triangular matrix R. The
number of flops is O(m2 ), so if n is small this procedure is not worthwhile,
and instead it is better to compute the QR factorization from scratch.

Adding a column. Starting from the initial factorization

A = ( a1 , . . . , an ) = Q R,
assume that a new column z is added,

A = ( a1 , . . . , ak , z, ak+1 , . . . , an ).
The following two steps are needed:
• First compute the vector w = QT z. In QT Ā, one has then a matrix
that is essentially R, except for the full vector w after column k:
 
QT Ā = QT a1 , ..., QT ak , w, QT ak+1 , ..., QT an ≡ B,
where the matrix B has the form
⎛ ⎞
x x x x x x
⎜ 0 x x x x x ⎟
⎜ ⎟
⎜ 0 0 x x x x ⎟
⎜ ⎟
B=⎜ ⎜ 0 0 0 x x x ⎟
⎟.
⎜ 0 0 0 x 0 x ⎟
⎜ ⎟
⎝ 0 0 0 x 0 0 ⎠
0 0 0 x 0 0
84 LEAST SQUARES DATA FITTING WITH APPLICATIONS

• Now apply Givens rotations, Gm−1 , . . . , Gk+1 to obtain an upper tri-


angular structure. Note though, that in the process of zeroing ele-
ments wk+2 , . . . , wn some fill-in will occur in QT ak+1 , . . . , QT an , but
without marring the triangular structure. For the above B the steps
are
⎛ ⎞ ⎛ ⎞
x x x x x x x x x x x x
⎜ 0 x x x x x ⎟ ⎜ 0 x x x x x ⎟
⎜ ⎟ ⎜ ⎟
⎜ 0 0 x x x x ⎟ ⎜ 0 0 x x x x ⎟
⎜ ⎟ ⎜ ⎟
⎜ 0 0 0 x x x ⎟→⎜ 0 0 0 x x x ⎟→
⎜ ⎟ ⎜ ⎟
⎜ 0 0 0 x 0 x ⎟ ⎜ 0 0 0 x 0 x ⎟
⎜ ⎟ ⎜ ⎟
⎝ 0 0 0 x 0 0 ⎠ ⎝ 0 0 0 x 0 0 ⎠
0 0 0 x 0 0 0 0 0 0 0 0
⎛ ⎞ ⎛ ⎞
x x x x x x x x x x x x
⎜ 0 x x x x x ⎟ ⎜ 0 x x x x x ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟
0 0 x x x x ⎟ ⎜ 0 0 x x x x ⎟
⎜ ⎜ ⎟
⎜ 0 0 0 x x x ⎟ ⎜ ⎟
⎜ ⎟ → ⎜ 0 0 0 x x x ⎟.
⎜ ⎟
0 0 0 x 0 x ⎟ ⎜ ⎟
⎜ ⎜ 0 0 0 0 x x ⎟
⎝ 0 0 0 0 0 x ⎠ ⎝ 0 0 0 0 0 x ⎠
0 0 0 0 0 0 0 0 0 0 0 0

Again, the original matrix Q is needed to compute the new triangular ma-
trix. The work is done in O(mn) flops.

Deleting a column.
We start with the decomposition of A = ( a1 , . . . , an ) = QR with R =
( r 1 , . . . , r n ). Assume that the kth column of A is deleted,

A = ( a1 , . . . , ak−1 , ak+1 , . . . , an ).

It is easy to prove that

Ā = Q ( r 1 , . . . , r k−1 , r k+1, r n ) ≡ Q H,

where H is R with the kth column removed; thus it is triangular up to


column k − 1 and has a Hessenberg structure from there on, e.g., for m = 7,
n = 5, k = 3, ⎛ ⎞
x x x x
⎜ 0 x x x ⎟
⎜ ⎟
⎜ 0 0 x x ⎟
⎜ ⎟
H=⎜ ⎜ 0 0 x x ⎟.

⎜ 0 0 0 x ⎟
⎜ ⎟
⎝ 0 0 0 0 ⎠
0 0 0 0
DIRECT METHODS FOR FULL-RANK PROBLEMS 85

To obtain a QR factorization of A, a single Givens transformation has to be


applied to each of the columns of H from h̄k+1 , . . . , h̄n , to zero the elements
h̄k+1,k , . . . , h̄n,n−1 , i.e., R = GTn−1 · · · GTk H ⇔ H = Gk · · · Gn−1 R. The
new QR factorization is therefore

Ā = (Q Gk · · · Gn−1 ) R ≡ Q R.

The work required is O(n2 ) flops. Note that we do not make use of the
matrix Q to compute the triangular factor R.

4.5 Iterative refinement


The LSQ solution computed via the normal equations or the QR factoriza-
tion can be improved by iterative procedures that use the already available
factorization, although they need some additional storage. One should be
careful though if the problem is too ill-conditioned, as it does not make
sense to compute a very accurate solution to such a problem when the data
have comparatively large errors. There are several techniques that are in
use, differing in convergence rates and applicability. The first that we will
describe is the natural extension to least squares problems of the iterative
refinement method for linear systems.

Iterative refinement as extension of the linear system


case
Given an initial approximation x∗ to the LSQ solution, with residual r ∗ =
b − A x∗ , the error Δx is defined as x∗ + Δx = x∗ and it follows that
   
min b − A x2 = min b − A (x∗ + Δx)2 = min r ∗ − A Δx2 .
x Δx Δx

Hence, the correction Δx, in theory, can be obtained from solving a new
LSQ problem with the original matrix A. The correcting step may have to
be repeated if the new approximate solution is still not satisfactory. The
residual should be computed in extended precision and then rounded to
standard (fixed) precision.
This algorithm, implemented using the QR factorization, was analyzed
in [107], where it is shown that the procedure for solution refinement is
adequate only if the LSQ problem is nearly consistent.

Iterative refinement using the augmented system


A second, more robust and successful iterative refinement algorithm, using
extended precision where necessary, corrects not only the solution but also
the residual [17, 18]. The appropriate formulation of this least squares
86 LEAST SQUARES DATA FITTING WITH APPLICATIONS

problem is the augmented system (2.1.4). The errors Δx = x∗ − x∗ and


Δr = r ∗ − r ∗ satisfy

I A r ∗ + Δr b
= ,
AT 0 x∗ + Δx 0

leading to the solution of the system

I A Δr b I A r∗ f
= − ≡
AT 0 Δx 0 AT 0 x∗ g
to obtain the corrections. The algorithm steps are
r0 r∗
• Choose as initial values = .
x0 x∗
• For k = 0, 1, . . . until convergence, compute the “residuals” for the
augmented system using extended precision for the inner products,

f b I A rk
= − .
g 0 AT 0 xk

• Using the economical QR factorization of A, multiply the first equa-


tion by QT1 and use AT = R1T QT1 to obtain the system

I R1 QT1 Δr QT1 f
= .
R1T 0 Δx g

Solve this system in standard precision.


• Compute Δr = Q1 (QT1 Δr) and correct the solution and the residual:

r k+1 rk Δr
= + .
xk+1 xk Δx

Each iteration involves the solution of two triangular systems and a pre-
multiplication by Q1 QT1 to obtain Δr, for a total of 8mn − 2n2 flops in
standard precision and 4mn in extended precision per iteration. The con-
vergence rate here is c(m, n) cond(A)εM . In fact, as Björck points out ([20],
pp. 123): “Hence, in a sense, iterative refinement is even more satisfactory
for large residual least squares problems and may give solutions to full sin-
gle precision accuracy, even when the initial solution may have no correct
significant figures.” If one is working with a t-digit binary floating-point
number system, then on a problem with condition number cond(A)  2q ,
with q < t, after k iterations we can obtain k(t−q) binary digits of accuracy
in the LSQ solution. The algorithm works as well if modified Gram-Schmidt
is used. Open source software can be found in [64].
DIRECT METHODS FOR FULL-RANK PROBLEMS 87

Fixed precision iterative refinement and corrected semi-


normal equations
The last iterative refinement algorithm that we will discuss is a fixed preci-
sion iterative refinement method, applied to the so-called seminormal equa-
tions
RT R x = AT b,

where R is the Cholesky factor of AT A.

Fixed precision iterative refinement


x0 = x
for k = 0, 1, ... until convergence
r k = b − A xk
solve RT (R Δxk ) = AT r k
xk+1 = xk + Δxk
end

The cost for each step is roughly 4n2 flops (solving triangular systems),
plus 4mn flops for the two matrix-vector products. Note that the matrix
A is needed.
Fixed precision refinement can be used to correct the solution obtained
via the normal equations when R is computed from the Cholesky factor-
ization of AT A, although then only a convergence rate of O(cond(A)2 )
can be obtained. For ill-conditioned problems, several iteration steps may
be necessary to achieve good accuracy. See section 6.6.5 in [20] for more
details.
Another important application of the algorithm is in connection with
sparse least squares problems, where it is unfeasible to store the (often quite
dense) orthogonal factor Q or Q1 of the QR factorization of A. To solve
a given LSQ problem one only needs to compute and store the triangular
factor R1 and apply the orthogonal transformations “on the fly” to the right-
hand side to obtain QT1 b, and then x∗ is obtained via back-substitution.
But in order to solve a new system with the same A and a new right-
hand side, as is required for iterative refinement, Q1 would be needed. The
immediate solution is to recall that R1 is a Cholesky factor of the normal-
equation matrix AT A, and the seminormal equations could therefore be
used. One step of the above iterative refinement algorithm produces an
improved solution with an error proportional to cond(A). This technique,
called corrected seminormal equations (CSNE), was derived and analyzed
by Björck [19].
88 LEAST SQUARES DATA FITTING WITH APPLICATIONS

4.6 Stability and condition number estimation


In order to obtain accuracy bounds for the error in a computed approximate
LSQ solution x∗ to a well-conditioned least squares problem, it is useful to
prove the backward stability of the algorithm used (see Appendix A). This
implies that x∗ can be expressed as the exact solution of a perturbed least
squares problem:
min (A + ΔA)x − (b + Δb)2 ,
x

where ΔA2 and Δb2 are of the order of the machine precision εM .
In other words, the computed solution “almost” satisfies the least squares
minimization criterion minx A x − b2 .
Given the backward estimate one can then, in principle, estimate the
actual error of the computed solution x∗ in terms of the LSQ condition
number cond(A, b) of the matrix:

x∗ − x∗ 2
= O(cond(A, b) εM ).
x∗ 2

Hence, a small backward error only implies a computed solution x∗ close


to the exact one x∗ , if the condition number is of reasonable size. The
backward stability of the algorithms for full-rank matrices described in this
chapter has been studied extensively.
Methods that explicitly form the normal equations cannot be backward
stable, because the round-off errors occurring while forming the matrix
products are not, in general, perturbations of size εM relative to A and
b. On the other hand, the methods based on the QR factorization are
backward stable, including the modified Gram-Schmidt method, provided
that the product QT b is computed implicitly via the augmented matrix
( A b ).
There are many specialized methods that apply to structured LSQ prob-
lems, such as those with a Toeplitz matrix, which are not – or are not known
to be – backward stable. Given a computed x∗ , it is possible to verify nu-
merically whether an algorithm is stable by computing an estimate for the
smallest backward error. In fact, a theorem proved in [249] gives an ex-
pression for the optimal backward error in the Frobenius norm, once an
approximate least squares solution has been computed. For practical pur-
poses, less expensive bounds are presented in [21], p. 6. One can be satisfied
with the computed solution if the backward errors ΔA and Δb are small in
comparison to the uncertainties in the data.
As mentioned in Section 3.5 on perturbation analysis, an important
factor in the behavior of the solution to the least squares problem is the
matrix condition number cond(A, b). It is therefore useful to obtain an
estimate for this number, if possible as a by-product of the computations
DIRECT METHODS FOR FULL-RANK PROBLEMS 89

Method Flops Sensitivity Backwd stab


Normal equations mn2 + 13 n3 cond(A)2 No
LU Cholesky 2mn2 − 32 n3 cond(L)2 No
LU Cline 3mn2 − 37 n3 cond(A) Yes
QR Householder 2mn2 − 32 n3 cond(A, b) Yes
QR Givens 3mn2 − n3 cond(A, b) Yes
QR fast Givens 2mn2 − 32 n3 cond(A, b) Yes
QR MGS on (A, b) mn2 cond(A, b) Yes

Table 4.2: Comparison of the different computational methods for solving


the LSQ problem from [105].

needed in the least squares solution process. This is obviously possible


in the specific case of the SVD algorithm, since the singular values are
computed explicitly. If the QR factorization is used and the matrix has full
rank, there are two options, both exploiting the triangular form of R1 .
One method, applicable if the QR factorization algorithm with column
pivoting is used, is based on the fact that there is a lower estimate for
cond(A) from the inequalities

|r11 | ≤ σ1 = R1 2 ≤ n1/2 |r11 | and σn−1 = R1−1 2 ≥ |rnn


−1
|.

Using these results one obtains a lower bound for the condition as

|r11 |/|rnn | ≤ cond(A).

This may be a severe underestimate, though. One can improve it by per-


forming inverse iteration with R1T R1 to obtain a good estimate for the
dominant eigenvalue of (R1T R1 )−1 and thus for 1/σn . This can even be
used if the matrix is nearly rank deficient.
Another condition estimator, first designed by Hager, computes the con-
dition number of a triangular matrix in the L1 or L∞ norm, using optimiza-
tion techniques. A bound for an estimate of the spectral condition number
can then be obtained from
1
A2 ≤ (A1 A∞ ) 2 .

Details can be found in [20] and the references cited there.

4.7 Comparison of the methods


Table 4.2 compiles a list of properties for the computational methods dis-
cussed in this section. The second column shows the dominant term in the
90 LEAST SQUARES DATA FITTING WITH APPLICATIONS
 
flop counts, i.e., the terms O(mn) and O n2 are not included. The third
column shows the sensitivity to rounding errors, and the fourth column
shows whether the algorithm is backward stable. All the methods are stor-
age efficient, since they can essentially overwrite the matrix A. If iterative
refinement is to be used, however, then A must be saved.

• Normal equations. In spite of the squared condition number, the


normal equations should not be discarded out of hand because they
require less storage and only about half the number of operations
compared to the QR factorization methods. The method breaks down
−1/2
for cond(A) ≈ εM or greater.

• LU factorization. The Peters-Wilkinson LU factorization method


is a good choice if m  n. This method can also be used to pro-
duce a better conditioned problem, sometimes just by using a partial
factorization.

• QR factorization. The QR factorization methods are the most


robust. QR using Householder- or Givens-type orthogonalizations
only breaks down when cond(A) ≈ ε−1 M . The QR (MGS) method
produces the economical QR decomposition, whereas the Householder
and Givens forms produce the full version. Givens transformations
are well suited for sparse problems.
• Iterative refinement. Although one can apply iterative refinement
to both normal equations and the QR method, the rate of convergence
for the former includes O(cond(A)2 ), while for the latter includes only
O(cond(A)).
Some final comments regarding the sensitivity to rounding errors of the
methods discussed so far, for the full-rank case: if cond(A) is large and the
residual norm r ∗ 2 is small, i.e., the equations are almost consistent, then
QR (Householder or Givens) is more accurate than the normal equations.
If cond(A) and the residual norm are both large, then all methods are
inaccurate!
Chapter 5

Direct Methods for


Rank-Deficient Problems

The methods developed so far break down when the matrix A is rank
deficient, i.e., for rank(A) = r < n, in which case there is no unique solution.
Within the linear manifold of solutions, the minimum and basic solutions
are specially useful because:
• If we assume that A is derived from noisy data, then the minimum-
 T
 ∗ = rk=1 uσkkb v k will filter the small singular values
norm solution x
corresponding to noise (see Theorem 44).
• If the rank deficiency of A is caused by a redundancy in the terms of
the model, a basic solution xB will choose only linearly independent
columns of A, i.e., this approach provides subset selection.
• Also, there is currently a strong interest in sparse solutions (com-
pressed sensing) and the basic solution can provide a starting point
when seeking even sparser solutions.
• There are other applications where the minimum norm solution is
of importance, such as in solving nonlinear problems by successive
linearizations, where we want to keep the size of the corrections as
small as possible.
This chapter gives a short introduction to numerical methods designed
specifically for rank-deficient problems. They are also appropriate when, as
is often the case, there is no a priori knowledge of the rank. We start with
an analysis of the rank issue, then describe some factorization algorithms
that can handle rank deficiency: LU and QR with column permutations and
complete orthogonal factorization methods, including the URV, ULV and

91
92 LEAST SQUARES DATA FITTING WITH APPLICATIONS

VSV decompositions. Finally, we introduce bidiagonal reductions as part


of the SVD algorithm. Some of the algorithms are more appropriate for
computing basic solutions, others for computing minimum-norm solutions.

5.1 Numerical rank


Knowledge of the rank of A is useful when computing a solution to an
LSQ problem. However, in the presence of limited accuracy arithmetic it
is not enough to know the mathematical rank of A. In fact, an m × n
matrix with m ≥ n may have full rank mathematically – meaning that
all n singular values are positive – but if some of them are very small, for
example σn  εM , then, for all computational purposes, the matrix is rank
deficient. Or, vice versa, a matrix with mathematical rank r ≤ min(m, n),
whose elements have been perturbed by a small amount, could change rank.
We therefore introduce the concept of numerical rank, depending on a given
error level ε.

Definition 64. The numerical ε-rank rε of a matrix A is the minimum


rank of all possible norm-ε perturbations of A:

rε ≡ minE2 ≤ε rank(A + E). (5.1.1)

Expressed in another way, rε is the number of columns of A that remain


linearly independent, assuming a given error level ε for perturbations of the
matrix.
A quantitative definition can be obtained using the singular value de-
composition. In fact, using the bounds obtained in Theorem 41 for the
singular values of a perturbation of A, one can see that the singular values
σi (A + E) of the perturbed matrix can be considered numerically zero if
σi (A + E) ≤ ε. This justifies the following determination of the numerical
rank. The index rε is the numerical ε-rank if

σ1 ≥ σ2 ≥ . . . ≥ σr > ε ≥ σr +1 ≥ . . . ≥ 0. (5.1.2)

In other words, given the tolerance ε, the singular values from σrε +1 on-
ward are indistinguishable from zero. This criterion for rank determination
should be used only if there is a clear gap between the consecutive singular
values σrε and σrε +1 , because the number rε must be robust with respect
to perturbations of the singular values and the threshold.
Often, one considers instead the following criterion that uses the nor-
malized singular values and therefore is scale free:
σr σr+1
≤τ < .
σ1 σ1
DIRECT METHODS FOR RANK-DEFICIENT PROBLEMS 93

This is equivalent to the preceding inequality (5.1.2) with τ = ε/σ1 and it is


more directly related to the condition number of the matrix. In particular,
we say that a matrix is numerically rank deficient if cond(A) > ε−1 M , where
εM is the machine precision.
The perturbation E of the matrix A is in general not known; it may
be due to errors in the underlying model, or due to data errors, so that
the elements eij have some statistical distribution (for details see [20] and
[105]), or the error can be due to the numerical computations. The value
for the tolerance ε should then be chosen appropriately: if the error is due
to round-off we can choose ε  εM A∞ ; if the error in the model or in
the data is larger than the round-off, then choose ε  10−p A∞ , when the
elements of A have p correct decimal digits. It is important to stress that
all the previous choices are only valid if the errors in all the elements of A
are roughly of the same size, see [230].
Golub and Van Loan [105], p. 262, give two options to estimate the
numerical rank if there is no clear gap in the singular values, both based
on a minimization process: one option is when an accurate solution of the
LSQ problem is important, and the other is used when minimizing the size
of the residual is more important.

Example 65. Consider the n × n matrix


⎛ ⎞
0.1 1
⎜ 0.1 1 ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ . . ⎟.
⎜ ⎟
⎝ 0.1 1⎠
0.1

The singular values of A in the case n = 6 are

σ1 = 1.088, σ2 = 1.055, σ3 = 1.007,

σ4 = 0.955, σ5 = 0.915, σ6 = 9.9 · 10−7 ,


i.e., the numerical rank with respect to any threshold between 10−1 and
10−6 is r (A) = n − 1, whereas the mathematical rank is n. There is a
distinct gap in the singular values.

5.2 Peters-Wilkinson LU factorization


The LU factorization method described in Section 4.2, can be modified
to compute the pseudoinverse of a rank-deficient matrix A ∈ Rm×n with
rank(A) = r < min(m, n) and from it the minimum-norm solution of the
least squares problem can be calculated. We recommend that this method
94 LEAST SQUARES DATA FITTING WITH APPLICATIONS

be used with caution, since rank determination in an LU factorization can


be risky because the singular values of A change during the elimination
process; see [162] for recent results on rank-revealing LU factorizations.
The algorithm uses Gaussian elimination with complete pivoting includ-
ing linear independency checks. In the present case, the process will stop
when there is no pivot larger than a specified tolerance. The factorization
is then Π1 A Π2 = LDU. Now, both L and U have trapezoidal form of width
r,
L1
L= , U = ( U1 , U2 ),
L2
where L1 and U1 are nonsingular r × r matrices, lower and upper triangu-
lar, respectively, with unit diagonal, and D is diagonal as in Section 4.2.
Complete pivoting ensures that L and U will be well conditioned and that
D reflects cond(A).
A more convenient form of Π1 A Π2 is

Ir
Π1 A Π2 = L1 DU1 ( Ir , S ),
T

where T = L2 L−1 −1
1 and S = U1 U2 are obtained using back-substitution on
T L1 = L2 and U1 S = U2 . The minimum-norm solution is then:

Ir
 ∗ = A† b = Π2 ( Ir , S )† U1−1 D−1 L−1
x 1 Π1 b.
T

In [22] there is also a modification of the Peters-Wilkinson algorithm for


sparse problems. The package LUSOL [90] implements sparse LDU factor-
ization with options for complete and rook pivoting, which also has rank-
revealing properties while giving sparse factors. Rook pivoting is a strategy
intermediate between full and row pivoting. At each step of elimination the
pivot is chosen as one that is the largest in the row and column to which it
belongs. The experimental performance of rook pivoting makes it compa-
rable to full pivoting in precision and to partial pivoting in computational
efficiency.

5.3 QR factorization with column permutations


The methods described in this section are particularly convenient for subset
selection when a basic solution is wanted. Given A and b we want to find a
permutation Π such that the first r = rank(A) columns of A Π are linearly
independent and also find a vector z ∈ Rr that solves

Ir
min b − Ar z2 , Ar = A Π .
z 0
DIRECT METHODS FOR RANK-DEFICIENT PROBLEMS 95

For a QR factorization with column permutations, orthogonal transfor-


mations such as Householder reflections are applied to zero out successive
sub-columns, interlaced with appropriate column permutations, to place
the most linearly independent columns in front. At step k (after the trans-
formation), the matrix has the form
(k) (k)
R11 R12 k
R(k) = QTk A Πk = ,
0
(k)
A22 m−k

with QTk = Hk · · · H1 and where Πk is the product of the permutation


matrices used so far. In theory, one should be able to obtain an upper
trapezoidal structured matrix after k = rank(A) steps, while in practice –
(k)
due to finite precision arithmetic – we must expect that A22 = 0 for all k.
At each step k there are two important questions:
(k)
1. How to select the most linearly independent columns of A22 , i.e.,
what permutations should be performed.
2. How to decide when to stop the process because the numerical rank
(k)
has been attained and R11 contains the linearly independent infor-
mation of A.
To formalize the second point we introduce the concept of a rank-revealing
QR (RRQR) factorization, following [40]. Intuitively this is the QR factor-
(r)
ization of a permutation of A chosen to yield a “small” A22 , where r is the
numerical rank.
Definition 66. Rank-revealing QR (RRQR) factorization. Let A ∈
Rm×n have singular values σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0, with a well-defined
numerical rank r, i.e., with a clear gap between σr and σr+1 . The factor-
ization
R11 R12 r
AΠ = Q , (5.3.1)
0 R22 m − r
with R11 upper triangular of order r, is an RRQR factorization of A if
σ1
cond(R11 )  and R22   σr+1 .
σr
This means that R11 contains the linearly independent information of
A and R22 is a submatrix of small norm, in which case the matrix
−1
R11 R12
W =Π (5.3.2)
−I

is a good first approximation for the basis of the null space of A. The null
vectors can be improved by a few steps of subspace iteration as shown in
96 LEAST SQUARES DATA FITTING WITH APPLICATIONS

[42]. If after k steps the above two conditions of RRQR hold with r = k,
the best possible factorization has been attained and the numerical rank is
determined as r.
Hong and Pan [129] proved that there always exists a permutation ma-
trix Π so that A Π = QR is an RRQR factorization of A if the numerical
rank is known. The question is how to obtain an economic and reliable
algorithm. One can reformulate the RRQR conditions in terms of a bi-
objective optimization processes, i.e., the task is to find a permutation Π
that maximizes σr (R11 ), the smallest singular values of R11 and at the same
time minimizes σ1 (R22 ), the norm of R22 .
The maximization ensures that the first r columns of Q span the range
of A. The pivoted QR factorization, a column pivoting strategy devised
by Businger and Golub (see [105], p. 235), fulfills this requirement. At
(k)
step k, the column of the rectangular matrix A22 with largest norm is
permuted to be in the first position, after which the Householder transfor-
mation that zeros all the elements of this sub-vector below the diagonal is
applied. This algorithm is implemented in Linpack and Lapack and can
be applied successfully in most cases if it is used with some safeguarding
condition estimation (see, for example, [20], pp. 114ff.).
Theoretically, this process should be continued up to k = rank(A). As
the rank is unknown, the stopping criterion in the Businger-Golub algo-
(k)
rithm is A22 2 ≤ εM A2 . This criterion works well in most cases, but
(k)
may fail occasionally because, although a sufficiently small A22 means that
(k)
A is close to a matrix of rank k, it may happen that R11 is nearly rank
(k)
deficient without A22 being very small and the algorithm should have been
stopped.

Example 67. The following pathological example due to Kahan (see, for
example, [20]) illustrates the fact that the column pivoted QR factorization
algorithm can fail. Let
⎛ ⎞
1 −c −c ··· −c
⎜ 1 −c ··· −c⎟
⎜ ⎟
⎜ .. ⎟
A = diag(1, s, . . . , sn−1 ) ⎜
⎜ . ⎟.
⎟ (5.3.3)
⎜ .. ⎟
⎝ . −c⎠
1

Assume that n = 100, c = 0.2 and s = 0.9798. Then A is invariant under


(n−1)
the previous algorithm and A22 = ann = 0.133; however, A is almost
rank deficient with rank(A) = n − 1, as can be seen from the singular
values σn−1 √= 0.1482 and σn = 0.368 · 10−8 . It was proved in [262] that
n−2
σn−1 = s 1 + c.
DIRECT METHODS FOR RANK-DEFICIENT PROBLEMS 97

Several hybrid algorithms that solve the second minimization have been
designed. They use the Businger-Golub algorithm in a pre-processing stage
and then they compute another permutation matrix to improve on the
rank-revealing aspect. The theoretical basis for most of these algorithms is
the following bound from [42], on the norm of the lower right sub-block in
the RRQR factorization:

  Y1 r × (n − r)
R22 2 ≤ A Π Y 2 Y2−1 2 , Y = ,
Y2 (n − r) × (n − r)

where the linearly independent columns of Π Y are approximate null vectors


of A. The norm of R22 will be small if one chooses a Y and a permutation
Π such that Y2 is well conditioned.
Assume, for example, that rank(A) = n − 1, that the pivoting algorithm
has been applied and a preliminary QR factorization has been obtained
with R1 triangular. The null space of A is spanned by v n , the right singu-
lar vector corresponding to σn . An approximating vector y  v n can be
obtained by a couple of inverse iteration steps on R1T R1 (see the subsec-
tion on properties of the SVD). The permutation Π is then determined by
moving the largest component in absolute value of y to the end, so that
Y2 2 = |vn | is large. This technique is then adapted to the general case
when rank(A) < n − 1, see [40] for details.
An efficient RRQR algorithm was developed by refining the pivoted QR
factorization (see [14] for a survey and description). The algorithm can
be found in [263] under TOMS/782, and it starts with a pre-processing
using pivoted QR, followed by interlaced column permutations with re-
triangularization to obtain the rank-revealing condition. A more expensive
algorithm to determine the r < n most linearly independent columns of A
is an SVD-based algorithm developed by Golub, Klema and Stewart [97].
It combines subset selection using SVD with the computation of the basic
solution using QR factorization (see [20], pp. 113, for details).
Once the RRQR factorization (5.3.1) has been computed, the basic so-
lution xB is easily computed by multiplying b with QT1 , followed by back-
substitution with R11 , which costs a total of 4mn − 2n2 + r2 flops. To
compute the LSQ solution of minimal norm (following [20], p. 107), we
must first compute the matrix W in (5.3.2) and then we have

 ∗ = xB − W x N ,
x x N = W † xB ,

in which xN is computed by solving the least squares problem minz W z −


xB 2 (we note that the vector W xN = W W † xB is the component of xB in
the null space of A that we subtract from xB to arrive at the minimal-norm
solution). The total cost of this approach is r2 (n − r) + 2r(n − r)2 + 2r2
flops.
98 LEAST SQUARES DATA FITTING WITH APPLICATIONS

5.4 UTV and VSV decompositions


A UTV decomposition is a slightly different type of decomposition, where
the matrix A is written as the product of three matrices, two orthogonal
ones and a middle upper or lower triangular (or block triangular) matrix.
These decompositions fill a space somewhere between the pivoted QR fac-
torization and the SVD. They provide orthonormal bases for the range and
null space, as well as estimates for the singular values – and they are faster
to compute and update. The algorithms to compute a URV decomposition
start with a QR step, followed by a rank revealing stage when the singular
vectors corresponding to the smaller singular values are estimated.
The URV decomposition, with an upper triangular middle matrix, takes
the following form:
Definition 68. Given the m × n matrix A, a URV decomposition is a
factorization of the form
⎛ ⎞
Rk F
R
A = UR VRT = ( U URo U ) ⎝ 0 G⎠ ( VRk V ),
0
0 0
where Rk is a k × k nonsingular upper triangular matrix and G is (n − k) ×
(n−k). If A has a well-defined gap between the singular values, σk+1 σk ,
then the URV decomposition is said to be rank revealing if
σmin (Rk ) = O(σk ) and (F G)2 = O(σk+1 ).
There is also a ULV decomposition, where the matrix L is lower triangu-
lar – we skip the details here and refer instead to [78]. The QR factorization
with column permutations can be considered as a special URV decomposi-
tion, and the justification is the following theorem [129].
Theorem 69. For 0 < k < n define
&
ck = k(n − k) + min(k, (n − k)).
Then there exists a permutation matrix Π such that the pivoted QR
factorization has the form
R11 R12
AΠ = Q ,
0 R22
with a k×k upper triangular R11 , σk (R11 ) ≥ σk (A) and R22 2 ≤ ck σk+1 (A).
The normal equations can also handle rank-deficient problems, as long
as the numerical rank of A is well defined. The starting point is a Cholesky
factorization of AT A using symmetric pivoting,
ΠT (AT A) Π = R1T R1 ,
DIRECT METHODS FOR RANK-DEFICIENT PROBLEMS 99

where Π is a permutation matrix, and R1 is the upper triangular (or trape-


zoidal) Cholesky factor; the numerical properties of this algorithm are dis-
cussed by Higham [127]. The next step is to compute a ULV decomposition
of the matrix E R1 E, where the “exchange matrix” E consists of the columns
of the identity matrix in reverse order,

L11 0
E R1 E = UL L VLT , L= ,
L21 L22

which leads to the symmetric VSV decomposition of AT A, having the form

AT A = V LT L V T , V = ( V1 V2 ) = Π E V L

(the orthogonal factor UL is not used). Then the minimal norm solution is
given by

 ∗ = V1 L−1
x −T T T
11 L11 V1 (A b).

We refer to [124] for more details about definitions and algorithms for sym-
metric VSV decompositions.

5.5 Bidiagonalization
We now describe a procedure based on Householder transformations of the
compound matrix ( b A ) for computing a matrix bidiagonalization (see
[21] for more details). The process is also used in the solution of total
least squares problems. The same least squares problem structure can also
be achieved with a Lanczos-type process, as will be defined for use in the
LSQR algorithm in the next chapter, although the present method is more
stable.
For simplicity, we will assume first that A has full rank. The objec-
tive is to use Householder transformations to compute orthogonal matrices
UB ∈ Rm×m and VB ∈ Rn×n , such that UBT AVB is lower bidiagonal. For
convenience, instead of bidiagonalizing A and then computing UBT b, we will
apply the Householder transformations to the augmented matrix ( b A ).
This has the advantage that the transformed problem has a very simple
right-hand side. We use a sequence of interleaved left and right House-
holder reflections, so that in step k = 1, . . . , n + 1 a left reflection zeros the
elements in column k under the diagonal of the augmented matrix and a
right reflection zeros the elements in row k from element (k, k + 1) onward
100 LEAST SQUARES DATA FITTING WITH APPLICATIONS

of A, resulting in the decomposition


⎛ ⎞
β1 α1
⎜ 0 β2 α2 ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ . . ⎟
⎜ ⎟
⎜ αn ⎟⎟ with β = b2 .
T
UB ( b AV B )=⎜ βn
⎜ βn+1 ⎟
⎜ ⎟
⎜0 ··· 0 ⎟
⎝ ⎠
.. ..
. .

With x = VB y we now obtain

min A x − b2 = min UBT AVB y − UBT b2 =


x y
min By − β1 
e1 2 , (5.5.1)
y

with the (n + 1) × n matrix


⎛ ⎞
α1
⎜ β2 α2 ⎟
⎜ ⎟
⎜ .. .. ⎟
B=⎜ . . ⎟.
⎜ ⎟
⎝ βn αn ⎠
βn+1

After k < n steps of the bidiagonal reduction one has computed orthogonal
matrices Uk+1 and Vk such that their columns are the first k + 1 and k
columns in UB and VB respectively. UkT AVk = Bk is the leading (k + 1) × k
submatrix of B.
Note that for the above-described bidiagonalization, as before for the
QR factorization method, no significant additional storage is needed be-
cause the relevant information about the transformations can be stored in
the empty spaces of A. The complete bidiagonal factorization applied only
to A can be used as a first step of an algorithm for calculating the SVD –
see the next section.
Another bidiagonalization algorithm, which is advantageous in the case
that m  n, was first described in [150], and it is known as R-bidiagonaliza-
tion. Instead of bidiagonalizing the original matrix A, the algorithm starts
by a QR factorization and then it bidiagonalizes the resulting upper trian-
gular matrix. In the second step the algorithm can take advantage of the
special structure and uses the less expensive Givens transformations. The
number of flops is changed from 4mn2 − 4n3 /3 for the standard bidiago-
nalization to 2mn2 + 2n3 for R-bidiagonalization, so the break-even point
is m = 5n/3.
DIRECT METHODS FOR RANK-DEFICIENT PROBLEMS 101

The bidiagonalization can be stopped prematurely if a zero element


appears in B. One obtains then a so-called core least squares problem,
with important properties that simplify the solution of the LS problem; see
[180]. In fact, if αk+1 = 0, the least squares problem takes the form
 
 Bk 0 y1 e1 
β1 
min  −  .
y  0 Ak y2 0 
2
Thus, it can be decomposed into two independent minimization problems:
min Bk y 1 − β1 
e1 2 , min Ak y 2 2 .
y1 y2

The first is the core subproblem; it has a full-rank matrix and hence a unique
solution that can be computed as above via QR with Givens rotations. The
minimum of the second problem is obtained for y 2 = 0. A similar argument
can be used to decompose the problem (5.5.1) into two separable ones, if
an element βk+1 = 0.
Using the fact that at each step the bidiagonal reduction produces Bk ,
the leading submatrix of the final B, Paige and Saunders derived a tech-
nique called LSQR to solve general sparse rank-deficient least squares prob-
lems [179]. Essentially, it interleaves QR solution steps with an iterative
formulation of the reduction to bidiagonal form in order to compute a se-
quence of solutions y k ∈ Rk to the reduced problem
min β1 
e1 − Bk y2 .
y

By applying Givens rotations, this lower bidiagonal problem can be trans-


formed to an equivalent upper bidiagonal one, which in the LSQR case
has specific computational advantages: the sequence of residual norms is
decreasing. One accepts xk = Vk y k as an approximate solution of the orig-
AT r k 2
inal least squares problem if A 2 r k 2
is smaller than a given tolerance.
The method is also known as partial least squares (PLS); see, for example,
[256], [257].

5.6 SVD computations


We conclude this chapter with a brief discussion of the Golub-Reinsch-
Kahan algorithm for computing the SVD; details can be found in [20] and
[105]. The steps are
1. Bidiagonalization of A by a finite number of Householder transforma-
tions from left and right,
B
UBT A VB = , B bidiagonal.
0
This is the closest to a diagonal form achievable by a finite process.
102 LEAST SQUARES DATA FITTING WITH APPLICATIONS

SVD (G-R-K) R-SVD To obtain


Σ 4mn2 − 4n3 /3 2mn2 + 2n3 rank
Σ, V , UnT b 4mn2 + 8n3 2mn2 + 11n3 LSQ solution
Σ, V , Un 14mn2 + 8n3 6mn2 + 20n3 pseudoinverse

Table 5.1: Comparison of algorithms for computing the SVD: the stan-
dard Golub-Reinsch-Kahan algorithm and the variant with an initial QR
factorization. The matrix Un consists of the first n columns of U .

Method Used for Cost


LU (Cholesky) min solution 2mn2 − 23 n3
QR Householder factorization 4mnr − 2r2 (m + n) + 34 r3
QR Householder basic solution 4mn − 2n2 + r2
QR Householder min solution r2 (n − r) + 2r(n − r)2 + 2r2
Bidiagonalization min solution 4mn2 − 34 n3
R-bidiagonalization min solution 2mn2 + 2n3
SVD (G-R-K) min solution 4mn2 + 8n3
R-SVD min solution 2mn2 + 11n3

Table 5.2: Comparison of some methods to solve the LSQ problem in the
rank-deficient case, where r is the rank of A. The LU and QR factorizations
use permutations.

2. Computation of the SVD of B by an iterative procedure, which pro-


duces:
UΣT B VΣ = Σ = diag(σ1 , . . . , σn ).

Σ
3. Finally, the SVD of A is A = U V T , with
0

UΣ 0
U = UB and V = V B VΣ .
0 Im−n

For the second step, an algorithm that is equivalent to an implicit-shift QR


for B T B is applied directly to B in order to zero its superdiagonal elements.
Basically the procedure consists of Givens rotations, alternately applied to
the right and left that chase down the nonzero off-diagonal elements. The
equivalence with the implicit-shift QR guarantees the convergence, which
is often cubic. As mentioned in the previous section, this algorithm can be
extended with a pre-processing stage that computes a QR factorization of
A, applies steps 1–3 to R, and finally updates U by replacing it for Q U .
Table 5.1, with information taken from [105], compares the properties
and costs of the SVD algorithms with and without R-bidiagonalization.
DIRECT METHODS FOR RANK-DEFICIENT PROBLEMS 103

The matrices Σ and Un correspond to the economical version of the SVD.


For the LSQ problem, the matrix U need not be explicitly formed because
it can be applied to b as it is developed. The numbers above assume that
the iterative stage 2 needs on average 2 steps per singular value. If A is
nearly rank deficient, this will be reflected in the SVD, and the relative
accuracy of the smaller singular values will suffer.
A variant of the Golub-Reinsch-Kahan SVD algorithm was described in
[248]; this algorithm is called partial SVD, and it terminates the iterations
in step 2, once it is guaranteed that all remaining singular values to be
computed are smaller than a user-specified threshold.
Table 5.2, using information from [105], compares some of the methods
discussed in this chapter for matrices of rank r < min(n, m).
This page intentionally left blank
Chapter 6

Methods for Large-Scale


Problems

Large-scale problems are those where the number of variables is such that
direct methods, like the ones we have described in earlier chapters, cannot
be applied because of storage limitations, accuracy restrictions or compu-
tational cost. Usually, large-scale problems have special structure, such as
sparseness, which favors the use of special techniques. We already saw that
Givens rotations are preferable to Householder transformations in this case.
There are many iterative methods tailored to specific problems and ap-
plications in this area; here we choose to survey only the most important
algorithms and concepts related to iterative methods for linear LSQ prob-
lems. We finally discuss a block approach that can be combined with either
direct or iterative methods and can handle fairly large problems in a dis-
tributed network of computers. The reader is referred to [20] for a general
survey of methods.

6.1 Iterative versus direct methods


If the LSQ problem is large, the question arises whether either iterative or
direct methods should be used. A problem with, for example, half a million
equations and n = O(105 ) parameters to be determined, such as is common
in solving inverse problems, needs conservatively O(n3 ) operations if one
uses a direct method. On a modern computer (circa 2011), performing 1010
floating-point operations/second, that calculation would require about 28
hours of CPU time, although hardware is getting cheaper and faster all the
time. Codes that take advantage of multicore hardware can also speed up
the process.

105
106 LEAST SQUARES DATA FITTING WITH APPLICATIONS

On the other hand, iterative methods compute an approximation to


the solution and then improve on it until the error is small enough. For
many large problems, iterative methods are favorable alternatives to direct
methods because they can (approximately) solve the problem faster and
with much less demands on memory. Typically, iterative methods do not
require the matrix to be explicitly stored – instead they use subroutines
that compute the necessary matrix-vector multiplications.
However, if the normal equations are sparse, a direct method may still
be a good option. The advantages/disadvantages of direct and iterative
methods can be summarized as follows:
• Speed
Direct methods: the number of necessary steps is known, but fill-
in (i.e., creation of new nonzero elements) may make the process
prohibitively expensive. Taking advantage of sparseness to minimize
fill-in is an art and it requires special algorithms.
Iterative methods: at best there is an estimate of the number of itera-
tions and the cost per iteration, but matrix structures are preserved.
• Storage
Direct methods: sometimes efficient compressed storage formats can
be used but retrieval may be costly.
Iterative methods: compressed storage formats can be chosen, subject
to the same proviso as above.
• Robustness
Direct methods: there is a direct method applicable to every problem
and the solution can always be attained.
Iterative methods: may require many iterations to converge.
• Formulation
Direct methods: solve the normal equations (this may destroy sparse-
ness) or the overdetermined system itself.
Iterative methods: solve the normal equations AT (A x−b), the overde-
termined system, or the augmented matrix form.
A compromise between the two approaches are preconditioned iterative
methods, which use an approximate factorization of A. Also, block iter-
ative methods that are easily parallelizable are an interesting alternative
discussed at the end of the chapter.
For the sake of completeness, the classical iterative methods will be
surveyed first, and then some of the more efficient Krylov subspace methods
will be explained.
METHODS FOR LARGE-SCALE PROBLEMS 107

6.2 Classical stationary methods


The basic idea behind the definition of the classical iterative methods is to
split the matrix so that the resulting system can easily be solved at each
iteration. We consider first the application to the normal equations. Recall
that if A is a full-rank matrix, then the normal equations are symmetric
and positive definite. The general formulation is given below:
• Split AT A into convenient matrices Q − (Q − AT A), where Q is easily
invertible.
• The iteration is now defined by choosing a starting vector x(0) and
then solving at each step: Q x(k+1) = (Q − AT A) x(k) + AT b.
• The convergence properties depend on the spectral radius of the iter-
ation matrix: I − Q−1 (AT A).
The most common methods are
• Richardson: Q = I.
• Jacobi: Q = main diagonal of AT A.
• Gauss-Seidel: Q = lower triangular part of AT A. It can be shown
that this is a residual-reducing method.
• SOR: introduces a relaxation factor ω in the Gauss-Seidel method.
Convergence is assured if the acceleration parameter satisfies ω ∈
(0 , 2).
We point out that all these methods can be implemented without forming
the product AT A and that their convergence in the full-rank case is well
known. There are also convergence results for the rank-deficient case. A
not so well known fact is that splitting can also be applied directly to the
rectangular matrix A, thus avoiding the use of the normal equations and
its poorer conditioning. Not every splitting is valid in this case and one
needs to use a so-called proper splitting.
Definition 70. Given the matrix A, the splitting A = Q − (Q − A) is a
proper splitting if the ranges and null spaces of A and Q coincide.
For a proper splitting, the iteration Q x(k+1) = (Q − A) x(k) + b (the
problem to solve at each step is a least squares problem) converges to x∗ =
A† b if and only if the spectral radius ρ of the iteration matrix satisfies
ρ(Q† (Q − A)) < 1. For details see [20].
If the rate of convergence for an iterative method is slow, there are
several possible ways to improve it, such as Chebyshev acceleration or pre-
conditioning. In the Chebyshev method, k original iterates are combined
108 LEAST SQUARES DATA FITTING WITH APPLICATIONS

to define an iteration matrix with a smaller spectral radius. The other con-
vergence acceleration method, preconditioning, is also used extensively for
Krylov methods and will therefore be explained after these methods have
been introduced.

6.3 Non-stationary methods, Krylov methods


In contrast to the classical iterative methods described above, Krylov meth-
ods have no iteration matrix. Their popularity, both for the solution of
systems of equations and for least squares problems, is because, in addition
to low computing time and memory requirements, they are parameter free.
Access to A is only via subroutines for matrix/vector products, and there-
fore they are well suited to solving large, sparse problems. The convergence
rate for well-conditioned matrices is good, usually reaching an acceptable
accuracy in a small number of steps. In addition, the rate of convergence
can be improved with various preconditioning techniques. First we define
Krylov subspaces.
Definition 71. Given an m × m matrix A, a vector z and an integer d,
the Krylov subspace denoted by Kd (A, z) is the space generated by
 
Kd (A, z) ≡ span z, Az, . . . , Ad−1 z .

Solution of square systems


In this subsection we consider the linear system A x = z, where A denotes
a square matrix. Krylov subspace methods for the solution of nonsingular
linear systems A x = z are based on the expansion of A−1 in powers of the
matrix A using its minimal polynomial


d
q(t) = (t − λj )mj ,
j=1

where λj are the distinct eigenvalues of A with corresponding multiplicity


mj . The inverse of the matrix can then be written as


m−1
A−1 = γi Ai ,
i=0
d
with some coefficients γi and m = j=1 mj . The methods are also appli-
cable sometimes to the singular matrix case, based on the Drazin inverse
solution. For a very clear introduction see [133].
Given the power expansion of A−1 , it is easy to see that the linear
system solution: x = A−1 z, belongs to a Krylov subspace generated by
METHODS FOR LARGE-SCALE PROBLEMS 109

A and z. The dimension d of this space is defined by the degree of the


minimal polynomial of A and therefore by its eigenvalue distribution. It
will be smaller if there are clusters of eigenvalues (especially if their associ-
ated eigenvectors are linearly independent). Faster convergence of Krylov
algorithms is related to smaller dimensions.
Krylov algorithms compute successive approximations x(k) to the solu-
tion in expanding Krylov spaces Kk (A, z) for k = 1, . . . , d. Some of the
methods recast the approximations in the form: x(k) = x(0) + p(k) , with
x(0) the initial guess and p(k) direction vectors. This can be viewed as
solving the problem: A p = z − Ax(0) ≡ r (0) , with the p(k) approximations
to p belonging to the Krylov space Kk (A, r (0) ).

Conjugate gradient method for LSQ problems


The conjugate gradient (CG) method (see the original Hestenes and Stiefel
paper [126]), applicable to a symmetric positive definite system Ax = z, is a
Krylov subspace method that at the kth iteration minimizes the functional

1 T
Φ(x) = x A x − xT z
2
over the affine subspace x(0) + Kk (A, r (0) ). Note that a minimum of Φ(x)
occurs when A x = z.
If A has full column rank so that AT A is positive definite, CG can be
applied to the normal equations in factored form. An implementation can
be found in the program CGLS; see [263]. The algorithm generates solution
approximations of the form x(k) = x(k−1) + p(k−1) .
The solution of the problem AT A p = AT (z − A x(0) ) ≡ AT r (0) is
achieved by minimizing the error functional in the AT A-norm,

x∗ − xAT A ≡ (x∗ − x)T AT A (x∗ − x),


2

over x(0) + Kk (AT A, AT r (0) ), by moving along a set of AT A-orthogonal


direction vectors p(k) .
At every iteration an update r (k) of the residual
  and a new direction
vector p(k) is calculated, so that the vectors p(k) k=1,... form an AT A-
orthogonal set (or conjugate orthogonal),
 (j)  i.e., (p(k) )T AT A p(j) = 0, j =
1, . . . , k − 1 and the residuals r , j = 1, . . . , k form an orthogonal set.

Algorithm CGLS
Initialize: x(0) = 0, r (0) = b − Ax(0) , p(0) = AT r (0)
for k = 1, 2, . . .
Compute the step by minimizing Φ(x) along the direction p(k−1)
110 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Update x(k) and r (k)


Compute the new direction p(k) , AT A-orthogonal
to all previous directions
Check for convergence
end

Only two coupled two-term recurrences are needed: one to update the
residuals and another to update the directions, involving a total of 3n + 2m
flops/step, in addition to two matrix-vector multiplications (with A and
AT ) per step. The iteration vectors x(k) converge to the LSQ solution x∗
if x(0) ∈ range(AT ), a condition that holds if the initial vector is chosen as
x(0) = 0. The rate of convergence can be estimated by

 ∗  cond(AT A) − 1
k
 ∗ 
x − x(k) 2 T < 2 x − x(0) 2 T .
A A cond(AT A) + 1 A A

The convergence is very fast if cond(A)2 ∼ 1, i.e., when the columns of A


are approximately orthogonal.
In exact arithmetic, the number of iterations to achieve convergence is
equal to the number of distinct singular values of A. In general this number
is not known, so an upper bound is the dimension n, but in practice it may
need far fewer or far more iterations because of the influence of rounding
errors on the orthogonalization process. Another important property is
that the norms of the error and the residual decrease
 monotonically
 with
k, but for ill-conditioned matrices A, the norm AT r (k) 2 may oscillate as
it converges to zero.
Example 72. Computed Tomography. In medicine, materials science,
geoscience and many other areas, one is interested in computing an image of
an object’s interior without opening or destroying the object. A well-known
method for doing so is computed tomography (CT), where one records the
damping of a multitude of X-rays (or other signals or waves) sent through
the object from different directions. The damping of each ray or its time of
travel is given by a line (or curvilinear) integral along the ray of the object’s
spatially varying absorption cofficient (or velocity of propagation) and an
image of this absorption coefficient is then reconstructed.
When the CT problem is discretized, we obtain a model in the form of an
LSQ problem with a solution x∗ whose elements represent the pixels in the
image (or some other more concise parametrization of the medium). The
right-hand side b consists of the measured dampings or travel times and
the elements of A are related to the lengths of each of the rays through the
individual pixels of the image. Since each ray only touches a small fraction
of all the pixels in the image, the matrix A is very sparse, and we can use
CGLS to solve the associated LSQ problem.
METHODS FOR LARGE-SCALE PROBLEMS 111

Figure 6.3.1: The norms of the errors and the LSQ residuals as a function
of the number of iterations for the computed tomography problem.

Figure 6.3.2: The CT reconstructions after 100, 200, 500 and 1500 CGLS
iterations.

In this example with X-rays, we use a problem with 4096 unknowns,


corresponding to a 64 × 64 image and a total of 4776 rays, leading to a
4776 × 4096 sparse matrix A with a density of nonzero elements equal to
1.5% (the problem was generated with software from the AIR Tools package
[123]). To simplify the example there is no noise. We performed 3000
CGLS iterations with this system. The norms of the errors x∗ − x(k) 2
and the LSQ residuals A x(k) − b2 as functions of the iteration number
k are shown in Figure 6.3.1. Clearly, both norms decay monotonically.
Figure 6.3.2 shows the CT reconstructions after 100, 200, 500 and 1500
CGLS iterations, and we see that the amount of artifacts and inaccuracies
is reduced considerably after 1500 iterations.
For seismic tomography via ray tracing in elastic media see [187, 189].

Bidiagonalization method: LSQR


A mathematically equivalent Krylov process can be generated by a Golub-
Kahan-type bidiagonalization method. An efficient implementation devel-
oped by Paige and Saunders in [179] is the LSQR code [263], which we
describe now. It has been used to solve singular and ill-conditioned prob-
lems and has been shown by example to be numerically very reliable. Es-
112 LEAST SQUARES DATA FITTING WITH APPLICATIONS

sentially, it interleaves the reduction to bidiagonal form with partial so-


lution steps in order to compute a sequence of approximating solutions
x(k) ∈ Kk (AT A, AT b), which are the minimizers of bidiagonal LSQ prob-
lems defined by a partial factorization of the matrix A.
In Section 5.5 we discussed an algorithm, based on interleaved left and
right Householder transformations applied to the augmented matrix ( b A),
that results in a factorization of the matrix A ∈ Rm×n into a product of
two orthogonal matrices and a lower bidiagonal m × n matrix containing
the nonsingular information of A:

B
A=U V T,
0

where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices with U T b =


(β1 , 0, . . . , 0)T , β1 = b2 , and B is the lower bidiagonal matrix
⎛ ⎞
α1
⎜ β2 α2 ⎟
⎜ ⎟
⎜ .. ⎟
B=⎜ ⎜ β3 . ⎟ ∈ R(n+1)×n .

⎜ .. ⎟
⎝ . αn ⎠
βn+1

The disadvantage of this stable algorithm is that the Householder transfor-


mations destroy the structure (in particular the sparsity) of A.
An alternative way to obtain the factorization is to generate the columns
of the orthogonal matrices and the elements of B sequentially after equating
the columns in the products AV = U1 B and AT U1 = V B T . Here U1 =
(u1 , . . . , un+1 ) (see Lanczos methods in chapter 9 of [105]). The scalars αj
and βj are determined so that uj 2 = v j 2 = 1, and u1 = b/ b2 . Using
the recursive formulas for uj and v j one can show that uj ∈ Kj (AAT , u1 )
and v j ∈ Kj (AT A, AT u1 ).
The sequence of approximate bidiagonal LSQ problems arises in the
following way. At step k, the bidiagonalization process has produced three
matrices: Uk+1 = (u1 , . . . , uk+1 ), Vk = (v 1 , . . . , v k ) (submatrices of U and
V ) and Bk , the leading (k + 1) × k submatrix of B.
An approximate solution x(k) of the LSQ problem minx A x − b2 ,
is now sought in the Krylov subspace Kk (AT A, AT b) generated by the
columns of Vk = span{v 1 , . . . , v k }, i.e., we are looking for a linear combi-
nation of columns of Vk :
x(k) = Vk y k .
Using the recurrence relations, one obtains

b − A x(k) = Uk+1 (β1 


e1 − Bk y k ),
METHODS FOR LARGE-SCALE PROBLEMS 113

where e1 = (1, 0, . . . , 0)T is the first unit vector and β1 = b2 . Because
Uk+1 is theoretically orthonormal, the LSQ problem in x(k) can be replaced
by a very simple one in y k :
 
min b − A x(k)  = min β1 
e1 − Bk y k 2 .
x(k) ∈range(Vk ) 2 yk

This kth subproblem has a bidiagonal matrix and a special right-hand side
T
given by (β1 , 0, . . . , 0) . It can be solved with a QR factorization of Bk ,
using Givens rotations to take advantage of the bidiagonal structure. The
computations can be organized in an efficient way so that at each step the
factorization is developed from the previous one, adding only the necessary
rotations to transform the additional column. Similarly, the iteration vector
x(k) is calculated by a recursion from the previous x(k−1) . To summarize,
the algorithm takes the following form

Algorithm LSQR
Initialize: u1 = b/ b2 , β1 = b2 and set v0 = 0,
for k = 1, 2, . . .
Define the kth approximate LSQ problem by generating
uk+1 , v k , αk , βk+1 using the formulas:
αk v k = AT uk − βk v k−1 , βk+1 uk+1 = Av k − αk uk ,
choosing αk and βk+1 so that v k and uk+1 are normalized.
Solve the problem:
Compute the QR factorization of Bk by updating that of
the previous step.
Compute x(k) from x(k−1) by a recursion.
Estimate
 T the  residual and solution norms r k 2 , xk 2 , and
A r k  and the condition of Bk .
2
Check for convergence.
end

The cost per iteration is 3m + 5n flops, in addition to two matrix-vector


multiplications (with A and AT ). Besides the storage for the matrix A,
the algorithm needs 2 vectors of length m and 3 vectors of length n. One
of the stopping criteria in [179] is specifically designed for singular or ill-
conditioned problems: the iterations are stopped if cond(Bk ) is larger than
a given tolerance.
A new formulation of the bidiagonalization process called LSMR was
recently presented [80]. Compared to LSQR, this algorithm requires an
additional n flops per iteration and one additional
 vector of length n. The

advantage of LSMR is that both r k 2 and AT r k 2 decay monotonically
with k (while only r k 2 is monotonic for LSQR), thus making it safer to
114 LEAST SQUARES DATA FITTING WITH APPLICATIONS

terminate the algorithm as soon as the residual AT r k associated with the


normal equations is sufficiently small.
For very ill-conditioned problems, instead of solving the original LSQ
problem, there is the option in LSQR of solving a damped LSQ problem
(equivalent to Tikhonov regularization, as we will see in Section 10):
 
 A b 

min  x−  .
x λI 0 2

The quantities uj+1 , v j , αj , βj+1 generated by the Golub-Kahan process


are independent of λ, but y k will now be the solution of a damped LSQ
subproblem.
As LSQR is mathematically equivalent to the conjugate gradient algo-
rithm applied to the normal equations for the LSQ problem, the conver-
gence results in exact arithmetic are the same. This means that the LSQR
iterates converge to the LSQ solution and the method should in theory take
at most n steps to obtain the exact solution. However, it may take many
fewer or many more, specially for ill-conditioned systems. As for CGLS the
norms of the solution and the residual have monotonic behavior.
With respect to stability, we mention the following comment from Björck
[21]: “The conclusion from both the theoretical analysis and the experimen-
tal evidence is that LSQR and CGLS are both well behaved and achieve a
final accuracy consistent with a backward stable method.”
Although LSQR has no relation to normal equations, as CGLS has, the
differences in stability of both methods are very small.

6.4 Practicalities: preconditioning and


stopping criteria
Preconditioning is a convergence acceleration technique that, in the LSQ
case, is equivalent to a transformation of variables. For this purpose one
 matrix S (the preconditioner), such that the modi-
chooses a nonsingular
fied problem miny AS −1 y − b2 , with Sx = y, requires fewer iterations
and less work. Two important requirements are that AS −1 be better con-
ditioned than A and that it also be easy to recover x∗ from the minimum
y ∗ of the above-transformed problem.
Assuming for simplicity that the normal equations are nonsingular and
therefore symmetric and positive definite, the best preconditioner would be
S = R1 , the triangular Cholesky factor of AT A, since then the condition
number of the matrix AS −1 would be 1.
In general for LSQ problems, the following preconditioners are fre-
quently chosen:
METHODS FOR LARGE-SCALE PROBLEMS 115

• S = diag(a1 2 , . . . , an 2 ), which amounts to a diagonal column


scaling imposing unit length.
Incomplete factorizations such as
• S = R,' where AT A = R  − E; here the matrix R
T R  is an incom-
plete Cholesky factor, leaving out inconvenient elements of the ex-
act Cholesky factor but keeping E2 small. For example, one can
either define a convenient storage structure for R and ignore fill-in
elements of the exact Cholesky factor or drop out elements that are
small in magnitude. The various preconditioners define a spectrum
of methods between the direct, where S = R (the exact Cholesky
factor when E = 0), through the incomplete Cholesky factor R,  to
the un-preconditioned iterative method when S = I.
 where R
• S = R,  comes from an approximate QR factor of A.

• S = U ΠT2 , where the factors are from an LU factorization Π1 AΠ2 =


U
L . L has unit diagonals and Π1 is chosen to keep |Lij | ≤ O(1).
0
(Π2 is less critical if A has column rank.)
It may be useful to remember the mathematical equivalence of LSQR
and CGLS. The latter method minimizes the quadratic form

min[(x(k) − x)T AT A(x(k) − x)].


x

To precondition either of the two procedures can be considered intu-


itively as an attempt to stretch the quadratic form to make it more “spher-
ical,” so that the eigenvalues are closer to each other.
The simplest preconditioner, the diagonal matrix, is a scaling along the
coordinate axis, whereas choosing S −1 = (AT A)−1 would be a scaling along
the eigenvectors. Finding the best preconditioner depends on the problem
and is still a research topic. A good survey can be found in [20].
Stopping criteria should always be chosen with great care so as to en-
sure the computation of a solution with sufficient accuracy (that, in the
end, always depends on the application). For a good discussion of vari-
ous stopping criteria we refer to [179]. Some general considerations are as
follows:
1. Always specify a maximum number of iterations, in order to prevent
an infinite loop.
2. Stop when the normal-equation residual’s norm

A x(k) − b2
116 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 6.4.1: The relative norms of the errors, the LSQ residuals and the
normal equation (NE) residuals, as a function of the number of iterations,
for the WELL1850 geodetic surveying problem. Left: a system with a noise-
free right-hand side bexact . Right: a system with a noisy right-hand side
b = bexact + e.

is of the same size as the norm e2 of the vector of errors in the right-
hand side, cf. Section 2.2.5. When this is the case, we have computed
a solution x(k) that fits the data as accurately as the errors allow
(recall that we cannot achieve a zero LSQ residual for noisy data). It
is not realistic to expect that the LSQ residual norm can get smaller
than e2 and in fact, further iterations will lead to overfitting, i.e.,
modeling of the errors.

3. Stop when the norm of the normal-equation residual AT (A x(k) −


b)2 is sufficiently small: In principle this norm should converge to
zero, but rounding errors will usually prevent it. For well-conditioned
problems, a small normal-equation residual is a sign that we have
computed an accurate LSQ solution.

4. For the LSQR algorithm applied to ill-conditioned problems, stop


when the condition number of the bidiagonal matrix Bk (or an esti-
mate) exceeds a given threshold. This approach can be used to limit
the sensitivity of the approximate LSQR solution x(k) to data errors
(at the cost of larger residual norms).

Example 73. Convergence of CGLS solutions. This example illus-


trates the convergence of the iterates x(k) of the CGLS algorithm (the LSQR
iterates are practically identical for this problem). This LSQ problem arises
in geodetic surveying and is available as problem “WELL1850” from Matrix
Market [157], except that the noise-free right-hand side bexact is modified so
that the noise-free system is consistent, i.e., A xexact = bexact . The matrix
A is 1850 × 712 and its condition number is 111.3 (so this is a rather small
and well-conditioned system).
METHODS FOR LARGE-SCALE PROBLEMS 117

Figure 6.4.1 shows the convergence for two systems. Left: for a sys-
tem with the noise-free right-hand side bexact . Right: with a noisy right-
hand side b = bexact + e, where the perturbation e was scaled such that
e2 /bexact 2 = 3 · 10−3 . We show:

• the relative error norm xexact − x(k) 2 /xexact 2 ,

• the relative LSQ residual norm b − A x(k) 2 /bexact 2 and

• the relative normal-equation (NE) residual norm

AT (b − A x(k) )2 /xexact 2 .


For the noise-free data we see that the solution error as well as both of the
residuals converge to zero, clearly demonstrating that the iterates converge
to the exact solution xexact , which in this case is identical to the LSQ so-
lution x∗ . For the noisy data, both the relative error norm and the relative
LSQ residual norm level off, showing that the iterates now do not converge
to xexact as expected, due to the presence of the noise. In fact, the relative
norm of the LSQ residual levels off at approximately 3 × 10−3 , as expected
from the noise that we introduced. The relative NE residual norm converges
to zero, showing that CGLS converges to a solution that satisfies the normal
equations, i.e., the LSQ solution x∗ for the problem.

6.5 Block methods


There are a variety of classical and modern variations of block methods.
We are interested here in methods that are suitable for parallel implementa-
tion. Very early on, Chazan and Miranker [43] devised a family of methods
for solving large systems of linear equations under the name of chaotic
relaxation with parallelization as the principal objective. The main charac-
teristic of interest was that the methods were asynchronous, an important
property in a network of heterogeneous computers or when the tasks are
not all uniform in the amount of work. Synchronization points force a
wait for all the processors to terminate and create the need for careful load
balancing to avoid inefficiencies.
We describe such an algorithm for solving large linear LSQ problems.
The idea is to partition the data into disjoint blocks and then select those
variables that are best defined by each particular block. If the subproblems
are small enough, then direct methods can be employed to solve them.
In Chapter 9 we extend these ideas to the nonlinear case and describe in
detail the algorithm to determine the subgroup of variables associated with
a given subset of equations. We also give there conditions for convergence of
the (necessarily) iterative process that results. Two variants are discussed:
a block Jacobi (BJ) and a block Gauss-Seidel (BGS) approach.
118 LEAST SQUARES DATA FITTING WITH APPLICATIONS

We allow repetition of variables in different blocks, and to prevent os-


cillations we use a running average for updating such repeated variables in
the BGS method. For BJ a normal average is used after a full sweep of
all the blocks. A user-chosen threshold is an integral part of the variable
selection procedure. A smaller threshold allows more variables in the block
and vice versa; thus the threshold itself acts as a regularization parameter
for the least squares subproblems. Since we need to keep the blocks overde-
termined, that can also be achieved by a judicious use of the threshold. Of
course, if the global problem is very ill-conditioned the subproblems can
use further regularization. For direct methods one can either use TSVD or
Tikhonov regularization. Alternatively, one can use conjugate gradients for
the blocks, with its own self-regularization by early termination.
Specifically, let the kth block of the least squares problem be written as
⎛ ⎞2
  
min r [k] 22 = min ⎝ [k]
aij xj + aij xj − bi ⎠ , (6.5.1)
x[k] x[k]
i∈RIk jCIk j ∈CI
/ k

where CIk is the subset of indices of the variables that are active in block
k, while the variables xj with j ∈/ CIk are kept fixed. The summation is
over the set of data indices RIk corresponding to observations in block k.
The block iteration then consists in solving equation (6.5.1) for each block
k = 1, . . . , K. Since


K
min b − A x22 = min r [k] 22 ,
x x
k=1
we expect that on termination of the block process we have an approximate
solution of the global problem.
In a practical implementation on a cluster of processors, we assume that
the current value of all the variables is held in a central location accessible to
all the processors. If the values obtained in the solution of block k are used
immediately to update the central repository, then that is a block Gauss-
Seidel iteration, while if one waits until the end of the cycle to update all the
variables, that will be a block Jacobi iteration, requiring a synchronization
point at the end of each sweep over all the blocks.
In a sequential BGS iteration, the blocks would be visited in order,
while in a parallel implementation we allow the block solves to arrive in an
arbitrary order: that is the chaotic part. The theory provides for conver-
gence, provided that each block is solved “often” enough, and it allows for
occasional processor failures, an important feature.
We think that this algorithm is appropriate and will be most successful
for problems in which both the data and the unknowns are “local” and
essentially the problem can be decomposed into blocks that are only weakly
coupled to variables not in the block (using a reasonable threshold).
METHODS FOR LARGE-SCALE PROBLEMS 119

In the literature there are a number of algorithms similar but not iden-
tical to this one, for which difficulties have been reported. Our experience
with this method has been quite positive, so maybe some of the differences
are important. Among the closest algorithm to ours is that of Steihaug and
Yalcinkaya [229], where they partition the problem only in the variables,
not allowing overlaps. Using sparse matrices from the Boeing collection,
they show that convergence may deteriorate if old information pollutes the
iteration. Bertsekas and his collaborators have done extensive work in this
area [166, 218], although again their approach is somewhat different from
the one described above.
A very interesting and extensive discussion of the solution of very large
linear systems by combining block and iterative methods can be found in
[52, 53]. It is a very good source for issues related to local clusters and ge-
ographically distributed ones (i.e., grid computing) and contains extensive
experimentation and good pointers to available software. Many of those
discussions are applicable to least squares problems.
This page intentionally left blank
Chapter 7

Additional Topics in Least


Squares

In this chapter we collect some more specialized topics, such as problems


with constraints, sensitivity analysis, total least squares and compressed
sensing.

7.1 Constrained linear least squares problems


The inclusion of constraints in the linear least squares problem is often a
convenient way to incorporate a priori knowledge about the problem. Lin-
ear equality constraints are the easiest to deal with, and their solution is
an important part of solving problems with inequality constraints. Bound
constraints and quadratic constraints for linear least squares are also con-
sidered in this chapter.

Least squares with linear constraints


Linear equality constraints (LSE)
The general form of a linear least squares problem with linear equality
constraints is as follows: find a vector x ∈ Rn that minimizes b − A x2
subject to the constraints C T x = d, where C is a given n × p matrix with
p ≤ n and d is a given vector of length p. This results in the LSE problem:

Problem LSE: minx b − A x2 subject to C T x = d. (7.1.1)

A solution exists if C T x = d is consistent, which is the case if rank(C) = p.


For simplicity, we assume that this is satisfied; Björck ([20], pp. 188) ex-

121
122 LEAST SQUARES DATA FITTING WITH APPLICATIONS

plains ways to proceed when there is no a priori knowledge about con-


sistency. Furthermore, the minimizing solution will be unique if the aug-
mented matrix
CT
Aaug = (7.1.2)
A

has full rank n.


The idea behind the different algorithms for the LSE problem is to
reduce it to an unconstrained (if possible lower-dimensional) LSQ problem.
To use Lagrange multipliers is of course an option, but we will instead
describe two more direct methods using orthogonal transformations, each
with different advantages.
One option is to reformulate the LSE problem as a weighted LSQ prob-
lem, assigning large weights (or penalties) for the constraints, thus enforcing
that they are almost satisfied:
 
 λC T λd 
min  x−  , λ large. (7.1.3)
x  A b 
2

Using the generalized singular value decomposition (GSVD), it can be


proved that if x(λ) is the solution to (7.1.3) then x(λ) − x2 = O(λ−2 ),
so that in fact x(λ) → x as λ → ∞. The LU factorization algorithm from
Section 4.2 is well suited to solve this problem, because p steps of Gaussian
elimination will usually produce a well-conditioned L matrix.
Although in principle only a general LSQ solver is needed, there are
numerical difficulties because the matrix becomes poorly conditioned for
increasing values of λ. However, a strategy described in [150], based on
Householder transformations combined with appropriate row and column
interchanges has been found to give satisfactory results. In practice, as [20]
mentions, if one uses the LSE equations in the form (7.1.3), it is sufficient
to apply a QR factorization with column permutations.
To get an idea of the size of the weight λ we refer to an example by
van Loan described in [20], pp. 193. One can obtain almost 14 digits ac-
curacy with a weight of λ = 107 using a standard QR factorization with
permutations (e.g., MATLAB’s function “qr”), if the constraints are placed
in the first rows. Inverting the order in the equations, though, gives only
10 digits for the same weight λ and an increase in λ only degrades the
computed solution. In addition, Björck [20] mentions a QR decomposition
based on self-scaling Givens rotations that can be used without the “risk of
overshooting the optimal weights.”
Another commonly used algorithm directly eliminates some of the vari-
ables by using the constraints. The actual steps to solve problem (7.1.1)
are:
ADDITIONAL TOPICS IN LEAST SQUARES 123

1. Compute the QR factorization of C to obtain:


R1
C=Q ⇐⇒ C T = ( R1T 0 )QT .
0

2. Use the orthogonal matrix to define new variables:


u u
= QT x ⇐⇒ x=Q (7.1.4)
v v
and also to compute the value of the unknown p-vector u by solving
the lower triangular system R1T u = d.
3. Introduce the new variables into the equation
 
   u 
b − A x2 = b − AQQ x 2 ≡ 
 T 
b − A
 ,

v 2

where the matrix A = AQ is partitioned according to the dimensions


u
of : A = ( A1 A2 ), allowing the reduced n − p dimensional
v
LSQ problem to be written as

min (b − A1 u) − A2 v2 . (7.1.5)


v

4. Solve the unconstrained, lower-dimensional LSQ problem in (7.1.5)


and obtain the original unknown vector x from (7.1.4).
This approach has the advantage that there are fewer unknowns in each
system that need to be solved for and moreover the reformulated LSQ
problem is better conditioned since, due to the interlacing property of the
singular values: cond(A2 ) ≤ cond(A) = cond(A). The drawback is that
sparsity will be destroyed by this process.
If the augmented matrix in (7.1.2) has full rank, one obtains a unique
solution x∗C , if not, one has to apply a QR factorization with column per-
mutations, while solving problem (7.1.5) to obtain the minimum-length
solution vector.
The method of direct elimination compares favorably with another QR-
based procedure, the null space method (see, for example, [20, 150]), both
in numerical stability and in operational count.
Example 74. Here we return to the linear prediction problem from Ex-
ample 28, where we saw that 4 coefficients were sufficient to describe the
particular signal used there. Hence, if we use n = 4 unknown LP coef-
ficients, then we obtain a full-rank problem with a unique solution given
by
x∗ = ( −1, 0.096, −0.728, −0.096 )T .
124 LEAST SQUARES DATA FITTING WITH APPLICATIONS

If we want the prediction scheme in (2.3.3) to preserve the mean of the


predicted signal, then we should add a linear constraint
to the LSQ problem,
4
forcing the prediction coefficients sum to zero, i.e., i=1 xi = 0. In the
above notation this corresponds to the linear equality constraint

C T x = d, C T = (1, 1, 1, 1), d = 0.

Following the direct elimination process described above, this corresponds to


the following steps for the actual problem.

1. Compute the QR factorization of C:


⎛ ⎞ ⎛ ⎞⎛ ⎞
1 0.5 0.5 0.5 0.5 2
⎜1⎟ ⎜0.5 0.5 −0.5 −0.5⎟ ⎜0⎟
C=⎜ ⎟ ⎜ ⎟⎜ ⎟
⎝1⎠ = Q R = ⎝0.5 −0.5 0.5 −0.5⎠ ⎝0⎠ .
1 0.5 −0.5 −0.5 0.5 0

2. Solve R1T u = d ⇔ 2u = 0 ⇔ u = 0.
 
3. Solve minv b − A2 v 2 with
⎛ ⎞
0.5 0.5 0.5
⎜ 0.5 −0.5 −0.5⎟
A2 = A ⎜ ⎟
⎝−0.5 0.5 −0.5⎠ ,
−0.5 −0.5 0.5

giving v = ( −0.149, −0.755, 0.324 )T .

4. Compute the constrained solution

u
x∗C = Q
v
⎛ ⎞ ⎛ ⎞
0.5 0.5 0.5 ⎛ ⎞ −0.290
⎜ 0.5 −0.5 −0.5⎟ −0.149 ⎜ ⎟
= ⎜ ⎟⎝ ⎠ ⎜ 0.141 ⎟
⎝−0.5 0.5 −0.5⎠ −0.755 = ⎝−0.465⎠ .
0.324
−0.5 −0.5 0.5 0.614

It is easy to verify that the elements of x∗C sum to zero. Alternatively we can
use the weighting approach in (7.1.3), with the row λC T added on top of the
matrix A; with λ = 102 , 104 and 108 we obtain solutions almost identical
to the above x∗C , with elements that sum to −1.72 · 10−3 , −1.73 · 10−7 and
−1.78 · 10−15 , respectively.
ADDITIONAL TOPICS IN LEAST SQUARES 125

Figure 7.1.1: Constrained least squares polynomial fits (m = 30, n = 10).


The unconstrained fit is shown by the dashed lines, while the constrained
fit are shown by the solid lines. Left: equality constraints M (x, 0.5) = 0.65
and M  (x, 0) = M  (x, 1) = 0. Right: inequality constraints M  (x, ti ) ≥ 0
for i = 1, . . . , m.

Linear inequality constraints (LSI)


Instead of linear equality constraints we can impose linear inequality con-
straints on the least squares problem. Then the problem to be solved is

Problem LSI: minx b − A x2 subject to l ≤ C T x ≤ u, (7.1.6)

where the inequalities are understood to be component-wise.


There are several important cases extensively discussed in [20, 150], and
Fortran subroutines are available from [88]. A good reference for this part
is [91]. A constrained problem may have a minimizer of b − A x2 that
is feasible, in which case it can be solved as an unconstrained problem.
Otherwise a constrained minimizer will be located on the boundary of the
feasible region. At such a solution, one or more constraints will be active,
i.e., they will be satisfied with equality.
Thus the solver needs to find which constraints are active at the solution.
If those were known a priori, then the problem could be solved as an equality
constrained one, using one of the methods we discussed above. If not, then
a more elaborate algorithm is necessary to verify the status of the variables
and modify its behavior according to which constraints become active or
inactive at any given time.
As we showed above, a way to avoid all this complication is to use
penalty functions that convert the problem into a sequence of unconstrained
problems. After a long hiatus, this approach has become popular again in
the form of interior point methods [259]. It is, of course, not devoid of its
own complications (principle of conservation of difficulty!).

Example 75. This example illustrates how equality and inequality con-
straints can be used to control the properties of the fitting model M (x, t),
126 LEAST SQUARES DATA FITTING WITH APPLICATIONS

using the rather noisy data (m = 30) shown in Figure 7.1.1 and a poly-
nomial of degree 9 (i.e., n = 10). In both plots the dashed line shows the
unconstrained fit.
Assume that we require that the model M (x, t) must interpolate the
point (tint , yint ) = (0.5, 0.65), have zero derivative at the end points t = 0,
and t = 1, i.e., M  (x, 0) = M  (x, 1) = 0. The interpolation requirement
correspond to the equality constraint

(t9int , t8int , . . . , tint , 1) x = yint ,

while the constraint on M  (x, t) has the form

(9t8 , 8t7 , . . . , 2t, 1, 0) x = M  (x, t).

Hence the matrix and the right-hand side for the linear constraints in the
LSE problem (7.1.1) has the specific form
⎛ 9 ⎞ ⎛ ⎞
tint t8int · · · t2int tint 1 yint
T
C = ⎝ 0 0 ··· 0 1 0 ⎠, d = ⎝ 0 ⎠.
9 8 ··· 2 1 0 0

The resulting constrained fit is shown as the solid line in the left plot in
Figure 7.1.1.
Now assume instead that we require that the model M (x, t) be mono-
tonically non-decreasing in the data interval, i.e., that M  (x, t) ≥ 0 for
t ∈ [0, 1]. If we impose this requirement at the data points t1 , . . . , tm , we
obtain a matrix C in the LSI problem (7.1.6) of the form
⎛ ⎞
9t81 8t71 . . . 2t1 0
⎜ 9t82 8t72 . . . 2t2 0 ⎟
T ⎜ ⎟
C =⎜ . .. .. .. ⎟ ,
⎝ .. . . . ⎠
8 7
9tm 8tm . . . 2tm 0

while the two vectors with the bounds are l = 0 and u = ∞. The resulting
monotonic fit is shown as the solid line in the right plot in Figure 7.1.1.

Bound constraints
It is worthwhile to discuss the special case of bounded variables, i.e., when
C = I in (7.1.6) – also known as box constraints. Starting from a feasible
point (i.e., a vector x that satisfies the bounds), we iterate in the usual way
for an unconstrained nonlinear problem (see Chapter 9), until a constraint
is about to be violated. That identifies a face of the constraint box and a
particular variable whose bound is becoming active. We set that variable
to the bound value and project the search direction into the corresponding
ADDITIONAL TOPICS IN LEAST SQUARES 127

h7
1

 -x3
x1  x2
7


x •

Figure 7.1.2: Gradient projection method for a bound-constrained problem


with a solution at the upper right corner. From the starting point x we move
along the search direction h until we hit the active constraint in the third
coordinate, which brings us to the point x1 . In the next step we hit the
active constraint in the second coordinate, bringing us to the point x2 . The
third step brings us to the solution at x3 , in which all three coordinates are
at active constraints.

face, in order to continue our search for a constrained minimum on it. See
Figure 7.1.2 for an illustration. If the method used is gradient oriented,
then this technique is called the gradient projection method [210].

Example 76. The use of bound constraints can sometimes have a profound
impact on the LSQ solution, as illustrated in this example where we return
to the CT problem from Example 72. Here we add noise e with relative
noise levels η = e2 /b2 equal to 10−3 , 3 · 10−3 , 10−2 and we enforce
non-negativity constraints, i.e., l = 0 (and u = ∞). Figure 7.1.3 compares
the LSQ solutions (bottom row) with the non-negativity constrained NNLS
solutions (top row), and we see that even for the largest noise level 10−2
the NNLS solution includes recognizable small features – which are lost in
the LS solution even for the smaller noise level 3 · 10−3 .

Sensitivity analysis
Eldén [74] gave a complete sensitivity analysis for problem LSE (7.1.1), in-
cluding perturbations of all quantities involved in this problem; here we spe-
cialize his results to the case where only the right-hand side b is perturbed.
Specifically, if x∗C denotes the solution with the perturbed right-hand side
b + e, then Eldén showed that

x∗C − x∗C 2 ≤ A†C 2 e2 , AC = A (I − (C † )T C T ).

To derive a simpler expression for the matrix AC consider the QR factor-


ization C = Q1 R1 introduced above. Then C † = R1−1 QT1 and it follows
128 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 7.1.3: Reconstructions of the CT problem for three different rel-


ative noise levels η = e2 /b2 . Top: LSQ solutions with non-negativity
constraints. Bottom: standard LSQ solutions.
ADDITIONAL TOPICS IN LEAST SQUARES 129

that if Q = ( Q1 Q2 ) then I − (C † )T C T = I − Q1 QT1 = Q2 QT2 and


hence AC = A Q2 QT2 = A2 QT2 . Moreover, we have A†C = Q2 A†2 and thus
A†C 2 = A†2 2 . The following theorem then follows immediately.
Theorem 77. Let x∗C and x∗C denote the solutions to problem LSE (7.1.1)
with right-hand sides b and b + e, respectively. Then, neglecting second-
order terms,
x∗C − x∗C 2 e2
≤ cond(A2 ) .
x∗C 2 A2 2 x∗C 2
This implies that the equality-constrained LS solution is typically less
sensitive to perturbations, since the condition number of A2 = AQ2 is less
than or equal to the condition number of A.

Least squares with quadratic constraints (LSQI)


If we add a quadratic constraint to the least squares problem, then we
obtain a problem of the form
2
Problem LSQI: minx b − A x2 subject to d − B x2 ≤ α, (7.1.7)

where A ∈ Rm×n and B ∈ Rp×n . We assume that


A
rank(B) = r and rank = n,
B

which guarantees a unique solution of the LSQI problem. Least squares


problems with quadratic constraints arise in many applications, including
ridge regression in statistics, Tikhonov regularization of inverse problems,
and generalized cross-validation; we refer to [121] for more details.
To facilitate the analysis it is convenient to transform the problem
into “diagonal” form by using the generalized singular value decomposition
(GSVD) from Section 3.3:

U T A X = DA , V B X = DB , b = U T b, d = V T d, x = Xy.

The matrices DA , DB are diagonal with non-negative elements α1 , α2 , ..., αn


and β1 , β2 , ..., βq , where q = min{p, n} and there are r values βi > 0. The
reformulated problem is now

 2 
n
 2 
m
minb − DA y 2 = min αi yi − bi + b2i (7.1.8)
y y
i=1 n+1

 2  r
 2 
p
subject to d − DB y 2 = βi yi − di + d2i ≤ α2 . (7.1.9)
i=1 r+1
130 LEAST SQUARES DATA FITTING WITH APPLICATIONS

A
pnecessary and sufficient condition for the existence of a solution is that
i=r+1 id 2
≤ α2 . The way to solve problem (7.1.8)–(7.1.9) will depend on
p
the size of the term i=r+1 d2i .
p
1. If r+1 d2i = α2 , the only y that can satisfy the constraints has as
first r elements yi = di /βi , i = 1, . . . , r. The remaining elements
yi are defined from the minimization of (7.1.8). The minimum is
n
attained if i=1 (αi yi − bi )2 = 0 or is as small as possible, so that for
i = r + 1, . . . , n we set yi = bi /αi if αi = 0 and yi = 0 otherwise.
p
2. If i=r+1 d2i < α2 , one could use the Lagrange multipliers method
directly – cf. Appendix C.2.1 – but it is also possible to try a simpler
approach to reach a feasible solution: define the vector y that mini-
mizes (7.1.8), which implies choosing yi = bi /αi if αi = 0 as before.
If αi = 0, then try to make the left-hand side of the constraints as
small as possible by defining yi = di /βi if βi = 0 or else yi = 0.

3. If the resulting solution y is feasible, i.e., it satisfies the constraints


(7.1.9), then x = X y is the LSQI solution.

4. If not, look for a solution on the boundary of the feasible set, i.e.,
where the constraint is satisfied with equality. That is the standard
form for the use of Lagrange multipliers, so the problem is now, for
 2  2 
Φ(y; μ) = b − DA y 2 − μ d − DB y 2 − α2 (7.1.10)

find y μ and μ so that ∇Φ = 0.


It can be shown that the solution y μ in step 4 satisfies the “normal
equations”
T T T T
(DA DA + μDB DB ) y μ = DA b + μDB d, (7.1.11)
where μ satisfies the secular equation:
 2
χ(μ) = d − DB y μ 2 − α2 = 0.

An iterative Newton-based procedure can be defined as follows. Starting


from an initial guess μ0 , at each successive step calculate y μi from (7.1.11),
then compute a Newton step for the secular equation obtaining a new value
μi+1 . It can be shown that this iteration is monotonically convergent to a
unique positive root if one starts with an appropriate positive initial value
and if instead of χ(μ) = 0 one uses the more convenient form

1 1
 2 − 2 = 0.
d − DB y μ  α
2
ADDITIONAL TOPICS IN LEAST SQUARES 131

Therefore the procedure can be used to obtain the unique solution of the
LSQI problem. This is the most stable, but also the most expensive numer-
ical algorithm. If instead of using the GSVD reformulation one works with
the original equations (7.1.7), an analogous Newton-based method can be
applied, in which the first stage at each step is
 
 A b 
min 
 √ x − √  .
xμ μB μ
μd 2

Efficient methods for solution of this kind of least squares problem have
been studied for several particular cases; see [20] for details.

7.2 Missing data problems


The problem of missing data occurs frequently in scientific research. It
may be the case that in some experimental plan, where observations had to
be made at regular intervals, occasional omissions arise. Examples would
be clinical trials with incomplete case histories, editing of incomplete sur-
veys or, as in the example given at the end of this section, gene expression
microarray data, where some values are missing. Let us assume that the
missing data are MCAR (missing completely at random), i.e., the proba-
bility that a particular data element is missing is unrelated to its value or
any of the variables. For example, in the case of data arrays, independent
of the column or row.
The appropriate technique for data imputation (a statistical term, mean-
ing the estimation of missing values), will depend among other factors, on
the size of the data set, the proportion of missing values and on the type
of missing data pattern. If only a few values are missing, say, 1–5%, one
could use a single regression substitution: i.e., predict the missing values
using linear regression with the available data and assign the predicted
value to the missing score. The disadvantage of this approach is that this
information is only determined from the – now reduced – available data set.
However, with MCAR data, any subsequent statistical analysis remains un-
biased. This method can be improved by adding to the predicted value a
residual drawn to reflect uncertainty (see [152], Chapter 4).
Other classic processes to fill in the data to obtain a complete set are
as follows:

• Listwise deletion: omit the cases with missing values and work with
the remaining set. It may lead to a substantial decrease in the avail-
able sample size, but the parameter estimates are unbiased.

• Hot deck imputation: replace the missing data with a random value
drawn from the collection of data of similar participants. Although
132 LEAST SQUARES DATA FITTING WITH APPLICATIONS

widely used in some applications there is scarce information about


the theoretical properties of the method.

• Mean substitution: substitute a mean of the available data for the


missing values. The mean may be formed within classes of data. The
mean of the resulting set is unchanged, but the variance is underesti-
mated.

More computationally intensive approaches based on least squares and


maximum-likelihood principles have been studied extensively in the past
decades and a number of software packages that implement the procedures
have been developed.

Maximum likelihood estimation


These methods rely on probabilistic modeling, where we wish to find the
maximum likelihood (ML) estimate for the parameters of a model, including
both the observed and the missing data.

Expectation-maximization (or EM) algorithm

One ML algorithm is the expectation-maximization algorithm. This algo-


rithm estimates the model parameters iteratively, starting from some initial
guess of the ML parameters, using, for example, a model for the listwise
deleted data. Then follows a recursion until the parameters stabilize, each
step containing two processes.

• E-step: the distribution of the missing data is estimated given the


observed data and the current estimate of the parameters.

• M-step: the parameters are re-estimated to those with maximum like-


lihood, assuming the complete data set generated in the E-step.

Once the iteration has converged, the final maximum likelihood estimates
of the regression coefficients are used to estimate the final missing data.
It has been proved that the method converges, because at each step the
likelihood is non-decreasing, until a local maximum is reached, but the
convergence may be slow and some acceleration method must be applied.
The global maximum can be obtained by starting the iteration several times
with randomly chosen initial estimates.
For additional details see [149, 152, 220]. For software IBM SPSS: miss-
ing value analysis module. Also free software such as NORM is available
from [163].
ADDITIONAL TOPICS IN LEAST SQUARES 133

Multiple imputation (MI)


Instead of filling in a single value for each missing value, Rubin’s multiple
imputation procedure [152], replaces each missing value with a set of plau-
sible values that represent the uncertainty about the right value to impute.
Multiple imputation (MI) is a Monte Carlo simulation process in which a
number of full imputed data sets are generated. Statistical analysis is per-
formed on each set, and the results are combined [68] to produce an overall
analysis that takes the uncertainty of the imputation into consideration.
Depending on the fraction of missing values, a number between 3 and 10
sets must be generated to give good final estimates.
The critical step is the generation of the imputed data set. The choice
of the method used for this generation depends on the type of the missing
data pattern (see, for example, [132]). For monotone patterns, a parametric
regression method can be used, where a regression model is fitted for the
variable with missing values with other variables as covariates (there is a
hierarchy of missingness: if zb is observed, then za for a < b, is also ob-
served). This procedure allows the incorporation of additional information
into the model, for example, to use predictors that one knows are related
to the missing data. Based on the resulting model, a new regression model
is simulated and used to impute the missing values.
For arbitrary missing data patterns, a computationally expensive Markov
Chain Monte Carlo (MCMC) method, based on an assumption of multi-
variate normality, can be used (see [220]). MI is robust to minor departures
from the normality assumption, and it gives reasonable results even in the
case of a large fraction of missing data or small size of the samples.
For additional information see [132, 149, 152]. For software see [132].

Least squares approximation


We consider here the data as a matrix A with m rows and n columns, which
is approximated by a low-rank model matrix. The disadvantage is that no
information about the data distribution is included, which may be impor-
tant when the data belong to a complex distribution. Two types of methods
are in common use: SVD based and local least squares imputations.

SVD-based imputation
In this method, the singular value decomposition is used to obtain a set of
orthogonal vectors that are linearly combined to estimate the missing data
values. As the SVD can only be performed on complete matrices, one works

with an auxiliary matrix A obtained by substituting any missing position
 
in A by a row average. The SVD of A is as usual A = U ΣV T . We select

first the k most significant right singular vectors of A .
134 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Then, one estimates the missing ij value of A by first regressing the row
i against the significant right eigenvectors and then using a linear combi-
nation of the regression coefficients to compute an estimate for the element
ij. (Note that the j components of the right eigenvectors are not used in
the regression.) This procedure is repeated until it converges, and an im-
portant point is that the convergence depends on the configuration of the
missing entries.

Local least squares imputation


We will illustrate the method with the imputation of a DNA microarray.
The so-called DNA microarray, or DNA chip, is a technology used to ex-
periment with many genes at once. To this end, single strands of comple-
mentary DNA for the genes of interest – which can be many thousands –
are immobilized on spots arranged in a grid (“array”) on a support that will
typically be a glass slide, a quartz wafer or a nylon membrane.
The data from microarray experiments are usually in the form of large
matrices of expression levels of genes (gene expression is the process by
which information from a gene is used in the synthesis of a functional gene
product), under different experimental conditions and frequently with some
values missing. Missing values occur for diverse reasons, including insuf-
ficient resolution, image corruption or simply due to dust or scratches on
the slide. Robust missing value estimation methods are needed, since many
algorithms for gene expression analysis require a complete matrix of gene
array values.
One method, developed by Kim, Golub and Park [145] for estimating
the missing values, is a least squares-based imputation approach using the
concept of coherent genes. The method is a local least squares (LLS) algo-
rithm, since only the genes with high similarity with the target gene, the
one with incomplete values, are used. The coherent genes are identified
using similarity measures based on the 2 -norm or the Pearson correlation
coefficient. Once identified, two approaches are used to estimate the miss-
ing values, depending on the relative sizes between the number of selected
similar genes and the available experiments:

1. Missing values can be estimated either by representing the target gene


with missing values as a linear combination of the similar genes.

2. The target experiment that has missing values can be represented as


a linear combination of related experiments.

Denote with G ∈ Rm×n a gene expression data matrix with m genes


and n experiments and assume m ≫ n. The row g Ti of G represents
expressions of the ith gene in n experiments. Assume for now that only
ADDITIONAL TOPICS IN LEAST SQUARES 135

one value is missing and it corresponds to the first position of the first gene,
G(1, 1) = g 1 (1), denoted by α for simplicity.
Now, among the genes with a known first position value, either the k
nearest-neighbors gene vectors are located using the 2 -norm or the k most
similar genes are identified using the Pearson correlation coefficient. Then,
based on these k closest gene vectors, a matrix A ∈ Rk×(q−1) and two
vectors b ∈ Rk and w ∈ R(q−1) are formed. The k rows of the matrix A
consist of the k closest gene vectors with their first values deleted; q varies
depending on the number of known values in these similar genes (i.e., the
number of experiments recorded successfully). The elements of b consist of
the first values of these gene vectors, and the elements of w are the q − 1
known elements of g 1 .
When k < q − 1, the missing value in the target gene is approximated
using the same-position value of the nearest genes:
 
1. Solve the local least squares problem minx AT x − w . 2

2. Estimate the missing value as a linear combination of the first values


of the coherent genes α = bT x.
When k ≥ q − 1, on the other hand, the missing value is estimated using
the experiments:
1. Solve the local least squares problem minx A x − b2 .
2. Estimate the missing value as a linear combination of values of exper-
iments, not taking into account the first experiment in the gene g 1 ,
i.e., α = wT x.
An improvement is to add weights of similarity for the k nearest neighbors,
leading to weighted LSQ problems. In the actual DNA microarray data,
missing values may occur in several locations. For each missing value the
arrays A, b and w are generated and the LLS solved. When building the
matrix A for a missing value, the already estimated values of the gene are
taken into consideration.
An interesting result, based on experiments with data from the Stan-
ford Microarray Database (SMD), is that the most robust missing value
estimation method is the one based on representing a target experiment
that has a missing value as a linear combination of the other experiments.
A program called LSimpute is available from the authors of [145].
There are no theoretical results comparing different imputation algo-
rithms, but the test results of [145, 238] are consistent and suggest that
the above-described method is more robust for noisy data and less sensitive
to k, the number of nearest neighbors used, than the SVD method. The
appropriate choice of the k closest genes is still a matter of trial and error,
although some experiments with random matrices point to thresholds for
it [156]. See also [250].
136 LEAST SQUARES DATA FITTING WITH APPLICATIONS

7.3 Total least squares (TLS)


The assumption used until now for the LSQ problem is that errors are
restricted to the right-hand side b, i.e., the linear model is Ax = b + r
where r is the residual. A new problem arises when the data matrix A is
also not known exactly, so both A and b have random errors. For simplicity,
the statistical hypothesis will be that the rows of the errors are independent
and have a uniform distribution with zero mean and common variance (a
more general case is treated in [106]). This leads to the total least squares
(TLS) problem of calculating a vector xTLS so that the augmented residual
matrix ( E r ) is minimized:
 2
Problem TLS: min ( E r )F subject to (A + E) x = b + r. (7.3.1)

We note that

( E r )2F = E2F + r22 .


The relation to ordinary least squares problems is as follows:
• In the least squares approximation, one replaces the inconsistent prob-
lem A x = b by the consistent system A x = b , where b is the vector
closest to b in range(A), obtained by minimizing the orthogonal dis-
tance to range(A).
• In the total least squares approximation, one goes further and re-
places the inconsistent problem A x = b by the consistent system
A x = b , finding the closest A and b to A and b, respectively, by
minimizing simultaneously the sum of squared orthogonal distances
from the columns ai to ai and b to b .
Example 78. To illustrate the idea behind TLS we consider the following
small problem:
1 1
A= , b= .
0 1
The TLS solution xTLS = 1.618 is obtained with
−0.276 0.171
E= , r= .
0.447 −0.276

Figure 7.3.1 illustrates the geometrical aspects of this simple problem. We


see that the perturbed right-hand side b+r and the perturbed first (and only)
column A + E are both orthogonal projections of b and A, respectively, on
the subspace range(A + E). The perturbations r and E are orthogonal to b
and A, respectively, and they are chosen such that their lengths are minimal.
Then xTLS is the ratio between the lengths of these two projections.
ADDITIONAL TOPICS IN LEAST SQUARES 137

LSQ problem b TLS problem b


 


3 b



3 
1 b = Ax
  ∗  A
1
 

 1
  A   A
   
 
 


Figure 7.3.1: Illustration of the geometry underlying the LSQ problem


(left) and the TLS problem (right). The LSQ solution x∗ is chosen such that
the residual b − b (the dashed line) is orthogonal to the vector b = A x∗ .
The TLS solution xTLS is chosen such that the the residuals b −b and A −A
(the dashed lines) are orthogonal to b = A xTLS and A , respectively.

We will only consider the case when rank(A) = n and b ∈ / range(A).


The trivial case when b ∈ range(A) means that the system is consistent,
(E r) = 0 and the TLS solution is identical to the LSQ solution.
We note that the rank-deficient matrix case, rank(A) < n, has in prin-
ciple only a solution in the trivial case b ∈ range(A). In the general case
it is treated by reducing the TLS problem to a smaller, full-rank problem
using column subset selection techniques. More details are given in ([239],
section 3.4). The work of Paige and Strakoš [180] on core problems is also
relevant here. The following example taken from [105], p. 595, illustrates
this case.

Example 79. Consider the problem defined by


⎛ ⎞ ⎛ ⎞
1 0 1
A=⎝ 0 0 ⎠, b = ⎝ 1 ⎠.
0 0 1

The rank of A is 1 and the matrix is rank deficient, with b ∈


/ range(A), so
there is no solution of the problem as is. Note that
⎛ ⎞
0 0
Eε = ⎝ 0 ε ⎠
0 ε

is a perturbation, such that for any ε = 0 we have b ∈ range(A + Eε ), but


there is no smallest Eε F .
138 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The TLS problem and the singular value decomposition


Let us rewrite the TLS problem as a system:

x x
( A b ) +( E r ) = 0,
−1 −1

or equivalently

x
Cz+F z =0 with z= ,
−1

where C = ( A b ) and F = ( E r ) are matrices of size m × (n + 1).


We seek a solution z to the homogeneous equation

(C + F ) z = 0 (7.3.2)
2
that minimizes F F . For the TLS problem to have a non-trivial solution
z, the matrix C + F must be singular, i.e., rank(C + F ) < n + 1. To attain
this, the SVD


n+1
C=( A b ) = U ΣV T = σi ui v Ti
i=1

can be used to determine an acceptable perturbation matrix F . The singu-


lar values of A, here denoted by σ1 ≥ σ2 ≥ ... ≥ σn , will also enter in the
discussion.
The solution technique is easily understood in the case when the singular
values of C satisfy σ1 ≥ σ2 ≥ ... ≥ σn > σn+1 , i.e., when the smallest
singular value is isolated. Since we are considering the full-rank, non-trivial
case, we also have σn+1 > 0.
From the Eckart-Young-Mirski Theorem 43, the matrix nearest to C
withan rank lower than n + 1 is at distance nσn+1 from C, and it is given
by i=1 σi ui v Ti . Thus, selecting C + F = i=1 σi ui v Ti implies choosing a
perturbation F = −σn+1 un+1 v Tn+1 with minimal norm: F F = σn+1 .
The solution z is now constructed using the fact that the v i are or-
thonormal, and therefore (C + F ) v n+1 = 0. Thus, a general solution of
the TLS problem is obtained by scaling the right singular vector v n+1 cor-
responding to the smallest singular value, in order to enforce the condition
that zn+1 = −1:
vi,n+1
zi = , i = 1, 2, . . . , n + 1. (7.3.3)
−vn+1,n+1

This is possible provided that vn+1,n+1 = 0. If vn+1,n+1 = 0, a solution


does not exist and the problem is called nongeneric.
ADDITIONAL TOPICS IN LEAST SQUARES 139

A theorem proved in [106] gives sufficient conditions for the existence of


a unique solution. If the smallest singular value σn of the full-rank matrix
A is strictly larger than the smallest singular value of ( A b ) = C, σn+1 ,
then vn+1,n+1 = 0 and a unique solution exists.
Theorem 80. Denote by σ1 , . . . , σn+1 the singular values of the augmented
matrix ( A b ) and by σ1 , . . . , σn the singular values of the matrix A. If
σn > σn+1 , then there is a TLS correction matrix ( E r ) = −σn+1 un+1 vn+1 T

and a unique solution vector

xTLS = −(v1,n+1 , . . . , vn,n+1 )T /vn+1,n+1 (7.3.4)

that solves the TLS problem.


Moreover, there are closed-form expressions for the solution:

xTLS = (AT A − σn+1


2
I)−1 AT b

and the residual norm:


 2
2

n
uTi b
A xTLS − b2 = 2
σn+1 1+ .
i=1
(σi )2 − σn+1
2

The following example, taken from pp. 179 in [20], illustrates the difficulty
associated with the nongeneric case in terms of the SVDs.
Example 81. Consider the augmented matrix:

1 0 1
( A b )= , where A= .
0 2 0

The smallest singular value of A is σ1 = 1, and for the augmented matrix
is σ2 = 1. So here σn = σn+1 , and vn+1,n+1 = 0, and therefore no solution
exists. The formal algorithm
nwould choose the perturbation matrix from the
dyadic form for A + E = i=1 σi ui v Ti ; then

0 0
( A+E b+r )=
0 2

and it is easy to see that b + r ∈


/ range(A + E), and there is no solution.
There are further complications with the nongeneric case, but we shall
not go into these issues here primarily because they seem to have been
resolved by the recent work of Paige and Strakoš [180].
In [239] p. 86, conditions for the existence and uniqueness of solutions
are given, as well as closed form expressions for these solutions. In [239]
p. 87, a general algorithm for the TLS problem is described. It solves
140 LEAST SQUARES DATA FITTING WITH APPLICATIONS

b
6
line with normal (x, −1) 



(ai , x·ai )

  ri = ai ·x − bi
 S
  S• (ai , bi )
  √
 di = (ai ·x − bi )/ x2 + 1

 - a

Figure 7.3.2: Illustration of LSQ and TLS for the case n = 1. The ith
data point is (ai , bi ) and the point vertically above it on the line is (ai , ai ·x),
hence the vertical distance√ is ri = ai ·x − bi . The orthogonal distance to the
line is di = (x·ai − bi )/ x2 + 1.

any generic and nongeneric TLS problem, including the case with multiple
right-hand sides. The software developed by van Huffel is available from
Netlib [263]. Subroutine PTLS solves the TLS problem by using the partial
singular value decomposition (PSVD) mentioned in Section 5.6, thereby
improving considerably the computational efficiency with respect to the
classical TLS algorithm. A large-scale algorithm based on Rayleigh quotient
iteration is described in [25]. See [155] for a recent survey of TLS methods.

Geometric interpretation
It is possible to give a geometric interpretation of the difference between the
fits using the least squares and total least squares methods. Define the rows
T
of the matrix ( A b ) as the m points ci = (ai1 , . . . , ain , bi ) ∈ Rn+1 , to
which we try to fit the linear subspace Sx of dimension n that is orthogonal
x
to the vector z = :
−1
( )⊥ ( )
x a
Sx = span = ( )| a ∈ R , b ∈ R, b = x a .
n T
−1 b

In LSQ, one minimizes


 2
 x  
m
 T 2 x
( A b )  = ck z z= ,
 −1  −1
2 k=1
ADDITIONAL TOPICS IN LEAST SQUARES 141

which measures the sum of squared distances along the n + 1-coordinate


axis from the points ck to the subspace Sx (already known as the residuals
ri ). For the case n = 1, where A = (a1 , a2 , . . . am )T and x = x (a single
unknown), the LSQ approximation minimizes the vertical distance of the
points (ai , bi ) to the line through the origin with slope x (whose normal
vector is (x, −1)T ); see Figure 7.3.2.
The TLS approach can be formulated as an equivalent problem.
From the SVD of ( A b ) and the definition of matrix 2-norm we have
 
( A b ) z 2 / z2 ≥ σn+1 ,

for any nonzero z. If σn+1 is isolated and the vector z has unit norm, then
 2
z = ±v n+1 and the inequality becomes an equality ( A b ) z 2 = σn+1 2
.
So, in fact the TLS problem consists of finding a vector x ∈ R such that
n

 2
 x 
( A b ) 
 −1  A x − b2
2
2 2
 2 = = σn+1 . (7.3.5)
 x  2
x2 + 1
 
 −1 
2

m (aT x−b )2
The left-hand quantity is i=1 xi T x+1i , where the ith term is the square
of the true Euclidean (or orthogonal) distance from ci to the nearest point
in the subspace Sx . Again, see Figure 7.3.2 for an illustration in the case
n = 1.
2 2
The equivalent TLS formulation minx A x − b2 /(x2 + 1) is conve-
nient for regularizing the TLS problem additively; see next section. For
more details see Section 10.2.

Further aspects of TLS problems


The sensitivity of the TLS problem
A study of the sensitivity of the TLS problem when there is a unique
solution, i.e., when σn > σn+1 , is given in [106]. The starting point for the
analysis is the formulation of the TLS problem as an eigenvalue problem
n+1
for C T C = i=1 v i σi2 v Ti (i.e., σi2 are the eigenvalues of C T C and v i are
the corresponding eigenvectors). The main tool used is the singular value
interlacing property, Theorem 4 from Appendix B. Thus, if x ∈ Rn and

x x
CT C 2
= σn+1 ,
−1 −1

then x solves the TLS problem.


142 LEAST SQUARES DATA FITTING WITH APPLICATIONS

One interesting result is that the total least squares problem can be
considered as a de-regularization process, which is apparent when looking
at the first row of the above eigenvalue problem:

(AT A − σn+1
2
I) xTLS = AT b.

This implies that the TLS problem is worse conditioned than the associated
LSQ problem.
An upper bound for the difference of the LSQ and TLS solutions is

xTLS − x∗ 2 σ2
≤  2 n+12 ,
x∗2 σn − σn+1

so that σn 2 − σn+1


2
measures the sensitivity of the TLS problem. This

suggests that the TLS solution is unstable (the bounds are large) if σn is
close to σn+1 , for example, if σn+1 is (almost) a multiple singular value.
Stewart [231] proves that up to second-order terms in the error, xTLS is
insensitive to column scalings of A. Thus, unlike ordinary LSQ problems,
TLS cannot be better conditioned by column scalings.

The mixed LS-TLS problem


Suppose that only some of the columns of A have errors. This leads to the
model
2 2
min(E2 F + r2 ) subject to ( A1 A2 + E2 ) x = b + r, (7.3.6)
E,r

with A1 ∈ Rm×n1 and A2 , E2 ∈ Rm×n2 . This mixed LS-TLS problem


formulation encompasses the LSQ problem (when n2 = 0), the TLS problem
(when n1 = 0), as well as a combination of both.
How can one find the solution to (7.3.6)? The basic idea, developed
in [95], is to use a QR factorization of ( A b ) to transform the problem
to a block structure, thereby reducing it to the solution of a smaller TLS
problem that can be solved independently, plus an LSQ problem. First,
compute the QR factorization ( A b ) = Q R with

R11 R12
R= R11 ∈ Rn1 ×n1 , R22 ∈ R(n2 +1)×(n2 +1) .
0 R22

Note that F F = QT F F and therefore the constraints on the original


and transformed systems are equivalent. Now compute the SVD of R22 and
let z 2 = v n2 +1 be the rightmost singular vector. We can then set F12 = 0
and solve the standard least squares problem

R11 z 1 = −R12 z 2 .
ADDITIONAL TOPICS IN LEAST SQUARES 143

Note that, as the TLS part of z has been obtained already and therefore
F22 is minimized, there is no loss of generality when F12 = 0. Finally,
compute the mixed LS-TLS solution as

z1
x = −z(1 : n)/z(n + 1), z= .
z2

7.4 Convex optimization


In an earlier section of this chapter we have already met a convex optimiza-
tion problem: least squares with quadratic constraints. But there are many
more problems of interest and some of them are related to the subject of
this book.
Convex optimization has gained much attention in recent years, thanks
to recent advances in algorithms to solve such problems. Also, the efforts
of Stephen Boyd and his group at Stanford have helped to popularize the
subject and provide invaluable software to solve all sizes of problems. One
can say now that convex optimization problems have reached the point
where they can be solved as assuredly as the basic linear algebra problems
of the past.
One of the reasons for the success is that, although nonlinear, strictly
convex problems have unique solutions. General nonlinear programming
problems can have goal landscapes that are very bumpy and therefore it
can be hard to find a desired optimal point and even harder to find a global
one.
An important contribution in recent times has been in producing tools
to identify convexity and to find formulations of problems that are convex.
These problems look like general nonlinear programming problems, but the
objective function f (x) and the constraints gi (x) and hi (x) are required to
be convex:

min f (x)
subject to gi (x) ≤ 0, i = 1, ..., m
hi (x) = 0, i = 1, ..., p.

In the past 30 years several researchers discovered that by adding a


property (self-concordance) to these problems made it possible to use inte-
rior point methods via Newton iterations that were guaranteed to converge
in polynomial time [167]. That was the breakthrough that allows us now
to solve large problems in a reasonable amount of computer time and has
provided much impetus to the application of these techniques to an ever
increasing set of engineering and scientific problems.
144 LEAST SQUARES DATA FITTING WITH APPLICATIONS

A notable effort stemming from the research group of S. Boyd at Stan-


ford is M. Grant’s cvx MATLAB package, available from http://cvxr.com.
van den Berg and Friedlander [240, 241] have also some interesting contribu-
tions, including efficient software for the solution of medium- and large-scale
convex optimization problems; we also mention the Python software cvxopt
by Dahl and Vandenberghe [56]. A comprehensive book on the subject is
[30].

7.5 Compressed sensing


The theory and praxis of compressed sensing has become an important
subject in the past few years. Compressed sensing is predicated on the
fact that we often collect too much data and then compress it, i.e., a good
deal of that data is unnecessary in some sense. Think of a mega-pixel
digital photography and its JPEG compressed version. Although this is a
lossy compression scheme, going back and forth allows for almost perfect
reconstruction of the original image. Compressed sensing addresses the
question: is it possible to collect much less data and use reconstruction
algorithms that provide a perfect image? The answer in many cases is yes.
The idea of compressed sensing is that solution vectors can be repre-
sented by much fewer coefficients if an appropriate basis is used, i.e., they
are sparse in such a base, say, of wavelets [164]. Following [34, 35] we con-
sider the general problem of reconstructing a vector x ∈ RN from linear
measurements, yk = ϕTk x for k = 1, . . . , K. Assembling all these measure-
ments in matrix form we obtain the relation: y = Φ x, where y ∈ RK and
ϕk are the rows of the matrix Φ.
The important next concept is that compressed sensing attempts to
reconstruct the original vector x from a number of samples smaller than
what the Nyquist-Shannon theorem says is possible. The latter theorem
states that if a signal contains a maximum frequency f (i.e., a minimum
wavelength w = 1/f ), then in order to reconstruct it exactly it is necessary
to have samples in the time domain that are spaced no more than 1/(2f )
units apart. The key to overcoming this restriction in the compressed sens-
ing approach is to use nonlinear reconstruction methods that combine l2
and l1 terms.
We emphasize that in this approach the system y = Φ x is under-
determined, i.e., the matrix Φ has more columns than rows. It turns out
that if one assumes that the vector x is sparse in some basis B, then solving
the following convex optimization problem:

min x1 subject to ΦB x = y, (7.5.1)


e
x

reconstructs x = B x exactly with high probability. The precise statement


ADDITIONAL TOPICS IN LEAST SQUARES 145

is as follows.
Theorem 82. Assume that x has at most S nonzero elements and that we
take K random linear measurements, where K satisfies

K  C · S · log N,
where C = 22(δ +1) and δ > 0. Then, solving (7.5.1) reconstructs x exactly
with probability greater than 1 − O(N −δ ).

The “secret” is that the Nyquist-Shannon theorem talks about a whole


band of frequencies, from zero to the highest frequency, while here we are
considering a sparse set of frequencies that may be in small disconnected
intervals. Observe that for applications of interest, the dimension N in
problem (7.5.1) can be very large and there resides its complexity. This
problem is convex and can be reduced to a linear programming one. For
large N it is proposed to use an interior-point method that basically solves
the optimality conditions by Newton’s method, taking care of staying feasi-
ble throughout the iteration. E. Candes (Caltech, Stanford) offers a number
of free software packages for this type of applications. Some very good re-
cent references are [240, 241].
As usual, in the presence of noise and even if the constraints in (7.5.1)
are theoretically satisfied, it is more appropriate to replace that problem
by minxe x1 subject to Φ x − y2 ≤ α, where the positive parameter α
is an estimate of the data noise level.
It is interesting to notice the connection with the basic solution of rank-
deficient problems introduced in Section 2.3. Already in 1964 J. B. Rosen
wrote a paper [209] about its calculation and in [193] an algorithm and im-
plementation for calculating the minimal and basic solutions was described.
That report contains extensive numerical experimentation and a complete
Algol program. Unfortunately, the largest matrix size treated there was
40 and therefore without further investigation that algorithm may not be
competitive today for these large problems. Basic solutions are used in
some algorithms as initial guesses to solve this problem, where one gener-
ally expects many fewer nonzero elements in the solution than the rank of
the matrix.
This page intentionally left blank
Chapter 8

Nonlinear Least Squares


Problems

So far we have discussed data fitting problems in which all the unknown pa-
rameters appear linearly in the fitting model M (x, t), leading to linear least
squares problems for which we can, in principle, write down a closed-form
solution. We now turn to nonlinear least squares problems (NLLSQ) for
which this is not true, due to (some of) the unknown parameters appearing
nonlinearly in the model.

8.1 Introduction
To make precise what we mean by the term “nonlinear” we give the following
definitions. A parameter α of the function f appears nonlinearly if the
derivative ∂f /∂α is a function of α. A parameterized fitting model M (x, t)
is nonlinear if at least one of the parameters in x appear nonlinearly. For
example, in the exponential decay model

M (x1 , x2 , t) = x1 e−x2 t

we have ∂M/∂x1 = e−x2 t (which is independent of x1 , so is linear in it) and


∂M/∂x2 = −t x1 e−x2 t (which depends on x2 ) and thus M is a nonlinear
model with the parameter x2 appearing nonlinearly. We start with a few
examples of nonlinear models.

Example 83. Gaussian model. All measuring devices somehow influ-


ence the signal that they measure. If we measure a time signal gmes , then
we can often model it as a convolution of the true signal gtrue (t) and an

147
148 LEAST SQUARES DATA FITTING WITH APPLICATIONS

instrument response function Γ(t):


 ∞
gmes (t) = Γ(t − τ ) gtrue (τ ) dτ.
−∞

A very common model for Γ(t) is the non-normalized Gaussian function


2
/(2x23 )
M (x, t) = x1 e−(t−x2 ) , (8.1.1)

where x1 is the amplitude, x2 is the time shift and x3 determines the width
of the Gaussian function. The parameters x2 and x3 appear nonlinearly
in this model. The Gaussian model also arises in many other data fitting
problems.
Example 84. Rational model. Another model function that often arises
in nonlinear data fitting problems is the rational function, i.e., a quotient
of two polynomials of degree p and q, respectively,

P (t) x1 tp + x2 tp−1 + · · · + xp t + xp+1


M (x, t) = = , (8.1.2)
Q(t) xp+2 tq + xp+3 tq−1 + · · · + xp+q+1 t + xp+q+2

with a total of n = p + q + 2 parameters.


Rational models arise, e.g., in parameter estimation problems such as
chemical reaction kinetics, signal analysis (through Laplace and z trans-
forms), system identification and in general as a transfer function for a
linear time-invariant system (similar to the above-mentioned instrument
response function). In these models, the coefficients of the two polynomi-
als or, equivalently, their roots, characterize the dynamical behavior of the
system.
Rational functions are also commonly used as empirical data approxi-
mation functions. Their advantage is that they have a broader scope than
polynomials, yet they are still simple to evaluate.
The basic idea of nonlinear data fitting is the same as described in
Chapter 1, the only difference being that we now use a fitting model M (x, t)
in which (some of) the parameters x appear nonlinearly and that leads to
a nonlinear optimization problem. We are given noisy measurements yi =
Γ(ti )+ei for i = 1, . . . , m, where Γ(t) is the pure-data function that gives the
true relationship between t and the noise-free data, and we seek to identify
the model parameters x such that M (x, t) gives the best fit to the noisy
data. The likelihood function Px for x is the same as before, cf. (1.3.1),
leading us to consider the weighted residuals ri (x)/ςi = (yi − M (x, ti )/ςi
and to minimize the weighted sum-of-squares residuals.
Definition 85. Nonlinear least squares problem (NLLSQ). In the
case of white noise (where the errors have a common variance ς 2 ), find a
NONLINEAR LEAST SQUARES PROBLEMS 149

Figure 8.1.1: Fitting a non-normalized Gaussian function (full line) to


noisy data (circles).

minimizer x∗ of the nonlinear objective function f with the special form

2

m
min f (x) ≡ min 12 r(x)2 = min 12 ri (x)2 , (8.1.3)
x x x
i=1

where x ∈ Rn and r(x) = (r1 (x), . . . , rm (x))T is a vector-valued function


of the residuals for each data point:

ri (x) = yi − M (x, ti ), i = 1, . . . , m,

in which yi are the measured data corresponding to ti and M (x, t) is the


model function.

The factor 12 is added for practical reasons as will be clear later. Simply
replace ri (x) by wi ri (x)/ςi in (8.1.3), where wi = ςi−1 is the inverse of
the standard deviation for the noise in yi , to obtain the weighted problem
2
minx 12 W r(x)2 for problems when the errors have not the same variance
values.

Example 86. Fitting with the Gaussian model. As an example, we


consider fitting the instrument response function in Example 83 by the
Gaussian model (8.1.1). First note that if we choose gtrue (τ ) = δ(τ ), a
delta function, then we have gmes (t) = Γ(t), meaning that we ideally mea-
sure the instrument response function. In practice, by sampling gmes (t) for
selected values t1 , t2 , . . . , tm of t we obtain noisy data yi = Γ(ti ) + ei for
i = 1, . . . , m to which we fit the Gaussian model M (x, t). Figure 8.1.1 illus-
trates this. The circles are the noisy data generated from an exact Gaussian
model with x1 = 2.2, x2 = 0.26 and x3 = 0.2, to which we added Gaussian
noise with standard deviation ς = 0.1. The full line is the least squares
fit with parameters x1 = 2.179, x2 = 0.264 and x3 = 0.194. In the next
chapter we discuss how to compute this fit.
150 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Example 87. Nuclear magnetic resonance spectroscopy. This is a


technique that measures the response of nuclei of certain atoms that possess
spin when they are immersed in a static magnetic field and they are exposed
to a second oscillating magnetic field. NMR spectroscopy, which studies the
interaction of the electromagnetic radiation with matter, is used as a non-
invasive technique to obtain in vivo information about chemical changes
(e.g., concentration, pH), in living organisms. An NMR spectrum of, for
example, the human brain can be used to identify pathology or biochemical
processes or to monitor the effect of medication.
The model used to represent the measured NMR signals yi ∈ C in the
frequency domain is a truncated Fourier series of the Lorentzian distribu-
tion:


K
yi  M (a, φ, d, ω, ti ) = ak eı̂φk e(−dk +jωk )ti , i = 1, . . . , m,
k=1

where ı̂ denotes the imaginary unit. The parameters of the NMR signal pro-
vide information about the molecules: K represents the number of different
resonance frequencies, the angular frequency ωk of the individual spectral
components identifies the molecules, the damping dk characterizes the mo-
bility, the amplitude ak is proportional to the number of molecules present,
and φk is the phase. All these parameters are real. To determine the NMR
parameters through a least squares fit, we minimize the squared modulus of
the difference between the measured spectrum and the model function:


m
2
min |yi − M (a, φ, d, ω, ti )| .
a,φ,d,ω
i=1

This is another example of a nonlinear least squares problem.

8.2 Unconstrained problems


Generally, nonlinear least squares problems will have multiple solutions
and without a priori knowledge of the objective function it would be too
expensive to determine numerically a global minimum, as one can only
calculate the function and its derivatives at a limited number of points.
Thus, the algorithms in this book will be limited to the determination of
local minima. A certain degree of smoothness of the objective function will
be required, i.e., having either one or possibly two continuous derivatives,
so that the Taylor expansion applies.
NONLINEAR LEAST SQUARES PROBLEMS 151

Optimality conditions
Recall (Appendix C) that the gradient ∇f (x) of a scalar function f (x) of
n variables is the vector with elements
∂f (x)
, j = 1, . . . , n
∂xj

and the Hessian ∇2 f (x) is the symmetric matrix with elements

∂ 2 f (x)
[∇2 f (x)]ij = , i, j = 1, . . . , n.
∂xi ∂xj

The conditions for x∗ to be a local minimum of a twice continouously


differentiable function f are
1. First-order necessary condition. x∗ is a stationary point, i.e., the
gradient of f at x∗ is zero: ∇f (x∗ ) = 0.
2. Second-order sufficient conditions. The Hessian ∇2 f (x∗ ) is pos-
itive definite.
Now we consider the special case where f is the function in the nonlinear
least squares problem (8.1.3), i.e.,

m
f (x) = 1
2 ri (x)2 , ri (x) = yi − M (x, ti ).
i=1

The Jacobian J(x) of the vector function r(x) is defined as the matrix with
elements
∂ri (x) ∂M (x, ti )
[J(x)]ij = =− , i = 1, . . . , m, j = 1, . . . , n.
∂xj ∂xj

Note that the ith row of J(x) equals the transpose of the gradient of ri (x)
and also:
∇ri (x)T = −∇M (x, ti )T , i = 1, . . . , m.
The elements of the gradient and the Hessian of f are given by

∂f (x) 
m
∂ri (x)
= ri (x) ,
∂xj i=1
∂xj
∂ 2 f (x)  ∂ri (x) ∂ri (x) 
m m
∂ 2 ri (x)
[∇2 f (x)]k = = + ri (x) ,
∂xk ∂x i=1
∂xk ∂x i=1
∂xk ∂x

and it follows immediately that the gradient and Hessian can be written in
matrix form as
152 LEAST SQUARES DATA FITTING WITH APPLICATIONS

∇f (x) = J(x)T r(x),



m
∇2 f (x) = J(x)T J(x) + ri (x)∇2 ri (x),
i=1
* 2 + * + ∂ 2 M (x, ti )
∇ ri (x) k = − ∇2 M (x, ti ) k = − , k,  = 1, . . . , m.
∂xk ∂x

Notice the minus sign in the expression for ∇2 ri (x). The optimality con-
ditions now take the special form

∇f (x) = J(x)T r(x) = 0 (8.2.1)

and


m
∇ f (x) = J(x) J(x) +
2 T
ri (x)∇2 ri (x) is positive definite. (8.2.2)
i=1

The fact that distinguishes the least squares problem from among the
general optimization problems is that the first – and often dominant –
term J(x)T J(x) of the Hessian ∇2 f (x) contains only the Jacobian J(x)
of r(x), i.e., only first derivatives. Observe that in the second term the
second derivatives are multiplied by the residuals. Thus, if the model is
adequate to represent the data, then these residuals will be small near the
solution and therefore this term will be of secondary importance. In this
case one gets an important part of the Hessian “for free” if one has already
computed the gradient. Most specialized algorithms exploit this structural
property of the Hessian.
An inspection of the Hessian of f will show two m comparatively easy
2
NLLSQ cases. As we observed above, the term r
i=1 i (x)∇ r i (x) will
be small if the residuals are small. An additional favorable case occurs
when the problem is only mildly nonlinear, i.e., all the Hessians ∇2 ri (x) =
−∇2 M (x, ti ) have small norm. Most algorithms that neglect this sec-
ond term will then work satisfactorily. The smallest eigenvalue λmin of
J(x)T J(x) can be used to quantify the relative importance of the two
terms of ∇2 f (x): the first term of the Hessian dominates if for all x near
a minimum x∗ the quantities |ri (x)| ∇2 ri (x)2 for i = 1, . . . , m are small
relative to λmin . This obviously holds in the special case of a consistent
problem where r(x∗ ) = 0.
The optimality conditions can be interpreted geometrically by observing
that the gradient ∇f (x) is always a vector normal to the level set that
passes through the point x (we are assuming in what follows that f (x) is
NONLINEAR LEAST SQUARES PROBLEMS 153

Figure 8.2.1: Level sets L(c) (in R2 they are level curves) for a function
f whose minimizer x∗ is located at the black dot. The tangent plane is a
line perpendicular to the gradient ∇f (x), and the negative gradient is the
direction of steepest descent at x.

twice differentiable, for simplicity). A level set L(c) (level curve in R2 ) is


defined formally as
L(c) = {x : f (x) = c}.
The tangent plane at x is { y ∈ Rn | ∇f (x)T (y − x) = 0}, which shows that
∇f (x) is its normal and thus normal to the level set at x.
This leads to a geometric interpretation of directions of descent. A
direction p with p2 = 1 is said to be of descent (from the point x) if
f (x + tp) < f (x) for 0 < t < t0 . It turns out that directions of descent can
be characterized using the gradient vector. In fact, by Taylor’s expansion
we have

f (x + tp) = f (x) + t∇f (x)T p + O(t2 ).


Thus, for the descent condition to hold we need to have ∇f (x)T p < 0,
since for sufficiently small t the linear term will dominate. In addition
we have ∇f (x)T p = cos(φ)∇f (x)2 where φ is the angle between p and
the gradient. From this we conclude that the direction p = −∇f (x) is of
maximum descent, or steepest descent, while any direction in the half-space
defined by the tangent plane and containing the negative gradient is also a
direction of descent, since cos(φ) is negative there. Figure 8.2.1 illustrates
this point. Note that a stationary point is one for which there are no descent
directions, i.e., ∇f (x) = 0.
We conclude with a useful geometric result when J(x∗ ) has full rank.
In this case the characterization of a minimum can be expressed by using
the so-called normal curvature matrix (see [9]) associated with the surface
defined by r(x) and with respect to the normalized vector r(x)/ r(x)2 ,
defined as
1 
m
K(x) = − (J(x)† )T ri (x)∇2 ri (x) J(x)† .
r(x)2 i=1
154 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Then the condition for a minimum can be reformulated as follows.

Condition 88. If J(x∗ )T (I − r(x∗ )2 K(x∗ ))J(x∗ ) is positive definite,


then x∗ is a local minimum.

The role of local approximations


Local approximations always play an important role in the treatment of
nonlinear problems and this case is no exception. We start with the Taylor
expansion of the model function:

M (x + h, t) = M (x, t) + ∇M (x, t)T h + 12 hT ∇2 M (x, t)h + O(h32 ).

Now consider the ith residual ri (x) = yi − M (x, ti ), whose Taylor ex-
pansion for i = 1, . . . , m is given by

ri (x + h) = yi − M (x + h, ti )
= yi − M (x, ti ) − ∇M (x, ti )T h −
1 T 2
2 h ∇ M (x, ti )h + O(h2 ).
3

Hence we can write

r(x + h) = r(x) + J(x) h + O(h22 ),

which is a local linear approximation at x valid for small h. If we keep x


fixed and consider the minimization problem

min r(x + h)2  min J(x) h + r(x)2 , (8.2.3)


h h

it is clear that we can approximate locally the nonlinear least squares prob-
lem with a linear one in the variable h. Moreover, we see that the h that
minimizes r(x + h)2 can be approximated by

h ≈ −(J(x)T J(x))−1 J(x)T r(x) = −J(x)† r(x),


where we have neglected higher-order terms. In the next chapter we shall
see how this expression provides a basis for some of the numerical methods
for solving the NLLSQ problem.
As a special case, let us now consider the local approximation (8.2.3) at
a least squares solution x∗ (a local minimizer) where the local, linear least
squares problem takes the form

min J(x∗ ) h + r(x∗ )2 .


h
NONLINEAR LEAST SQUARES PROBLEMS 155

Figure 8.2.2: Histograms and standard deviations for the least squares
solutions to 1000 noisy realizations of the Gaussian model fitting problem.
The vertical lines show the exact values of the parameters.

If we introduce x = x∗ + h, then we can reformulate the above problem as


one in x,
min J(x∗ ) x − J(x∗ ) x∗ + r(x∗ )2 ,
x

which leads to an approximation of the covariance matrix for x∗ by using


the covariance matrix for the above linear least squares problem, i.e.,
 
Cov(x∗ )  J(x∗ )† Cov J(x∗ ) x∗ − r(x∗ ) (J(x∗ )† )T
 
= J(x∗ )† Cov r(x∗ ) − J(x∗ ) x∗ (J(x∗ )† )T
= J(x∗ )† Cov(y)(J(x∗ )† )T . (8.2.4)

Here, we have used that r(x∗ ) = y −(M (x∗ , t1 ), . . . , M (x∗ , tm ))T and that
all the terms except y in r(x∗ ) + J(x∗ ) x∗ are considered constant.
The above equation provides a way to approximately assess the uncer-
tainties in the least squares solution x∗ for the nonlinear case, similar to
the result in Section 2.1 for the linear case. In particular, we see that the
Jacobian J(x∗ ) at the solution plays the role of the matrix in the linear
case. If the errors e in the data are uncorrelated with the exact data and
−1
have covariance Cov(e) = ς 2 I, then Cov(x∗ )  ς 2 J(x∗ )T J(x∗ ) .

Example 89. To illustrate the use of the above covariance matrix estimate,
we return to the Gaussian model M (x, t) from Examples 83 and 86, with
exact parameters x1 = 2.2, x2 = 0.26 and x3 = 0.2 and with noise level
ς = 0.1. The elements of the Jacobian for this problem are, for i = 1, . . . , m:

2 2
[J(x)]i,1 = e−(ti −x2 ) /(2x3 ) ,
ti − x2 −(ti −x2 )2 /(2x23 )
[J(x)]i,2 = x1 e ,
x23
(ti − x2 )2 −(ti −x2 )2 /(2x23 )
[J(x)]i,3 = x1 e .
x33
156 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The approximate standard deviations for the three estimated parameters


 −1
are the square roots of the diagonal of ς 2 J(x∗ )T J(x∗ ) . In this case we
get the three values

6.58 · 10−2 , 6.82 · 10−3 , 6.82 · 10−3 .

We compute the least squares solution x∗ for 1000 realizations of the noise.
Histograms of these parameters are shown in Figure 8.2.2. The standard
deviations of these estimated parameters are

std(x1 ) = 6.89 · 10−2 , std(x2 ) = 7.30 · 10−3 , std(x3 ) = 6.81 · 10−3 .

In this example, the theoretical standard deviation estimates are in very


good agreement with the observed ones.

8.3 Optimality conditions for constrained


problems
In the previous section we reviewed the conditions for a point x∗ to be a
local minimizer of an unconstrained nonlinear least squares problem. Now
we consider a constrained problem

2
min 1 r(x)2 , (8.3.1)
x∈C 2

where the residual vector r(x) is as above and the set C ⊂ Rn (called the
feasible region) is defined by a set of inequality constraints:

C = { x | ci (x) ≤ 0, i = 1, . . . , p},

where the functions ci (x) are twice differentiable. In the unconstrained


case, it could be that the only minimum is at infinity (think of f (x) =
a + bx). If instead we limit the variables to be in a closed, bounded set,
then the Bolzano-Weierstrass theorem [79] ensures us that there will be at
least one minimum point.
A simple way to define a bounded set C is to impose constraints on
the variables. The simplest constraints are those that set bounds on the
variables: li ≤ xi ≤ ui for i = 1, . . . , p. A full set of such constraints
(i.e., p = n) defines a bounded box in Rn and then we are guaranteed that
f (x) will have a minimum in the set. For general constraints some of the
functions ci (x) can be nonlinear.
The points that satisfy all the constraints are called feasible. A con-
straint is active if it is satisfied with equality, i.e., the point is on a boundary
of the feasible set. If we ignore the constraints, unconstrained minima can
NONLINEAR LEAST SQUARES PROBLEMS 157

Figure 8.3.1: Optimality condition for constrained optimization. The


shaded area illustrates the feasible region C defined by three constraints.
One of the constraints is active at the solution x∗ and at this point the
gradient of f is collinear with the normal to the constraint.

occur inside or outside the feasible region. If they occur inside, then the op-
timality conditions are the same as in the unconstrained case. However, if
all unconstrained minima are infeasible, then a constrained minimum must
occur on the boundary of the feasible region and the optimality conditions
will be different.
Let us think geometrically (better still, in R2 , see Figure 8.3.1). The
level curve values of f (x) decrease toward the unconstrained minimum.
Thus, a constrained minimum will lie on the level curve with the lowest
value of f (x) that is still in contact with the feasible region and at the
constrained minimum there should not be any feasible descent directions.
A moment of reflection tells us that the level curve should be tangent to
the active constraint (the situation is more complicated if more than one
constraint is active). That means that the normal to the active constraint
and the gradient (pointing to the outside of the constraint region) at that
point are collinear, which can be expressed as

∇f (x) + λ∇ci (x) = 0, (8.3.2)


where ci (x) represents the active constraint and λ > 0 is a so-called La-
grange multiplier for the constraint. (If the constraint were ci (x) ≥ 0, then
we should require λ < 0.) We see that (8.3.2) is true because, if p is a
descent direction, we have

pT ∇f (x) + λpT ∇ci (x) = 0


or, since λ > 0,

pT ∇ci (x) > 0,


and therefore p must point outside the feasible region; see Figure 8.3.1 for an
illustration. This is a simplified version of the famous Karush-Kuhn-Tucker
158 LEAST SQUARES DATA FITTING WITH APPLICATIONS

first-order optimality conditions for general nonlinear optimization [139,


147]. Curiously enough, in the original Kuhn-Tucker paper these conditions
are derived from considerations based on multiobjective optimization!
The general case when more than one constraint is active at the solution,
can be discussed similarly. Now the infeasible directions instead of being
in a half-space determined by the tangent plane of a single constraint will
be in a cone formed by the normals to the tangent planes associated with
the various active constraints. The corresponding optimality condition is


p
∇f (x) + λi ∇ci (x) = 0, (8.3.3)
i=1

where the Lagrange multipliers satisfy λi > 0 for the active constraints
and λi = 0 for the inactive ones. We refer to [170] for more details about
optimality constraints and Lagrange multipliers.

8.4 Separable nonlinear least squares problems


In many practical applications the unknown parameters of the NLLSQ
problem can be separated into two disjoint sets, so that the optimization
with respect to one set is easier than with respect to the other. This sug-
gests the idea of eliminating the parameters in the easy set and minimizing
the resulting functional, which will then depend only on the remaining vari-
ables. A natural situation that will be considered in detail here arises when
some of the variables appear linearly.
The initial approach to this problem was considered by H. Scolnik in
his Doctoral Thesis at the University of Zurich. A first paper with gen-
eralizations was [113]. This was followed by an extensive generalization
that included a detailed algorithmic description [101, 102] and a computer
program called VARPRO that become very popular and is still in use. For
a recent detailed survey of applications see [103] and [198]. A paper that
includes constraints is [142]. For multiple right hand sides see [98, 141].
A separable nonlinear least squares problem has a model of the special
form


n
M (a, α, t) = aj φj (α, t), (8.4.1)
j=1

where the two vectors a ∈ Rn and α ∈ Rk contain the parameters to be


determined and φj (α, t) are functions in which the parameters α appear
nonlinearly. Fitting this model to a set of m data points (ti , yi ) with m >
n + k leads to the least squares problem
NONLINEAR LEAST SQUARES PROBLEMS 159

min f (a, α) = min 12 r(a, α)22 = min 21 y − Φ(α) a22 .


a,α a,α a,α

In this expression, Φ(α) is an m × n matrix function with elements

Φ(α) = φj (α, ti ), i = 1, . . . , m, j = 1, . . . , n.

The special case when k = n and φj = eαj t is called exponential data fitting.
In this as in other separable nonlinear least squares problems the matrix
Φ(α) is often ill-conditioned.
The variable projection algorithm – to be discussed in detail in the next
chapter – is based on the following ideas. For any fixed α, the problem

min 21 y − Φ(α) a22


a

is linear with minimum-norm solution a∗ = Φ(α)† y, where Φ(α)† is the


Moore-Penrose generalized inverse of the rectangular matrix Φ(α), as we
saw in Chapter 3.
Substituting this value of a into the original problem, we obtain a re-
duced nonlinear problem depending only on the nonlinear parameters α,
with the associated function

f2 (α) = 12 y − Φ(α)Φ(α)† y22 = 12 PΦ(α) y22 ,

where, for a given α, the matrix

PΦ(α) = I − Φ(α)Φ(α)†

is the projector onto the orthogonal complement of the column space of the
matrix Φ(α). Thus the name variable projection given to this reduction
procedure. The solution of the nonlinear least squares problem minα f2 (α)
is discussed in the next chapter.
Once a minimizer α∗ for f2 (α) is obtained, it is substituted into the
linear problem mina 12 y − Φ(α∗ ) a22 , which is then solved to obtain a∗ .
The least squares solution to the original problem is then (a∗ , α∗ ).
The justification for the variable projection algorithm is the following
theorem (Theorem 2.1 proved in [101]), which essentially states that the
set of stationary points of the original functional and that of the reduced
one are identical.
Theorem 90. Let f (a, α) and f2 (α) be defined as above. We assume that
in an open set Ω containing the solution, the matrix Φ(α) has constant rank
r  min(m, n).
1. If α∗ is a critical point (or a global minimizer in Ω) of f2 (α) and
a∗ = Φ(α)† y, then (a∗ , α∗ ) is a critical point of f (a, α) (or a global
minimizer in Ω) and f (a∗ , α∗ ) = f2 (α∗ ).
160 LEAST SQUARES DATA FITTING WITH APPLICATIONS

2. If (a∗ , α∗ ) is a global minimizer of f (a, α) for α ∈ Ω, then α∗ is


a global minimizer of f2 (α) in Ω and f2 (α∗ ) = f (a∗ , α∗ ). Further-
more, if there is a unique a∗ among the minimizing pairs of f (a, α)
then a∗ must satisfy a∗ = Φ(α∗ )† y.

8.5 Multiobjective optimization


This is a subject that is seldom treated in optimization books, although it
arises frequently in financial optimization, decision and game theory and
other areas. In recent times it has increasingly been recognized that many
engineering problems are also of this kind. We have also already mentioned
that most regularization procedures are actually bi-objective problems (see
[240, 241] for an interesting theory and algorithm).
The basic unconstrained problem is

min f (x),
x

where now f ∈ R is a vector function. Such problems arise in areas of


k

interest of this book, for instance, in cooperative fitting and inversion, when
measurements of several physical processes on the same sample are used to
determine properties that are interconnected.
In general, it will be unlikely that the k objective functions share a com-
mon minimum point, so the theory of multiobjective optimization is differ-
ent from the single-objective optimization case. The objectives are usually
in conflict and a compromise must be chosen. The optimality concept is
here the so-called Pareto equilibrium: a point x is a Pareto equilibrium
point if there is no other point for which all functions have smaller values.
This notion can be global (as stated) or local if it is restricted to a
neighborhood of x. In general there will be infinitely many such points. In
the absence of additional information, each Pareto point would be equally
acceptable. Let us now see if we can give a more geometric interpretation
of this condition, based on some of the familiar concepts used earlier.
For simplicity let us consider the case k = 2. First of all we observe that
the individual minimum points

x∗i = arg min fi (x), i = 1, 2


x
are Pareto optimal. We consider the level sets corresponding to the two
functions and observe that at a point at which these two level sets are
tangent and their corresponding normals are in opposite directions, there
will be no common directions of descent for both functions. But that means
that there is no direction in which we can move so that both functions
are improved, i.e., this is a Pareto equilibrium point. Analytically this is
expressed as
NONLINEAR LEAST SQUARES PROBLEMS 161

Figure 8.5.1: Illustration of a multiobjective optimization problem. The


two functions f1 (x) and f2 (x) have minimizers x∗1 and x∗2 . The curve be-
tween these two points is the set of Pareto points.

∇f 1 (x) ∇f 2 (x)
+ = 0,
∇f 1 (x)2 ∇f 2 (x)2
which can be rewritten as

λ∇f 1 (x) + (1 − λ)∇f 2 (x) = 0, 0 ≤ λ ≤ 1.


It can be shown that for convex functions, all the Pareto points are
parametrized by this aggregated formula, which coincides with the opti-
mality condition for the scalar optimization problems

min [λf1 (x) + (1 − λ)f2 (x)], 0 ≤ λ ≤ 1.


x

This furnishes a way to obtain the Pareto points by solving a number of


scalar optimization problems. Figure 8.5.1 illustrates this; the curve be-
tween the two minimizers x∗1 and x∗2 is the set of Pareto points.
Curiously enough, these optimality conditions were already derived in
the seminal paper on nonlinear programming by Kuhn and Tucker in 1951
[147].
A useful tool is given by the graph in phase space of the Pareto points,
the so-called Pareto front (see 8.5.2.) In fact, this graph gives a complete
picture of all possible solutions, and it is then usually straightforward to
make a decision by choosing an appropriate trade-off between the objec-
tives. Since one will usually be able to calculate only a limited number
of Pareto points, it is important that the Pareto front be uniformly sam-
pled. In [190, 194], a method based on continuation in λ with added dis-
tance constraints produces automatically equispaced representations of the
Pareto front. In Figure 8.5.2 we show a uniformly sampled Pareto front for
a bi-objective problem.
The -constraint method of Haimes [114] is frequently used to solve this
type of problem in the case that a hierarchical order of the objectives is
162 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 8.5.2: Evenly spaced Pareto front (in (f1 , f2 ) space) for a bi-
objective problem [190].

known and one can make an a priori call on a compromise upper value
for the secondary objective. It transforms the bi-objective problem into a
single-objective constrained minimization of the main goal. The constraint
is the upper bound of the second objective. In other words, one minimizes
the first objective subject to an acceptable level in the second objective
(that better be larger than the minimum). From what we saw before, the
resulting solution may be sub-optimal.
A good reference for the theoretical and practical aspects of nonlinear
multiobjective optimization is [161].
Chapter 9

Algorithms for Solving


Nonlinear LSQ Problems

The classical method of Newton and its variants can be used to solve
the nonlinear least squares problem formulated in the previous chapter.
Newton’s method for optimization is based on a second-order Taylor ap-
proximation of the objective function f (x) and subsequent minimization of
the resulting approximate function. Alternatively, one can apply Newton’s
method to solve the nonlinear system of equations ∇f (x) = 0.
The Gauss-Newton method is a simplification of the latter approach for
problems that are almost consistent or mildly nonlinear, and for which
the second term in the Hessian ∇2 f (x) can be safely neglected. The
Levenberg-Marquardt method can be considered a variant of the Gauss-
Newton method in which stabilization (in the form of Tikhonov regular-
ization, cf. Section 10) is applied to the linearized steps in order to solve
problems with ill-conditioned or rank-deficient Jacobian J(x).
Due to the local character of the Taylor approximation one can only
obtain local convergence, in general. In order to get global convergence, the
methods need to be combined with a line search. Global convergence means
convergence to a local minimum from any initial point. For convergence
to a global minimum one needs to resort, for instance, to a Monte Carlo
technique that provides multiple random initial points, or to other costlier
methods [191].
We will describe first the different methods and then consider the com-
mon difficulties, such as how to start and end and how to ensure descent
at each step.

163
164 LEAST SQUARES DATA FITTING WITH APPLICATIONS

9.1 Newton’s method


If we assume that f (x) is twice continuously differentiable, then we can use
Newton’s method to solve the system of nonlinear equations

∇f (x) = J(x)T r(x) = 0,

which provides local stationary points for f (x). Written in terms of deriva-
tives of r(x) and starting from an initial guess x0 this version of the Newton
iteration takes the form
 −1
xk+1 = xk − ∇2 f (xk ) ∇f (xk )
 −1
= xk − J(xk ) J(xk ) + S(xk )
T
J(xk )T r(xk ),
k = 0, 1, 2, . . .

where S(xk ) denotes the matrix


m
S(xk ) = ri (xk )∇2 ri (xk ). (9.1.1)
i=1

As usual, no inverse is calculated to obtain a new iterate, but rather a linear


system is solved by a direct or iterative method to obtain the correction
Δxk = xk+1 − xk
 
J(xk )T J(xk ) + S(xk ) Δxk = −J T (xk )r(xk ). (9.1.2)

The method is locally quadratically convergent as long as ∇2 f (x) is Lip-


schitz continuous and positive definite in a neighborhood of x∗ . This follows
from a simple adaptation of the Newton convergence theorem in [66], which
leads to the result

xk+1 − x∗ 2 ≤ γ xk − x∗ 2 ,
2
k = 0, 1, 2, . . . .

The constant γ is a measure of the nonlinearity of ∇f (x), it depends on


 2
the Lipschitz constant for ∇2 f (x) and a bound for ∇2 f (x∗ )−1 2 , the
size of the residual does not appear. The convergence rate will depend on
the nonlinearity, but the convergence itself will not. The foregoing result
implies that Newton’s method is usually very fast in the final stages, close
to the solution.
In practical applications Newton’s iteration may not be performed ex-
actly and theorems 4.22 and 4.30 in [143, 183] give convergence results when
the errors are controlled appropriately.
Note that the Newton correction Δxk will be a descent direction (as
explained in the previous chapter), as long as ∇2 f (xk ) = J(xk )T J(xk ) +
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 165

S(xk ) is positive definite. In fact, using the definition of positive definite


matrices one obtains, by multiplying both sides of (9.1.2)
 
0 < ΔxTk J(xk )T J(xk ) + S(xk ) Δxk = −ΔxTk J T (xk )r(xk ).

Therefore, the correction Δxk is in the same half-space as the steepest


descent direction −J T (xk )r(xk ). Although the Newton correction is a
descent direction, the step size may be too large, since the linear model is
only locally valid. Therefore to ensure convergence, Newton’s method is
used with step-length control to produce a more robust algorithm.
One of the reasons why Newton’s method is not used more frequently
for nonlinear least squares problem is that its good convergence properties
come at a price: the mn2 derivatives appearing in S(xk ) must be computed!
This can be expensive and often the derivatives are not even available and
thus must be substituted by finite differences. Special cases where Newton’s
method is a good option are when S(xk ) is sparse, which happens frequently
if J(xk ) is sparse or when S(xk ) involves exponentials or trigonometric
functions that are easy to compute. Also, if one has access to the code that
calculates the model, automatic differentiation can be used [111].
If the second derivatives term in ∇2 f (xk ) = J(xk )T J(xk ) + S(xk ) is
unavailable or too expensive to compute and hence approximated, the re-
sulting algorithm is called a quasi-Newton method and although the conver-
gence will no longer be quadratic, superlinear convergence can be attained.
It is important to point out that only the second term of ∇2 f (xk ) needs
to be approximated since the first term J(xk )T J(xk ) has already been
computed. A successful strategy is to approximate S(xk ) by a secant-type
term, using updated gradient evaluations:


m
S(xk )  ri (xk )Gi (xk ),
i=1

where the Hessian terms ∇2 ri (xk ) are approximated from the condition

Gi (xk ) (xk − xk−1 ) = ∇ri (xk ) − ∇ri (xk−1 ).

This condition would determine a secant approximation for n = 1, but in


the higher-dimensional case it must be complemented with additional re-
quirements on the matrix Gi (xk ): it must be symmetric and it must satisfy
the so-called least-change secant condition, i.e., that it be most similar to
the approximation Gi (xk−1 ) in the previous step. In [67], local superlinear
convergence is proved, under the assumptions of Lipschitz continuity and
bounded inverse of ∇2 f (x).
166 LEAST SQUARES DATA FITTING WITH APPLICATIONS

9.2 The Gauss-Newton method


If the problem is only mildly nonlinear or if the residual at the solution (and
therefore in a reasonable-sized neighborhood) is small, a good alternative to
Newton’s method is to neglect the second term S(xk ) of the Hessian alto-
gether. The resulting method is referred to as the Gauss-Newton method,
where the computation of the step Δxk involves the solution of the linear
system  
J(xk )T J(xk ) Δxk = −J(xk )T r(xk ) (9.2.1)
and xk+1 = xk + Δxk .
Note that in the full-rank case these are actually the normal equations
for the linear least squares problem

min J(xk ) Δxk − (−r(xk ))2 (9.2.2)


Δxk+1

and thus Δxk = −J(xk )† r(xk ). By the same argument as used for the
Newton method, this is a descent direction if J(xk ) has full rank.
We note that the linear least squares problem in (9.2.2), which defines
the Gauss-Newton direction Δxk , can also be derived from the principle of
local approximation discussed in the previous chapter. When we linearize
the residual vector r(xk ) in the kth step we obtain the approximation in
(8.2.3) with x = xk , which is identical to (9.2.2).
The convergence properties of the Gauss-Newton method can be sum-
marized as follows [66] (see also [184] for an early proof of convergence):
Theorem 91. Assume that
• f (x) is twice continuously differentiable in an open convex set D.
• J(x) is Lipschitz continuous with constant γ, and J(x)2  α.
• There is a stationary point x∗ ∈ D, and
• for all x ∈ D there exists a constant σ such that
 
(J(x) − J(x∗ ))T r(x∗ )  σx − x∗ 2 .
2

If σ is smaller than λ, the smallest eigenvalue of J(x∗ )T J(x∗ ), then for any
c ∈ (1, λ/σ) there exists a neighborhood so that the Gauss-Newton sequence
converges linearly to x∗ starting from any initial point x0 in D:
cσ cαγ
xk+1 − x∗ 2  xk − x∗ 2 + xk − x∗ 2
2
λ 2λ
and
cσ + λ
xk+1 − x∗ 2  xk − x∗ 2 < xk − x∗ 2 .

ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 167

If the problem is consistent, i.e., r(x∗ ) = 0, there exists a (maybe smaller)


neighborhood where the convergence will be quadratic.
The Gauss-Newton method can produce the whole spectrum of conver-
gence: if S(xk ) = 0, or is very small, the convergence can be quadratic,
but if S(xk ) is too large there may not even be local convergence. The
important constant is σ, which can be considered as an approximation of
S(x∗ )2 since, for x sufficiently close to x∗ , it can be shown that
 
(J(x) − J(x∗ ))T r(x∗ )  S(x∗ )2 x − x∗  .
2 2

The ratio σ/λ must be less than 1 for convergence. The rate of convergence
is inversely proportional to the “size” of the nonlinearity or the residual,
i.e., the larger S(xk )2 is in comparison to J(xk )T J(xk )2 , the slower
the convergence.
As we saw above, the Gauss-Newton method produces a direction Δxk
that is of descent, but due to the local character of the underlying approx-
imation, the step length may be incorrect, i.e., f (xk + Δxk ) > f (xk ). This
suggests a way to improve the algorithm, namely, to use damping, which
is a simple mechanism that controls the step length to ensure a sufficient
reduction of the function. An alternative is to use a trust-region strategy
to define the direction; this leads to the Levenberg-Marquardt algorithm
discussed later on.
Algorithm damped Gauss-Newton
• Start with an initial guess x0 and iterate for k = 0, 1, 2, . . .
• Solve minΔxk J(xk ) Δxk + r(xk )2 to compute the correction Δxk .
• Choose a step length αk so that there is enough descent.
• Calculate the new iterate: xk+1 = xk + αk Δxk .
• Check for convergence.
Several methods for choosing the step-length parameter αk have been pro-
posed and the key idea is to ensure descent by the correction step αk Δxk .
One popular choice is the Armijo condition (see Section 9.4), which uses a
constant αk ∈ (0, 1) to ensure enough descent in the value of the objective:

f (xk + αk Δxk ) < f (xk ) + αk ∇f (xk )T Δxk


= f (xk ) + αk r(xk )T J(xk )T Δxk . (9.2.3)

This condition ensures that the reduction is (at least) proportional to both
the parameter αk and the directional derivative ∇f (xk )T Δxk .
Using the two properties above, a descent direction and an appropri-
ate step length, the damped Gauss-Newton method is locally convergent
168 LEAST SQUARES DATA FITTING WITH APPLICATIONS

y x∗ f (x∗ ) J(x∗ )T J(x∗ ) S(x∗ )


8 0.6932 0 644.00 0
3 0.4401 1.639 151.83 9.0527
−1 0.0447 6.977 17.6492 8.3527
−8 −0.7915 41.145 0.4520 2.9605

Table 9.1: Data for the test problem for four values of the scalar y. Note
that x∗ , J(x∗ ) and S(x∗ ) are scalars in this example.

and often globally convergent. Still, the convergence rate may be slow
for problems for which the standard algorithm is inadequate. Also, the
inefficiency of the Gauss-Newton method when applied to problems with
ill-conditioned or rank-deficient Jacobian has not been addressed; the next
algorithm, Levenberg-Marquardt, will consider this issue.
Example 92. The following example from [66] will clarify the above con-
vergence results. We fit the one-parameter model M (x, t) = ext to the data
set
(t1 , y1 ) = (1, 2), (t2 , y2 ) = (2, 4), (t3 , y3 ) = (3, y),
where y can take one of the four values 8, 3, −1, −8. The least squares
problem is, for every y, to determine the single parameter x, to minimize
the function


3
* x +
f (x) = 1
2 ri (x)2 = 1
2 (e − 2)2 + (e2x − 4)2 + (e3x − y)2 .
i=1

For this function of a single variable x, both the simplified and full Hessian
are scalars:

3 
3
J(x)T J(x) = (ti exti )2 and S(x) = ri (x)t2i exti .
i=1 i=1

Table 9.1 lists, for each of the four values of y, the minimizer x∗ and values
of several functions at the minima.
We use the Newton and Gauss-Newton methods with several starting
values x0 . Table 9.2 lists the different convergence behaviors for every case,
with two different starting values. The stopping criterion for the iterations
was |∇f (xk )| ≤ 10−10 . Note that for the consistent problem with y = 8,
Gauss-Newton achieves quadratic convergence, since S(x∗ ) = 0. For y = 3,
the convergence factor for Gauss-Newton is σ/λ  0.06, which is small
compared to 0.47 for y = −1, although for this value of y there is still linear
but slow convergence. For y = −8 the ratio is 6.5  1 and there is no
convergence.
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 169

Newton Gauss-Newton
y x0 # iter. rate # iter. rate
8 1 7 quadratic 5 quadratic
0.6 6 quadratic 4 quadratic
3 1 9 quadratic 12 linear
0.5 5 quadratic 9 linear
−1 1 10 quadratic 34 linear (slow)
0 4 quadratic 32 linear (slow)
−8 1 12 quadratic no convergence
−0.7 4 quadratic no convergence

Table 9.2: Convergence behavior for the four cases and two different start-
ing values.

Figure 9.2.1: Single-parameter exponential fit example: the scalars


J(x)T J(x) (left) and ∇2 f (x) (right) as functions of x and y (the white lines
are level curves). The second term in ∇2 f (x) has increasing importance as
y decreases.

To support the above results, Figure 9.2.1 shows plots of the Hessian
∇2 f (x) and its first term J(x)T J(x) (recall that both are scalars) as func-
tions of x and y, for x ∈ [−0.7, 0.7] and y ∈ [−8, 8]. We see that the
deterioration of the convergence rate of Gauss-Newton’s method, compared
to that of Newton’s method, indeed coincides with the increasing impor-
tance of the term S(x) in ∇2 f (x), due to larger residuals at the solution as
y decreases.

For large-scale problems, where memory and computing time are lim-
iting factors, it may be infeasible to solve the linear LSQ problem for the
correction Δxk to high accuracy in each iteration. Instead one can use one
of the iterative methods discussed in Chapter 6 to compute an approximate
step x̃k , by terminating the iterations once the approximation is “accu-
170 LEAST SQUARES DATA FITTING WITH APPLICATIONS

rate enough,” e.g., when the residual in the normal equations is sufficiently
small:

J(xk )T J(xk ) Δx̃k + J(xk )T r(xk )2 < τ J(xk )T r(xk )2 ,

for some τ ∈ (0, 1). In this case, the algorithm is referred to as an inexact
Gauss-Newton method . For these nested iterations an approximate Newton
method theory applies [61, 143, 183].

9.3 The Levenberg-Marquardt method


While the Gauss-Newton method can be used for ill-conditioned problems,
it is not efficient, as one can see from the above convergence relations when
λ → 0. The main difficulty is that the correction Δxk is too large and goes
in a “bad” direction that gives little reduction in the function. For such
problems, it is common to add an inequality constraint to the linear least
squares problem (9.2.2) that determines the step, namely, that the length
of the step Δxk 2 should be bounded by some constant. This so-called
trust region technique improves the quality of the step.
As we saw in the previous chapter, we can handle such an inequality
constraint via the use of a Lagrange multiplier and thus replace the problem
in (9.2.2) with a problem of the form
 
min J(xk ) Δxk + r(xk )22 + λk xk 22 ,
xk+1

where λk > 0 is the Lagrange multiplier for the constraint at the kth itera-
tion. There are two equivalent forms of this problem, either the “modified”
normal equations
 
J(xk )T J(xk ) + λk I Δxk = −J(xk )T r(xk ) (9.3.1)

or the “modified” least squares problem


 
 J(xk ) −r(xk ) 
min  √ Δxk −  . (9.3.2)
Δxk  λk I 0 
2

The latter is best suited for numerical computations. This approach, which
also handles a rank-deficient Jacobian J(xk ), leads to the Levenberg-Marquardt
method, which takes the following form:
Algorithm Levenberg-Marquardt

• Start with an initial guess x0 and iterate for k = 0, 1, 2, . . .

• At each step k choose the Lagrange multiplier λk .


ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 171

• Solve 9.3.1 or 9.3.2 for Δxk .


• Calculate the next iterate xk+1 = xk + Δxk .
• Check for convergence.
The parameter λk influences both the direction and the length of the step
Δxk . Depending on the size of λk , the correction Δxk can vary from a
Gauss-Newton step for λk = 0, to a very short step approximately in the
steepest descent direction for large values of λk . As we see from these
considerations, the LM parameter acts similarly to the step control for the
damped Gauss-Newton method, but it also changes the direction of the
correction.
The Levenberg-Marquardt step can be interpreted as solving the normal
equations used in the Gauss-Newton method, but “shifted” by a scaled
identity matrix, so as to convert the problem from having an ill-conditioned
(or positive semidefinite) matrix J(xk )T J(xk ) into a positive definite one.
Notice that the positive definiteness implies that the Levenberg-Marquardt
direction is always of descent and therefore the method is well defined.
Another way of√ looking at the Levenberg-Marquardt iteration is to con-
sider the matrix λk I as an approximation to the second derivative term
S(xk ) that was neglected in the definition of the Gauss-Newton method.
The local convergence of the Levenberg-Marquardt method is summa-
rized in the following theorem.
Theorem 93. Assume the same conditions as in Theorem 91 and in ad-
dition assume that the Lagrange multipliers λk for k = 0, 1, 2, . . . are non-
negative and bounded by b > 0. If σ < λ, then for any c ∈ (1, (λ+b)/(σ+b)),
there exists an open ball D around x∗ such that the Levenberg-Marquardt
sequence converges linearly to x∗ , starting from any initial point x0 ∈ D:
c(σ + b) cαγ
xk+1 − x∗ 2  xk − x∗ 2 + xk − x∗ 2
2
λ+b 2(λ + b)
and
c(σ + b) + (λ + b)
xk+1 − x∗ 2  xk − x∗ 2 < xk − x∗ 2 .
2(λ + b)
  
If r(x∗ ) = 0 and λk = O J(xk )T r(xk )2 , then the iterates xk converge
quadratically to x∗ .
Within the trust-region framework introduced above, there are several
general step-length determination techniques, see, for example, [170]. Here
we give the original strategy devised by Marquardt for the choice of the pa-
rameter λk , which is simple and often works well. The underlying principles
are
172 LEAST SQUARES DATA FITTING WITH APPLICATIONS

• The initial value λ0 should be of the same size as J(x0 )T J(x0 )2 .
• For subsequent steps, an improvement ratio is defined as in the trust
region approach:
actual reduction f (xk ) − f (xk+1 )
k = = .
2 Δxk (J(xk ) r(xk ) − λk Δxk )
predicted reduction 1 T T

Here, the denominator is the reduction in f predicted by the local linear


model. If k is large, then the pure Gauss-Newton model is good enough,
so λk+1 can be made smaller than at the previous step. If k is small
(or even negative), then a short, steepest-descent step should be used, i.e.,
λk+1 should be increased. Marquard’s updating strategy is widely used
with some variations in the thresholds.
Algorithm Levenberg-Marquardt’s parameter updating
• If ρk > 0.75, then λk+1 = λk /3.
• If ρk < 0.25, then λk = 2 λk .
• Otherwise use λk+1 = λk .
• If ρk > 0, then perform the update xk+1 = xk + Δxk .
A detailed description can be found in [154].
The software package MINPACK-1 available from Netlib [263] includes
a robust implementation based on the Levenberg-Marquardt algorithm.
Similar to the Gauss-Newton method, we have inexact versions of the
Levenberg-Marquardt method, where the modified least squares problem
(9.3.2) for the correction is solved only to sufficient accuracy, i.e., for some
τ ∈ (0, 1) we accept the solution if:

(J(xk )T J(xk ) + λk I) Δxk + J(xk )T r(xk )2 < τ J(xk )T r(xk )2 .

If this system is to be solved for several values of the Lagrange parameter


λk , then the bidiagonalization strategy from Section 5.5 can be utilized
such that only one partial bidiagonalization needs to be performed; for
more details see [121, 260].
Example 94. We return to Example 92, and this time we use the Levenberg-
Marquardt method with the above parameter-updating algorithm and the
same starting values and stopping criterion as before. Table 9.3 compares
the performance of Levenberg-Marquardt’s algorithm with that of the Gauss-
Newton algorithm. For the consistent problem with y = 8, the convergence
is still quadratic, but slower. As the residual increases, the advantage of
the Levenberg-Marquardt strategy sets in, and we are now able to solve the
large-residual problem for y = −8, although the convergence is very slow.
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 173

Gauss-Newton Levenberg-Marquardt
y x0 # iter. rate # iter. rate
8 1 5 quadratic 10 quadratic
0.6 4 quadratic 7 quadratic
3 1 12 linear 13 linear
0.5 9 linear 10 linear
−1 1 34 linear (slow) 26 linear (slow)
0 32 linear (slow) 24 linear (slow)
−8 1 no convergence 125 linear (very slow)
−0.7 no convergence 120 linear (very slow)

Table 9.3: Comparison of the convergence behavior of the Gauss-Newton


and Levenberg-Marquardt methods for the same test problem as in Table
9.2.

Figure 9.3.1: Number of iterations to reach an absolute accuracy of 10−6


in the solution by three NLLSQ different methods. Each histogram shows
results for a particular method and a particular value of ζ in the perturbation
of the starting point; top ζ = 0.1, middle ζ = 0.2, bottom ζ = 0.3.
174 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Example 95. This example is mainly concerned with a comparison of the


robustness of the iterative methods with regards to the initial guess x0 . We
use the Gaussian model from Examples 83 and 86 with noise level ς = 0.1,
and we use three different algorithms to solve the NLLSQ problem:

• The standard Gauss-Newton method.

• An implementation of the damped Gauss-Newton algorithm from MAT-


LAB’s Optimization Toolbox, available via the options
’LargeScale’=’off’ and ’LevenbergMarquardt’=’off’ in the func-
tion lsqnonlin.

• An implementation of the Levenberg-Marquardt algorithm from MAT-


LAB’s Optimization Toolbox, available via the options
’Jacobian’=’on’ and ’Algorithm’=’levenberg-marquardt’ in the
function lsqnonlin.

The starting guess was chosen equal to the exact parameters (2.2, 0.26, 0.2)T
plus a Gaussian perturbation with standard deviation ζ. We used ζ = 0.1,
0.2, 0.3, and for each value we created 500 different realizations of the ini-
tial point. Figure 9.3.1 shows the number of iterations in the form of his-
tograms – notice the different axes in the nine plots. We give the number of
iterations necessary to reach an absolute accuracy of 10−6 , compared to a
reference solution computed with much higher accuracy. The italic numbers
in the three left plots are the number of times the Gauss-Newton method did
not converge.
We see that when the standard Gauss-Newton method converges, it con-
verges rapidly – but also that it may not converge. Moreover, as the starting
point moves farther from the minimizer, the number of instances of non-
convergence increases dramatically.
The damped Gauss-Newton method used here always converged, thus was
much more robust than the undamped version. For starting points close to
the minimum, it may converge quickly, but it may also require about 40
iterations. For starting points further away from the solution, this method
always uses about 40 iterations.
The Levenberg-Marquardt method used here is also robust, in that it con-
verges for all starting points. In fact, for starting points close to the solution
it requires, on the average, fewer iterations than the damped Gauss-Newton
method. For starting points farther from the solution it still converges,
but requires many more iterations. The main advantage of the Levenberg-
Marquardt algorithm, namely, to handle ill-conditioned and rank-deficient
Jacobian matrices, does not come into play here as the particular problem
is well conditioned.
We emphasize that this example should not be considered as representa-
tive for these methods in general – rather, it is an example of the perfor-
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 175

Figure 9.3.2: Convergence histories from the solution of a particular


NLLSQ problem with three parameters by means of the standard Gauss-
Newton (top) and the Levenberg-Marquardt algorithm (bottom). Each fig-
ure shows the behavior of pairs of the three components of the iterates xk
together with level sets of the objective function.

mance of specific implementations for one particular (and well-conditioned)


problem.

Example 96. We illustrate the robustness of the Levenberg-Marquardt al-


gorithm by solving the same test problem as in the previous example by
means of the standard Gauss-Newton and Levenberg-Marquardt algorithms.
The progress of a typical convergence history for these two methods is shown
in Figure 9.3.2. There are three unknowns in the model, and the starting
point is x0 = (1, 0.3, 0.4)T . The top figures show different component
pairs of the iterates xk for k = 1, 2, . . . for the Gauss-Newton algorithm
(for example, the leftmost plot shows the third component of xk versus its
second component). Clearly, the Gauss-Newton iterates overshoot before
they finally converge to the LSQ solution. The iterates of the Levenberg-
Marquardt algorithm are shown in the bottom figures, and we see that they
converge much faster toward the LSQ solution without any big “detour.”
176 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Characteristics Newton G-N L-M


Ill-conditioned Jacobian yes yes (but slow) yes
Rank-deficient Jacobian yes no yes
Convergence S(xk ) = 0 quadratic quadratic quadratic
Convergence S(xk ) small quadratic linear linear
Convergence S(xk ) large quadratic slow or none slow or none

Table 9.4: Comparison of some properties of the Newton, Gauss-Newton


(G-N) and Levenberg-Marquardt (L-M) algorithms for nonlinear least
squares problems.

9.4 Additional considerations and software


An overall comparison between the methods discussed so far is given in
Table 9.4. The “yes” or “no” indicates whether the corresponding algorithm
is appropriate or not for the particular problem. Below we comment on
some further issues that are relevant for these methods.

Hybrid methods
During the iterations we do not know whether we are in the region where
the convergence conditions of a particular method hold, so the idea in
hybrid methods is to combine a “fast” method, such as Newton (if second
derivatives are available), with a “safe” method such as steepest descent.
One hybrid strategy is to combine Gauss-Newton or Levenberg-Marquardt,
which in general is only linearly convergent, with a superlinearly convergent
quasi-Newton method, where S(xk ) is approximated by a secant term.
An example of this efficient hybrid method is the algorithm NL2SOL by
Dennis, Gay and Welsch [65]. It contains a trust-region strategy for global
convergence and uses Gauss-Newton and Levenberg-Marquardt steps ini-
tially, until it has enough good second-order information. Its performance
for large and very nonlinear problems is somewhat better than Levenberg-
Marquardt, in that it requires fewer iterations. For more details see [66].

Starting and stopping


In many cases there will be no good a priori estimates available for the ini-
tial point x0 . As mentioned in the previous section, nonlinear least squares
problems are usually non-convex and may have several local minima. There-
fore some global optimization technique must be used to descend from un-
desired local minima if the global minimum is required. One possibility is
a simple Monte Carlo strategy, in which multiple initial points are chosen
at random and the least squares algorithm is started several times. It is
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 177

hoped that some of these iterations will converge, and one can then choose
the best solution. See, e.g., [131] for an overview of global optimization
algorithms.
In order to make such a process more efficient, especially when function
evaluations are costly, a procedure has been described in [191] that saves
all iterates and confidence radiuses. An iteration is stopped if an iterate
lands in a previously calculated iterate’s confidence sphere. The assumption
underlying this decision is that the iteration will lead to a minimum already
calculated. The algorithm is trivially parallelizable and runs very efficiently
in a distributed network. Up to 40% savings have been observed from using
the early termination strategy. This simple approach is easily parallelizable,
but it can only be applied to low-dimensional problems.
Several criteria ought to be taken into consideration for stopping the
iterative methods described above.

• The sequence convergence criterion: xk+1 − xk 2 ≤ tolerance (not


a particularly good one)!

• The consistency criterion: |f (xk+1 )| ≤ tolerance (relevant only for


problems with r(x∗ ) = 0).

• The absolute function criterion: ∇f (xk+1 )2 ≤ tolerance.

• The maximum iteration count criterion: k > kmax .

The absolute function criterion has a special interpretation for NLLSQ


problems, namely, that the residual at xk+1 is nearly orthogonal to the
subspace generated by the columns of the Jacobian. For the Gauss-Newton
and Levenberg-Marquardt algorithms, the necessary information to check
near-orthogonality is easily available.

Methods for step-length determination


Control of the step length is important for ensuring a robust algorithm that
converges from starting points far from the minimum. Given an iterate xk
and a descent direction p, there are two common ways to choose the step-
length αk .

• Take αk as the solution to the one-dimensional minimization problem

minα r(xk + αk p)2 .

This is expensive if one tries to find the exact solution. Fortunately,


this is not necessary, and so-called soft or inexact line search strategies
can be used.
178 LEAST SQUARES DATA FITTING WITH APPLICATIONS

• Inexact line searches: an αk is accepted if a sufficient descent condi-


tion is satisfied for the new point xk + αk p and if αk is large enough
that there is a gain in the step. Use the so-called Armijo-Goldstein
step-length principle, where αk is chosen as the largest number from
the sequence α = 1, 12 , 41 , . . . for which the inequality

r(xk )2 − r(xk + αk p)2 ≥ 12 αk J(xk ) p2

holds.

For details see, for example, chapter 3 in [170] and for a brief overview see
[20] p. 344.

Software
In addition to the software already mentioned, the PORT library, avail-
able from AT&T Bell Labs and partly from Netlib [263], has a range of
codes for nonlinear least squares problems, some requiring the Jacobian
and others that need only information about the objective function. The
Gauss-Newton algorithm is not robust enough to be used on its own, but
most of the software for nonlinear least squares that can be found in NAG,
MATLAB and TENSOLVE (using tensor methods), have enhanced Gauss-
Newton codes. MINPACK-1, MATLAB and IMSL contain Levenberg-
Marquardt algorithms. The NEOS Guide on the Web [264] is a source of
information about optimization in general, with a section on NLLSQ. Also,
the National Institute of Standards and Technology Web page at [265] is
very useful, as well as the book [165].
Some software packages for large-scale problems with a sparse Jacobian
are VE10 and LANCELOT. VE10 [264], developed by P. Toint, implements
a line search method, where the search direction is obtained by a trun-
cated conjugate gradient technique. It uses an inexact Newton method for
partially separable nonlinear problems with sparse structure. LANCELOT
[148] is a software package for nonlinear optimization developed by A. Conn,
N. Gould and P. Toint. The algorithm combines a trust-region approach,
adapted to handle possible bound constraints and projected gradient tech-
niques. In addition it has preconditioning and scaling provisions.

9.5 Iteratively reweighted LSQ algorithms for


robust data fitting problems
In Chapter 1 we introduced the robust data fitting problem based on the
principle
mof M-estimation, leading to the nonlinear minimization problem
minx i=1 ρ (ri (x)) where the function ρ is chosen so that it gives less
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 179

weight to residuals ri (x) with large absolute value. Here we consider the
important case of robust linear data fitting, where the residuals are the
elements of the residual vector r = b − A x. Below we describe several
iterative algorithms that, in spite of their differences, are commonly referred
to as iteratively reweighted least squares algorithms due to their use of a
sequence of linear least squares problems with weights that change during
the iterations.
To derive the algorithms we consider the problem of minimizing the
function

m
f (x) = ρ (ri (x)) , r = b − A x.
i=1

We introduce the vector g ∈ Rm and the diagonal matrix D ∈ Rm×m with


elements defined by

gi = ρ (ri ), dii = ρ (ri ), i = 1, . . . , m, (9.5.1)

where ρ and ρ denote the first and second derivatives of ρ(r) with respect
to r. Then the gradient and the Hessian of f are given by

∇f (x) = −AT g(x) and ∇2 f (x) = AT D(x) A, (9.5.2)

where we use the notation g(x) and D(x) to emphasize that these quantities
depend on x.
The original iteratively reweighted least squares algorithm [8] uses a
fixed-point iteration to solve ∇f (x) = 0. From (9.5.1) and (9.5.2) and
introducing the diagonal m × m weight matrix

W (r) = diag (ρ (ri )/ri ) for i = 1, . . . , m, (9.5.3)

we obtain the nonlinear equations

AT g(x) = AT W (r) r = AT W (r) (b − A x) = 0

or
AT W (r) A x = AT W (r) b.
The fixed-point iteration scheme used in [8] is
 −1 T
xk+1 = AT W (r k ) A A W (r k ) b
= argminx W (r k )1/2 (b − A xk ) 2 , k = 1, 2, . . . (9.5.4)

where r k = b − A xk is the residual from the previous iteration. Hence,


each new iterate is the solution of a weighted least squares problem, in
which the weights depend on the solution from the previous iteration. This
algorithm, with seven different choices of the function ρ, is implemented
180 LEAST SQUARES DATA FITTING WITH APPLICATIONS

in a Fortran software package described in [50] where more details can be


found.
A faster algorithm for solving the robust linear data fitting problem –
also commonly referred to as an iteratively reweighted least squares algo-
rithm – is obtained
mby applying Newton’s method from Section 9.1 to the
function f (x) = i=1 ρ (ri (x)); see ([20], section 4.5.3) for details. Accord-
ing to (9.5.2), the Newton iteration is
 −1 T
xk+1 = xk − AT D(xk ) A A g(xk ), k = 1, 2, . . . . (9.5.5)

We emphasize that the Newton update is not a least squares solution, be-
cause the diagonal matrix D(xk ) appears in front of the vector g(xk ).
O’Leary [171] suggests a variant of this algorithm where, instead of updat-
ing the solution vectors, the residual vector is updated as
 −1 T
r k+1 = r k − αk A AT D(xk ) A A g(xk ), k = 1, 2, . . . , (9.5.6)

and the step-length αk is determined via line search. Upon convergence


the robust solution xk is computed from the final residual vector r k as the
solution to the consistent system A x = b − r k . Five different numerical
methods for computing the search direction in (9.5.6) are compared in [171],
where it is demonstrated that the choice of the best method depends on
the function ρ. Wolke and Schwetlick [258] extend the algorithm to also
estimate the parameter β that appears in the function ρ and which must
reflect the noise level in the data.
As starting vector for the above iterative methods one can use the or-
dinary LSQ solution or, alternatively, the solution to the 1-norm problem
minx b − A x1 (see below).
The iteratively reweighted least squares formalism can also be used to
solve linear p-norm problem minx b − A xp for 1 < p < 2. To see this, we
note that

m 
m
b − A xpp = |ri (x)|p = |ri (x)|p−2 ri (x)2
i=1 i=1

= Wp (r) 1/2


(b − A x)22

where Wp (r) = diag(|ri |p−2 ). The iterations of the corresponding Newton-


type algorithm take the form

xk+1 = xk + Δxk , k = 1, 2, . . . , (9.5.7)

where Δxk is the solution to the weighted LSQ problem

min Wp (r k )1/2 (r k − A Δx)2 , (9.5.8)


Δx
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 181

see (Björck [20], section 4.5.2) for details. Experience shows that, for p
close to 1, the LSQ problem (9.5.8) tends to become ill-conditioned as the
iterations converge to the robust solution. Special algorithms are needed
for the case p = 1; some references are [6, 51, 153].
For early use in geophysical prospecting see [48, 219].

9.6 Variable projection algorithm


We finish this chapter with a brief discussion of algorithms to solve the
separable NLLSQ problem introduced in Section 8.4, where we listed some
important properties of this problem. The simplest method of solution that
takes into account the special structure is the algorithm NIPALS [256],
where an alternating optimization procedure is employed: the linear and
nonlinear parameters are successively fixed and the problem is minimized
over the complementary set. This iteration, however, only converges lin-
early.
The variable projection algorithm, developed by Golub and Pereyra
[101], takes advantage instead of the explicit coupling between the param-
eters α and a, in order to reduce the original minimization problem to two
subproblems that are solved in sequence, without alternation. The nonlin-
ear subproblem is smaller (although generally more complex) and involves
only the k parameters in α. One linear subproblem is solved a posteriori
to determine the n linear parameters in a. The difference with NIPALS is
perhaps subtle, but it makes the algorithm much more powerful, as we will
see below.
For the special case when φj (α, t) = eαj t , the problem is called expo-
nential data fitting, which is a very common and important application that
is notoriously difficult to solve. Fast methods that originated with Prony
[204] are occasionally used, but, unfortunately, the original method is not
suitable for problems with noise. The best-modified versions are from M.
Osborne [175, 176, 177], who uses the separability to achieve better results.
As shown in [198], the other method frequently used with success is the
variable projection algorithm. A detailed discussion of both methods and
their relative merits is found in chapter 1 of [198], while the remaining
chapters show a number of applications in very different fields and where
either or both methods are used and compared.
The key idea behind the variable projection algorithm is to eliminate
the linear parameters analytically by using the pseudoinverse, solve for
the nonlinear parameters first and then solve a linear LSQ problem for
the remaining parameters. Figure 9.6.1 illustrates the variable projection
principle in action. We depict I − Φ(α)Φ† (α) for a fixed α as a linear
mapping from Rn → Rm . That is, its range is a linear subspace of dimension
n (or less, if the matrix is rank deficient). When α varies, this subspace
182 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 9.6.1: The geometry behind the variable projection principle. For
each αi the projection I − Φ(α)Φ† (α) maps Rn into a subspace in Rm
(depicted as a line).

pivots around the origin. For each α the residual is equal to the Euclidean
distance from the data vector y to the corresponding subspace. As usual,
there may not be any subspace to which the data belong, i.e., the problem
is inconsistent and there is a nonzero residual at the solution. This residual
is related to the l2 approximation ability of the basis functions φj (α, t).
There are several important results proved in [101], [227] and [212],
which show that the reduced function is better conditioned than the original
one. Also, we observe that

• The reduction in dimensionality of the problem has as a consequence


that fewer initial parameters need to be guessed to start a minimiza-
tion procedure.

• The algorithm is valid in the rank-deficient case. To guarantee con-


tinuity of the Moore-Penrose generalized inverse, only local constant
rank of Φ(α) needs to be assumed.

• The reduced nonlinear function, although more complex and there-


fore apparently costlier to evaluate, gives rise to a better-conditioned
problem, which always takes fewer iterations to converge than the full
problem [212]. This may include convergence when the Levenberg-
Marquardt algorithm for the full function does not converge.

• By careful implementation of the linear algebra involved and use of


a simplification due to Kaufman [140], the cost per iteration for the
reduced function is similar to that for the full function, and thus
minimization of the reduced problem is always faster. However, in
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 183

hard problems this simplification may lead to a less robust algorithm


[172].
The linear problem is easily solved by using the methods discussed in previ-
ous chapters. There are two types of iterative methods to solve the NLLSQ
problem in the variable projection algorithm:
• Derivative free methods such as PRAXIS [263] require only an
efficient computation of the nonlinear function

f2 (α) = 12 y − Φ(α)Φ(α)† y22 = 12 PΦ(α) y22 .

Instead of the more expensive pseudoinverse computation it is possible


to obtain PΦ(α) by orthogonally transforming Φ(α) into trapezoidal
form. One obtains then a simple expression for the evaluation of the
function from the same orthogonal transformation when applied to
y. For details see [101].
• Methods that need derivatives of r(α) = y − Φ(α)Φ(α)† y. In
[101], a formula for the Fréchet derivative of the orthogonal projec-
tor PΦ(α) was developed and then used in the Levenberg-Marquardt
algorithm, namely:

Dr(α) = −[PΦ(α) D(Φ(α))Φ(α)† + (PΦ(α)

D(Φ(α))Φ(α)† )T ]y.

Kaufman [140] ignores the second term, producing a saving of up to


25% in computer time, without a significant increase in the number
of iterations. The Levenberg-Marquardt algorithm used in [101] and
[140] starts with an arbitrary α0 and determines the iterates from the
relation:

Dr(α
√ k) r(αk )
αk+1 = αk − .
λk I 0
At each iteration, this linear LSQ problem is solved by orthogonal
transformations. Also, the Marquardt parameter λk can be deter-
mined so that divergence is prevented by enforcing descent:
r(αk+1 )2 < r(αk )2 .
The original VARPRO program by Pereyra is listed in a Stanford Uni-
versity report [100]; there the minimization is done via the Osborne mod-
ification of the Levenberg-Marquardt algorithm. A public domain version,
including modifications and additions by John Bolstadt, Linda Kaufman
and Randy LeVeque, can be found in Netlib [263] under the same name.
It incorporates the Kaufman modification, where the second term in the
Fréchet derivative is ignored and the information provided by the program
is used to generate a statistical analysis, including uncertainty bounds for
184 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 9.6.2: Data, fit and residuals for the exponential fitting problem.

the estimated linear and nonlinear parameters. Recent work by O’Leary


and Rust indicates that in certain problems the Kaufman modification can
make the algorithm less robust. They present in that work a modern and
more modular implementation. In the PORT library [86] of Netlib there
are careful implementations by Gay and Kaufman of variable projection
versions for the case of unconstrained and constrained separable NLLSQ
problems, including the option of using finite differences to approximate
the derivatives. VARP2 [98, 141] is an extension for problems with multi-
ple right-hand sides and it is also available in Netlib.
The VARPRO program was influenced by the state of the art in comput-
ing at the time of its writing, leading to a somewhat convoluted algorithm,
and in a way most of the sequels inherited this approach. The computa-
tional constraints in memory and operation speeds have now been removed,
since for most problems to which the algorithm is applicable in standard
current machinery, results are produced quickly, even if multiple runs are
executed in order to try to get a global minimum. In [198, 228] there is a
description of a simplified approach for the case of complex exponentials,
where some efficiency is sacrificed in order to achieve a clearer implemen-
tation that is easier to use and maintain.

Example 97. The following example from [174] illustrates the competitive-
ness of the variable projection method, both in speed and robustness. Given
a set of m = 33 data points (ti , yi ) (the data set is listed in the VARPRO
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 185

code in Netlib), fit the exponential model:

M (a, α, t) = a1 + a2α1 t + aα2t


3 .

We compare the Levenberg-Marquardt algorithm for minimization of the


full function with VARPRO, which uses the reduced function. The two
algorithms needed 32 and 8 iterations, respectively, to reduce the objective
function to less than 5 · 10−5 ; this is a substantial saving, considering that
the cost per iteration is similar for the two approaches. Moreover, the
condition numbers of the respective linearized problems close to the solution
are 48845 and 6.9; a clear example showing that the reduced problem is
better conditioned than the original one. Figure 9.6.2 shows the fit and the
residuals for the VP algorithm. The autocorrelation analysis from Section
1.4 gives = −5 · 10−6 and T = 9.7 · 10−6 , showing that the residuals can
be considered as uncorrelated and therefore that the fit is acceptable. The
least squares solution for this problem is

a∗ = (0.375, −1.46, 1.94)T , α∗ = (0.0221, 0.0129)T .

As explained in Section 3.5, the sensitivity of this solution can be assessed


via the diagonal elements of the estimated covariance matrix
 −1
ς 2 J(x∗ )T J(x∗ ) .

In particular, the square roots of the diagonal elements of this matrix are
estimates of the standard deviations of the five parameters. If we use the
estimate ς 2  r(x∗ )22 /m, then these estimated standard deviations are

0.00191, 0.204, 0.203, 0.000824, 0.000413,

showing that the two linear parameters a2 and a3 are potentially very sen-
sitive to perturbations. To illustrate this, the two alternative sets of param-
eters
â = (0.374, −1.29, 1.76)T , α̂ = (0.0231, 0.0125)T
and
ã = (0.378, −2.29, 2.76)T , α̃ = (0.0120, 0.0140)T ,
both give fits that are almost indistinguishable from the least squares fit.
The corresponding residual norms are

r(x∗ )2 = 7.39 · 10−3 , r(x̂)2 = 7.79 · 10−3 , r(x̃)2 = 8.24 · 10−3 ,

showing that the large changes in a2 and a3 give rise to only small variations
in the residuals, again demonstrating the lack of sensitivity of the residual
to changes in those parameters.
186 LEAST SQUARES DATA FITTING WITH APPLICATIONS

9.7 Block methods for large-scale problems


We have already mentioned the inexact Gauss-Newton and Levenberg-
Marquardt methods, based on truncated iterative solvers, as a way to deal
with large computational problems. A different approach is to use a divide-
and-conquer strategy that, by decomposing the problem into blocks, may
allow the use of standard solvers for the (smaller) blocks. First, one subdi-
vides the observations in appropriate non-overlapping groups. Through an
SVD analysis one can select those variables that are more relevant to each
subset of data; details of one such method for large-scale, ill-conditioned
nonlinear least squares problems are given below and in [188, 189, 191].
The procedure works best if the data can be broken up in such a way that
the associated variables have minimum overlap and only weak couplings are
left with the variables outside the block. Of course, we cannot expect the
blocks to be totally uncoupled; otherwise, the problem would decompose
into a collection of problems that can be independently solved.
Thus, in general, the procedure consists of an outer block nonlinear
Gauss-Seidel or Jacobi iteration, in which the NLLSQ problems correspond-
ing to the individual blocks are solved approximately for their associated
parameters. The block solver is initialized with the current value of the vari-
ables. The full parameter set is updated either after each block is processed
(Gauss-Seidel strategy of information updating as soon as it is available),
or after all the block solves have been completed (Jacobi strategy). The ill-
conditioning is robustly addressed by the use of the Levenberg-Marquardt
algorithm for the subproblems and by the threshold used to select the vari-
ables for each block.
The pre-processing for converting a large problem into block form starts
by scanning the data and subdividing it. The ideal situation is one in which
the data subsets are only sensitive to a small subset of the variables. Hav-
ing performed that subdivision, we proceed to analyze the data blocks to
determine which parameters are actually well determined by each subset of
data. During this data analysis phase we compute the SVD of the Jaco-
bian of each block; this potentially very large matrix is trimmed by deleting
columns that are zero or smaller than a given threshold. Finally, the right
singular vectors of the SVD are used to complete the analysis.

Selecting subspaces through the SVD


Jupp and Vozoff [138] introduced the idea of relevant and irrelevant param-
eters based on the SVD. We write first the Taylor expansion of the residual
vector at a given point x (see §8.2):

r(x + h) = r(x) + J(x) h + O(h22 ). (9.7.1)


ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 187

Considering the SVD of the Jacobian J(x) = U Σ V T , we can introduce


the so-called rotated perturbations

p = σ1 V T h,
where σ1 is the largest singular value of J(x). Neglecting higher-order
terms in (9.7.1) we can write this system as


r
σi
r(x + h) − r(x) = J(x) h = pi ui ,
i=1
σ1
where σi /σ1 are the normalized singular values and r is the rank of the Ja-
cobian. This shows the direct relationship between the normalized singular
values, the rotated perturbations pi = σ1 v Ti h, and their influence on the
variation of the residual vector. Thus,

r
σi
2
r(x + h) − r(x)22 = p2i ,
i=1
σ1

which shows that those parameters pi that are associated with small normal-
ized singular values will not contribute much to variations in the residual.
Here we have assumed that all the components of the perturbation vector
h are of similar size.
The above analysis is the key to the algorithm for partitioning the pa-
rameter set into blocks, once a partition of the data set has been chosen.
Let RIk denote the row indices for the kth block of data with m[k] elements.
Given a base point x, calculate the Jacobian J(x)[k] for this data set, i.e.,
J(x)[k] has only m[k] rows and less than n columns, since if the data have
been partitioned appropriately and due to the local representation of the
model, we expect that there will be a significant number of columns with
small components that can be safely neglected. Then the procedure is as
follows:
1. Compute the SVD of J(x)[k] and normalize the singular values by
dividing them by the largest one.
2. Select the first n[k] normalized singular values that are above a given
threshold, and their associated right singular vectors.
3. Inspect the set of chosen singular vectors and select the largest com-
ponents of V [k] in absolute value.
4. Choose the indices of the variables in parameter space corresponding
to the columns of V [k] that contain large entries to form the set CIk of
column indices. This selects the subset of parameters that have most
influence on the variation in the misfit functional for the given data
set.
188 LEAST SQUARES DATA FITTING WITH APPLICATIONS

With this blocking strategy, variables may be repeated in different subsets.


Observe also that it is possible for the union of all the subsets to be smaller
than the entire set of variables; this will indicate that there are variables
that cannot be resolved by the given data, at least in a neighborhood of
the base point x and for the chosen threshold. Since this analysis is local,
it should be periodically revised as the optimization process advances.
Once this partition has been completed, we use an outer block nonlinear
Gauss-Seidel or Jacobi iteration [173] in order to obtain the solution of the
full problem. To make this more precise, let us partition the index sets
M ={1, 2, . . . , m} and N = {1, 2, . . . , n} into the subsets {RIk } and {CIk },
i.e., the index sets that describe the partition of our problem into blocks.
Each subproblem can now be written as

min 21 r(x)[k] 22 subject to x = x∗i , i ∈ {1, 2, . . . , n} − CIk , (9.7.2)


x

where r(x)[k] is the sub-vector of r(x) with elements {r(x)i }i∈RIk . In other
words, we fix the values of the parameters that are not in block k to their
current values in the global set of variables x∗ . Observe that the dimension
of the subproblem is then m[k] × n[k] . By considering enough subsets k =
1, . . . , K we can make these dimensions small, especially n[k] and therefore
make the subproblems (9.7.2) amenable to direct techniques (and global
optimization, if necessary).
One step of a sequential block Gauss-Seidel iteration consists then in
sweeping over all the blocks, solving the subproblems to a certain level of
accuracy, and replacing the optimal values in the central repository of all
the variables at once. A sequential block Jacobi iteration does the same,
but it does not replace the values until the sweep over all the blocks is
completed.
Since we allow repetitions of variables in the sub-blocks, it is prudent
to introduce averaging of the multiple appearances of variables. In the case
of Jacobi, this can be done naturally at the end of a sweep. In the case of
Gauss-Seidel, one needs to keep a count of the repetitions and perform a
running average for each repeated variable.

Parallel asynchronous block nonlinear Gauss-Seidel


iteration
The idea of chaotic relaxation for linear systems originated with Rosenfeld
in 1967 [211]. Other early actors in this important topic were A. Ostrowski
[178] and S. Schechter [221]. Chazan and Miranker published in 1969 a
detailed paper [43] describing and formalizing chaotic iterations for the
parallel iterative solution of systems of linear equations. This was extended
to the nonlinear case in [159, 160].
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 189

The purpose of chaotic relaxation is to facilitate the parallel implemen-


tation of iterative methods in a multi-processor system or in a network of
computers by reducing the amount of communication and synchronization
between cooperating processes and by allowing that assigned sub-tasks go
unfulfilled. This is achieved by not requiring that the relaxation follow a
pre-determined sequence of computations, but rather letting the different
processes start their evaluations from a current, centrally managed value of
the unknowns.
Baudet [10] defines the class of asynchronous iterative methods and
shows that it includes chaotic iterations. Besides the classical Jacobi and
Gauss-Seidel approaches he introduces the purely asynchronous method in
which, at each iteration within a block, current values are always used.
This is a stronger cooperative approach than Gauss-Seidel and he shows
in numerical experimentation how it is more efficient with an increasing
number of processors.
We paraphrase a somewhat more restricted definition for the case of
block nonlinear optimization that concerns us. Although Baudet’s method
would apply directly to calculating the zeros of ∇f (x), we prefer to describe
the method in the optimization context in which it will be used. We use
the notation introduced above for our partitioned problem.
Let

K 
K
f (x) = 21 r(x)22 = 1
2 r(x) 
[k] 2
2 = f (x)[k] ,
k=1 k=1

where the data are partitioned into K non-overlapping blocks. An asyn-


chronous iteration for calculating minx f (x) starting from the vector x0 is
a sequence of vectors xk with elements defined by
[k]
xj = arg min f (x)[k] for j ∈ CIk
x

subject to
[k]
xj = xj for j ∈ CIk .
The initial vector for the kth minimization is

x[k] = (x1 (s1 (k)), . . . , xn (sn (k))),


where S = {s1 (k), . . . , sn (k)} is a sequence of elements in N n that indicates
at which iteration a particular component was last updated. In addition,
the following conditions should be satisfied:

si (k) ≤ k − 1 and si (k) → ∞, as k → ∞.

These conditions guarantee that all the variables are updated often enough,
while the formulation allows for the use of updated subsets of variables “as
190 LEAST SQUARES DATA FITTING WITH APPLICATIONS

they become available.” Baudet [10] gives sufficient conditions for the con-
vergence of this type of process for systems of nonlinear equations. The
convergence of an asynchronous block Gauss-Seidel process and as a spe-
cial case, the previously described sequential algorithm, follows from the
following theorem proved in [188].

Theorem 98. Convergence of the asynchronous BGS L-M itera-


tion. Let the operator

T (x) = x − (J(x)T J(x) + λI)−1 J(x)T r(x)

be Lipschitz, with constant γ, and uniformly monotone with constant μ, in


a neighborhood of a stationary point of f (x), with 2μ/γ 2 > 1. Then T is
vector contracting and the asynchronous Levenberg-Marquardt method for
minimizing 21 r(x)22 is convergent.

As we indicated above, an alternative to this procedure is a chaotic


block Jacobi iteration, where all the processes use the same initial vector
at the beginning of a sweep, at the end of which the full parameter vector
is updated. In general, asynchronous block Gauss-Seidel with running av-
erages is the preferred algorithm, since Jacobi requires a synchronization
step at the end of each sweep that creates issues of load balancing.
Chapter 10

Ill-Conditioned Problems

Throughout this book, ill conditioning and rank deficiency have been dis-
cussed in several places. For completeness, we will summarize the main
ideas now. This chapter therefore surveys some important aspects of LSQ
problems with ill-conditioned matrices; for more details and algorithms we
refer to [20, 118, 121].

10.1 Characterization
As we have already discussed, the consequence of ill conditioning is per-
turbation amplification. The basic tool for analyzing ill-conditioned linear
least squares problems is the singular value decomposition A = U ΣV T .
Assume that A has no zero singular values and that the data in the LSQ
problem are perturbed, i.e., b = b̄ + e, where b̄ is the
nexact right-hand side,
x̄ is the exact solution and the noise vector e = i=1 (uTi e) ui represents
white noise with zero mean due to measurement and/or approximation
errors. Then the perturbed solution to the LSQ problem is


n
uT b 
n
uT b̄ 
n
uT e 
n
uT e
x∗ = i
vi = i
vi + i
v i = x̄ + i
vi .
i=1
σi i=1
σi i=1
σi i=1
σi
 
Given the statistical assumption on e, it follows that E |uTi e| is constant.
Hence, the contribution to the LS solution from the noise is amplified more
by the smaller singular values. This behavior is particularly pronounced
when A has a large condition number cond(A) = σ1 /σn . For these ill-
conditioned problems we must consider how to deal with the highly ampli-
fied noise in the solution.
Ill-conditioned problems are characterized by their distribution of the
singular values. Among the ill-conditioned problems there are two impor-

191
192 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 10.1.1: The singular values of a 512 × 512 discretization of the


“phillips” test problem. Notice that the singular values decay gradually to
zero with no specific gap between large and small values.

tant classes that have achieved special attention due to their importance:
numerically rank-deficient problems and discrete ill-posed problems.
Numerically rank-deficient problems have a cluster of small singular val-
ues with respect to some threshold, and there is a distinct gap between
the larger and smaller singular values. The linear prediction problem from
Example 28 is an example of such a problem. The numerical rank defi-
ciency underlying this class of problems represents linear dependencies in
the underlying mathematical model. The small singular values may be of
the order of the machine precision, relative to σ1 , in which case they are
nonzero due to rounding errors in the computations. Alternatively, the
small singular values can be somewhat larger – but still well separated
from the larger ones – in which case their sizes represent approximation
and truncation errors in the model.
Discrete ill-posed problems are generated by the discretization of con-
tinuous ill-posed problems such as Fredholm equations of the first kind and
similar inverse problems. These problems have singular values that decay
gradually to zero, and there is no gap anywhere between large and small
singular values. This characteristic behavior reflects fundamental proper-
ties of the underlying ill-posed problem. Figure 10.1.1 shows the singular
values of a discretization of size 512 × 512 of the particular test problem
“phillips” (see p. 32 in [121]). As the matrix dimensions of a discrete ill-
posed problem increase, eventually the computed singular values will level
off as a result of rounding errors.

10.2 Regularization methods


Whenever small singular values are present, the idea of regularization is
to extract the linearly independent information from the matrix and the
ILL-CONDITIONED PROBLEMS 193

noisy right-hand side. For numerically rank-deficient problems the concept


of numerical rank comes into play, while for discrete ill-posed problems the
approach is perhaps less intuitive. In both cases, however, the objective is
to suppress the noisy SVD components of the LSQ solution corresponding
to the small singular values.
If the problems are “small,” in the sense that we can easily compute the
SVD, then it is often natural to regularize the problem through the use of
the truncated SVD (TSVD) method. In TSVD, the number of terms used
in the least squares solution is restricted to k < n, so that the regularized
solution will be

k
uT b
TSVD solution: x∗k = i
vi . (10.2.1)
i=1
σi

A convenient way to interpret this approach is that, instead of solving the


original problem with an ill-conditioned matrix, we solve a problem with
an approximate, mathematically rank-deficient and better-conditioned one.
This is because the TSVD solution x∗k solves the problem

k
min Ak x − b2 with Ak = σi ui v Ti . (10.2.2)
x
i=1

The regularization is controlled by the truncation parameter k, which is


chosen to minimize the influence of noise. It can be shown that the condition
number for x∗k is σ1 /σk , which, depending on the choice of k, can be much
smaller than cond(A) = σ1 /σn . For numerically rank-deficient problems
it is natural to select the parameter k according to the singular values –
specifically as the numerical rank with respect to a threshold that reflects
the errors in the data. For discrete ill-posed problems, on the other hand,
the choice of k must involve the right-hand side; we return to this aspect
in the next section.
An alternative strategy is Tikhonov regularization [236] (which often
gives results that are similar to TSVD, cf. [24], p. 512). Tikhonov regular-
ization can be implemented efficiently for much larger problems because,
as we shall see, it leads to a least squares problem. The underlying idea
is to add to the LSQ minimization problem a quadratic penalty term that
prevents the solution norm from growing large. The original problem is
now replaced by computation of the solution x∗λ to the problem
, -
2 2
Tikhonov problem: minx A x − b2 + λ2 x2 , (10.2.3)

which is equivalent to the damped least squares problem


 
 A b 
min  x−  . (10.2.4)
x λI 0 2
194 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Here λ is the regularization parameter that controls the trade-off between


minimizing the residual norm and minimizing the norm of the solution.
Since the Tikhonov solution x∗λ is a solution to a LSQ problem, it can be
computed by a variety of direct or iterative methods suited for the particular
properties of the matrix A (structure, sparsity, etc). A disadvantage is
that the damped LSQ problem must be solved for every value of λ, which
can be cumbersome if one needs to try many different values; however,
if a bidiagonalization of A can be computed (cf. section 5.5), then the
computational effort for each λ is reduced considerably.
One often encounters the alternative formulation (AT A+λ2 I) x = AT b,
which is the normal equations for (10.2.4) and therefore mathematically
equivalent to (10.2.3) and (10.2.4). This formulation is not well suited
for numerical computations due to the loss of accuracy when forming the
cross-product matrix AT A with the ill-conditioned matrix A.
A common way to present and analyze the Tikhonov solution x∗λ is via
the SVD representation:


n
uTi b σi2
Tikhonov solution: x∗λ = γi vi where γi = .
i=1
σi σi2 + λ2
(10.2.5)
The γi are called filters, and they damp the effect of terms corresponding
to singular values smaller than λ, thus establishing the relation with the
TSVD.
For even larger problems, both the SVD and LSQ computations, associ-
ated with the TSVD and Tikhonov methods, are prohibitive. If the matrix
A is sparse and one knows beforehand that only a small fraction of the SVD
components should be included in the regularized solution, then it could
be worthwhile to compute approximate singular values via Golub-Kahan
iterations; see [12] and the code LASVD in [263].
An alternative is to use regularizing iterations, which rely on the fact
that some iterative methods for the LSQ problem, such as CGLS and LSQR,
have a regularizing effect, where the number of iterations plays the role of
the regularization parameter (see [117] and [121]). Since these methods
are defined by Krylov iterations, they restrict the solution x(k) in the kth
iteration to a k-dimensional subspace spanned by the right Golub-Kahan
vectors. They can therefore be considered as projection methods that re-
duce the large and ill-conditioned problem to a smaller one with only k
unknowns. Using the Courant-Fischer min-max theorem [20] one can show
that these approximations are better conditioned than the original problem.
Both theory and computational experience show that the Krylov sub-
space algorithms in their early iterations produce regularized solutions x(k)
that capture the dominating SVD components of the exact solution, i.e.,
the ones corresponding to the largest singular values. However, at later iter-
ILL-CONDITIONED PROBLEMS 195

ations — as the projected problems converge to the original ill-conditioned


one — more noise enters into the solution. That is why the subspace di-
mension k acts now as the regularization parameter.
In some situations noise may start to dominate the solution x(k) even
before all large singular values have been captured by the Golub-Kahan
process. When this happens, the “self-regularizing effect” of the projection
alone is insufficient to produce a useful regularized solution and instead so-
called hybrid methods should be used. Once the dimension of the Krylov
subspace is large enough to ensure that we capture all the desired infor-
mation in the data, a regularization method like TSVD or Tikhonov is
introduced for the much smaller bidiagonal matrices of the LSQR method.
In this way we still prevent noise from entering the solution; for details
see, e.g., [117], [118] and [144]. The disadvantage is that now two param-
eters have to be chosen: the number of iterations and the regularization
parameter for the projected bidiagonal problem.
One important fact in the case of TSVD and Tikhonov regularization, as
well as regularizing iterations, is that the solution and the residual norms
are monotone functions of the corresponding regularizing parameters. A
larger truncation parameter k, a smaller Tikhonov parameter λ, or a larger
number of iterations, correspond to an increase of the solution norm and a
decrease in the residual norm. This, as we will see in the next section, is
important for several parameter selection techniques.

10.3 Parameter selection techniques


In each of the regularization techniques described above some parameter
must be selected: the TSVD truncation parameter k, the Tikhonov regu-
larization parameter λ, or the number of CGLS or LSQR iterations. Here
we will briefly survey some computational techniques for choosing these pa-
rameters. By xreg we will denote the regularized solution from any of the
above methods. We focus on discrete ill-posed problems, where the choice
cannot be based solely on the singular values. These methods have in com-
mon the fact that they implicitly incorporate information about both the
singular values and the left singular vectors through the use of the residual
norm.
When accurate information is available about the variance of the noise,
represented by the perturbation vector e, the regularization parameter can
be chosen using Morozov’s discrepancy principle [144]. Here the regular-
ization parameter is chosen such that the residual norm for the regularized
solution is equal to – or close to – the norm of the noise vector:
b − A xreg 2  e2 .
This technique of “fitting to the noise level” was already introduced in Sec-
196 LEAST SQUARES DATA FITTING WITH APPLICATIONS

tion 6.4 as a stopping criterion for iterative methods. When used as a


parameter-choice method, one should be aware that the regularized solu-
tion is quite sensitive to underestimates of e2 [121].
When no reliable estimate is available for e2 , there are several other
possible parameter-choice methods. Among them is the L-curve technique,
which determines the parameter from a log/log plot of the solution norm
versus the residual norm, for different values of the regularizing parameter.
In other words, each point on the L-curve is given by
( log b − A xreg 2 , log xreg 2 ) ,
where both norms depend on the regularization parameter. For example,
in the case of TSVD the two norms depend on the truncation parameter k
as follows:

k
uTi b
2
xk 22 = ,
i=1
σi
n
 2
b − A xk 22 = uTi b ,
i=k+1

and it is easy to see that the norm xk 2 of the TSVD solution increases
with k, i.e., when more SVD components are included. At the same time,
the norm b−A xk 2 of the corresponding residual vector is seen to decrease
with k. For Tikhonov regularization, one can show that xλ 2 increases
monotonically when λ decreases, while b−A xλ 2 decreases monotonically
as λ decreases. See [240, 241] for an interesting theory and algorithm based
on bi-objective optimization to solve this problem efficiently. Possibly, using
[190, 194] techniques for sampling the Pareto front uniformly could be of
help in improving even more an excellent algorithm that has many other
important applications.
Further analysis in [118, 121] shows that this curve, in log/log scale, will
often have a characteristic L shape as shown in Figure 10.3.1. The idea is
now to determine the regularization parameter (e.g., k or λ) correspond-
ing to the “corner” of the L-curve, the intuitive argument being that this
choice will balance the residual and solution norms. For more details and
algorithms we refer to [118, 120, 121]; the methods of Section 8.5 can also
be applied. The technique is known to fail if the solution is very smooth,
i.e., if the solution is dominated by very few SVD components [119].
When using iterative algorithms based on Golub-Kahan bidiagonaliza-
tion, one can use the so-called L-ribbons [33] to estimate the best regular-
ization parameter. These are inexpensive by-products of the G-K process,
and an extension of this work is the computation of the curvature ribbon de-
scribed in [32]. A similar methodology for large-scale problems is discussed
in [99].
ILL-CONDITIONED PROBLEMS 197

Figure 10.3.1: Shaw test problem. Left: the singular values σi and the
absolute values of the right-hand side’s SVD coefficients |uTi b|; notice that
the latter level off at the standard deviation 10−3 of the noise. Middle:
the L-curve for Tikhonov regularization with corner for λ = 6.8 · 10−4 ; the
vertical line at b − A xλ 2 = e2 represents the discrepancy principle.
Right: the GCV function G(λ) with a minimum for λ = 8.1 · 10−4 .

Generalized cross-validation (CGV) is another parameter selection tech-


nique that does not need additional statistical information about the noise.
First proposed by Golub et al. [94], it seeks to minimize the predictive
mean-square error b̄ − A xλ 2 , where b̄ is the exact right-hand side. The
computational problem amounts to minimizing the GCV function, given
here for Tikhonov regularization:

m b − A xλ 2
1 2
G(λ) = , Fλ = A (AT A + λ2 I)−1 AT . (10.3.1)
1
(m trace(I − Fλ ))2
An efficient algorithm for computing the above GCV function, based on
the bidiagonalization of A, can be found in [73].
Yet another approach for choosing the regularization parameter is to
perform a residual analysis along the lines described in Section 1.4. An
advantage of this approach is that it incorporates statistical information
from the right-hand side (instead of just using the residual norm as in the
above methods). In the case of white noise in the data, we emphasize the
use of the normalized cumulative periodogram as originally advocated by
Rust [215, 216], where the key idea is to choose the regularization parameter
as soon as the residuals can be considered white noise. For more details
and implementation aspects see [122, 216].
A careful comparison of the different parameter-choice methods can
be found in [118]. The general conclusion is that the optimal parameter
selection technique is problem and regularization method dependent.
Example 99. Regularization of the “shaw” test problem. In this
example we consider the “shaw” test (see p. 49 in [121]), which models the
scattering of light as it passes through a thin slit. We use a discretization
198 LEAST SQUARES DATA FITTING WITH APPLICATIONS

with m = n = 64 and we add white Gaussian noise with standard deviation


ς = 10−3 to the right-hand side. The left part of Figure 10.3.1 shows the
singular values σi and the absolute values of the right-hand side’s SVD
coefficients |uTi b|; notice that the latter level off at the standard deviation
10−3 of the noise. This means that all the SVD coefficients for i > 10 are
dominated by the noise contributions uTi e and therefore should be purged
from the regularized solution. This can be achieved by using k = 10 in the
TSVD method.
The middle part of Figure 10.3.1 shows the L-curve for the problem,
and we see two distinct parts – a horizontal part where too few SVD com-
ponents are included in the solution and a vertical part where too many
components are included. In the latter part the noisy SVD components
uTi e/σi completely dominate the solution. The corner corresponds to the
choice λ = 6.8 · 10−4 of the Tikhonov regularization parameter. Some of
the corresponding filter factors γi in (10.2.5) are γ8 = 0.98, γ9 = 0.79,
γ10 = 1.2 · 10−2 and γ10 = 2.0 · 10−4 , showing that this choice of λ filters
out SVD components for i > 10.
The vertical line in the plot at b−A xλ 2 = e2 = 8.2·10−3 represents
the discrepancy principle, while the small inset figure shows b − A xλ 2 as
a function of λ, together with a horizontal line at e2 = 8.2·10−3 . For this
problem, the discrepancy principle leads to the choice λ = 9.1 · 10−3 , which
is somewhat larger than that chosen by the L-curve, leading to a regularized
solution with effectively only 8 SVD components.
Finally, the right part of Figure 10.3.1 shows the GCV function G(λ)
in (10.3.1). The minimum occurs for λ = 8.1 · 10−4 , which is quite close to
the value indicated by the L-curve. The figure illustrates a potential problem
with the GCV method, namely, that the minimum is quite flat and thus can
be difficult to locate.

10.4 Extensions of Tikhonov regularization


The Tikhonov regularization problem in (10.2.3) was motivated by the need
to suppress the noisy SVD components of the LSQ solution, and this is
2
achieved by adding the penalization term x2 , with weight λ2 , to the LSQ
problem. In this section we discuss ways to extend this approach, depending
on the user’s prior information about the problem.
In some applications we have prior information about the “smooth-
2
ness” of the solution, in the sense that instead of using the norm x2
2
as the penalty term we prefer to use L x2 where L is a p × n matrix.
If rank(L) = n then L x2 is a norm, otherwise L has a non-trivial null
space and L x2 is a semi-norm; in the latter case, the Tikhonov solution
is unique if the null spaces of A and L intersect trivially. Common choices
of L are finite-difference approximations to the first or second derivative
ILL-CONDITIONED PROBLEMS 199

since they lead to regularized solutions that tend to be “smoother” than the
standard Tikhonov solution x∗λ , in the sense that there is a higher correla-
tion between neighboring elements of the solution (and the solutions appear
as a smoother function when plotted). If we do not incorporate boundary
conditions about the solution, then the standard choices of L are the two
rectangular matrices

⎛ ⎞ ⎛ ⎞
−1 1 1 −2 1
⎜ .. .. ⎟ ⎜ .. .. .. ⎟
L1 = ⎝ . . ⎠, L2 = ⎝ . . . ⎠,
−1 1 1 −2 1

which represent discrete approximations to the first- and second-derivative


operators on a regular grid. Variants of these matrices that incorporate
boundary conditions, and extensions to two-dimensional problems, are dis-
cussed in Section 8.2 of [121]. The regularization and parameter-choice
algorithms for the case L = I can easily be extended to the case L = I; we
refer to [118] for more details about theory, algorithms, and how to work
with semi-norms. See also section 7.1 on the LSQI problem.
The above choice of penalty term is appropriate when the goal is to
compute smooth solutions; however, in certain applications we prefer to
compute solutions that are less smooth than those produced by Tikhonov
regularization. An important case is when our regularized solutions are
assumed to be piecewise smooth, i.e., we allow a small amount of elements
in the solution that are not correlated with their neighbors, leading to so-
lutions with small “jumps” in the size of the elements. The corresponding
penalty term originates in the concept of total variation (which is a funda-
mental tool in solving nonlinear conservation law problems and in image
processing), and it takes the form L1 x1 , assuming that the solution x
represents a one-dimensional signal; see, e.g., [55] for the extension to the
two-dimensional case.
Total variation regularization is now used successfully in two- and three-
dimensional problems, such as image deblurring and tomographic recon-
struction. There are too many different algorithms available today for solv-
ing the total variation regularization problem to survey here; most of the
algorithms reformulate the problem as a convex optimization problem with
a smooth approximation to L1 x1 and then use large-scale techniques tai-
lored to the particular application. See [55] for a recent survey of software
for total variation image deblurring.

Example 100. This example illustrates the use of Tikhonov regularization


with L = I and L = L2 as well as total variation regularization. The under-
lying example is the test problem “shaw” from Example 99 with custom-made
solutions. The top and bottom rows in Figure 10.4.1 show reconstructions
200 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 10.4.1: Reconstructions of smooth and non-smooth solutions by


means of Tikhonov and total variation regularization.

(solid lines) of a smooth and a non-smooth solution, respectively (shown as


the dashed lines). For the smooth solution, Tikhonov with L = L2 gives the
best reconstruction, while for the non-smooth solution the total variation
reconstruction is superior.

Prior information can take other forms, and we have already seen, in
Chapter 1, that different types of noise in the data lead to different mini-
mization problem. Underlying the total least squares (TLS) problem from
Section 7.3 is the fact that the coefficient matrix can also be influenced by
errors. We can incorporate this point of view into the Tikhonov formalism
and define the regularized TLS (R-TLS) problem:
( )
A x − b22 2
min + λ 2
L x2 , (10.4.1)
x 1 + x22

where the matrix L is as discussed above. We make the following observa-


tions related to this problem.
2
1. The Tikhonov problem (10.2.3) with smoothing term L x2 and the
R-TLS problem (10.4.1) have the alternative formulations
2
min A x − b2 s.t. L x2 ≤ α
x

and
A x − b22
min s.t. L x2 ≤ α
x 1 + x22
ILL-CONDITIONED PROBLEMS 201

where α is a positive parameter. If α is smaller than L x∗ 2 or


L xTLS 2 , respectively (which is the relevant case for regularization
problems), then the solutions to the above two problems satisfy the
constraint with equality. We assume this in the discussion below.
2. For the case L = I the R-TLS problem (10.4.1) takes the form
minx {(1 + α2 )−1 A x − b22 + λ2 x2 }, which is identical to the
2

standard-form Tikhonov problem in (10.2.3). Hence in this case the


R-TLS solution is identical to the Tikhonov solution. There is no
need to solve the more complicated R-TLS problem.
3. For the case L = I it was proved in [93] that the R-TLS solution is a
solution to the problem

(AT A − λ1 I + λ2 LT L) x = AT b,

where the two parameters λ1 and λ2 are given by

A x − b22 (b − A x)T A x
λ1 = and λ2 =
1 + x22 α2

and it is also shown in [93] that λ2 > 0 as long as α < L xTLS 2 .


This means that the R-TLS solution is genuinely different from the
Tikhonov solution whenever the residual vector b − A x is different
from zero, since both λ1 and λ2 are nonzero in this case.
The most common algorithms for solving the R-TLS problem are variants
of an iterative procedure that require at each iteration the solution of a
quadratic eigenvalue problem; see [225] and [207]. A MATLAB implementa-
tion RTLSQEP of such an algorithm was written by D. M. Sima and is avail-
able from homes.esat.kuleuven.be/~dsima/software/RTLS/RTLS.html. An
alternative algorithm described in [11] is based on solving a sequence of
quadratic optimization subproblems with a quadric constraint.

10.5 Ill-conditioned NLLSQ problems


An ill-conditioned nonlinear least squares problem has a solution that is
very sensitive to data perturbations. The notion of rank deficiency can be
extended to the nonlinear least squares problem. We say that a nonlinear
problem is rank deficient if the Jacobian J(x) has rank r < min{m, n} in a
neighborhood of a local minimum of the objective function f (x). Of course,
a Jacobian that is rank deficient along the iteration also presents a problem
(branching or bifurcation), which may call for additional techniques such
as continuation to follow different branches leading most likely to different
solutions.
202 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Note that all the optimization methods introduced to solve the NLLSQ
problem depend heavily on the behavior of the Jacobian, even if the Hessian
is considered, since the Jacobian is also the dominant part of the second-
order information. One regularization technique, in the rank-deficient case,
is the TSVD approach, where one replaces the Jacobian J by Jk , the rank-
k approximation (10.2.2) defined using the SVD of J. Unfortunately, the
determination of an appropriate truncation parameter k is more costly than
in the linear case, since the Jacobian is changing with the iteration.
Tikhonov regularization is another choice. Instead of solving the original
minimization problem 8.1.3, the problem one considers is
 
min f (x)22 + μ2 x − c22 ,
x

for some parameter μ. Here c is an approximation to the solution (if one


has a good estimate) and otherwise the zero vector.
As we saw in detail in the previous chapter, the method of Levenberg-
Marquardt is a regularized version of Gauss-Newton and close in spirit to
Tikhonov regularization in the nonlinear case. Böckman [28] has considered
regularization via a trust-region algorithm combined with separation of
variables and has obtained excellent results in comparison with state-of-
the-art solvers applied to the unreduced problem.
Chapter 11

Linear Least Squares


Applications

We now present in detail two larger applications of linear least squares for
fitting of temperature profiles and geological surface modeling. Both of
them use splines, so for completeness, we start with a summary of their
definitions and basic properties.

11.1 Splines in approximation


This is a short summary of splines with an emphasis on univariate and
bivariate cubic splines, their B-splines representation, and their use in least
squares approximation. The topic of splines is important in many data
fitting applications, especially where there is no pre-determined functional
form. There is a distinct advantage in using splines for data representation,
whether in interpolation or least squares approximation, because of their
local support and consequent good quality of local approximation. We
concentrate in detail on cubic splines because they are well suited to our
type of applications. For an exhaustive discussion see some of the classical
books, such as [7, 29, 69].

Univariate splines
A spline is a function defined piecewise over its domain by polynomials of
a certain degree that are joined together smoothly.

Definition 101. A spline s(k) (x) of degree k, with domain [a, b] and nodes
or knots (breakpoints) a = t0 ≤ t1 ≤ . . . ≤ tN = b has the following
properties:

203
204 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 11.1.1: Left: four consecutive cubic B-splines defined on the uni-
formly distributed knots tj = 0, 2, . . . , 7. Right: Cubic spline with coincident
end knots at 0.

• it is a polynomial of degree k in each subinterval [tj , tj+1 ], j =


0, . . . , N − 1;
• at each interior knot t1 , . . . , tN −1 that does not coincide with its
neighbors, the function s(k) (x) and its derivatives up to order k − 1
are continuous.

The class of such splines is a linear function space of dimension N + k, as


one can verify from the number of conditions imposed for the continuity of
the splines and its derivatives.
There are various possible representations for a spline function, but the
numerically most stable is in terms of a basis of B-splines.
Definition 102. Given the non-decreasing sequence of knots {tj }j=0,...,N ,
the normalized jth B-spline of degree k is defined by
(k)
Bj (x) = (tj+k+1 − tj ) [tj , . . . , tj+k+1 ] (t − x)k+ ,

where (t − x)k+ ≡ max((t − x)k , 0) is the truncation function and


[tj , . . . , tj+k+1 ] (t − x)k+ denotes the kth divided difference of the truncation
function with x fixed.
(k)
Each Bj (x) is in fact a specific k-degree spline with local support or
“active” on [tj , tj+k+1 ], i.e., nonzero only in k + 1 consecutive subintervals.
Vice versa, at any [tj , tj+1 ], only k + 1 nonzero B-splines overlap. Also,
 (k)
at any x ∈ [tj , tj+1 ], the sum j Bj (x) = 1. If the knots are uniformly
(k)
distributed, the maximum value of any B-spline Bj (x) occurs at the mid-
point of its “definition” interval. See Figure 11.1.1 for an illustration with
k = 3.
Instead of deriving the B-splines from the definition, one can generate
a B-spline of any degree recursively, starting from the piecewise constant
LINEAR LEAST SQUARES APPLICATIONS 205

(0)
B-splines Bj , by using the de Boor-Cox formulas on the knot sequence
{tj }j=0,...,m . For k = 1, 2, ...

(0) 1, tj ≤ t < tj+1
• Bj (x) =
0, otherwise.
(k) x−tj (k−1) tj+k+1 −x (k−1)
• Bj (x) = tj+k −tj Bj (x) + tj+k+1 −tj+1 Bj+1 (x),

When some knots coincide, any resulting indeterminate of the form 00 is


defined as 0.
The knot distribution is very important, because the relative spacing
of the nodes determines the shape of the B-splines. Neither scaling nor
translating the knot vector affects the shape. This implies that in the case
of uniformly spaced knots (i.e., tj+1 − tj = constant ∀j), these equations
can be simplified using the knot vector {0, 1, . . . , N }.
We will concentrate now on cubic splines and simplify the notation by
omitting the upper-script k = 3, i.e., we use s(x) for a cubic spline and
Bj (x) for a cubic B-spline. Returning to the representation of splines in
terms of B-splines, a general spline can be expressed as

s(x) = cj Bj (x), x ∈ [a, b].
j

The coefficients cj are known as control vertices or control points (from the
application of splines in computer graphics).
The knots {t0 , t1 , . . . , tN } define the N −3 splines B0 , . . . , BN −4 . There-
fore, to have a basis of N + 3 functions for the space of cubic splines on
the domain [a, b] one needs additional knots to define the 6 additional basis
functions required. Thus, 6 more knots, sometimes called phantom knots,
must be associated with t−3 , t−2 , t−1 , tN +1 , tN +2 , tN +3 . These knots can
be chosen to coincide with the end points a and b, or to be outside the
interval [a, b]. The right plot in Figure 11.1.1 illustrates this point for the
case N = 1 with a single knot interval [t0 , t1 ] = [3, 4].
The set {B−3 , . . . , BN −1 } is now a basis for any spline s(x) defined on
x ∈ [a, b]:

N −1
s(x) = cj Bj (x).
j=−3

As mentioned before, the cubic B-splines are only active (i.e., nonzero) in
four consecutive intervals, so a change in a control vertex cj only changes
the local behavior of the curve.
Multiple knots are particular cases of a nonuniform knot distribution.
Another important use of multiple knots is to reduce smoothness. If two
knots coincide, the continuity of the derivatives of the spline is reduced by
206 LEAST SQUARES DATA FITTING WITH APPLICATIONS

one order, i.e., for a cubic spline, instead of continuity of derivatives up to


order 2, one would have only first-order continuity. In general, there is a
reduction of one order of continuity for each additional coincident knot. A
special case is an open uniform distribution where 3 knot values at each end
of the knot vector coincide. This is useful for imposing boundary conditions.
One can use a nonuniform distribution of knots to improve the approx-
imation quality. The difference in the number of knots needed for a given
approximation accuracy, be it by interpolation or least squares approxima-
tion, can be remarkable, compared √to a uniform distribution. For example,
when interpolating the function x in [0, 1] to a O(10−4 ) precision with
a uniform distribution one needs N > O(106 ), whereas with a nonuniform
one it can be done with N ≈ 70 (see [202], p. 255).
Some of the algorithms that use nonuniform knot sequences may require
the addition of knots. It is of interest therefore to have a technique that,
given a spline s(x) defined on a knot sequence {tj }j=0,...,m , computes the
coefficients of the representation of the spline when some knots have been
inserted, maybe for added accuracy. An efficient algorithm for this is the
Oslo algorithm, described in [69], chapter 1. In the same reference, formulas
for the derivative and integral of splines are given, as well as an algorithm
for calculating the Fourier coefficients that is of interest in signal processing.

Cubic splines least squares approximation


Originally, splines were designed for interpolation, but this is not appropri-
ate in the presence of noise. In this case we recall that the data are of the
form
yi = Γ(xi ) + ei , i = 1, . . . , m,
with Γ a smooth, low-frequency function, whereas the errors ei generally
consist of all frequencies, including high-frequency oscillations. The “local”
character of splines, where the value of any control vertex cj is only affected
by the data in a few neighboring intervals of the knot tj , can promote the
approximation not only of the data trend but also of the less smooth noise,
more so than when using more global approximating functions.
Essentially there are two main strategies when using splines to obtain
a smooth fit to noisy data:

• smoothing splines,

• regression splines.

For completeness and because of their importance, we briefly describe


smoothing splines (following [29]), although we will not pursue them in
detail here. Given our fitting problem for the data set {(xi , yi )}i=1,...,m
with yi = Γ(xi ) + ei and with estimates ςi of the standard deviation in ei , a
LINEAR LEAST SQUARES APPLICATIONS 207

smoothing cubic spline is constructed by solving the following minimization


problem over the space of all cubic splines
 m . 0
 yi − f (xi ) /2  tm

min p + (1 − p) 2
f (x) dx , 0 < p < 1.
f
i=1
ςi t0

The parameter p is chosen to balance the closeness of the fit with the


smoothness of the approximating function. This is actually, again, a bi-
objective optimization problem and different values of the parameter p will
give different points in the Pareto front. Which one to choose is problem
dependent, but as usual, the most complete solution would require finding
an uniform sampling of the Pareto front and then choosing the appropriate
trade-off between accuracy of the fit and smoothness.
Another way to approximate the smooth function contained in the data
is via a least squares approximation with splines, a technique known under
the name of regression splines. The key is now that the knot sequence is not
the set {xi }i=1,...,m at which the data yi were sampled but an independent
sequence covering the same domain interval [a, b] where all the xi are con-
tained. In general, fewer knots will give smoother splines; the question is
to determine the number and distribution of knots to achieve the optimal
representation. For a unique solution to exist, every knot interval must
contain at least one xi (adaptation of the Schoenberg-Whitney theorem,
see [29]).
Once a set of knots has been chosen, the appropriate spline vertices cj
are determined as solutions of a linear least squares problem:
2
min y − B c2 ,
c

where the so-called observation matrix B is the m × (N + 3) matrix of


B-splines functions sampled at the abscissas of the data points
bij = Bj (xi ), i = 1, . . . , m, j = 1, . . . , N + 3.
For practical reasons we have shifted here the sub-indices, both of the cj ,
and the Bj (x) from −3, . . . , N − 1 to 1, . . . , N + 3.
We will assume that the data points xi are given in increasing order.
The matrix B has full rank if, as mentioned before, there are enough data
in each knot interval; cf. [29]. If there are not enough data per subinterval,
it may still have full rank but be ill-conditioned.
Structurally, B is a sparse matrix with row bandwidth 4 for cubic
splines. Let us partition the matrix into N − 1 blocks,

⎛ ⎞
B1
⎜ ⎟
B = ⎝ ... ⎠ with Bj ∈ Rmj ×(N +3) , j = 1, . . . , N − 1,
BN −1
208 LEAST SQUARES DATA FITTING WITH APPLICATIONS

where the number mj of rows in the jth block is the number of data per
subinterval. Then each block is sparse and the four nonzero elements of
each row are in columns j through j + 3 (since in every subinterval only
four B-splines are active).
For very large problems, Lawson and Hanson ([150], chapter 27) describe
a sequential Householder QR process that is applied in N − 1 steps and can
be tailored to reduce considerably the working storage needed, namely,
from(N + 3 + max mj ) × 5 to m × (N + 3) locations. At each step, only one
new block is incorporated and the already obtained QR decomposition is
updated to consider this block. The algorithm requires m(N + 3)2 − (N +
3)3 /3 flops; see the example in [150], p. 221. Alternatively, row-wise Givens
rotations can be used efficiently to compute the QR decomposition with
low fill-in. A good reference for details regarding adaptations of storage
methods and algorithms for normal equations and Householder and Givens
QR decompositions to banded matrices is [20].
A considerably more complex problem is the nonlinear least squares
problem resulting when the number, but not the distribution, of the knots
is chosen a priori. In other words, the minimization problem must be solved
for both the knots and the control vertices:


m
2
min [yi − s(xi ; c, t)] .
c,t
i=1

This is a nonlinear constrained separable least squares problem, as the knots


have to be in non-descending order. Therefore, the methods described in
the previous chapters can be applied. In [137] the difficulties that appear
are discussed; see also chapter 4 in [69].

Bivariate tensor product cubic splines


We now describe bivariate tensor product splines; the extension to higher
dimensions is straightforward. Some definitions will be needed.

Definition 103. Given the space U of functions defined on a domain X and


the space V of functions with domain Y, the tensor product of two functions
u ∈ U and v ∈ V, is a function with domain X ×Y = {(x, y) , x ∈ X , y ∈ Y}:

u ⊗ v (x, y) = u(x)v(y).

The tensor product of the spaces U ⊗ V is the set of all such products and
their linear combinations.
The Kronecker or tensor product of two matrices A ∈ Rm×n and B ∈
LINEAR LEAST SQUARES APPLICATIONS 209

Rp×q is the mp × nq block matrix,


⎛ ⎞
a11 B a12 B ... a1n B
⎜ a21 B a22 B ... a2n B ⎟
⎜ ⎟
A⊗B =⎜ . .. .. ⎟ .
⎝ .. . . ⎠
am1 B am2 B ... amn B

An important property is that the pseudoinverse of the Kronecker prod-


uct of matrices is the Kronecker product of their pseudoinverses: (A⊗B)† =
A† ⊗ B † . By vec(X) we denote a one-dimensional (column) array obtained
by storing a matrix X consecutively by columns. We are interested in bicu-
bic splines written in terms of a tensor product basis of cubic B-splines.

Definition 104. Consider the rectangular region [a, b] × [c, d] and define a
(x) (y)
mesh on it with knots (tj , tk ) for j = 0, . . . , J and k = 0, . . . , K. Assume
further that Bj (x) and Bk (y) are the cubic B-splines defined as above on
the corresponding one-dimensional knot sequences in the x and y directions.
Then, provided that (as in 1D) additional knots (phantom knots) have been
(x) (y)
added, i.e., (tj , tk ) for j, k = −3, −2, −1 and j = J + 1, J + 2, J + 3,
k = K + 1, K + 2, K + 3, a basis of bicubic tensor product B-splines is
formed by the terms Bj (x)Bk (y).

A general two-dimensional tensor product spline can be written as

 K−1
J−1 
s(x, y) = cjk Bj (x)Bk (y), ∀(x, y) ∈ [a, b] × [c, d].
j=−3 k=−3

The basis functions have now support in 4 × 4 adjacent sub-rectangles or


cells. And, vice versa, only 16 terms Bj (x)Bk (y) are active in each cell; see
Figure 11.1.2 for an example.
The properties of the bicubic spline are analogous to the 1D case: on
(x) (x) (y) (y)
each sub-rectangle [tj , tj+1 ] × [tk , tk+1 ] the spline s(x, y) is a polynomial
of degree 3 in x and y. The spline and all its derivatives up to second order
are continuous under the provision, as before, that the knots are distinct
in both directions:

∂ i+l s(x, y)  K−1


J−1  ∂ i Bj (x) ∂Bkl (y)
= cjk ( Bk (y) + Bj (x) ) i, j = 0, . . . , 2
∂xi ∂y l j=−3
∂xi ∂y l
k=−3

The dimension of the vector space of bicubic splines is (J +3)(K +3). Bi-
variate splines inherit (with suitable adjustments) the properties described
for 1D: smoothness, shape invariance, etc. Details of evaluation, derivation
and integration can be found in [69], chapter 2.
210 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 11.1.2: Bicubic tensor product spline.

Least squares approximation with bicubic splines


We now consider the following problem. Given values of the dependent
variable z, corresponding to a set of independent variables (x, y) in the
domain D = [a, b] × [c, d], find a bicubic spline best l2 -approximation to the
values of z. We assume that D is covered with a rectangular spline knot
mesh as described above.
In the two-dimensional approximation problem, the distribution of the
data points (x, y) in D determines whether a number of techniques from
one dimension can be carried over.
In many applications one has uniform mesh data, i.e., the values of z
are sampled on a rectangular grid G = {(xi , yl )}i=1,...,I; l=1,...,L . In this
case, the solution of the two-dimensional problem can be reduced to a
cascade of one-dimensional problems. This is possible because, for mesh
data, the observation matrix B can be written as the Kronecker product of
observation matrices Bx and By corresponding to each of the independent
variables. The least squares problem is then
⎛ ⎞2
I  L  K−1
J−1 
⎝zil − cjk Bj (xi ) Bk (yl )⎠ ,
i=1 l=1 j=−3 k=−3

which in matrix form can be written as


2
(Bx ⊗ By ) vec(C) − vec(Z)2 ,
where the matrices C and Z contain the control vectices and the data,
respectively. Its solution can be written in terms of the pseudoinverse of
the tensor product of the matrices. A fast algorithm has been developed
by Eric Grosse [112] based on the work in [195].
LINEAR LEAST SQUARES APPLICATIONS 211

In the not uncommon situation that some data values zil on the grid
are missing, several general techniques still allow the reduction to 1D prob-
lems. One method is to complete the grid by linear interpolation using the
nearest available grid values. Another successful alternative is an iterative
procedure whereby, starting with some initial values for the unknown zil ,
the least squares algorithm for full grids is applied recursively to fill in the
missing values, until the l2 -error at these gridpoints is small compared to
the l2 -error on the whole mesh. It can be proved ([69], Chapter 10) that
this algorithm converges. Note that the initial values are “smoothed out”
by the least squares, so they need not be accurate.
If the data are scattered, i.e., not on a uniform mesh, another approach
for the least squares approximation is needed. We assume again that the
data are zi = Γ(xi , yi ) + ei , i = 1, . . . , m with (xi , yi ) the independent
variables in the domain D that may be of irregular shape. As above, we
(x) (y)
superimpose a splines rectangular knot mesh (tj , tk ) on a rectangle con-
taining the data and must now minimize
⎛ ⎞2

m  K−1
J−1 
⎝zi − cjk Bj (xi )Bk (yi )⎠ .
i=1 j=−3 k=−3

To write this in matrix form, we store the control vertices cjk in the usual
form in vec(C), and on row i of the observation matrix A we put the terms
Bj (xi )Bk (yi ) consistent with the ordering of c = vec(C). As in the one-
dimensional case, we shift the indices so that the numeration of j and k
starts with 1. The problem is now
2
min B c − z2 .
c

The matrix B is sparse with only 42 = 16 non-consecutive entries per row.


One can get a compact block banded (4-diagonal) structure if the data are
organized by cells and these in turn are stored by varying first in the y
direction (see [69], p. 148ff. for details).
If the problems are large, iterative methods are the most efficient. An-
other approach for very large data sets, common in geological applications
(see Section 11.3), is to take advantage of the local character of splines and
partition the global least squares problem into smaller subproblems corre-
sponding to a domain decomposition of both the data set and the control
vertices. A block iteration is then started, whereby all these local sub-
problems are solved and the computed local control vertices are assigned
as approximations of the global vertices. The block iteration stops when
the control vertices converge. Here again, the smoothing effect of the least
squares approximation is helpful, and no more than a couple of block iter-
ations are needed for convergence (see Section 6.5).
212 LEAST SQUARES DATA FITTING WITH APPLICATIONS

With scattered data, if the data are poorly distributed, with some cells
containing none or few data points, the problem can be singular or ill-
conditioned. Unfortunately, as we observed in our applications and as also
pointed out in [112], often there is no real gap in the singular values of the
observation matrix B and the numerical rank cannot be determined. In
this case the regularization methods of Section 10 ought to be used.
One last word about data approximation with spline tensor products.
There is no economical way to adapt the spline knot mesh to features
that are not in either of the two axis directions. In fact, to improve the
representation of such features it is necessary to refine in both directions,
thereby generating a two-dimensional fine mesh around it, but which is only
needed in a small part of the domain.

Literature and software


From the extensive literature on splines, the classic book by Carl de Boor
[29] treats the theory of univariate splines in detail, as well as the topics
of tensor product of splines and smoothing splines. The Fortran programs
described and used in that book are available from Netlib and include stable
evaluation of B-splines, smoothing, least squares and curve approximation.
They also form the core of the MATLAB Spline Toolbox. A concise and
clear theoretical introduction of splines and the application to interpolation
can be found in the book by Powell [202].
Dierckx’s monograph [69] covers carefully and exhaustively the most
important aspects and algorithms of data fitting with splines in both one-
and two-dimensions. It has detailed discussions of least squares methods
and smoothing criteria and gives pointers to available software. It also
includes a description of a Fortran package FITPACK for data fitting in
one- and two-dimensions. The software package can be found in Netlib
under the name of DIERCKX. Another good introduction to splines with
applications to computer graphics in mind is the book by Bartels et al. [7].
To discuss in more detail the important case of splines with nonuniform
knots is beyond the scope of this book. In addition to the information
and software available in [69], one software package in the public domain is
spline2 [226].

11.2 Global temperatures data fitting


As discussed in Chapter 1, in some data fitting problems there is a well-
understood underlying physical process: one knows the form of the func-
tional relationship between dependent and independent variables and just
wants to determine some of its parameters. In other cases, either the func-
tional relationship is very complicated and one wants to approximate it by
LINEAR LEAST SQUARES APPLICATIONS 213

Figure 11.2.1: Earth temperature anomalies (from [135]).

a simpler mathematical function, or one has no previous information about


the function type altogether. In these latter cases a parametrized model
must be chosen.
We will examine one such problem, describe the steps used to choose
a suitable model with the help of some graphical and statistical tools, and
then analyze the quality of the computed fit.

Example 105. The data set under consideration (see Figure 11.2.1), is
the time series of the Earth global annual temperature anomalies between
1856 and 1999, compiled at the University of East Anglia [135]. They were
obtained from the recorded temperatures by subtracting the global average
(14◦ C) for the years 1961–1990. This set has been exhaustively analyzed
by B. Rust and others [213, 222, 234] to try to clarify the question: “Is
rising carbon dioxide (CO2 ) the cause of the temperature rising?” (See Rust’s
webpage for interesting background material on climate change [214].)

How to choose a model to fit a given set of data


We will start with some exploratory data graphics: the scatter plot in Figure
11.2.1 of the temperature Ti versus the time ti for i = 1, . . . , m provides
insight into the underlying structure of the data and helps decide an appro-
priate type of approximating function. From this plot of the data set one
214 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 11.2.2: Quadratic polynomial fit of temperature data, residuals and


periodogram.

can appreciate that the relationship is clearly more complicated than linear:
there are several max/min (four, at approximately 1870, 1910, 1940 and at
1980), and there seems to be an underlying periodicity. Several data points
may be considered as outliers (close to the years 1875, 1940 and 2000).
The model function M (t) should be both simple to evaluate and robust,
i.e., well conditioned. Commonly used model functions are polynomials,
rational functions, exponential functions, trigonometric functions and their
combinations. The simplest choice of approximating function, in order to
obtain a linear least squares problem, is a polynomial. Several of these
models were explored by Rust [213] and we refer to his careful evaluation of
the fit. Here we will only sketch some of the reasons for his choice of model
based on residual analysis. A basic quadratic polynomial model is plotted in
Figure 11.2.2.
A quadratic will clearly produce an underfit of the data, because from
the number of extrema one can deduce that at least a polynomial of degree
5 is needed. Visual inspection of the residual plot in Figure 11.2.2 not only
corroborates this, but also shows that the residuals have some systematic
variations. Apparently some periodicity of approximately 60–70 years has
not been modeled by the polynomial. Figure 11.2.2 also shows a periodogram
(or power spectrum, see Section 1.3.3) of the residuals, with a peak indi-
cating a dominant cycle of 72 years. This oscillation was reported on an
article in Nature [222], and it was thought to arise from ocean-atmospheric
interaction.
Table 11.1 lists the coefficient of determination R2 , defined in (2.2.7),
for different models tested in [213]. We see that a model that adds a sine
function to the basic quadratic model
. /

M (c, t) = c1 + c2 (t − t0 )2 + c3 sin (t + c5 ) , (11.2.1)
c4

explains 80% of the variance in the data, and therefore it provides a good
fit.
LINEAR LEAST SQUARES APPLICATIONS 215

Model R2
Linear 0.618
Quadratic 0.713
Quintic 0.770
Exponential 0.724
Linear + sine 0.758
Quadratic + sine 0.810
Exponential + sine 0.810
Uniform spline (N = 8) 0.801

Table 11.1: Coefficients of determination (from [213]).

The disadvantage of model (11.2.1) is that to compute its coefficients


one has to solve a nonlinear least squares problem, although for this specific
model it can be done using the techniques for separable variables described
previously. We will instead try another approach leading to a linear least
squares problem.
Since there is no a priori information about the function type we can
select a cubic B-spline, a flexible function that can represent almost any
structure. As we discussed in the previous section, these functions have the
advantage of being well-conditioned, as long as there are enough data in
every segment. n
The n-term spline is written as B(t) = j=1 cj Bj (t) where, as in the
previous section, the indices have been shifted to start from j = 1 and for
simplicity, we denote n ≡ N + 3, where N is the number of knot intervals.
The spline is defined on a uniformly distributed knot mesh that covers the
time interval [1856, 1999]. With n m, the approximating spline function
can now be determined by solving the linear least squares problem

min  T − B c2 ,
c

where the (i, j)-element of the matrix B is bij = Bj (ti ), i.e., the jth B-spline
basis function evaluated at the data abscissas ti . The vector T denotes the
temperature anomalies data and c is the parameter vector. The appropriate
algorithms to solve this problem were described in the previous section. If
the knot sequence is chosen so that there are some data ti in every subin-
terval, the problem is nonsingular.
The optimum number of knots, or equivalently, the number n of basis
functions Bj (t), must be determined. This will be achieved by striking a
balance between smaller residuals r ∗ = T − B c∗ and fewer parameters c∗
to be determined, where c∗ is the LSQ solution. As discussed in Chapter
1, the goal is to find a model that represents the information in the data
well enough but “leaves out” the noise, avoiding the danger of overfitting
216 LEAST SQUARES DATA FITTING WITH APPLICATIONS

r ∗ 2 (s∗ )2
2
n R2 adj R2 | |/T
6 1.59 0.0115 0.782 0.77 3.83
8 1.45 0.0107 0.801 0.79 3.23
10 1.44 0.0107 0.802 0.79 3.23
12 1.45 0.011 0.801 0.78 3.23
14 1.42 0.0109 0.805 0.79 3.01
24 1.21 0.0101 0.835 0.80 1.59

Table 11.2: Results for different number n of B-splines in the fit to the


temperature data.

the data.
Starting with n = 6 we will solve the least squares problem with an in-
creasing number of knots and use three statistical measures to assess the
goodness of the fit to decide on a provisional optimum value of n: the
squared residual norm r ∗ 22 , the squared standard error (s∗ )2 = r ∗ 22 /(m−
n) (i.e., the squared residual norm adjusted by the residual degrees of free-
dom, cf. equation (2.2.6)) and the adjusted coefficient of determination
adj R2 (see 2.2):
m 2
r / (m − n) (s∗ )2
adj R = 1 − m i=1 i 2
2
= 1 − m ,
i=1 (Ti − T̄ ) /(m − 1) i=1 (Ti − T̄ ) /(m − 1)
2

(11.2.2)
where T̄ is the mean of the data {Ti }i=1,...,m . These quantities are shown
as a function of n in Table 11.2. Once a tentative “best” value has been
selected, we will analyze the corresponding residual vector to evaluate the
quality of the approximation.
Under the assumption that the approximation errors are small, i.e., the
model represents the pure-data function well, cf. Section 2.2, the squared
standard error (s∗ )2 is a good estimate of the variance of the noise. The
values of (s∗ )2 are larger than the noise variance when n is small and
decrease as n increases, until they stabilize at a fairly constant value. The
first corresponding n is then a reasonable choice for the tentative number
of terms. Of course, a small value for (s∗ )2 is desirable.
The inconvenience in using R2 is that increasing the number of coef-
ficients increases R2 without necessarily improving the fit; this is avoided
by adjusting its value with the degrees of freedom. The values of adj R2 are
smaller than 1, and again, a value closer to 1 indicates a better fit. Negative
values may indicate an overfit. Its use is recommended over the plain R2
when comparing a series of models obtained by adding parameters.
Table 11.2 also shows a trend parameter | |/T , appropiate for equis-
paced data, where ρ is the autocorrelation and T is a threshold. (Both are
defined and used in section 1.3.) The values of this trend parameter, all
LINEAR LEAST SQUARES APPLICATIONS 217

Figure 11.2.3: Cubic B-spline approximation to temperature anomaly data


for n = 8.

larger than 1, seem to indicate that there is still structure in the data that
has not been modeled adequately. Larger number of knots will of course
lead to better approximations of the data, but the work involved in choos-
ing so many parameters appears to be unjustified, although the condition
number of the matrix for all the above n is not larger than 40. The most
balanced choice is probably n = 8, giving the plot shown in Figure 11.2.3.
The noise variance estimate (s∗ )2 has stabilized and is roughly the same
for all n in the interval [6, 24] and, from the values of adj R2 , the model
explains approximately 80% of the data.

Model validation
As was pointed out already in Section 1.3, in a good fitting model the resid-
uals are practically dominated by the errors in the data and have the same
statistical characteristics. Therefore the residuals must be random, uncor-
related, normally distributed with mean zero and identical variance. The
quantitative checks of these properties have been described in Section 1.3;
hence we take here a complementary approach and concentrate on the use
of graphical tools. We want an answer to the following questions for the
“best” n = 8 spline function (shown in Figure 11.2.3):

1. Are the residuals small?

2. Are the residuals random and uncorrelated?

3. Is the variation in the response different along the independent vari-


able range?
218 LEAST SQUARES DATA FITTING WITH APPLICATIONS

4. Are the residuals (approximately) normally distributed with zero mean


and constant variance?
5. Is the fit adequate, or do the residuals suggest a better fit, additional
terms, or a trend?
6. Or, have we overfitted, i.e., can some terms be eliminated?
In fact, some of the questions are related: if the fit is adequate, the residuals
follow the characteristics of the errors, and these in most cases have a
normal distribution and vice versa. But looking at each point will allow us
to display several of the common techniques for residual analysis.
The first question has been answered numerically, and Figure 11.2.3
confirms it. For the second question, although a plot of the residual versus
the independent variable would probably suffice, a confirmation can be ob-
tained from an autocorrelation plot. The vertical axis of this plot is defined
by the autocorrelation coefficient Rh with time lag h:
m−h
i=1 (ri − r̄)(ri+h − r̄)
1
Rh = m
m , h = 1, 2, . . . ,
i=1 (ri − r̄)
1 2
m

where r̄ is the mean of the residuals and the horizontal axis is the time
lag h. (As this is a plot of residuals, we can assume r̄ = 0.) The unit-lag
autocorrelation ρ in (1.4.3) is a special case for h = 1. Autocorrelations
should be near zero for randomness. In fact to detect non-randomness it is
usually enough to know the unit-lag autocorrelation ρ, i.e., the autocorrela-
tion between ri and ri+1 , which can easily be seen from a lag plot where the
vertical axis is ri+1 and the horizontal axis is ri ; see Figure 11.2.4. This
plot shows no apparent structure and the residuals seem random.
One very important point is to check whether there is a constant varia-
tion of the residual (dependent variable) over the course of the observations,
because if there is, the standard least squares approximation would not pro-
vide good estimates of the parameters, as shown in the example of Chapter
1; instead a weighted approximation is necessary. A plot of the residual
can be used to show whether the variation along the range of the indepen-
dent variable is fixed, and therefore it is unnecessary to use weighted least
squares; this is clearly apparent here.
For the fourth question about the distribution of the residuals, a his-
togram (with vertical axis: counts and horizontal axis: residual values) will
give a first idea, and a normal probability plot of the residuals would be
further confirmation. Both plots are shown in Figure 11.2.5.
For the normal probability plot, we choose as vertical axis the ordered
residuals ri and as horizontal axis the corresponding normal order statistic
medians xi . These xi for i = 1, . . . , m are calculated using the following
formula with the inverse of the normal distribution function:
LINEAR LEAST SQUARES APPLICATIONS 219

Figure 11.2.4: Unit-lag plot of residuals for the spline approximation with
n = 8.

Figure 11.2.5: Statistical plots associated with the spline approximation


for n = 8. Left: histogram of residual. Middle: normal probability plot.
Right: periodogram of the residual.
220 LEAST SQUARES DATA FITTING WITH APPLICATIONS

j c∗j ±δj c∗j /δj


1 −0.685 ±0.725 −0.95
2 0.402 ±0.124 −3.2
3 −0.164 ±0.063 −2.6
4 −0.566 ±0.050 −11
5 0.156 ±0.050 3.1
6 −0.260 ±0.063 −4.2
7 0.479 ±0.124 3.9
8 0.902 ±0.725 1.2

Table 11.3: Parameter values and uncertainties for n = 8.



⎨ 1 − 0.5
1/m
for i = 1
P r(X < xi ) = 0.51/m for i = m

⎩ i−0.3175
m+0.365 otherwise.
A way to think about this is that the sample values are plotted against what
we would expect to see if the residual distribution were strictly consistent
with the normal distribution. If the data are consistent with a sample from
a normal distribution, the points should lie close to a straight line.
The histogram, while roughly following the normal distribution bell shape,
is skewed with the center displaced. The normal probability plot shows ba-
sically a linear pattern except for some deviations at both ends, so on this
account the model is not yet very good.
The periodogram of the residual, shown in the right plot of Figure 11.2.5,
also points to some underfitting. There is a peak corresponding to a weak
periodic component of 20 years per cycle that has not been modeled yet. This
is also mentioned in [216], where they use splines with a nonuniform knot
distribution. Apparently it corresponds to some real phenomenon, although
a physical explanation has not yet been given.
When B has full rank and the noise in the data has covariance matrix
ς 2 I, the covariance Cov(c∗ ) = ς 2 (B T B)−1 for the LSQ solution defines the
uncertainties in the parameters. If the error variance ς 2 is unknown, as is
the case here, an estimated value is given by the above-defined (s∗ )2 . Let
δj denote the square roots of the diagonal elements of Cov(c∗ ), i.e.,

δj2 = [Cov(c∗ )]jj , j = 1, . . . , n.

Then δj is the standard deviation for the corresponding parameter c∗j , so


that a one-standard deviation interval around this parameter is c∗j ± δj .
In Table 11.3, the values of the parameters c∗j with their uncertainties
of ±1δj are listed. In the rightmost column we list the ratios c∗j /δj of the
LINEAR LEAST SQUARES APPLICATIONS 221

parameters to their corresponding uncertainties. If this ratio is large, the


probability is small that the parameter c∗j is zero, making Bj redundant.
On the other hand, if the uncertainty interval for a parameter is such that
c∗j may be zero, it is convenient to compute the confidence levels of the
parameters. The confidence bounds for c∗j are given by
 
Pr c∗j − κp δj < c∗j < c∗j + κp δj = 1 − p,

with 1 − p the desired confidence level (a common choice, for example,


is p = 0.05, for a 95% confidence level). The parameter κp depends on
the confidence level and is obtained from the inverse of the Student’s t-
cumulative distribution function with m − n degrees of freedom.
For the above problem, the first and last parameters c1 and c8 have a
small ratio c∗j /δj , so it is worthwhile to compute the confidence bounds for
these two parameters. We use the Student’s t distribution with 144−8 = 136
degrees of freedom, giving κp = 0.67449, and for the first parameter we must
evaluate

Pr{−0.685 − 0.67449 · 0.725 < c∗1 < −0.685 + 0.67449 · 0.725}

and similarly for the last parameter. We obtain

Pr{−1.174 < c∗1 < −0.1964} = 0.5,


Pr{0.06866 < c∗8 < 1.736} = 0.75.

Hence, for c∗1 we can only guarantee with 50% probability that the confidence
interval does not include zero, while for c∗1 this confidence goes up to 75%. A
similar technique can be used to check for overfitting; for the present model
it is redundant, as one can rather make a case for underfitting against
overfitting.

11.3 Geological surface modeling


An interesting application area, both for linear and nonlinear least squares
problems, is exploration seismology, as used in oil and gas prospecting (for
details see [224]). One of the problems in this area is the determination
of the location and shape of geological formations in the Earth from mea-
surements on its surface. Although the presence of oil or gas per se cannot
be detected by the elastic properties of rocks or by the shape of the inter-
faces, such as where the rocks tilt upward, or where the strata are broken by
faults, those are valuable information for identifying favorable conditions of
hydrocarbon presence (traps for the fluids that tend to rise and accumulate
at plugs provided by impervious rocks). Also, determining the elastic prop-
erties of rocks helps, through rock physics, to obtain valuable information
222 LEAST SQUARES DATA FITTING WITH APPLICATIONS

about properties that are useful to reservoir engineering, such as density,


porosity and permeability. We will discuss some of these applications in
the next chapter, as they lead to nonlinear least squares problems.

Example 106. Salt SEG surface


Our next example will consider the computation of an analytical approx-
imation of a complex geological interface defined by a discrete scattered set
of depth data points [58, 197]. This is just one surface from a data set (Salt
SEG) created under the sponsorship of the Society of Exploration Geophysi-
cists (SEG). The surface represents the depth of a boundary between two
different sedimentary rock formations pierced by a salt body (center hole).
The surface includes a large normal fault that runs across the salt, from
northwest to southeast, and some smaller faults that produce sharp gradi-
ents in the surface.
The set contains m = 71, 952 depth data points (xi , yi , zi ) in the domain
(xi , yi ) ∈ [0, 280] × [0, 280], with a depth range: zi ∈ [400, 17300]. Although
the (xi , yi ) points all lie on a rectangular grid, there are large data gaps in
the domain due to the salt body intrusion and several faults (that belong to
other surfaces). The depth values zi have not only a wide range but, due to
the faults, they present large discontinuities. Figure 11.3.1 shows a plot of
the data.
We assume that the data values can be fitted by a smooth, twice differ-
entiable function, the errors being random and normally distributed. The
idea is not to extrapolate to cover the holes, but rather to approximate the
data only where the surface is defined. A much more laborious alternative
is to segment the domain and approximate the surface in patches that then
need to be stitched together.
A suitable mathematical model M (x, y) to represent this surface is a
tensor product of cubic B-splines, having the appropriate smoothness and,
an important feature in this case of large domain gaps, having a local support
basis functions representation.
To define the model function we introduce a uniform mesh, determined
on the given [0, 280]×[0, 280] rectangular domain by the knots: (xknot j , ykknot )
for j = 1, . . . , J and k = 1, . . . , K. The model function is then

J 
 K
M (x, y) = cjk Bj (x)Bk (y).
j=1 k=1

The coefficients (or control vertices) cjk are determined by a large, linear,
overdetermined and inconsistent system of equations:


J 
K
cjk Bj (xi )Bk (yi ) ≈ zi , i = 1, . . . , m.
j=1 k=1
LINEAR LEAST SQUARES APPLICATIONS 223

Figure 11.3.1: The data that define the salt SEG surface.

A best fit in the l2 -norm is obtained by solving


⎛ ⎞2
m J K
min ⎝zi − cjk Bj (xi )Bk (yi )⎠ ,
cjk
i=1 j=1 k=1

where we assume that n = J · K m. As explained in a previous section,


the least squares problem can be recast in matrix form minc z − B c22 ,
where z is the vector of the data, the coefficients are stored in the vector c
and B is the usual matrix associated with the B-spline basis functions, with
the terms Bj (xi )Bk (yi ) stored in row i consistently with the ordering in the
coefficient vector, as explained in detail in the section on splines.
We recall that the problem is sparse, because the B-spline functions have
local support – at any given evaluation point only four coefficients (con-
trol vertices) in each coordinate are different from zero, for a total of 16
coefficients in two-dimensions. Because of the data “gaps” and the large
discontinuities, the problem is ill-conditioned. In fact, if the spline mesh
is so fine that a control knot (xknot
j , ykknot ) is more than two mesh intervals
from every data point, the associated coefficient cjk will not enter into any
evaluation of the least squares matrix and thus will produce a column of
zeros in B. In addition, control vertices associated with knots that are close
to only a few data points can be poorly defined by the data and also lead to
small singular values and ill-conditioning.
Figure 11.3.2 shows a log-scale plot of the singular value distribution of
the matrix B for a 20 × 20 B-splines grid model of a representative subset
224 LEAST SQUARES DATA FITTING WITH APPLICATIONS

of the data. The singular values are normalized with respect to the largest
σ1 . Note the gap before the last three singular values, which are of order of
10−7 relative to the largest singular value and are nonzero due to rounding
errors. The gap at 10−4 shows a typical behavior for rank-deficient problems
and also suggest the numerical rank of B.
For the special case that the data are uniformly spaced, a fast algorithm
to solve the tensor product least squares problem has been developed [187,
195], based on the reduction to a cascade of 1D problems (tensor product
fitting to tensor product data). In the present case, in order to use this
reduction, one could re-grid the data to a uniform mesh, filling any gaps by
interpolation or extrapolation.
The alternative that we present here instead involves considering the
multidimensional least squares problem as a whole. To give a reasonable
fit, the number of unknown parameters must be at least n = 400. So we
are dealing with a large, linear, sparse and ill-conditioned problem, with the
main difficulty being the ill-conditioning.
The size of the problem excludes the use of a simple truncated SVD al-
gorithm, which otherwise would be a good choice given the rank deficiency.
The matrix B is sparse (16 entries per row), so LASVD, the method de-
veloped by Berry, seems an option. Unfortunately, the exploratory SVD
computation using a data subset (shown in Figure 11.3.2), indicates that
most of the singular values are relevant and would have to be computed,
making LASVD impractical. We therefore try BTSVD [196, 197], a block
algorithm based on TSVD and compare its performance with the iterative
Lanczos method LSQR that has good regularizing properties. BTSVD is a
hybrid algorithm, based on partitioning the original problem into subprob-
lems small enough to make a truncated SVD algorithm practical.
Because of the local character of the B-splines, the global LSQ problem
subdivides naturally into subproblems, corresponding to a domain decom-
position of the independent variables of the data set and the corresponding
relevant control vertices. Some natural overlap arises from the phantom
vertices, i.e., the control vertices associated with the two outermost basis
functions in each direction, which are affected by the data in the neighbor-
ing domain.
Thus, the local LSQ problems are loosely coupled and therefore they are
suited to a block Gauss-Seidel approach (for details see [196, 197] and 6.5).
The blocks are visited sequentially and a master copy of all the control ver-
tices is updated as soon as a block fit has been completed. This updating
strategy should achieve a global C 1 approximation, since we impose appro-
priate continuity conditions across the block boundaries. Among the several
possibilities to determine a value for the shared phantom vertices we choose
to assign a dynamic mean value of the corresponding local values.
BTSVD can be considered a special case of the block Gauss-Seidel it-
LINEAR LEAST SQUARES APPLICATIONS 225

eration considered in Section 6.5; it is amenable to parallelization and the


parallel version can be asynchronous. The convergence follows from Theo-
rem 98.
An important issue is the local regularization (see Chapter 10), since a
number of the subproblems are rank deficient. For each of these subproblems
one has to define a regularization parameter, i.e., the number l of SVD
terms used in a regularized solution. Among several options, we choose the
L-curve because it does not require additional information about the data,
though, as it is costly to compute the optimum regularization parameter, we
have implemented a simplified strategy based on the monotone behavior of
the solution norms in order to compute a reasonably good approximation of
the parameter.
The plot of the singular values in Figure 11.3.2 can give some idea of the
rank of the complete least squares problem, rank(A) ≈ rτ , where rτ is the
number of normalized singular values bounded below by τ = 10−4 . We use
this value as a starting value for l. From the first tentative TSVD solution
cl and its residual r l = z − B cl , successive improved approximations of
the local control vertices and the residuals are computed until the desired
solution quality is obtained. We stop the process when the solution norm
starts to increase dramatically (which is the central idea in the L-curve
criterion).
We have found that the most efficient strategy for BTSVD combines
some initial LSQR iterations with a switch to a TSVD solver for handling
the ill-conditioning and termination issues. The local LSQR iterations are
an essential element of the algorithm because they determine a good ini-
tial approximation to the phantom vertices used in the TSVD steps, thus
speeding up the convergence of the block Gauss-Seidel method.
On the other hand, Krylov subspace methods are self-regularizing; the
subspace dimension (= iteration number) is the regularization parameter.
As mentioned in Section 6.4, a reliable, automatic regularization is not
easily implemented. The LSQR program requires as input the parameter
CONLIM, an estimate of the condition of the bidiagonal matrices Bk (see
Section 6.3) and the algorithm will stop at iteration k if:

cond(Bk ) ≥ CONLIM.

This heuristic should have a similar effect as TSVD with a normalized


singular value threshold of τ  1/CONLIM.
Table 11.4 illustrates the difficulty of stopping the LSQR iteration if
there is no indication of the appropriate CONLIM value. LSQR was stopped
using different values, and we recorded the RMS as well as the percentage of
data points with residual farther than 1 standard deviation from the mean
(“Large residual” in Table 11.4). We also list the relation of the computer
time needed by the two algorithms LSQR and BTSVD to obtain the same
226 LEAST SQUARES DATA FITTING WITH APPLICATIONS

quality of results.
The stopping criterion for the Gauss-Seidel sweeps √ through the blocks
is a bound on the residual mean square RMS = r2 / m. The experience
with this and other problems is that at most 3 sweeps over all the blocks
are needed to attain the desired precision. The choices CONLIM = 103
and CONLIM = 104 for LSQR give similar approximation quality, while
the solution for CONLIM = 103 is computed at a fraction of the number of
operations needed when using CONLIM = 104 . A second way to introduce
regularization into LSQR is available through the parameter ATOL, which
can cause iterations to terminate before the norm of the solution becomes
too large. This was our approach.
BTSVD was run using a 5 × 5 block domain decomposition, with 10 × 10
local vertices and a common threshold of τ = 10−4 . An RMS = 8.18 · 10−3
was obtained, with 91% of the data having an error less than 0.5% in 2
sweeps. The first sweep used 32 local LSQR iterations and the second one
used the local TSVD. Figures 11.3.3 and 11.3.4 show plots of the component-
wise relative errors obtained with the two methods. We see that, even close
to the holes and the large-gradient areas the accuracy of the fit is reason-
able, but more important, this difficult local features do not contaminate the
smoother regions.
In general, BTSVD is applicable (and has been applied; see [197]) to
data in irregularly shaped domains, including holes, provided that these will
also be the domains of evaluation of the model functions. Such domains
must first be embedded into a rectangular region.
An important question that we have not yet considered is, How do we
choose the number of basis functions J and K? The aim is for the resulting
representation to preserve the character of the discrete data without intro-
ducing extraneous artifacts and/or excessive smoothing. To address this
issue we tried a multigrid strategy. Starting from a coarse spline mesh and
possibly a subset of the data, we solved the resulting least squares problem.
The quality of the fit was checked via the residual norm. Note that in ad-
dition, other properties of the model could be evaluated by computing local
differences and comparing them with the derivatives of the fitting function:
 
Bx (x, y) = cjk Bj (x)Bk (y), By (x, y) = cjk Bj (x)Bk (y).
j,k j,k

The norm of the difference between these quantities can be used in conjunc-
tion with the function residual to choose an appropriate parameter space
(i.e., the values of J and K).
LINEAR LEAST SQUARES APPLICATIONS 227

Figure 11.3.2: Singular values for SEG salt model.

CONLIM RMS Large res. # iters LSQR / BTSVD time


102 8.5 · 10−3 8% 31 0.1
103 7.1 · 10−3 5% 66 0.24
5 · 103 7.06 · 10−3 5% 167 0.6
104 7.05 · 10−3 5% 863 3.1

Table 11.4: LSQR performance for various values of CONLIM.


228 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 11.3.3: Relative error for LSQR. As expected, the largest errors (of
25% for about 8% of the points) are close to the faults and the salt inclusion
(holes). Elsewhere, the errors are as small as 0.025%.

Figure 11.3.4: Relative error for BTSVD. The error behavior is similar to
LSQR.
LINEAR LEAST SQUARES APPLICATIONS 229

Figure 11.3.5: Monterey Bay, California: Data.

Example 107. Monterey Bay, California, topography


To give an idea of the computing resources needed, we describe a sim-
ilar exercise with a data set that corresponds to altimetry and bathymetry
(height and underwater depth measurements) of Monterey Bay, California,
showing the well-known and striking underwater canyon there. The data
set has m = 40, 000 data points in a uniform 200 × 200 mesh, and the com-
plete approximation uses n = 10, 000 basis functions, for a LSQ problem
of dimensions 40, 000 × 10, 000. Decomposing the domain into 5 × 5 blocks
(of dimension (40 × 40), initializing this time with the CG method (which
gives similar results to LSQR) and then finishing up (because of potential
ill-conditioning) with the BTSVD method, we obtain a robust and efficient
algorithm that takes less than one minute on a MacBook Pro (with Intel
Duo processors).
230 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 11.3.6: Monterey Bay, California: B-spline fit.


Chapter 12

Nonlinear Least Squares


Applications

In this chapter we consider several nonlinear least squares applications in


detail, starting with the fast training of neural networks and their use in
optimal design to generate surrogates of very expensive functionals. Then
we consider several inverse problems related to the design of piezoelectrical
transducers, NMR and geophysical tomography. Not surprisingly, several
of the applications lead to separable problems that are solved using variable
projection.

12.1 Neural networks training


Neural networks are nonlinear parametric models. Training the network
corresponds to fitting the parameters in these models in the least squares
sense by using pre-classified data (a training set) and an optimization algo-
rithm. Since the nonlinear least squares problem (NLLSQ) that results is
one whose linear and nonlinear variables separate, the use of the variable
projection (VP) algorithm is possible, and it increases both the speed and
the robustness of the training process [101, 103, 168, 182, 199, 227, 252,
253, 254], compared to traditional methods.
First we explain the necessary neural network concepts and notations
that lead to the specific least squares problem that has to be solved to
train the network. Then we explain and test a training algorithm based
on variable projection. The algorithm is applicable to one hidden layer,
fully connected, neural network models using several types of activation
functions. A generalization of VP developed and implemented by Golub
and Leveque [98] (see also [84, 85, 141]) can be used for the case of multiple
outputs.

231
232 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Neural network concepts


Neural networks are a convenient way to represent general nonlinear map-
pings between multidimensional spaces in terms of superposition of nonlin-
ear functions, by using so-called activation functions an hidden units.
We discuss here multilayer, feed-forward, fully connected neural net-
works (NNs), although the techniques are applicable to more general ones,
for instance, those with feedback loops. The NN we consider consist of 3
types of nodes arranged in layers: input, hidden and output layers. Each
node can have several inputs and outputs, and they act on the input infor-
mation in ways that depend on the layer type:
• Input node: no action.
• Hidden node:

– Weighted sum of its inputs with a possible offset (bias) that can
be incorporated as an additional input. If wi are the weights and
xi the inputs, with x0 corresponding to the bias, i.e., x0 = 1,
then
 d
y= wi xi ≡ wT x.
i=0

– This linear output can be generalized by applying a nonlinear


function f (·), called an activation function, so that the output
is instead z = f (y) = f (wT x). Observe that this provides a
standardized way to handle multivariable inputs.

• Output node: it can weight and sum inputs and additionally apply
an activation function.
In a feed-forward NN there is information flowing in only one direction. A
fully connected NN has every node in one layer connected to every node
in the next layer, and there are no connections between nodes of the same
layer. See Figure 12.1.1 for the general architecture of a NN.
The notation used is as follows:
• The training set is defined by (input/output) pairs, x1,i and ti , i =
1, . . . , m.
• At the lth-layer, xl,i and xl+1,i denote the input and output vectors,
whereas y l,i is used for the vector of weighted sums.
• The final output is xL+1,i .
• Note that these vectors may have different lengths in different layers,
given that the number of nodes can vary.
NONLINEAR LEAST SQUARES APPLICATIONS 233

Figure 12.1.1: Neural network architecture.

• At the lth-layer, the weights, for calculating the weighted sums are
denoted by the vector wl . The number of its elements is the product
of the number of nodes in layer l − 1, times the number of nodes in
layer l.
• The total weight vector for the NN will be denoted by
WT = ( w1 T w2 T ... wL T ).

Perceptron models
These are networks that use nonlinear activation functions, and they are
usually applied to classification problems. To allow for a general mapping,
one must consider successive transformations corresponding to several lay-
ers of adaptive parameters. The most common choice for the activation
function is the logistic sigmoid function, see Figure 12.1.2, defined by
1
f (y) = . (12.1.1)
1 + e−y
Some of its properties are
• For y ∈ (−∞, ∞) , f (y) ∈ (0, 1).
• The derivative approaches 0 when |y|  1.
• It has a simple derivative form: f  (y) = f (y)(1 − f (y)).
• It is continuously differentiable and able to approximate the hard
delimiter step function.
234 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 12.1.2: Sigmoid function.

Since the sigmoid functions are dense in the space of continuous functions,
a single hidden layer NN can approximate any continuous function to any
precision, if enough nodes are used [54].

Training as a separable nonlinear least squares problem


The training of a neural network uses known input-output pairs (the train-
ing data) in order to find the NN parameters that best reproduce their
behavior. A frequently used measure of performance is the sum of squares
of the residuals between the NN output and that of the training set. Given
a training set defined by (x1,i , ti ), i = 1, . . . , m and assuming that the out-
puts are independent, the weight parameters W are to be determined so
that the sum of squares residuals is minimized:

1
m
2
min xL+1,i − ti 2 , (12.1.2)
W 2 i=1

where, of course, xL+1,i depends nonlinearly on both the weights W and


the inputs x1,i .
When there is a correlation between the outputs, or the errors have
different sizes, [181] suggests a more appropiate formulation involving the
output error correlation matrix, giving rise to a generalized least squares
problem [20]. See also Chapter 6 in [15].
In what follows we will consider a simplified 3-layer NN such that there is
only one output node to which no activation function is applied. In this case
NONLINEAR LEAST SQUARES APPLICATIONS 235

the approximating function reduces to a weighted combination of activation


functions and the training of the network leads to a separable nonlinear least
squares problem, for whose solution we propose to use variable projection.

Why variable projection?

Besides the obvious advantage of reducing the number of parameters to


be determined by eliminating the linear parameters, it has been shown
[169, 181, 212, 227] that the resulting reduced problem is better condi-
tioned than the original full one and if the same optimization algorithm is
used it always converges in fewer iterations. Since with a careful implemen-
tation the cost per iteration is about the same, these results and extensive
practical experience as reported in [103] (where a wealth of references can
be found), show that there is a net gain in using variable projection over
solving the unreduced problem with a conventional nonlinear least squares
algorithm. Since neural networks are usually trained using very slow algo-
rithms (requiring hundred of thousands of iterations), such as back prop-
agation (gradient method with step control), the gain compared to these
conventional training methods is even larger.
On the negative side (for every algorithm), if there are no good a priori
initial values for the nonlinear parameters then, these nonlinear, non-convex
problems will frequently have multiple solutions and therefore some kind of
global optimization technique will need to be used to escape from undesired
local minima. We show in the computational section below that a simple
Monte Carlo method in which multiple initial values are chosen at random
helps with this problem.
Valid concerns of NN practitioners are the design of the network and
the so called bias-variance dilemma. Since we are considering single hidden
layer perceptrons, the only design parameter in this case is the number
of nodes in the hidden layer, which should be as small as possible, since
parsimonious models (i.e., models with as few parameters as possible) are
to be desired. The techniques we have described before for choosing the
number of basis functions in a model are applicable here.
The second concern is related to the fact that training data in real
situations will have errors and that the resulting problems are often ill-
conditioned and hence there is a real possibility of overfitting the training
data. Minimizing the bias (i.e., the residual norm) and the variance are de-
sirable but conflicting goals. As in any bi-objective optimization problem, a
compromise has to be reached. The procedure described below uses a mod-
ern implementation of the Levenberg-Marquardt method (cf. [151]), which
has a built-in approach to palliate the above problem, if an approximation
to the variance is available. We have also discussed in detail ill-conditioned
problems and regularization in Section 10.
236 LEAST SQUARES DATA FITTING WITH APPLICATIONS

VARPRO program and numerical tests

One of the variable projection implementations we use is based on the orig-


inal program written by Pereyra [101] as modified by J. Bolstad [263]. The
minimization method used is a modification by Osborne of the Levenberg-
Marquardt (L-M) algorithm, and it also includes the Kaufman simplifica-
tion. Careful implementation of the linear algebra involved in the differenti-
ation of the variable projector and the L-M algorithm produces an efficient
algorithm. The information provided by the program allows for a statisti-
cal analysis, including, for example, uncertainty bounds in the parameter
estimations (see [113, 213] for a brief overview).
Analysis and experimental results in [217] for fully connected NN us-
ing sigmoid activation functions (which have limited discrimination capa-
bilities), suggest that many network training problems are ill-conditioned.
One can show that columns of the Jacobian can easily be nearly linearly
dependent. As proved in [227], the use of the variable projection method
improves the condition of the problem.
The interesting analysis in [76] might be useful to design an even more
robust algorithm that would be applicable not only to the original NN
problem but also to nonlinear separable problems in general. In fact the
authors show theoretically and practically that regularizing the NN non-
linear problem directly (using the Tikhonov approach to compensate for
ill-conditioning) and then linearizing using Gauss-Newton gives a problem
with a smaller condition number than the usual approach for a nonlinear
problem, namely, to linearize and then regularize (Levenberg-Marquardt).

Example 108. We consider a test problem from a public benchmark, about


the prediction of hourly electrical energy consumption in a building (named
“a”), based on the date, time of day, outside temperature, outside air hu-
midity, solar radiation and wind speed. The data can be found in [203].
Complete hourly data for four consecutive months is provided for training,
and output data for the next two months should be predicted.
The purpose of this test is to show the performance of the training algo-
rithm and not to design the “optimal” network. The combination of a small
network, a large number of data points, and the regularization provided
by the use of the variable projection and Levenberg-Marquardt algorithms
should guarantee that there will be no overfitting of the training data (see
chapter 9 in [15] for a more complete discussion of these issues).
We use a single hidden layer network with 5 nodes, sigmoid activation
functions and a constant bias. Since the extrapolation that is required will
not overlap in the time of the year with the training data, we ignore the
year and month and use only an hour count, hour of the day count, and a
day of the week count as time input variables. We also include the output
data of the three previous hours as memory (previous hour, first and second
NONLINEAR LEAST SQUARES APPLICATIONS 237

Figure 12.1.3: Results for building "a"; perceptron training with 5 nodes.

differences), and thus we have 10 input variables and 1 output variable.


This is then a feedback network. All the variables are normalized to mean
0 and variance 1.
The resulting model, when compared with the normalized training data,
has a residual mean error Σm 2
i=1 ri /m = 0.17. Figure 12.1.3 shows some
detailed results of the training step, including a comparison of calculated
against training output, the absolute error at the training points as a func-
tion of the hour counter, and finally a scatter plot of those quantities over
the whole range (a perfect fit should give a straight line with a slope of 45◦ ).
The main observation is that the global fit of this extensive and oscilla-
tory data set with just 5 nodes (i.e., 55 nonlinear and 6 linear parameters)
is remarkably good, and the training time per run is still only 24.2 seconds
on average on a 800 MHz PC.
238 LEAST SQUARES DATA FITTING WITH APPLICATIONS

12.2 Response surfaces, surrogates or proxies

Now that computer power is such that many complex multi-physics prob-
lems (i.e., involving several simultaneous physical phenomena) can be ad-
equately modeled, scientists and engineers are pushing the boundary to
the next challenge that requires many simulations, such as those involved
in optimal design and material identification. In order to make the solu-
tion of these problems practical with current hardware, one has to resort
either to large-scale parallelization or, as we describe below, to response
surfaces, surrogates or proxies, which are approximations obtained from a
limited training set of simulations. Another application of these techniques
is the fast evaluation of simulations (i.e., faster than by full high-fidelity
modeling), required in many real-time applications, as we exemplify below.
Many approximations are used in practice, such as low-order polynomials,
but here we will use instead neural networks. Low-order polynomials may
not lead to appropriate proxies, and the number of terms grows rapidly
with the number of unknowns.

Example 109. This real data test is taken from a finite element simulation
database developed by Weidlinger Associates Inc. to evaluate the progressive
collapse potential of reinforced-concrete and steel-frame buildings; see Fig-
ure 12.2.1 for an example. The independent variables correspond to config-
uration parameters (such as bay size, beam type, etc.) and the load applied
to the structure. The dependent variable is the response of the structure
(e.g., deflection) to such load. Hence, the data set defines the load-response
characteristics of the structure for particular configurations, as illustrated
in Figure 12.2.2.
One of the data sets taken from the database and used for illustration
herein is given in Table 12.1. The independent variables (x1 , x2 , x3 ) cor-
respond to the beam type, the bay size and the applied load, respectively.
The dependent variable, y1 , corresponds to the deflection response. Many
other response quantities of interest such as rotation, thrust and moment
are included in the database, but have been left out of the sample data set
for the sake of clarity.
We have used program VARPRO for training, considering a different
number of nodes with sigmoid activation functions in a single hidden layer,
fully connected perceptron, with three input and one output node. In Table
12.2 we summarize the results on a 800 MHz Celeron PC, running under the
Solaris x86 operating system. For the last two rows, the RMS indicates that
we are essentially interpolating the data. This surrogate can then be used
to rapidly evaluate the response for other values of the input parameters.
NONLINEAR LEAST SQUARES APPLICATIONS 239

Figure 12.2.1: A typical frame unit of a building.

Figure 12.2.2: Load-response characteristics.


240 LEAST SQUARES DATA FITTING WITH APPLICATIONS

x1 x2 x3 y x1 x2 x3 y
90 20 10 −0.71 90 20 30 −6.50
308 20 80 −0.82 306 20 140 −17.05
824 20 150 −0.58 824 20 400 −20.57
1125 20 200 −0.52 1125 20 550 −19.73
1607 20 300 −0.57 1507 20 800 −24.52
90 25 10 −1.42 90 25 30 −18.54
308 25 20 −0.46 308 25 100 −16.25
824 25 100 −0.67 824 25 300 −23.31
1125 25 150 −0.72 1125 25 400 −20.05
1607 25 250 −0.93 1607 25 600 −25.38
90 30 10 −2.67 90 30 30 −26.61
308 30 20 −0.79 308 30 60 −19.03
824 30 100 −1.30 824 30 250 −30.86
1125 30 150 −1.46 1125 30 300 −17.78
1607 30 200 −1.21 1807 30 450 −22.15
90 20 20 −1.94 90 20 50 −23.59
308 20 100 −3.36 308 20 180 −33.97
824 20 300 −4.32 824 20 500 −41.02
1125 20 400 −3.51 1125 20 685 −40.60
1607 20 600 −4.79 1607 20 928 −41.09
90 25 20 −5.92 90 25 40 −26.62
306 25 40 −0.98 308 25 160 −45.40
824 25 200 −3.38 824 25 400 −49.30
1125 25 300 −4.57 1125 25 550 −49.85
1807 25 450 −5.56 1607 25 710 −53.26
90 30 20 −13.73 90 30 40 −33.77
308 30 40 −1.96 308 30 120 −44.38
824 30 I −3.60 824 30 350 −59.24
1125 30 250 −7.12 1125 30 500 −63.97
1607 30 350 −6.11 1607 30 685 −67.05

Table 12.1: Data used in example. x1 = beam type, x2 = bay size, x3 =


applied load, y = deflection response.
NONLINEAR LEAST SQUARES APPLICATIONS 241

# nodes RMS iterations CPU time in # nonlinear


seconds (100 runs) parameters
2 0.3693 52 2.39 8
4 0.1128 167 18.71 16
6 0.089 214 53.91 24
8 0.0551 262 102.48 32
10 0.0175 300 156.42 40
12 0.0053 287 205.09 48
14 6.25 · 10−13 185 251.77 56
16 1.23 · 10−13 72 240.4 64

Table 12.2: Results for progressive collapse example. The CPU time is the
accumulated time for 100 runs.

12.3 Optimal design of a supersonic aircraft


Optimization problems in many industrial applications are extremely hard
to solve in a general manner. Good examples of such problems can be
found in the design of aerospace systems. Because of the high level of
integration of today’s systems and the increase in complexity of analysis and
design methods for the evaluation of system performance, such problems
are characterized by multi-disciplinary simulations, goal functionals and
constraints that are expensive to evaluate, and, in addition, they often
have large-dimensional design parameter spaces that further complicate the
solution of the problem.
The resulting optimization problems are frequently non-convex, i.e.,
multi-modal and ill-conditioned, making complete optimizations of com-
plex aircraft configurations prohibitively expensive. Finally, the simula-
tions may require using legacy or commercial codes that have to be used as
black boxes that, in particular, may not produce the derivative information
required by some optimization techniques.
Multi-modal, ill-conditioned problems require global optimization tech-
niques and regularization [118, 189] and present some of the most chal-
lenging problems for robust initialization and ulterior accurate solution. In
high-dimensional spaces, the available techniques are problematic at their
best and one often must resort to surrogate models, divide and conquer
techniques and parallel computing, in order to even have a chance to solve
the problem in a reasonable time [2, 45, 188, 189].
As we indicated above, the most common surrogate models consist of
low-degree polynomial approximations. These are inadequate as surrogates
for highly discontinuous functions such as supersonic boom. In this section
we use neural networks to produce surrogates that more faithfully and
successfully reproduce such functions. As before, we also use a fast training
242 LEAST SQUARES DATA FITTING WITH APPLICATIONS

tool for generating the NN via the variable projection method.

Example 110. We consider the generation of sample data for the aero-
dynamic and boom characteristics of a generic supersonic aircraft config-
uration, the fitting of this data using neural networks, and the use of the
resulting surrogate models in a representative design optimization problem.
For this test example, some of the most relevant design parameters were
selected, as described below. Although we use only a small number of design
variables relative to a realistic aerospace vehicle design, the problem has all
of the elements necessary to exercise and evaluate the proposed initializa-
tion and optimization techniques and also to appreciate the performance of
the surrogates.
The two goals in this bi-objective problem are to maximize the total
range of the aircraft and to minimize the perceived loudness of the ground
boom signature (measured in dBA), while satisfying a number of mission
and buildability constraints, such as

• Structural integrity of the aircraft for a N = 2.5 g pull-up maneuver.

• Takeoff field length < 6,000 ft.

• Landing field length < 6,000 ft.

For high-fidelity aerodynamic modeling we use the A502 solver, also known
as PanAir [38, 39], a flow solver developed at Boeing to compute the aerody-
namic properties of arbitrary aircraft configurations flying at either subsonic
or supersonic speeds. This code uses a higher-order (quadratic doublet, lin-
ear source) panel method, based on the solution of the linearized potential
flow boundary-value problem. Results are generally valid for cases that sat-
isfy the assumptions of linearized potential flow theory – small disturbance,
not transonic, irrotational flow, and negligible viscous effects. Once the
solution is found for the aerodynamic properties on the surface of the air-
craft, A502 can then easily calculate the flow properties at any location
in the flow field, hence obtaining the near-field pressure signature needed
for sonic boom prediction. In keeping with the axisymmetric assumption
of sonic boom theory, the near-field pressure can be obtained at arbitrary
distances below the aircraft [5].
The high-fidelity method for computing the ground boom signature is
shown in Figure 12.3.2. At the near-field plane location, the pressure sig-
nature created by the aircraft is extracted and propagated down to the ground
using extrapolation methods based on geometric acoustics.
The location of the near-field must be far enough from the aircraft, so
that its flow field is nearly axisymmetric and there are no remaining diffrac-
tion effects, which cannot be handled by the extrapolation scheme. Since
A502/Panair only uses a surface mesh for all of its calculations, it is able
NONLINEAR LEAST SQUARES APPLICATIONS 243

Figure 12.3.1: Un-intersected components of a transonic business jet con-


figuration (left) and intersected components forming the outer mold line, a
well-defined three-dimensional surface that provides the outer geometry of
the plane (right).

Figure 12.3.2: Sonic boom propagation procedure.


244 LEAST SQUARES DATA FITTING WITH APPLICATIONS

to obtain near-field pressures at arbitrary distances without changes in the


computational cost. In this work we are using the Sboom [19] extrapolation
method to propagate near-field signatures into ground booms. The sonic
boom extrapolation method accounts for vertical gradients of atmospheric
properties and for stratified winds (although the winds have been set to zero
in this example). The method relies on results from geometric acoustics
for the evolution of the wave amplitude and utilizes isentropic wave theory
to account for nonlinear waveform distortion due to atmospheric density
gradients and stratified winds.
Past research on low-boom aircraft design focused on reducing the mag-
nitude of only the initial peak of the ground boom signature [1, 46]. This
requirement, which had been suggested as the goal of the DARPA-sponsored
Quiet Supersonic Platform (QSP) program (Δp0 < 0.3 pound × foot2 ), ne-
glects the importance of the full signature, which depends on the more ge-
ometrically complex aft portion of the aircraft, where empennage, engine
nacelles and diverters create more complicated flow patterns. Moreover,
such designs often have two shock waves very closely following each other
in the front portion of the signature [39, 47], a behavior that is not robust
and is therefore undesirable. For these reasons, we are computing the per-
ceived loudness (measured in dBA) of the complete signature. Frequency
weighting methods are used to account for the fact that humans do not have
an equal response to sounds of different frequencies. In these calculations,
less weight is given to the frequencies to which the ear is less sensitive.
In addition, all signatures computed are post-processed, to add a physical
rise time across the shock waves yielding loudness numbers that are more
representative of those perceived in reality.

Generation of response surfaces (surrogates)


Using the tools described above, 450 configurations obtained via Latin hyper-
cube sampling (LHS) were generated at a high computing cost. This sample
input-output data set was fitted using a neural network (NN) in order to
produce surrogates that could be used for optimization in a more economical
form. The NN was a single hidden layer perceptron with sigmoid activa-
tion functions that provided a general nonlinear model. We used for its
fast training (i.e., determination of the NN parameters) the variable pro-
jection algorithm (VARPRO) to solve the resulting nonlinear least squares
separable problem in order to generate a reduced cost approximation of the
objective space. This was combined with a global optimization algorithm,
since the resulting problems are generally multimodal [37].
The training of a single hidden layer neural network using sigmoid ac-
tivation functions leads to a separable nonlinear least squares problem, in
which the unknown parameters in the network are determined as the best
fit of the training data in the l2 sense (i.e., the input/output data contained
NONLINEAR LEAST SQUARES APPLICATIONS 245

in the simulation database). This results in a surrogate function that is a


linear combination of sigmoids:
 1
S(x; a, α1 , α2 , . . .) = aj T j j ,
j 1+ e−(x α +α0 )

where x ∈ Rk corresponds to the input or design parameters, aj are the


linear parameters and αj ∈ Rk are the nonlinear parameters. The data
used corresponds to a representative supersonic business jet configuration.
For each output, the training files contain 300 data points, while a test set
of an additional 150 data points is used for evaluation of the fit. There are
eight independent or input variables, namely:

• Wing reference area;

• Wing aspect ratio;

• Longitudinal position of the wing;

• Wing sweep angle;

• Lift coefficient at initial cruise;

• Lift coefficient at final cruise;

• Altitude at initial cruise; and

• Altitude at final cruise.

There are 10 dependent variables that we want to approximate with neural


networks, namely:

• Drag coefficient at initial cruise;

• Sonic boom initial rise at initial cruise;

• Sonic boom sound level at initial cruise without rise time modification;

• Sonic boom sound level at initial cruise with first type rise time mod-
ification;

• Sonic boom sound level at initial cruise with third type rise time mod-
ification;

• Drag coefficient at final cruise;

• Sonic boom initial rise at final cruise;

• Sonic boom sound level at final cruise without rise time modification;
246 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Output Max. #Sig- RMS RMS Time Max.


value moids training test (sec) Res.
1 – 8 0.0000865 0.00033 872 0.0009
2 – 8 0.0033 0.206 831 0.15
3 – 10 0.406 1.2 1872 1.5
4 95 10 0.848 2.0 2332 3.0
5 95 10 0.541 2.01 2393 1.9
6 0.017 10 0.0000721 0.000229 2782 0.00023
7 3 8 0.0364 2.66 1393 0.26
8 99 10 0.414 1.14 1773 1.9
9 94 10 0.747 4.46 2090 3.2
10 94 10 0.545 1.71 1917 2.2

Table 12.3: Neural Network results for surrogate functionals.

• Sonic boom sound level at final cruise with first type rise time modi-
fication; and

• Sonic boom sound level at final cruise with third type rise time modi-
fication.

We scale the input variables so that they have zero mean and variance equal
to 1. In Table 12.3 we show the results of training and testing for each one
of the 10 outputs.
Using these surrogates we conduct a number of multiobjective genetic
algorithm optimizations that attempt to simultaneously maximize the range
(in nautical miles) and minimize the perceived noise level at the ground
(in dB), while satisfying all the constraints. The NSGA-II evolutionary
algorithm [59] is used to generate an approximation to the Pareto front,
and a summary of the most promising results can be seen in Figure 12.3.3.

The initial population is composed of 64 alternative designs that are


randomly distributed within the design space. The level of performance of
these designs is represented by the open circles in Figure 12.3.3. These
circles constitute a pair of (range, loudness) for each of the aircraft in
question. Note that the optimization algorithm attempts to drive all designs
toward the upper left-hand corner of the graph. Of the 64 initial design
alternatives only some represent feasible designs, in the sense that they
satisfy the constraints of the problem (takeoff and landing field lengths <
6,000 ft). The open circles represent designs that do not meet the design
constraints and are therefore infeasible and must be discarded. That is the
reason why some open circles appear to have extraordinary performance.
Notice, again, that the baseline configuration does not meet the constraints
NONLINEAR LEAST SQUARES APPLICATIONS 247

Figure 12.3.3: Results in objective space for the multiobjective genetic


algorithm optimization. Initial population: circles. Final population: aster-
isks (all feasible designs).
248 LEAST SQUARES DATA FITTING WITH APPLICATIONS

(takeoff field length in this case) of the optimization problem.


In the genetic algorithm we choose a crossover probability for the vari-
ables of 0.75 and a mutation probability of 0.125. The distribution indices
for both the crossover and mutation operations are taken to be 12% and
25%, respectively. The population is allowed to evolve over 1,000 genera-
tions. As the generations are evolved, the entire population migrates toward
feasibility and attempts to maximize their range and minimize their noise
level simultaneously. The result is the Pareto front of solutions given by
the asterisks in the graph. This Pareto front represents the set of non-
dominated solutions of the problem (no other solution is better than any
of those solutions in both performance measures simultaneously). Notice
that all asterisks on the Pareto front correspond to aircraft that satisfy the
constraints of the problem (that was not the case for the initial population).
The designs/aircrafts on the Pareto front provide the designer with al-
ternatives to trade-off the relative benefits of each of the two (conflicting)
objective functions: one can choose an aircraft with very low boom signa-
ture, but also very low range, or an aircraft with extremely high range, but
rather poor sonic signature, or something in between. Note that the shape
of the Pareto front is quite unique: it is initially very steep and then its
slope appears to decrease to a smaller value (the shape is reversed from the
usual L-curve because we are minimizing one objective and maximizing the
other).
In fact the Pareto front is well approximated by two straight-line seg-
ments with a knee located at a loudness level of approximately 67.5 dBA.
The interpretation is clear: it pays off to sacrifice a little bit of the quiet-
ness of the aircraft to attain significant improvements in its range. After
the loudness increases beyond 67.5 dBA, modest improvements in range are
obtained at the expense of very large increases in sonic boom loudness.
Observe that this optimization task would be impossible with current
computer machinery if we would need to use the high-fidelity simulations
for all the 64,000 evaluations involved in the calculation. Amazingly, the
optimal design chosen, when verified with the high-fidelity simulators, had a
performance within 2% of it, so it would not really be necessary, for practical
engineering purposes, to perform any additional refinements.

12.4 NMR spectroscopy


Nuclear magnetic resonance (NMR) in liquids and solids was discovered
more than 50 years ago by Ed Purcell (Harvard) and Felix Bloch (Stanford),
who shared the Nobel Prize in Physics in 1952 for their work. NMR is now
a fundamental analytical tool in synthetic chemistry, plays an important
role in biomedical research, and has revolutionized modern radiology and
neurology. Its applications outside the laboratory or medical clinic are as
NONLINEAR LEAST SQUARES APPLICATIONS 249

diverse as imaging oil-wells, analyzing food, and detecting explosives.


A typical NMR experiment involves placing the sample under study in
a strong magnetic field, which forces the magnetic moments or spins of all
the nuclei in the sample to line up along the main applied field and precess
around this direction. The spins precess at the same frequency but with
random phases. Pulses of radiofrequency (RF) magnetic fields are then
applied that disturb the spin alignments but make the phases coherent and
detectable. As this state precesses in the magnetic field, the spins emit
radio-frequency radiation that can be analyzed to reveal the structural,
chemical and dynamical properties of the sample. The idea of applying
strong RF pulses is due to H. C. Torrey and E. L. Hahn. Higher and
higher magnetic fields, of the order of teslas, have been used in order to
increase the sensitivity of the method until just recently, when researchers
in the United States and Germany have shown that MR can be performed
with fields in the microtesla range, by pre-polarizing the nuclei and using a
superconducting quantum interference device (SQUID) to detect individual
flux quanta. In a magnetic field of 1.8 microtesla, the researchers observe
proton magnetic resonance in a liquid sample at about 100 Hz with an
astonishing degree of sensitivity.
In vivo NMR spectroscopy has the strongest connection and owes most
to the variable projection methodology. In 1988, van der Veen et al. [244],
a group of researchers at Delft University and Phillips Medical Systems in
the Netherlands, published a very influential paper on the accurate quan-
tification of in vivo NMRS data, using the variable projection method and
prior knowledge (constraints). According to Google Scholar, as of August
2012 this paper had more than 335 citations.
The problem here is to fit NMR spectra in the time domain, using mod-
els whose parameters have physical significance. The most commonly used
model in NMRS is a linear combination of exponentially damped sinusoids,
but other types of nonlinear functions are also considered:


K
yn = ak ejφk e(−dk +j2πfk )tn + en , n = 0, 1, . . . , N,
k=1

where j = −1, ak is the amplitude, φk is the phase, dk is the damping
factor, and fk is the frequency.
Van der Veen et al. consider NMRS measurements of human calf muscle
and human brain tissue and compare the FFTs of the original data with a
linear prediction and SVD decomposition, and VARPRO with and without
prior knowledge. The best results are obtained with VARPRO plus prior
knowledge, which was difficult or impossible to impose on previous simpler
approaches. They also list as desirable features the fact that starting values
of the amplitudes of the spectral components are not required and that there
are no restrictions on the form of the model functions.
250 LEAST SQUARES DATA FITTING WITH APPLICATIONS

The model, without prior knowledge, involves 256 complex data points
and 11 exponentials, for a total of 44 parameters divided equally between
linear and nonlinear. In order to obtain good initial values, the time signal
was Fourier transformed and displayed. Then, an interactive peak-picking
was performed. This provided good initial values for the frequency and
the damping of each peak. When the full functional is minimized, one also
needs initial values for the linear parameters, and these are obtained by
solving the LSQ problem obtained by evaluating the nonlinear part at the
chosen initial values, a very wise choise as indicated in the original paper
[101], since this is much better than using arbitrary values.
Van der Veen et al. first consider three different optimization methods
for the full and the reduced problem, using variable projection with the
Kaufman improvement:

• The original VARPRO algorithm with the Levenberg-Marquardt im-


plementation,

• a secant type code NL2SOL [65],

• and LMDER, a modern Levenberg-Marquardt implementation from


MINPACK [263].

According to their results, NL2SOL with the separated functional seems to


be the most reliable, even for large levels of noise. MINPACK’s routine is
almost as reliable and systematically faster. We should say here that the
reported average times for this problem are about one minute on a SUN
ULTRA2 (200 MHz), so in current and future platforms the differences in
performance are negligible.
The most important conclusion drawn there is that if a VARPRO-type
code is to be used, it should include the Kaufman simplification and should
take advantage of the advances in numerical optimization. The MINPACK
NLLS solver LMDER or the Gay-Kaufman implementation are good can-
didates for replacement codes. See, however, the recent results of [172].
Some years later, a group in Leuven, Belgium [245, 246, 247], made a
comparative study for this problem, including artificial noise and also using
prior knowledge. They examined one of the data sets considered earlier by
van der Veen et al. They added different levels of white Gaussian noise,
small (5%), medium (15%) and large (25%), and considered 300 runs per
level in a Monte Carlo simulation.
The study of Vanhamme et al. considers also three different methods
to obtain starting values: HSVD, a fully automatic parameter estimation
method that combines a state-space approach with SVDs: pick1, the peak-
picking method described above, and pick2, a more careful (and expen-
sive) version of pick1. Since the influence of these different initial value
choices seems to be method-independent, only results for MINPACK were
NONLINEAR LEAST SQUARES APPLICATIONS 251

presented. The conclusions were that the desirable automatic procedure


works very well for low and medium noise levels, but not as well for high
levels of noise. The procedures pick1 and pick2 are similar in performance
and reliability, with a slight edge for pick2.
Finally, they considered the effect of using prior knowledge about the
problem. In prior1, the number of linear variables is reduced to 11 by
noting that all the peaks have a phase of 135◦ . In prior2 all the known prior
knowledge is used to eliminate variables, obtaining a problem with 5 linear
and 12 nonlinear parameters. Results are given for MINPACK only and
here, using prior1 variable projection has a slight edge in performance for
small and medium noise levels, although for medium and high levels of noise
there is a deterioration in the reliability. For prior2, variable projection is
consistently more reliable than the full functional approach, with a slight
penalty in performance (under 8%), which has now come down to under 10
seconds of CPU time.
As a conclusion to this study the authors suggested using the full func-
tional instead of the reduced one, and they proceeded to write their own
solver AMARES to do that. This new solver includes some other features
special to the problem that were not present in [244]. The conclusions of
Vanhamme et al. indicate that with the Gay and Kaufman implementa-
tion of variable projection, the balance clearly tilts in favor of the reduced
functional, both in speed and reliability. Both VARPRO and AMARES are
currently offered in the MRUI system, which according to the development
group, “more than 780 research groups worldwide in 53 countries benefit
from.”

12.5 Piezoelectric crystal identification


The physical model
A piezoelectric transducer converts electrical signals into mechanical de-
formations to generate sound waves, or conversely, it converts mechanical
deformations into electric signals in order to monitor and record incoming
sound waves. Piezoelectric transducer arrays are used as ultrasound sources
and receivers for medical imaging, sonar and oil exploration applications.
There is extensive research in the ultrasound industry to validate and design
new prototype piezoelectric transducers with optimal performance charac-
teristics.

Example 111. A homogeneous piezoelectric crystal model is defined by


ten elastic, electromagnetic and coupling parameters, by the geometry of
the sample and that of the electrodes. One can use a finite element (FE)
time domain code for the numerical modeling of the device response. The
numerical model can be used to provide insight into the performance of a
252 LEAST SQUARES DATA FITTING WITH APPLICATIONS

specific design, but also, via a nonlinear inversion process, it can be used
to determine the material property parameters of an unknown crystal from
complex impedance measurements. These properties are essential for for-
ward simulation in design processes.
The problem that we will consider involves a nonlinear least squares
inversion algorithm to improve on the accuracy of some nominal initial
values of the parameters of a given transducer crystal, using measured values
at successive time intervals of the resulting impedance, when a small voltage
is applied to the sample.
We accomplish this task by coupling forward modeling and optimiza-
tion tools, namely, PZFLEX [205], a coupled finite element time domain
code for the elastic and electromagnetic parts of the problem, and PRAXIS,
a general unconstrained minimization code for the nonlinear least squares
fit [31]. No explicit derivatives of the goal functional are required by this
code. Bound constraints are imposed in order to limit the variability of the
parameters to physically meaningful values, and, since PRAXIS is an un-
constrained optimization code, these constraints are introduced via a change
of independent variables, so that when the new variables run over the whole
space (i.e., unconstrained), the physical parameters stay in the desired box.
This can be achieved, for example, by using the transformation:

α = α/(1 + ex ) + ᾱ/(1 + e−x ),


which has the desired properties, i.e., when −∞ < x < ∞, then α < α < ᾱ.
The inverse relationship is

x = −ln[(ᾱ − α)/(α − α)],


and that is the change of variables that we use to convert the box constrained
problem into an unconstrained one.
In order to characterize the material properties of a given crystal, cur-
rent practice uses several IEEE standard shapes (corresponding essentially
to asymptotic limit cases) to determine different groups of parameters at a
time by trial and error. These samples are fairly expensive and sometimes
hard to manufacture. Given the analysis tools at our disposal, we will use
shapes different from the IEEE ones in order to see whether we can deter-
mine as many parameters as possible at once and in an automated manner.
For this purpose and as a pre-process, we propose to carry on sensitivity
analyses of different crystal shapes, to find configurations for which as many
as possible of the parameters can be determined at once.
For this analysis we consider the linearized model, which is represented
by the Jacobian of the Fourier transform of the impedance values with re-
spect to model parameters. These derivatives are estimated by finite dif-
ferences. Since the number of parameters is small, we can apply a direct
singular value decomposition (SVD) based solver to this rectangular matrix,
NONLINEAR LEAST SQUARES APPLICATIONS 253

calculated at a reference value of the parameters (in practice all these pa-
rameters are known within 10% of their true values, for a given material).
As explained in Section 9.7 (see also [36, 187, 192, 200]), we can use the
SVD to rank the relative relevancy of each parameter, for a given data set.
This can be used as a guide to design geometrical configurations and place-
ment of electrodes, so that the measured impedance is sensitive to as many
parameters as possible, thus minimizing the number of samples that need to
be built in order to determine material properties accurately.

PZFLEX modeling
Operational emphasis for imaging transducers is broadband (impulsive) rather
than narrowband (continuous wave). Transducers are currently available for
diagnostic imaging and Doppler velocity measurement, as well as for a host
of specialty applications (intracavity, biopsy, etc.) and disease treatment
(lithotripsy, hyperthermia, tissue ablation). Over the past two decades the
ultrasound industry has done a remarkable job in developing and refining
these devices, using a combination of semi-analytical design procedures and
prototype experiments.
However, it is apparent to many that conventional design methods are
approaching practical limits of effectiveness. The industry has been slowly
recognizing discrete numerical modeling on the computer as a complemen-
tary and useful tool, i.e., using virtual prototyping instead of, or comple-
menting, actual laboratory experimentation. Today, nearly all of the major
ultrasound system companies are experimenting with finite element mod-
els using commercial packages like ANSYS, or by writing their own codes.
Most have enjoyed only limited success at significant development and/or
simulation costs.
We suggest that the main source of difficulty is universal reliance on
classical implicit algorithms for frequency-domain and time-domain analy-
sis based on related experience with shock and wave propagation problems.
In general, implicit algorithms are best suited to linear static problems,
steady-state vibrations and low-frequency dynamics. A much better choice
for transient phenomena, linear or nonlinear, is an explicit time-domain
algorithm, which exploits the hyperbolic (wave) nature of the governing dif-
ferential equations.
The finite element method reduces the electromechanical partial differ-
ential equations (PDEs) over the model domain to a system of ordinary
differential equations (ODEs) in time. This is done by using one of the
nearly equivalent integral formalisms: virtual work, weak form, Galerkin’s
method, weighted residuals, or less formally, using pointwise enforcement of
the conservation and balance laws. The result is that spatial derivatives in
the PDEs are reduced to a summation of “elemental” systems of linear al-
gebraic equations on the unknown field values at nodes of the finite element
254 LEAST SQUARES DATA FITTING WITH APPLICATIONS

discretization.
The continuum elements used in the PZFLEX code are 4-node quadrilat-
erals in two-dimensions and 8-node hexahedrons in three-dimensions. The
unknown field over an element is represented by low-order shape functions
determined by nodal (corner) values, i.e., bilinear in two-diensions and tri-
linear in three-dimensions. Using a minimum of 15 elements per wavelength
limits wave dispersion errors to less than 1%. Experience has shown that
these choices offer the most robust basis for large-scale wave propagation
analysis in structural and isotropic or anisotropic continuum models.
When transient signals are of principal interest, the most direct solution
method is step-by-step integration in time. There are many ways to evaluate
the current solution from known results at previous time steps. Implicit
methods couple the current and previous time step solution vectors and
hence, a global system of equations must be solved at each time step. Their
advantage is unconditional stability with respect to time step. By contrast,
explicit methods decouple the current solution vectors, eliminating the need
for a global system solve, but they are only conditionally stable, i.e., there is
a time step restriction (the Courant-Fiedrichs-Levy (CFL) condition) above
which the method is unstable.
The caveat for integration of wave phenomena is that solution accuracy
requires a time step smaller than one-tenth the period of the highest fre-
quency to be resolved. This is close to the CFL stability limit for explicit
methods and effectively removes the principal advantage of implicit integra-
tion. Explicit integration of the field equations involves diagonalizing the
uncoupled mass and damping matrices, using nodal lumping, replacing the
time derivatives with finite differences and integrating using a central dif-
ference scheme (second-order accurate). For stability the time step must
be smaller than the shortest wave transit time across any element (CFL
condition). More details on the finite element approach can be found in
[36].

The nonlinear least squares problem

Given a homogeneous piezoelectric crystal, a small voltage is applied and the


resulting impedance is calculated from measurements as a digitized function
of time. This function is fast Fourier transformed and the resulting complex
samples constitute the observed data, {Iio }, i = 1, . . . , m. Given a vector of
model parameters α and using the finite element simulator described above,
we can calculate a similar response that we call {Iic (α)}.
In the present problem, we will improve the accuracy of some of the
parameters that define a particular transducer, by applying a nonlinear least
squares inversion to this measured data, namely:
NONLINEAR LEAST SQUARES APPLICATIONS 255


m
min g(α) = min (Iio − Iic (α))2 .
α α
i=1

Not all the α’s are physically feasible; there is only a range [α, α] of param-
eters that corresponds to possible materials and the minimization includes,
therefore, bound constraints:

min g(α), αi ∈ [αi , αi ], i = 1, . . . , n.


α

Sensitivity analysis
As the manufacturing costs of transducer prototypes is high, we use a lin-
earized sensitivity analysis, as described in Section 9.7, to determine a pri-
ori which parameters can be calculated from measured impedances on several
crystal samples with different geometries. A summary of the procedure fol-
lows: Given a parameter vector α, calculate J(α) by finite differences and
its SVD.

1. Let the matrix of the right singular vectors, scaled by the corresponding
singular values, be V = V Σ.

2. Inspect the rows of V and select the elements above a certain threshold.
Choose the indices of the variables in parameter space corresponding
to these entries to form the subset α of parameters that are most
influenced by the data set. Observe that this does not establish a
direct correspondence between small singular values and parameters,
but rather one between singular values and linear combinations of
parameters, given by the rotated variables.

Numerical results
We consider a cylinder of piezoceramic material. The finite element mod-
eling assumes two planes of symmetry: axial at half height and radial, in
order to optimize the computation, which then becomes two-dimensional.
For a fixed diameter of 10 mm we consider several heights, giving aspect
ratios (diameter/height) between 20/1 (corresponding to one of the IEEE
standard shapes) and 1/2. By performing the analysis described in the pre-
vious section we have calculated the results shown in Table 12.4, where the
numbers under the various parameters indicate how well determined they
are when the real and imaginary part of the impedance are used as the data
set. A value larger than 0.1 indicates a well-determined parameter, while a
value less than 0.01 indicates a poorly determined one.
256 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Asp. rat. ep11 ep33 s11 s12 s13


20/1 6e-4 0.62 0.79 0.2 0.04
5/1 0.07 0.3 0.89 0.25 0.07
1/1 0.05 0.5 0.4 0.12 0.3
1/2 0.046 0.42 0.4 0.12 0.21
Asp. rat. s33 s44 d15 d13 d33
20/1 0.07 8e-4 3e-6 0.51 0.07
5/1 0.07 0.01 0.01 0.58 0.04
1/1 0.44 0.075 0.087 0.41 0.66
1/2 0.56 0.086 0.083 0.26 0.74

Table 12.4: Sensitivity analysis for a cylinder of piezoceramic material.


The bold numbers indicate well-determined parameters.

Param Name Target Final % error


ep11 dielectric 1 0.213 · 10−7 0.213 · 10−7 0.12
ep33 dielectric 2 0.297 · 10−7 0.296 · 10−7 −0.33
s11 compliance 1 1.559 · 10−11 1.530 · 10−11 2.11
s12 compliance 2 −0.441 · 10−11 −0.491 · 10−11 12.2
s13 compliance 3 −0.819 · 10−11 −0.758 · 10−11 6.81
s33 compliance 4 2.0 · 10−11 2.01 · 10−11 0.68
s44 compliance 5 4.48 · 10−11 4.49 · 10−11 0.3
d15 piezoe. stress 1. 7.19 · 10−10 7.2 · 10−10 0.32
d13 piezoe. stress 2. −2.895 · 10−10 −2.63 · 10−10 −9.08
d33 piezoe. stress 3. 6.047 · 10−10 6.521 · 10−10 7.79

Table 12.5: Target and final parameters.


NONLINEAR LEAST SQUARES APPLICATIONS 257

Figure 12.5.1: Initial, target and final, real and imaginary parts of the
impedance spectrum for low frequencies.

These results indicate that, considering truly three-dimensional shapes,


such as those with 1/1 or 1/2 aspect ratios, provides a better way to estimate
more of the relevant parameters at once than using the IEEE shapes. We
consider now the disk with aspect ratio 1/2 to test our parameter determi-
nation code. We will use PRAXIS to least squares fit a data set consisting
of the real and imaginary parts of the FFT of the impedance, which in the
range 1 KHz to 1 MHz is represented in digital form by 1341 unequally
spaced samples. One important feature of PRAXIS is that it does not re-
quire derivatives with respect to the unknown parameters, which are not
available in the large-scale code PZFLEX.
Because of symmetries and the fact that the material is considered ho-
mogeneous, the problem can be cast as a two-dimensional problem, and for
the wavelengths, materials and model size involved, a mesh of 25 × 50 ele-
ments is appropriate. For a driving frequency of 3.75 MHz, the elastic part
of PZFLEX requires 7091 time steps. On a 300 MHz Pentium II computer
under the SOLARIS operating system an average solve takes approximately
75 seconds.
We first generate synthetic data by running PZFLEX with the set of
parameters shown in Table 12.5 (the “target values”). Then we perturb these
parameter values by 5% and call them the “initial values.” The objective is
to see how many of the target values of these parameters we can recover,
say, to 1% accuracy or better. We also use a change of variables strategy
to force the parameters to stay within a ±8% box around the initial guess.
After 172 evaluations (i.e., PZFLEX solves; a 3.5 hours job), the results
are shown in Table 12.5. The initial and final residual mean squares were
64844 and 1068, respectively.
258 LEAST SQUARES DATA FITTING WITH APPLICATIONS

Figure 12.5.2: Initial, target and final, real and imaginary parts of the
impedance spectrum for high frequencies.

In Figures 12.5.1 and 12.5.2 we show the plots of the target and final,
real and imaginary parts of the impedance spectrum for low and high fre-
quencies separately. Observe that there is a change of scale going from low
to high frequency in order to enhance the details. In this typical example
we have been able to recover half of the parameters to the desired accuracy,
and an additional parameter has been improved by more than 50%, while
the estimates for the remaining parameters have worsened. Observe that
our bound constraints allow for a maximum error of 16% in the worst-case
scenario.

12.6 Travel time inversion of seismic data


A seismic survey for the exploration of hydrocarbons consists of generating
elastic waves and recording their back-scattering from the material discon-
tinuities in the underground, with the purpose of obtaining information
about their composition and geological structure. By processing and inter-
preting the obtained data, it is sometimes possible to extract information
of the travel time of different coherent events (reflections), which in turn
can be used to improve the knowledge of the material properties of the
subsoil. Only a restricted number of measurements is available, and these
problems are therefore highly underdetermined. In principle, the process
will estimate the rock parameters as well as the subsurface structure only
at a discrete set of points or as a finite-dimensional parametrized analytic
model.
The objective of the travel time inversion process is to improve on the
NONLINEAR LEAST SQUARES APPLICATIONS 259

accuracy of the parameters α that appear in the description of reflector


positions and inhomogeneous velocities within a given geological region
using seismic data. For this task we require a complete set of forward
modeling tools that include interactive structural model construction and
three-dimensional seismic ray tracing (see [189]).
Given a set of parameters that describe the geological medium, we can
compute via seismic ray tracing the approximate time response to a source
excitation, as measured at a geophone receiver array. We aim to determine,
from a set of m observed and interpreted arrival times t0i measured at the
receivers, improved values of the n model parameters α, such that the
computed arrival times tci (α) match the observed times “as well as possible”
in, for example, the l2 -norm sense. The resulting nonlinear least squares
problem is
m
min (t0i − tci (α))2 .
α
i=1

Travel times between sources and receivers and derivatives with respect
to the parameters are required. Calculation of travel time derivatives can be
accomplished in this case economically with information produced during
the ray tracing (see [189]). The vector of parameters α can be subdivided
into variables that define the interfaces and variables that describe material
properties in the regions between the interfaces. A seismic survey usually
extends over a considerable area, where many shots are placed together
with local receiver arrays, leading to small aperture beams of rays that
only “see” a limited part of the model at a time.
If the model is parametrized in a local way, for instance, using B-splines,
then this leads to a natural setup for dealing with the resulting very large-
scale problems (millions of observations, thousands of parameters), using
the block partitioning approach described in an earlier chapter. If we par-
tition the data set by shots, we can then determine the parameters that
are sensed by each subset of data, thus obtaining as many subproblems of
the same kind as the global one as there are shots. A block nonlinear asyn-
chronous Gauss-Seidel iteration leads to a solution of the global problem
that is naturally parallelizable in a network of computers (see [187]).

Example 112. Application to synthetic data


We consider a synthetic data set corresponding to a geothermal field in
Indonesia. The aim of the exercise is to demonstrate the viability of the
travel time inversion to provide velocity sections in regions with complex
near-surface geology and significant topographic relief, using both turning-
ray and reflection data. The data are based on an acquisition geometry and
geological structure for the Silangkitang, Indonesia, field area. By synthetic
data we mean that, given a structural model and a set of model parameters
αtarget for the velocity, we generate by ray tracing a set of turning-ray and
260 LEAST SQUARES DATA FITTING WITH APPLICATIONS

reflected travel times. We call those travel times “the observed data.” We
then alter the parameters, generating an initial velocity and improve it step
by step using the methods described above. This is a standard procedure
to generate problems with known solution in order to test and validate a
complex algorithm.
The Silangkitang two-dimensional section has a lateral extent of 4000m,
and we invert it for a depth of 1000m. The shot gather data are given by
as many records as data points, in the format

xsource , ysource , zsource , xreceiver , yreceiver , zreceiver , travel time.

From these coordinates we deduce the topography and fit a B-spline curve to


represent it. In this model there was no reflection data available, but we put
an artificial reflector at 1000 m depth to create the synthetic data. We use
50 shots, distributed along the surface of the model, and 160 fixed receivers
spaced at 25m.
As we indicated above, we start with the turning-ray data (also referred
to as refracted arrivals or diving rays), which join source to receivers without
reflecting. In order to create a one-dimensional (variation in depth only)
initial velocity model, we examined the travel time curves for a few shots
and decided on an appropriate gradient:

v init (z) = v0 + gz.

We add to this background gradient a one-dimensional B-spline correction


with 8 equally spaced basis functions. With this gradient and the correction
set to zero, we proceed to perform one sweep of a block Jacobi inversion
procedure, i.e., for each block we solve the nonlinear least squares problem
and save the updated parameters. Once all the blocks are processed we
average those parameters that had more than one correction. Block Jacobi
is a synchronous procedure. Currently, we favor a Gauss-Seidel approach
that does averaging on the fly for parameters that appear in more than one
block.
The next step consists of replicating this velocity laterally, creating a 16×
8 mesh of two-dimensional tensor product B-spline basis functions. This
two-dimensional model was then inverted. We performed two full sweeps.
In Figure 12.6.1 each symbol represents the residual RMS associated with
one shot,
1/2
1 
m
2
RMS = ((t − tci (α)) ,
m i=1

measured in seconds. In the upper left panel of Figure 12.6.1 we see that
the initial guess (*) is so poor that a number of shots do not produce enough
arrivals and therefore they are absent in the inversion. The one-dimensional
NONLINEAR LEAST SQUARES APPLICATIONS 261

Figure 12.6.1: RMS at different stages of inversion of the Silangkitang


synthetic data set. We show RMS for the observations corresponding to
each of 50 shots.

correction improves matters, especially in the middle of the model, but it


is the two-dimensional lateral correction that brings the maximum RMS
to about 17msec. For real, good quality data we would consider a good fit
to have a maximum RMS under 10msec. In Figure 12.6.1 we cross-plot
the RMS per shot for the two data sets at different stages of this process.
Using the computed two-dimensional velocity as an initial guess, we invert
the synthetic reflection data: that completes a full sweep through the whole
data set. Now the maximum RMS has been reduced to below 3msec (lower-
right figure), an excellent result that validates the approach, but, of course,
remember that this is synthetic, noiseless data.
This page intentionally left blank
Appendix A

Sensitivity Analysis

A.1 Floating-point arithmetic


The classic reference for finite precision arithmetic is Wilkinson’s mono-
graph “Rounding Errors in Algebraic Processes” [255], while a more recent
treatment is Higham’s “Accuracy and Stability of Numerical Algorithms”
[128]. Almost any numerical analysis book has an introductory chapter
about this topic. Here we list some of the basic ideas used in our text.
Digital computers use floating-point representation for real and complex
numbers based on the binary system, i.e., the basis is 2. Real numbers are
rewritten in a special normalized form, where the mantissa is less than 1.
Usually there is the option to use single (t-digit) or double (2t-digit) length
mantissa representation and arithmetic. If we denote by fl(x) the floating-
point computer representation of a real number x, and by ⊕ the floating-
point addition, then the unit round-off μ (for a given computer) is defined
as the smallest ε such that in floating-point arithmetic: fl(1) ⊕ ε > fl(1).
For a binary t-digit floating-point system μ = 2−t . The machine epsilon
εM = 2μ is the gap between 1 and the next larger floating-point number
(and, thus, in a relative sense, gives an indication of the gap between the
floating-point numbers). Several of the bounds in this book contain the unit
round-off or the machine precision; it is therefore advisable to check the size
of εM for a particular machine and word length. A small Fortran program is
available from Netlib to compute the machine precision for double precision
and can be adapted easily for single.
Representation error. The relative error in the computer represen-
tation fl(x) of a real number x = 0 satisfies

|fl(x) − x|
≤ μ,
|x|

263
264 APPENDIX A

implying that fl(x) ∈ [x(1 − εM ), x(1 + εM )].


Rounding error. The error in a given floating-point operation ,
corresponding to the real operation ∗, satisfies

fl(x)  fl(y) = (x ∗ y)(1 + ε), with |ε| ≤ μ.

To measure the cost of the different algorithms described in the book


we use as the unit the flop. A word of warning though: its definition differs
from one author to the other; here we follow the one used in [105, 128],
which is also common in many articles in the literature.

Definition 1. A flop is roughly the work associated with a floating-point


operation (addition, subtraction, multiplication, or division).
In March 2011 the cheapest cost per Gigaflop (109 flops) was $1.80,
achieved on the computer cluster HPU4Science, made of six dual Core 2
Quad off-the-shelf machines at a cost of $30, 000, with performance en-
hanced by combining the CPUs with the Graphical PUs. In comparison,
the cost in 1984 was $15 million on a Cray X-MP.

A.2 Stability, conditioning and accuracy


A clear and concise review of these topics can be found in [57, 128, 237].
One general comment first: given a t-digit arithmetic, there is a limit to the
attainable accuracy of any computation, because even the data themselves
may not be representable by a t-digit number. Additionally, in practical
applications, one should not lose sight of the fact that usually the data,
derived from observations, already have a physical error much larger than
the one produced by the floating-point representation.
Let us formally define a mathematical problem by a function that relates
a data space X with a solutions space Y, i.e., P : X (data) → Y(solutions).
Let us also define a specific algorithm for this problem as a function P :
X → Y. One is interested in evaluating how close the solution computed
by the algorithm is to the exact solution of the mathematical problem.
The accuracy will depend on the sensitivity of the mathematical prob-
lem P to perturbations of its data, the condition of the problem, and on
the sensitivity of the algorithm P̃ to perturbations of the input data, the
stability of the algorith.
The condition of the mathematical problem is commonly measured by
the condition number κ(x). We emphasize the problem dependency, so,
for example, the same matrix A may give rise to an ill-conditioned least
squares problem and a well-conditioned eigenvector problem.
A formal definition of the condition number follows.
SENSITIVITY ANALYSIS 265

Definition 2. The condition number is defined by

P (x + δx) − P (x)2 δx2


κ(x) = sup / .
δx P (x)2 x2
∂Pi
If the mapping P is differentiable with Jacobian [J(x)]ij = ∂xj , the above
formula can be replaced by

J(x)2 x2
κ (x) = .
P (x)2

The condition number κ (x) is the leading coefficient of the data perturba-
tion in the error bound for the solution.
Among the possible formalizations of the stability concept, backward
stability is a convenient one; an algorithm is backward stable if it computes
the exact solution for slightly perturbed data, or in other words, quoting
[237], an algorithm is backward stable if “[it] gives the right answer to nearly
the right question”: P (x) = P (x̃) = P (x + Δx). The perturbation of the
input data Δx2 is the backward error. Formally,

Definition 3. The algorithm P is backward stable if the computed solution


x̃ is the exact solution of a slightly perturbed problem; i.e., if

P (x) = P (x̃)
for some x with

x̃ − x2
= O(εM ).
x2
One can bound the actual error in the solution P (x) of a problem with
condition number κ(x) if it is computed using a backward stable algorithm
P (x) :  
P (x) − P (x)
2
= O(κ(x)εM ).
P (x)2
In words, a backward stable algorithm, when applied to a well-conditioned
problem, yields an accurate solution, and if the backward error is smaller
than the data errors, the problem is solved to the same extent that it is actu-
ally known. On the other hand, the computed solution to an ill-conditioned
problem can have a very large error, even if the algorithm used was back-
ward stable, the condition number acting as an amplification factor of the
data errors.
This page intentionally left blank
Appendix B

Linear Algebra Background

B.1 Norms
Vector norms
The most commonly used norms for vectors are the l1 -, l2 - and l∞ -norms,
denoted by || · ||1 , || · ||2 , and || · ||∞ , respectively. These norms are defined
by:
n
• v1 = i=1 |vi |.
n  12  1
• v2 = i=1 |vi |2 = v T v 2 (Euclidean norm).

• v∞ = max1≤i≤n |vi | (Chebyshev norm).

A useful relationship between an inner product and the l2 -norms of its


factors is the Cauchy-Schwartz inequality:
 T 
x y  ≤ x y .
2 2

Norms are continuous functions of the entries of their arguments. It follows


that a sequence of vectors x0 , x1 , . . . converges to a vector x if and only if
limk→∞ xk − x = 0 for any norm.

Matrix norms
A natural definition of a matrix norm is the so-called induced or operator
norm that, starting from a vector norm  · , defines the matrix norm as
the maximum amount that the matrix A can stretch a unit vector, or more
formally: A = maxv=1 Av. Thus, the induced norms associated with
the usual vector norms are

267
268 APPENDIX B

n
• A1 = maxj i=1 |aij | .

1
• A2 = [max(eigenvalue of (AT A))] 2 .
n
• A∞ = maxi j=1 |aij | .

In addition the so-called Frobenius norm (Euclidean length of A considered


as an nm-vector) is
 m  21 1
n
• AF = j=1 i=1 |aij |2 = trace(AT A) 2 .

√ square, orthogonal matrices Q ∈ R we have Q2 = 1 and QF =


n×n
For
n. Both the Frobenius and the matrix l2 -norms are compatible with
the Euclidean vector norm. This means that Ax ≤ A x is true
when using the l2 -norm for the vector and either the l2 or the Frobenius
norm for the matrix. Also, both are invariant with respect to orthogonal
transformations Q:

QA2 = A2 , QAF = AF .

In terms of the singular values of A, the l2 norm can be expressed as A2 =


maxi σi = σ1 , where σi , i = 1, . . . , min(m, n) are the singular values of A
in descending order of size. In the special case of symmetric matrices the
l2 -norm reduces to A2 = maxi |λi |, with λi an eigenvalue of A. This is
also called the spectral radius of the matrix A.

B.2 Condition number


The condition number of a general matrix A in norm ·p is

 
κp (A) = Ap A† p .

For a vector-induced norm, the condition number of A is the ratio of the


maximum to the minimum stretch produced by the linear transformation
represented by this matrix, and therefore it is greater than or equal to 1.
In the l2 -norm, κ2 (A) = σ1 /σr , where σr is the smallest nonzero singular
value of A and r is the rank of A.
In finite precision arithmetic, a large condition number can be an in-
dication that the “exact” matrix is close to singular, as some of the zero
singular values may be represented by very small numbers.
LINEAR ALGEBRA BACKGROUND 269

B.3 Orthogonality

The notation used for the inner product of vectors is v T w = i vi wi . Note
2
that v2 = v T v. If v, w = 0, and v T w = 0, then these vectors are
orthogonal, and they are orthonormal if in addition they have unit length.
Orthogonal matrices. A square matrix Q is orthogonal if QT Q = I
or QQT = I, i.e., the columns or rows are orthonormal vectors and thus
Q2 = 1. It follows that orthogonal matrices represent isometric trans-
formations that can only change the direction of a vector (rotation, reflec-
tion), but not its Euclidean norm, a reason for their practical importance:
Qv2 = v2 , QA2 = AQ2 = A2 .
Permutation matrix. A permutation matrix is an identity matrix
with permuted rows or columns. It is orthogonal, and products of permu-
tation matrices are again permutations.
Orthogonal projection onto a subspace of an inner product
space. Given an orthonormal basis, {u1 , u2 , . . . , un } of a subspace S ⊆
X , where X is an inner product space, the orthogonal projection P : X → S
satisfies

n
Px = (xT ui ) ui .
i=1

The operator P is linear and satisfies Px = x if x ∈ S (idempotent) and


Px2 ≤ x2 ∀x ∈ X . Therefore, the associated square matrix P is an
orthogonal projection matrix if it is Hermitian and idempotent, i.e., if

P T = P and P 2 = P.

Note that an orthogonal projection divides the whole space into two or-
thogonal subspaces. If P projects a vector onto the subspace S, then I − P
projects it onto S ⊥ , the orthogonal complement of S with S + S ⊥ = X and
S ∩ S ⊥ = 0. An orthogonal projection matrix P is not necessarily an or-
thogonal matrix, but I −2P is orthogonal (see Householder transformations
in Section 4.3).
An important projector is defined by a matrix of the form PU = U U T ,
where U has p orthonormal columns u1 , u2 , . . . , up . This is a projection
onto the subspace spanned by the columns of U . In particular, the pro-
jection onto the subspace spanned by a single (not necessarily of norm 1)
uuT
vector u is defined by the rank-one matrix Pu = u Tu.

Gram matrix. The Gram matrix or Grammian Gram(A) of an m × n


matrix A is AT A. Its elements are thus the n2 possible inner products
between pairs of columns of A.
270 APPENDIX B

B.4 Some additional matrix properties


The Sherman-Morrison-Woodbury formula gives a representation of
the inverse of a rank-one perturbation of a matrix in terms of its inverse:

A−1 uv T A−1
(A + uv T )−1 = A−1 − ,
1 + v T A−1 u
provided that 1 + v T A−1 u = 0. As usual, for calculations, A−1 u is short
hand for “solve the system” Ax = u.
The next theorem presents the interlacing property of the singu-
lar values of A with those of matrices obtained by removing or adding a
column or a row (see [20]).
Theorem 4. Let A be bordered by a column u ∈ Rm , Â = (A, u) ∈ Rm×n ,
m ≥ n. Then, the ordered singular values σi of A separate the singular
 as follows:
values of A

1 ≥ σ1 ≥ σ
σ n−1 ≥ σn−1 ≥ σ
2 ≥ σ2 ≥ ... ≥ σ n .

Similarly, if A is bordered by a row v ∈ Rn ,

A
 = ∈ Rm×n , m ≥ n,
vT

then
1 ≥ σ1 ≥ σ
σ 2 ≥ σ2 ≥ ... ≥ σ
n−1 ≥ σn−1 ≥ σ
n ≥ σn .
Appendix C

Advanced Calculus
Background

C.1 Convergence rates


Definition 5. Let x∗ , xk ∈ R for k = 0, 1, . . . The sequence {xk } is said to
converge to x∗ if

lim |xk − x∗ | = 0.
k→∞

The convergence is
linear if ∃ c ∈ [0, 1) and an integer K > 0 such that for k ≥ K,
|xk+1 − x∗ | ≤ c |xk − x∗ | ;
superlinear if ∃ ck −→ 0 and an integer K > 0 such that for k ≥ K,
|xk+1 − x∗ | ≤ ck |xk − x∗ | ;
quadratic if ∃ c ∈ [0, 1) and an integer K > 0 such that for k ≥ K,
|xk+1 − x∗ | ≤ c |xk − x∗ | .
2

Definition 6. A locally convergent iterative algorithm converges


to the correct answer if the iteration starts close enough. A globally
convergent iterative algorithm converges when starting from almost
any point. For minimization, this is not to be confused with finding the
global minimum of a functional on a compact domain (see below).
Global and local minimum: x∗ is a global minimizer of a function
f : Rn −→ R on a compact domain D if f (x∗ ) ≤ f (x), ∀x ∈ Rn . x∗ is a
local minimizer inside a certain region, usually defined as an open “ball” of
size δ around x∗ , if f (x∗ ) ≤ f (x) for x − x∗ 2 < δ.

271
272 APPENDIX C

C.2 Multivariable calculus


The gradient and Hessian of a scalar function of several variables f (x) are
a vector and a matrix, respectively, defined by
. 2 /
∂ f
∇f (x) ≡ (∂f /∂x1 , ..., ∂f /∂xn ) ,
T
∇ f (x) ≡
2
.
∂xi ∂xj

For a vector function of several variables

r(x) = (r1 (x), r2 (x), . . . , rm (x))T ,


with each rk (x) : Rn → R, we denote by J(x) the Jacobian of r(x) and by
Gk the Hessian of a component function rk :
. / . 2 /
∂ri ∂ rk
J(x) = , Gk (x) = .
∂xj ∂xi ∂xj

Definition 7. Descent direction p of a function f : Rn −→ R.


p is a descent direction at xc for a function f (x) if for sufficiently
small and positive α f (xc +αp < f (xc ). Alternatively, p is a descent
direction at xc if the directional derivative (projection of the gradient on
a given direction) of the function f (x) at xc in the direction p is negative:
f (xc )T p < 0.
Theorem 8. Taylor’s theorem for a scalar function.
If f : Rn −→ R is continuously differentiable, then, for some t ∈ [0, 1] ,

f (x + p) = f (x) + ∇f (x + tp)T p.

If f (x) is twice continuously differentiable then, for some t ∈ [0, 1],

f (x + p) = f (x) + ∇f (x)T p + pT ∇2 f (x + tp)p.

A necessary condition for x∗ to be a stationary point of f (x) is that


∇f (x∗ ) = 0. The sufficient conditions for x∗ to be a local minimizer are
∇f (x∗ ) = 0 and ∇2 (x∗ ) is positive definite.
The derivative DA(x) of an m × n nonlinear matrix function A(x),
where x ∈ Rk is a tri-dimensional tensor formed with k matrices (slabs)
of dimension m × n, each one containing the partial derivatives of the ele-
ments of A with respect to one of the variables of the vector x. Thus, the
second derivative of the vector function r(x) is the tri-dimensional tensor
G = [G1 , ...,Gm ].
The derivative of the orthogonal projector PA(x) onto the column space
of a differentiable m × n matrix function A(x) of local constant rank can
be obtained as follows.
ADVANCED CALCULUS BACKGROUND 273

Lemma 9. (Lemma 4.1 in [101]) Let A† (x) be the pseudoinverse of A(x).


Then PA(x) = AA† and

DP A(x) = PA(x) ⊥
DAA† + (PA(x) DAA† )T ,

where PA⊥ = I − PA(x) .

Definition 10. A function f is Lipschitz continuous with constant γ in a


set X ⊆ R if
∀x, y ∈ X, |f (x) − f (y)| ≤ γ |x − y| . (C.2.1)
Lipschitz continuity is an intermediary concept between continuity and dif-
ferentiability. The operator V : Rn → Rn , is Lipschitz continuous with
constant γ in a set X ⊆ Rn if

∀x, y ∈ X, V (x) − V (y)2 ≤ γ x − y2 .

V is contracting if it is Lipschitz continuous with constant less than unity:


γ < 1. V is uniformly monotone if there exists a positive m such that
2
m x − y2 ≤ (V (x) − V (y))T (x − y).

V is vector Lipschitz if

|V (x) − V (y)| ≤ A |x − y| ,

where A is a non-negative n×n matrix and the inequality is meant element-


wise. V is vector contracting if it is vector Lipschitz continuous with A <
1.

Lagrange multipliers
Lagrange multipliers are used to find the local extrema of a function f (x)
of n variables subject to k equality constraints gi (x) = ci , by reducing the
problem to an (n − k)-variable problem without constraints. Formally, new
scalar variables λ = (λ1 , λ2 , . . . , λk )T , one for each contraint, are introduced
and a new function is defined as


k
F (x, λ) = f (x) + λi (gi (x) − ci ).
i=1

The local extrema of this extended function F (the Lagrangian), occur at


the points where its gradient is zero: ∇F (x, λ) = 0, λ = 0, or equivalently,

∇x F (x, λ) = 0, ∇λ F (x, λ) = 0.
274 APPENDIX C

This form encodes compactly the constraints, because

∇λi F (x, λ) = 0 ⇔ gi (x) = ci .

An alternative to this method is to use the equality constraints to eliminate


k of the original variables. If the problem is large and sparse, this elimina-
tion may destroy the sparsity and thus the Lagrange multipliers approach
is preferable.
Appendix D

Statistics

D.1 Definitions
We list some basic definitions and techniques taken from statistics that are
needed to understand some sections of this book.

• A sample space is the set of possible outcomes of an experiment


or a random trial. For some kind of experiments (trials), there may
be several plausible sample spaces available. The complete set of
outcomes can be constructed as a Cartesian product of the individual
sample spaces. An event is a subset of a sample space.

• We denote by Pr{event} the probability of an event. We often


consider finite sample spaces, such that each outcome has the same
probability (equiprobable).

• X is a random variable defined on a sample space if it assigns


a unique numerical value to every outcome, i.e., it is a real-valued
function defined on a sample space.

• X is a continuous random variable if it can assume every value in


an interval, bounded or unbounded (continuous distribution). It can
be characterized by a probability density function (pdf ) p(x)
defined by
 b
Pr {a ≤ X ≤ b} = p(x) dx.
a

• X is a discrete random variable if it assumes at most a countable


set of different values (discrete distribution). It can be characterized
by its probability function (pf ), which specifies the probability

275
276 APPENDIX D

that the random variable takes each of the possible values from say
{x1 , x2 , . . . , xi , . . .}:
p(xi ) = Pr{X = xi }.

• All distributions, discrete or continuous, can be characterized through


their (cumulative) distribution function, the total probability up
to, and including, a point x. For a discrete random variable,

P (xi ) = Pr{X ≤ xi }.

• The expectation or expected value E [X] for a random variable X


or equivalently, the mean μ of its distribution, contains a summary of
the probabilistic information about X. For a continuous variable X,
the expectation is defined by
 ∞
E [X] = x p(x) dx.
−∞

For a discrete random variable with probability function p(x) the


expectation is defined as

E[X] = xi p(xi ).
∀i

For a discrete random variable X with values x1 , x2 , . . . , xN and with


all p(xi ) equal (all xi equiprobable, p(xi ) = N1 ), the expected value
coincides with the arithmetic mean

1 
N
x̄ = xi .
N i=1

• The expectation is linear, i.e., for a, b, c constants and X, Y two


random variables,

E[X + c] = E[X] + c,
E[aX + bY ] = E[aX] + E[bY ].

• A convenient measure of the dispersion (or spread about the average)


is the variance of the random variable, var [X] or ς 2 :

var [X] = ς 2 = E[(X − E(X))2 ].

In the equiprobable case this reduces to

1 
N
var [X] = ς 2 = (xi − x̄)2 .
N i=1
STATISTICS 277

• For a random sample of observations x1 , x2 , . . . , xN , the following


formula is an estimate of the variance ς 2 :

1 
N
s2 = (xi − x̄)2 .
N − 1 i=1

In the case of a sample from a normal distribution this is a particularly


good estimate.
• The square root of the variance is the standard deviation, ς. It is
known by physicists as RMS or root mean square in the equiprob-
able case: 1
2
21  N
ς=3 (xi − x̄)2 .
N i=1

• Given two random variables X and Y with expected values E [X] and
E [Y ] , respectively, the covariance is a measure of how the values of
X and Y are related:

Cov{X, Y } = E(XY ) − E(X)E(Y ).

For random vectors X ∈ Rm and Y ∈ Rn the covariance is the m × n


matrix
Cov{X, Y } = E(XY T ) − E(X)E(X)T .
The (i, j)th element of this matrix is the covariance between the ith
component of X and the jth component of Y .
• In order to estimate the degree of interrelation between variables in a
manner not influenced by measurement units, the (Pearson) corre-
lation coefficient is used:
Cov{X, Y }
cXY = 1 .
(var [X] var [Y ]) 2
Correlation is a measure of the strength of the linear relationship
between the random variables; nonlinear ones are not measured sat-
isfactorily.
• If Cov{X, Y } = 0, the correlation coefficient is zero and the variables
X, Y are uncorrelated .
• The random variables X and Y , with distribution functions PX (x),
PY (y) and densities pX (x), pY (y), respectively, are statistically in-
dependent if and only if the combined random variable (X, Y ) has
a joint cumulative distribution function PX,Y (x, y) = PX (x)PY (y) or
278 APPENDIX D

equivalently a joint density pX,Y (x, y) = pX (x)pY (y). The expecta-


tion and variance operators have the properties E [XY ] = E [X] E [Y ] ,
var [X + Y ] = var [X] + var [Y ]. It follows that independent random
variables have zero covariance.

• Independent variables are uncorrelated, but the opposite is not true


unless the variables belong to a normal distribution.

• A random vector w is a white noise vector if E(w) = 0 and E(wwT ) =


ς 2 I. That is, it is a zero mean random vector where all the elements
have identical variance. Its autocorrelation matrix is a multiple of the
identity matrix; therefore the vector elements are uncorrelated. Note
that Gaussian noise  white noise.

• The coefficient of variation is a normalized measure of the disper-


sion:
ς
cv = .
μ
μ
In signal processing the reciprocal ratio ς is referred to as the signal
to noise ratio.

• The coefficient of determination R2 is used in the context of


linear regression analysis (statistical modeling), as a measure of how
well a linear model fits the data. Given a model M that predicts values
Mi , i = 1, . . . , N for the observations x1 , x2 , . . . , xN and the residual
vector r = (M1 − x1, . . . , MN − xN )T , the most general definition for
the coefficient of determination is
2
r2
R2 = 1 − N .
i=1 (xi − x)2

In general, it is an approximation of the unexplained variance, since


the second term compares the variance in the model’s errors with the
total variance of the data x1 , . . . , xN .

• The normal (or Gaussian) probability function N (μ, ς 2 ) has mean


μ and variance ς 2 . Its probability density function takes the form
4 5
2
1 1 x−μ
p(x) = √ exp − .
2πς 2 ς

• If Xi , i = 1, .
. . , n are random variables with normal distributions
n
N (0, 1), then i=1 Xi2 has a chi-square distribution with n degrees
of freedom. Given two independent random variables Y, W with chi-
square distributions with m and n degrees of freedom, respectively,
STATISTICS 279

the random variable X,


Y /m
X= ,
W/n
has an F-distribution with m and n degrees of freedom, F (m, n).
• The Student’s tn random variable with n degrees of freedom is de-
fined as
Z
tn = & ,
χ2n /n
where Z is a standard normal random variable, χ2n is a chi-square
random variable with n degrees of freedom and Z and χ2n are inde-
pendent.
• The F-distribution becomes relevant when testing hypotheses about
the variances of normal distributions. For example, assume that we
have two independent random samples (of sizes n1 and n2 respec-
tively) from two normal distributions; the ratio of the unbiased esti-
mators s1 and s2 of the two variances,
s1 /n1 − 1
,
s2 /n2 − 1
is distributed according to an F-distribution F (n1 , n2 ) if the variances
are equal: ς12 = ς22 .
• A time series is an ordered sequence of observations. The ordering
is usually in time, often in terms of equally spaced time intervals, but
it can also be ordered in, for example, space. The time series elements
are frequently considered random variables with the same mean and
variance.
• A periodogram is a data analysis technique for examining the fre-
quency domain of an equispaced time series and search for hidden
periodicities. Given a time series vector of N observations x(j), the
discrete Fourier transform (DFT) is given by a complex vector of
length N :


N
(j−1)(k−1)
X(k) = x(j)N , 1 ≤ k ≤ N,
j=1

with N = exp((−2πi)/N ).
• The magnitude squared of the discrete Fourier transform components
|X(k)|2 is called the power. The periodogram is a plot of the power
components versus the frequencies { N1 , N2 , ... N
k
, ...}.
280 APPENDIX D

D.2 Hypothesis testing


In hypothesis testing one uses statistics to determine whether a given
hypothesis is true. The process consists of four steps:
• Formulate the null hypothesis H0 and the alternative hypothesis Ha ,
which are mutually exclusive.
• Identify a test statistics t that can be used to assess the truth of the
null hypothesis from the sample data. It can involve means, propor-
tions, standard deviations, etc.
• Determine the corresponding distribution function, assuming that the
null hypothesis is true and, after choosing an acceptable level of
significance α (common choices are 0.01, 0.05, 0.1), determine the
critical region for t (see details below).
• Compute the t for the observations. If the computed value of the test
statistics falls into the critical region, reject H0 .
In other words, the test of a hypothesis on the significance level α is per-
formed by means of a statistic t and critical values tα and t̄α so that
1 − α = Pr {tα ≤ t ≤ t̄α } , 0≤α≤1
if H0 holds. The hypothesis H0 is rejected if t falls outside the range [tα , t̄α ].
This guarantees that H0 will only be erroneously rejected in 100 × α% of
the cases.
For example, assume that we have approximated m observations
y1 , y2, . . . , ym with a linear model Mfull with n terms and we want to know
if k of these terms are redundant, i.e., if one could get a good enough model
when setting k coefficient of the original model to 0. We have the residual
norms ρfull and ρred for both possible models, with ρfull < ρred .
We formulate the hypothesis H0 : the reduced model is good enough,
i.e., the reduction in residual when using all n terms is negligible.
Under the assumption that the errors in the data y1 , y2 , . . . , ym are
normally distributed with constant variance and zero mean, we can choose
a statistic based on a proportion of variance estimates:
ρred − ρfull m − n
fobs = .
ρfull k
This statistic follows an F -distribution with k and m−n degrees of freedom:
F (k, m − n). The common practice is to denote by fα the value of the
statistic with cumulative probability Fα (k, m − n) = 1 − α.
If fobs > fα , the computed statistic falls into the critical region for the
F -distribution and the H0 hypothesis should be rejected, with a possible
error of α, i.e., we do need all the terms of the full model.
References

[1] J. J. Alonso, I. M. Kroo and A. Jameson, “Advanced algorithms for


design and optimization of quiet supersonic platforms.” AIAA Paper
02-0114, 40th AIAA Aerospace Sciences Meeting and Exhibit, Reno,
NV, 2002.

[2] J. J. Alonso, P. Le Gresley, E. van der Weide and J. Martins, “pyMDO:


a framework for high-fidelity multidisciplinary optimization.” 10th
AIAA/ISSMO Multidisciplinary Analysis and Optimization Confer-
ence, Albany, NY, 2004.

[3] A. Anda and H. Park, “Fast plane rotations with dynamic scaling.”
SIAM J. Matrix Anal. Appl. 15:162–174, 1994.

[4] E. Angelosante, D. Giannakis and G. B. Grossi, “Compressed sensing


of time-varying signals.” Digital Signal Processing, 16th International
Conference, Santorini, Greece, 2009.

[5] H. Ashley and M. Landahl, Aerodynamics of Wings and Bodies.


Dover, New York, 1985.

[6] I. Barrodale and F. D. K. Roberts, “An improved algorithm for dis-


crete 1 linear approximation.” SIAM J. Numer. Anal. 10:839–848,
1973.

[7] R. Bartels, J. Beaty and B. Barski, An Introduction to Splines for Use


in Computer Graphics and Parametric Modeling. Morgan Kaufman
Publishers, Los Altos, CA, 1987.

[8] A. E. Beaton and J. W. Tukey, “The fitting of power series, meaning


polynomials, illustrated on band-spectroscopic data.” Technometrics
16:147–185, 1974.

[9] D. M. Bates and D. G. Watts, Nonlinear Regression Analysis and Its


Applications. J. Wiley, New York, 1988.

281
282 REFERENCES

[10] G. M. Baudet, “Asynchronous iterative methods for multiprocessors.”


J. ACM 15:226–244, 1978.

[11] A. Beck, A. Ben-Tal and M. Teboulle, “Finding a global optimal solu-


tion for a quadratically constrained fractional quadratic problem with
applications to the regularized total least squares.” SIAM J. Matrix
Anal. Appl. 28:425–445, 2006.

[12] M. Berry, “Large scale sparse singular value computations.” Interna-


tional J. Supercomp. Appl. 6:13–49, 1992.

[13] P. R. Bevington, Data Reduction and Error Analysis for the Physical
Sciences. McGraw-Hill, New York, 1969.

[14] C. H. Bischof and G. Qintana-Orti, “Computing rank-revealing QR


factorization of dense matrices.” ACM TOMS 24:226–253, 1998.

[15] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford Uni-


versity Press, New York, 1995.

[16] Å. Björck, “Solving linear least squares problems by Gram-Schmidt


orthogonalization.” BIT 7:1–21, 1967.

[17] Å. Björck, “Iterative refinement of linear least squares solutions.” BIT


7:257–278, 1967.

[18] Å. Björck, “Iterative refinement of linear least squares solutions II.”


BIT 8:8–30, 1968.

[19] Å. Björck, “Stability analysis of the method of seminormal equations


for linear least squares problems.” Lin. Alg. Appl. 88/89:31–48, 1987.

[20] Å. Björck, Numerical Methods for Least Squares Problems. SIAM,


Philadelphia, 1996.

[21] Å. Björck, “The calculation of linear least squares problems.” Acta


Numerica 13:1–53, 2004.

[22] Å. Björck and I. S. Duff, “A direct method for the solution of sparse
linear least squares problems.” Lin. Alg. Appl. 34:43–67, 1980.

[23] Å. Björck and G. H. Golub, “Iterative refinement of linear least


squares solution by Householder transformation.” BIT 7:322–337,
1967.

[24] Å. Björck, E. Grimme and P. Van Dooren, “An implicit shift bidiag-
onalization algorithm for ill-posed systems.” BIT 34:510–534, 1994.
REFERENCES 283

[25] Å. Björck, P. Heggernes and P. Matstoms, “Methods for large scale


total least squares problems.” SIAM J. Matrix Anal. Appl. 22:413–
429, 2000.

[26] Å. Björck and C. C. Paige, “Loss and recapture of orthogonality in the


modified Gram-Schmidt algorithm.” SIAM J. Matrix Anal. 13:176–
190, 1992.

[27] Å. Björck and V. Pereyra, “Solution of Vandermonde systems of equa-


tions.” Math. Comp. 24:893–904, 1970.

[28] C. Böckman, “A modification of the trust-region Gauss-Newton


method for separable nonlinear least squares problems.” J. Math. Sys-
tems, Estimation and Control 5:1–16, 1995.

[29] C. de Boor, A Practical Guide to Splines. Applied Mathematical Sci-


ences 27. Springer, New York, 1994.

[30] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge


University Press, Cambridge, 2003.

[31] R. Brent, Algorithms for Minimization Without Derivatives. Prentice


Hall, Englewood Cliffs, NJ, 1973. Reprinted by Dover, New York,
2002.

[32] D. Calvetti, P. C. Hansen and L. Reichel, “L-curve curvature bounds


via Lanczos bidiagonalization.” Electr. Transact. on Num. Anal.
14:134–149, 2002.

[33] D. Calvetti, G. H. Golub and L. Reichel, “Estimation of the L-curve


via Lanczos bidiagonalization algorithm for ill-posed systems.” BIT
39:603–619, 1999.

[34] E. J. Candes, “Compressive sampling.” Proceedings of the Interna-


tional Congress of Mathematicians, Madrid, Spain, 2006.

[35] E. J. Candes and M. B. Wakin, “People hearing without listening: An


introduction to compressive sampling.” Signal Processing Magazine,
IEEE 25:21–30, 2008.

[36] L. Carcione, J. Mould, V. Pereyra, D. Powell and G. Wojcik, “Nonlin-


ear inversion of piezoelectrical transducer impedance data.” J. Comp.
Accoustics 9:899–910, 2001.

[37] L. Carcione, V. Pereyra and D. Woods, GO: Global Optimization.


Weidlinger Associates Report, 2005.
284 REFERENCES

[38] R. I. Carmichael and L. I. Erickson, “A higher order panel method for


predicting subsonic or supersonic linear potential flow about arbitrary
configurations.” American Institute of Aeronautics and Astronautics
Paper 81–1255, 1981.
[39] M. Chan, “Supersonic aircraft optimization for minimizing drag and
sonic boom.” Ph.D. Thesis, Stanford University, Stanford, CA, 2003.
[40] T. F. Chan, “Rank revealing QR-factorizations.” Lin. Alg. Appl.
88/89:67–82, 1987.
[41] T. F. Chan and D. E. Foulser, “Effectively well-conditioned linear
systems.” SIAM J. Sci. Stat. Comput. 9:963–969, 1988.
[42] T. F. Chan and P. C. Hansen, “Computing truncated SVD least
squares solutions by rank revealing QR-factorizations.” SIAM J. Sci.
Statist. Comput. 11:519–530, 1991.
[43] D. Chazan and W. L. Miranker, “Chaotic relaxation.” Lin. Alg. Appl.
2:199–222, 1969.
[44] W. Cheney and D. Kincaid, Numerical Mathematics and Computing.
Brooks/Cole, Belmont, CA, 2007.
[45] S. Choi, J. J. Alonso and H. S. Chung, “Design of low-boom supersonic
business jet using evolutionary algorithms and an adaptive unstruc-
tured mesh method.” 45th AIAA/ASME/ASCE/AHS/ASC Struc-
tures, Structural Dynamics and Materials Conference, Palm Springs,
CA, 2004.
[46] H. S. Chung, S. Choi and J. J. Alonso, “Supersonic business jet
design using knowledge-based genetic algorithms with adaptive, un-
structured grid methodology.” AIAA 2003-3791, 21st Applied Aero-
dynamic Conference, Orlando, Fl., June 2003.
[47] H. S. Chung, “Multidisciplinary design optimization of supersonic
business jets using approximation model-based genetic algorithms.”
Ph.D. Thesis, Stanford University, Stanford, CA, 2004.
[48] J. F. Claerbout and F. Muir, “Robust modeling with erratic data.”
Geophysics 38:826–844, 1973.
[49] A. K. Cline, “An elimination method for the solution of linear least
squares problems.” SIAM J. Numer. Anal. 10:283–289, 1973.
[50] D. Coleman, P. Holland, N. Kaden, V. Klema and S. C. Peters, “A
system of subroutines for iteratively reweighted least squares compu-
tations.” ACM TOMS 6:328–336, 1980.
REFERENCES 285

[51] T. F. Coleman and Y. Li, “A globally and quadratically convergent


affine scaling method for linear 1 problems.” Mathematical Program-
ming, Series A 56:189–222, 1992.
[52] T. P. Collignon, “Efficient iterative solution of large linear systems
on heterogeneous computing systems.” Ph. D. Thesis, TU Delft, The
Netherlands, 2011.
[53] T. P. Collignon and M. B. van Gijzen. “Parallel scientific computing
on loosely coupled networks of computers.” In B. Koren and C. Vuik,
editors, Advanced Computational Methods in Science and Engineer-
ing. Springer Series Lecture Notes in Computational Science and En-
gineering, 71:79–106. Springer Verlag, Berlin/Heidelberg, Germany,
2010.
[54] G. Cybenko, “Approximation by superpositions of a sigmoidal func-
tion.” Math. Control Signals Systems 2:303–314, 1989.
[55] J. Dahl, P. C. Hansen, S. H. Jensen and T. L. Jensen, “Algorithms
and software for total variation image reconstruction via first-order
methods.” Numer. Algo. 53:67–92, 2010.
[56] J. Dahl and L. Vanderberghe, CVXOPT: A Python Package for Con-
vex Optimization. http://abel.ee.ucla.edu/cvxopt, 2012.
[57] G. Dahlquist and Å. Björck, Numerical Methods in Scientific Com-
puting. SIAM, Philadelphia, 2008.
[58] T. A. Davis and Y. Hu, “The University of Florida sparse matrix
collection.” ACM TOMS 38:1–25, 2011.
[59] K. Deb, A. Pratap, S. Agrawal and T. Meyarivan, A Fast and Elitist
Multiobjective Genetic Algorithm: NSGA-II. Technical Report No.
2000001. Indian Institute of Technology, Kanpur, India, 2000.
[60] P. Deift, J. Demmel, L.-C. Li and C. Tomei, “The bidiagonal singular
value decomposition and Hamiltonian mechanics.” SIAM J. Numer.
Anal. 28:1463–1516, 1991.
[61] R. S. Dembo, S. C. Eisenstat and T. Steihaug, “Inexact Newton meth-
ods.” SIAM J. Numer. Anal. 19:400–408, 1982.
[62] C. J. Demeure and L. L. Scharf, “Fast least squares solution of Van-
dermonde systems of equations.” Acoustics, Speech and Signal Pro-
cessing 4:2198–2210, 1989.
[63] J. W. Demmel, Applied Numerical Linear Algebra. SIAM, Philadel-
phia, 1997.
286 REFERENCES

[64] J. Demmel, Y. Hida, W. Riedy and X. S. Li, “Extra-precise iterative


refinement for overdetermined least squares problems.” ACM TOMS
35:1–32, 2009.

[65] J. Dennis, D. M. Gay and R. Welsch, “Algorithm 573 NL2SOL – An


adaptive nonlinear least-squares algorithm.” ACM TOMS 7:369–383,
1981.

[66] J. Dennis and R. Schnabel, Numerical Methods for Unconstrained


Optimization and Nonlinear Equations. SIAM, Philadelphia, 1996.

[67] J. E. Dennis and H. F. Walker, “Convergence theorems for least-


change secant update methods.” SIAM J. Num. Anal. 18:949–987,
1981.

[68] A. P. Dempster, N. M. Laird and D. B. Rubin. “Maximum likelihood


from incomplete data via the EM algorithm (with discussion).” Jour-
nal of the Royal Statistical Society B 39:1–38, 1977.

[69] P. Dierckx, Curve and Surface Fitting with Splines. Clarendon Press,
Oxford, 1993.

[70] J. Dongarra, J. R. Bunch, C. B. Moler and G. W. Stewart, LINPACK


User’s Guide. SIAM, Philadelphia, 1979.

[71] N. Draper and H. Smith, Applied Regression Analysis. J. Wiley, New


York, 1981.

[72] C. Eckart and G. Young, “The approximation of one matrix by an-


other of lower rank.” Psychometrika 1:211–218, 1936.

[73] L. Eldén, “A note on the computation of the generalized cross-


validation function for ill-conditioned least squares problems.” BIT
24:467–472, 1984.

[74] L. Eldén, “Perturbation theory for the linear least squares problem
with linear equality constraints.” BIT 17:338–350, 1980.

[75] L. Eldén, Matrix Methods in Data Mining and Pattern Recognition.


SIAM, Philadelphia, 2007.

[76] J. Eriksson, P.-Å. Wedin, M. E. Gulliksson and I. Söderkvist, “Reg-


ularization methods for uniformly rank-deficient nonlinear least-
squares problems.” J. Optimiz. Theory and Appl. 127:1–26, 2005.

[77] J. Eriksson and P.-Å. Wedin, “Truncated Gauss-Newton algorithms


for ill-conditioned nonlinear least squares problems.” Optimiz. Meth.
and Software 19:721–737, 2004.
REFERENCES 287

[78] R. D. Fierro, P. C. Hansen and P. S. K. Hansen, “UTV tools: MAT-


LAB templates for rank-revealing UTV decompositions.” Numer.
Algo. 20:165–194, 1999. The software is available from:
htpp://www.netlib.org/numeralgo.

[79] P. M. Fitzpatrick, Advanced Calculus. Second edition Brooks/Cole,


Belmont, CA, 2006.

[80] D. Fong and M. A. Saunders, “LSMR: An iterative algorithm for


sparse least-squares problems.” SIAM J. Sci. Comput. 33:2950–2971,
2011.

[81] G. E. Forsythe, “Generation and use of orthogonal polynomials for


data-fitting with a digital computer.” J. SIAM 5:74–88, 1957.

[82] J. Fox and S. Weisberg, Robust Regression in R, An Appendix


to An R Companion to Applied Regression, Second edition.
http://socserv.socsci.mcmaster.ca/jfox/Books
/Companion/appendix.html.

[83] C. F. Gauss, Theory of the Combination of Observations Least Sub-


ject to Errors. Parts 1 and 2, Supplement, G. W. Stewart. SIAM,
Philadelphia, 1995.

[84] D. M. Gay, Usage Summary for Selected Optimization Routines.


AT&T Bell Labs. Comp. Sc. Tech. Report, 1990.

[85] D. M. Gay and L. Kaufman, Tradeoffs in Algorithms for Separable


Nonlinear Least Squares. AT&T Bell Labs. Num. Anal. Manuscript,
90–11, 1990.

[86] D. M. Gay, NSF and NSG; PORT Library. AT&T Bell Labs.
http://www.netlib.org/port, 1997.

[87] P. E. Gill, G. H. Golub, W. Murray and M. A. Saunders, “Methods


for modifying matrix factorisations.” Math. Comp. 28:505–535, 1974.

[88] P. E. Gill, S. J. Hammarling, W. Murray, M. A. Saunders and M.


H. Wright, Users Guide for LSSOL (Version 1.0): A Fortran Pack-
age for Constrained Linear Least-Squares and Convex Quadratic Pro-
gramming. Report 86-1 Department of Operation Research, Stanford
University, CA, 1986.

[89] P. E. Gill, W. Murray and M. A. Saunders, “SNOPT: An SQP al-


gorithm for large-scale constrained optimization.” SIAM Rev. 47:99–
131, 2005.
288 REFERENCES

[90] P. E. Gill, W. Murray, M. A. Saunders and M. H. Wright, “Maintain-


ing LU factors of a general sparse matrix.” Linear Algebra and its
Applications 88–89:239–270, 1987.

[91] P. E. Gill, W. Murray and M. H. Wright, Practical Optimization.


Academic Press, 1981.

[92] G. H. Golub, “Numerical methods for solving linear least squares


problems.” Numer. Math. 7:206–16, 1965.

[93] G. H. Golub, P. C. Hansen and D. P. O’Leary, “Tikhonov regulariza-


tion and total least squares.” SIAM J. Matrix Anal. Appl. 21:185–194,
1999.

[94] G. H. Golub, M. Heath and G. Wahba, “Generalized cross-validation


as a method for choosing a good ridge parameter.” Technometrics
21:215–223, 1979.

[95] G. H. Golub, A. Hoffman and G. W. Stewart, “A generalization of


the Eckhard-Young-Mirsky matrix approximation theorem.” Linear
Algebra Appl. 88/89:317–327, 1987.

[96] G. H. Golub and W. Kahan, “Calculating the singular values and


pseudo-inverse of a matrix.” SIAM J, Numer. Anal. Ser. B 2:205–
224, 1965.

[97] G. H. Golub, V. Klema and G. W. Stewart, Rank Degeneracy and


Least Squares. Report TR-456, Computer Science Department, Uni-
versity of Maryland, College Park, 1977.

[98] G. H. Golub and R. Le Veque, “Extensions and uses of the variable


projection algorithm for solving nonlinear least squares problems.”
Proceedings of the Army Numerical Analysis and Computers Confer-
ence, White Sands Missile Range, New Mexico, pp. 1–12, 1979.

[99] G. H. Golub and U. von Matt, “Tikhonov regularization for large


problems.” Workshop on Scientific Computing, Ed. G. H. Golub, S.
H. Lui, F. Luk and R. Plemmons, Springer, New York, 1997.

[100] G. H. Golub and V. Pereyra, The Differentiation of Pseudo-Inverses


and Nonlinear Least Squares Problems Whose Variables Separate.
STAN-CS-72-261, Stanford University, Computer Sciences Depart-
ment, 1972. (It contains the original VARPRO computer code.)

[101] G. H. Golub and V. Pereyra, “The differentiation of pseudo-inverses


and nonlinear least squares problems whose variables separate.” SIAM
J. Numer. Anal. 10:413–432, 1973.
REFERENCES 289

[102] G. H. Golub and V. Pereyra, “Differentiation of pseudoinverses, sepa-


rable nonlinear least squares problems, and other tales.” Proceedings
MRC Seminar on Generalized Inverses and Their Applications, Ed.
Z. Nashed, pp. 302–324, Academic Press, NY, 1976.

[103] G. H. Golub and V. Pereyra, “Separable nonlinear least squares: The


variable projection method and its applications.” Inverse Problems
19:R1–R26, 2003.

[104] G. H. Golub and J. M. Varah, “On a characterization of the best


2 -scaling of a matrix.” SIAM J. Numer. Anal. 11:472–279, 1974.

[105] G. H. Golub and C. F. Van Loan, Matrix Computations. Third edi-


tion, John Hopkins University Press, Baltimore, 1996.

[106] G. H. Golub and C. F. Van Loan, “An analysis of the total least
squares problem.” SIAM J. Numer. Anal. 17:883–893, 1980.

[107] G. H. Golub and J. H. Wilkinson, “Note on iterative refinement of


least squares solutions.” Numer. Math. 9:139–148, 1966.

[108] J. F. Grcar, Optimal Sensitivity Analysis of Linear Least Squares.


Lawrence Berkeley National Laboratory, Report LBNL-52434, 2003.

[109] J. F. Grcar, “Mathematicians of Gaussian elimination.” Notices of the


AMS 58:782–792, 2011.

[110] J. F. Grcar, “John von Neumann’s analysis of Gaussian elimina-


tion and the origins of modern Numerical Analysis.” SIAM Review
53:607–682, 2011.

[111] A. Griewank and A. Walther, Evaluating Derivatives: Principles and


Techniques of Algorithmic Differentiation. Other Titles in Applied
Mathematics 105. Second edition, SIAM, Philadelphia, 2008.

[112] E. Grosse, “Tensor spline approximation.” Linear Algebra Appl.


34:29–41, 1980.

[113] I. Gutman, V. Pereyra and H. D. Scolnik, “Least squares estimation


for a class of nonlinear models.” Technometrics 15:209–218, 1973.

[114] Y. Y. Haimes, L. S. Ladon and D. A. Wismer, “On a bicriterion


formulation of the problem of integrated system identification and
system optimization.” IEEE Trans. on Systems, Man and Cybernetics
1:296–297, 1971.

[115] S. Hammarling, “A note on modifications to the Givens plane rota-


tions.” J. Inst. Math. Applic. 13:215–218, 1974.
290 REFERENCES

[116] S. Hammarling, A Survey of Numerical Aspects of Plane Rotations.


University of Manchester, MIMS Eprint 2008.69, 1977.
[117] M. Hanke and P. C. Hansen, “Regularization methods for large-scale
problems.” Surv. Math. Ind. 3:253–315, 1993.
[118] P. C. Hansen, Rank-Deficient and Discrete Ill-Posed Problems. SIAM,
Philadelphia, 1998.
[119] P. C. Hansen, “The L-curve and its use in the numerical treatment of
inverse problems.” In Comp. Inverse Problems in Electrocardiology.
Ed. P. Johnston, pp. 119–142, WIT Press, Southampton, UK, 2001.
[120] P. C. Hansen, “Regularization tools version 4.0 for MATLAB 7.3.”
Numer. Algorithms 46:189–194, 2007.
[121] P. C. Hansen, Discrete Inverse Problems – Insight and Algorithms.
SIAM, Philadelphia, 2010.
[122] P. C. Hansen, M. Kilmer and R. H. Kjeldsen, “Exploiting residual
information in the parameter choice for discrete ill-posed problems.”
BIT 46:41–59, 2006.
[123] P. C. Hansen and M. Saxild-Hansen, “AIR tools – A MATLAB pack-
age of algebraic iterative reconstruction methods.” J. Comp. Appl.
Math. 236:2167–2178, 2011.
[124] P. C. Hansen and P. Yalamov, “Computing symmetric rank-revealing
decompositions via triangular factorization.” SIAM J. Matrix Anal.
Appl. 23:443–458, 2001.
[125] J. G. Hayes, ed., Numerical Approximation to Functions and Data.
Athlone Press, London, 1970.
[126] M. R. Hestenes and E. Stiefel, “Methods of conjugate gradients for
solving linear systems.” J. Res. Nat. Stan. B. 49:409–432, 1952.
[127] N. Higham, “Analysis of the Cholesky decomposition of a semi-definite
matrix.” In Reliable Numerical Computing, ed. M. G. Cox and S. J.
Hammarling, Oxford University Press, London, 1990.
[128] N. Higham, Accuracy and Stability of Numerical Algorithms. Second
edition, SIAM, Philadelphia, 2002.
[129] H. P. Hong and C. T. Pan, “Rank-revealing QR factorization and
SVD.” Math. Comp. 58:213–232, 1992.
[130] P. J. Huber and E. M. Ronchetti, Robust Statistics. Second edition,
J. Wiley, NJ, 2009.
REFERENCES 291

[131] R. Horst, P. M. Pardalos and N. Van Thoai, Introduction to Global


Optimization. Second edition, Springer, New York, 2000.

[132] N. J. Horton and S. R. Lipsitz, “Multiple imputation in practice:


Comparison of software packages for regression models with missing
variables.” The American Statistician 55, 2011.

[133] I. C. F. Ipsen and C. D. Meyer, “The idea behind Krylov methods.”


Amer. Math. Monthly 105:889–899, 1998.

[134] R. A. Johnson, Miller & Freund’s Probability and Statistics for En-
gineers. Pearson Prentice Hall, Upper Saddle River, NJ, 2005.

[135] P. Jones, Data Tables of Global and Hemispheric Temperature


Anomalies
http://cdiac.esd.ornl.gov/trends/temp/jonescru/data.html.

[136] T. L. Jordan, “Experiments on error growth associated with some


linear least squares procedures.” Math. Comp. 22:579–588, 1968.

[137] D. L. Jupp, “Approximation to data by splines with free knots.” SIAM


J. Numer. Anal. 15:328, 1978.

[138] D. L. Jupp and K. Vozoff, “Stable iterative methods for the inversion
of geophysical data.” J. R. Astr. Soc. 42:957–976, 1975.

[139] W. Karush, “Minima of functions of several variables with inequalities


as side constraints.” M. Sc. Dissertation. Department of Mathematics,
University of Chicago, Chicago, Illinois, 1939.

[140] L. Kaufman, “A variable projection method for solving nonlinear least


squares problems.” BIT 15:49–57, 1975.

[141] L. Kaufman and G. Sylvester, “Separable nonlinear least squares with


multiple right-hand sides.” SIAM J. Matrix Anal. and Appl. 13:68–
89, 1992.

[142] L. Kaufman and V. Pereyra, “A method for separable nonlinear least


squares problems with separable equality constraints.” SIAM J. Nu-
mer. Anal. 15:12–20, 1978.

[143] H. B. Keller and V. Pereyra, Finite Difference Solution of Two-Point


BVP. Preprint 69, Department of Mathematics, University of South-
ern California, 1976.

[144] M. E. Kilmer and D. P. O’Leary, “Choosing regularization parameters


in iterative methods for ill-posed problems.” SIAM. J. Matrix Anal.
Appl. 22:1204–1221, 2001.
292 REFERENCES

[145] H. Kim, G. H. Golub and H. Park, “Missing value estimation for DNA
microarray expression data: Least squares imputation.” In Proceed-
ings of CSB’2004, Stanford, CA, pp. 572–573, 2004.
[146] S. Kotz, N. L. Johnson and A. B. Read, Encyclopedia of Statistical
Sciences. 6. Wiley-Interscience, New York, 1985.
[147] H. W. Kuhn and A. W. Tucker, Nonlinear Programming. Proceedings
of 2nd Berkeley Symposium, 481–492, University of California Press,
Berkeley, 1951.
[148] Lancelot: Nonlinear Programming Code
http://www.numerical.rl.ac.uk/lancelot/
[149] K. Lange, Numerical Analysis for Statisticians. Second edition,
Springer, New York, 2010.
[150] C. Lawson and R. Hanson, Solving Least Squares Problems. Prentice
Hall, Englewood Cliffs, NJ, 1974.
[151] K. Levenberg, “A method for the solution of certain nonlinear prob-
lems in least squares.” Quart. J. App. Math. 2:164–168, 1948.
[152] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing
Data. Second edition. J. Wiley, Hoboken, NJ, 2002.
[153] K. Madsen and H. B. Nielsen, “A finite smoothing algorithm for linear
1 estimation.” SIAM J. Optimiz. 3:223–235, 1993.
[154] K. Madsen and H. B. Nielsen, Introduction to Optimization and
Data Fitting. Lecture notes, Informatics and Mathematical Mod-
elling, Technical University of Denmark, Lyngby, 2008.
[155] I. Markovsky and S. Van Huffel, “Overview of total least-squares
methods.” Signal Processsing 87:2283–2302, 2007.
[156] O. Mate, “Missing value problem.” Master’s Thesis, Stanford Univer-
sity, Stanford, CA, 2007.
[157] Matrix Market. http://math.nist.gov/MatrixMarket
[158] C. D. Meyer, Matrix Analysis and Applied Linear Algebra. SIAM,
Philadelphia, 2000.
[159] J.-C. Miellou, “Iterations chaotiques a retards.” C. R. Acad. Sci. Paris
278:957–960, 1974.
[160] J.-C. Miellou, “Algorithmes de relaxation chaotique a retards.”
RAIRO R1:55–82, 1975.
REFERENCES 293

[161] K. M. Miettinen, Nonlinear Multiobjective Optimization. Kluwers


Academic, Boston, 1999.
[162] L. Miranian and M. Gu, “Strong rank revealing LU factorizations.”
Lin. Alg. Appl. 367:1–16, 2003.
[163] Software for multiple imputation.
http://www.stat.psu.edu/~jls/misoftwa.html
[164] M. Mohlenkamp and M. C. Pereyra, Wavelets, Their Friends, and
What They Can Do For You. EMS Lecture Series in Mathematics,
European Mathematical Society, Zurich, Switzerland, 2008.
[165] J. Moré and S. J. Wright, Optimization Software Guide. SIAM,
Philadelphia, 1993.
[166] A. Nedic, D. P. Bertsekas and V. S. Borkar, “Distributed asyn-
chronous incremental subgradient methods.” Studies in Computa-
tional Mathematics 8:381–407, 2001.
[167] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Methods
in Convex Programming. Studies in Applied Mathematics 13. SIAM,
Philadelphia, 1994.
[168] L. Ngia, “System modeling using basis functions and applications to
echo cancellation.” Ph.D. Thesis, Chalmers Institute of Technology,
Sweden, Goteborg, 2000.
[169] L. Ngia, Separable Nonlinear Least Squares Methods for On-Line Es-
timation of Neural Nets Hammerstein Models. Department of Signals
and Systems, Chalmers Institute of Technology, Sweden, 2001.
[170] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, New
York, second edition, 2006.
[171] D. P. O’Leary, “Robust regression computation using iteratively
reweighted least squares.” SIAM J. Matrix. Anal. Appl. 22:466–480,
1990.
[172] D. P. O’Leary and B. W. Rust, “Variable projection for non-
linear least squares problems.” Submitted for publication in
Computational Optimization and Applications (2011). Also at:
http://www.cs.umd.edu/users/oleary/software/varpro.pdf
[173] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear
Equations in Several Variables. Academic Press, New York, 1970.
[174] M. R. Osborne, “Some special nonlinear least squares problems.”
SIAM J. Numer. Anal. 12:571–592, 1975.
294 REFERENCES

[175] M. R. Osborne and G. K. Smyth, “A modified Prony algorithm for


fitting functions defined by difference equations.” SIAM J. Sci. Comp.
12:362–382, 1991.

[176] M. R. Osborne and G. K. Smyth, “A modified Prony algorithm for


exponential function fitting.” SIAM J. Sci. Comp. 16:119–138, 1995.

[177] M. R. Osborne, “Separable least squares, variable projections, and


the Gauss-Newton algorithm.” ETNA 28:1–15, 2007.

[178] A. Ostrowski, “Determinanten mit ueberwiegender Hauptdiagonale


and die absolute Konvergenz von linearen Iterationsprozessen.”
Comm. Math. Helv. 30:175–210, 1955.

[179] C. C. Paige and M. A. Saunders, “LSQR: An algorithm for sparse lin-


ear equations and sparse least squares.” ACM TOMS 8:43–71, 1982.

[180] C. C. Paige and Z. Strakos, “Core problems in linear algebraic sys-


tems.” SIAM J. Matrix Anal. Appl. 27:861–875, 2006.

[181] R. Parisi, E. D. Di Claudio, G. Orlandi and B. D. Rao, “A generalized


learning paradigm exploiting the structure of feedforward neural net-
works.” IEEE Transactions on Neural Networks 7:1450–1460, 1996.

[182] Y. C. Pati and P. S. Krishnaprasad, “Analysis and synthesis of feedfor-


ward neural networks using discrete affine wavelet transformations.”
IEEE Trans. Neural Networks 4:73–85, 1993.

[183] V. Pereyra, “Accelerating the convergence of discretization algo-


rithms.” SIAM J. Numer. Anal. 4:508–533, 1967.

[184] V. Pereyra, “Iterative methods for solving nonlinear least squares


problems.” SIAM J. Numer. Anal. 4:27–36, 1967.

[185] V. Pereyra, “Stability of general systems of linear equations.” Aeq.


Math. 2:194–206, 1969.

[186] V. Pereyra, “Stabilizing linear least squares problems.” Proc. IFIP,


Suppl. 68:119–121, 1969.

[187] V. Pereyra, “Modeling, ray tracing, and block nonlinear travel-time


inversion in 3D.” Pure and App. Geoph. 48:345–386, 1995.

[188] V. Pereyra, “Asynchronous distributed solution of large scale nonlin-


ear inversion problems.” J. App. Numer. Math. 30:31–40, 1999.

[189] V. Pereyra, “Ray tracing methods for inverse problems.” Invited top-
ical review. Inverse Problems 16:R1–R35, 2000.
REFERENCES 295

[190] V. Pereyra, “Fast computation of equispaced Pareto manifolds


and Pareto fronts for multiobjective optimization problems.” Math.
Comp. in Simulation 79:1935–1947, 2009.

[191] V. Pereyra, M. Koshy and J. Meza, “Asynchronous global optimiza-


tion techniques for medium and large inversion problems.” SEG An-
nual Meeting Extended Abstracts 65:1091–1094, 1995.

[192] V. Pereyra and P. Reynolds, “Application of optimization techniques


to finite element analysis of piezocomposite devices.” IEEE Ultrason-
ics Symposium, Montreal, CANADA, 2004.

[193] V. Pereyra and J. B. Rosen, Computation of the Pseudoinverse of


a Matrix of Unknown Rank. Report CS13, 39 pp. Computer Science
Department, Stanford University, Stanford, CA, 1964.

[194] V. Pereyra, M. Saunders and J. Castillo, Equispaced Pareto Front


Construction for Constrained Biobjective Optimization. SOL Report-
2010-1, Stanford University, Stanford, CA, 2010. Also CSRCR2009-
05, San Diego State University, 2009. In Press, Mathematical and
Computer Modelling, 2012.

[195] V. Pereyra and G. Scherer, “Efficient computer manipulation of ten-


sor products with applications in multidimensional approximation.”
Math. Comp. 27:595–605, 1973.

[196] V. Pereyra and G. Scherer, “Least squares scattered data fitting by


truncated SVD.” Applied Numer. Math. 40:73–86, 2002.

[197] V. Pereyra and G. Scherer, “Large scale least squares data fitting.”
Applied Numer. Math. 44:225–239, 2002.

[198] V. Pereyra and G. Scherer, “Exponential data fitting.” In Exponential


Data Fitting and Its Applications, Ed. V. Pereyra and G. Scherer.
Bentham Books, Oak Park, IL, 2010.

[199] V. Pereyra, G. Scherer and F. Wong, “Variable projections neural net-


work training.” Mathematics and Computers in Simulation 73:231–
243, 2006.

[200] V. Pereyra, G. Wojcik, D. Powell, C. Purcell and L. Carcione, “Folded


shell projectors and virtual optimization.” US Navy Workshop on
Acoustic Transduction Materials and Devices, Baltimore, 2001.

[201] G. Peters and J. H. Wilkinson, “The least squares problem and


pseudo-inverses.” The Computer Journal 13:309–316, 1969.
296 REFERENCES

[202] M. J. D. Powell, Approximation Theory and Methods. Cambridge Uni-


versity Press, New York, 1981.

[203] L. Prechelt, Proben 1-A Set of Benchmark Neural Network Problems


and Benchmarking Rules. Technical Report 21, Fakultaet fuer Infor-
matik, Universitaet Karlsruhe, 1994.

[204] Baron Gaspard Riche de Prony, “Essai experimental et analytique:


sur les lois de la dilatabilite de fluides elastique et sur celles de la
force expansive de la vapeur de l’alkool, a differentes temperatures.”
J. Ecole Polyt. 1:24–76, 1795.

[205] PZFLEX: Weidlinger Associates Inc. Finite Element Code to Simu-


late Piezoelectric Phenomena. http://www.wai.com, 2006.

[206] L. Reichel, “Fast QR decomposition of Vandermonde-like matrices


and polynomial least squares approximations.” SIAM J. Matrix Anal.
Appl. 12:552–564, 1991.

[207] R. A. Renaut and H. Guo, “Efficient algorithms for solution of regu-


larized total least squares.” SIAM J. Matrix Anal. Appl. 26:457–476,
2005.

[208] J. R. Rice, “Experiments on Gram-Schmidt orthogonalization.” Math.


Comp. 20:325–328, 1966.

[209] J. B. Rosen, “Minimum and basic solutions to singular systems.” J.


SIAM 12:156–162, 1964.

[210] J. B. Rosen and J. Kreuser, “A gradient projection algorithm for non-


linear constraints.” In Numerical Methods for Non-Linear Optimiza-
tion. Ed. F. A. Lootsma, Academic Press, New York, pp. 297–300,
1972.

[211] J. Rosenfeld, A Case Study on Programming for Parallel Processors.


Research Report RC-1864, IBM Watson Research Center, Yorktown
Heights, New York, 1967.

[212] Å. Ruhe and P-Å. Wedin, “Algorithms for separable non-linear least
squares problems.” SIAM Rev. 22:318–337, 1980.

[213] B. W. Rust, “Fitting nature’s basis functions.” Parts I-IV, Computing


in Sci. and Eng. 2001–03.

[214] B. W. Rust, http://math.nist.gov/~BRust/Gallery.html, 2011.


REFERENCES 297

[215] B. W. Rust, Truncating the Singular Value Decomposition for Ill-


Posed Problems. Tech. Report NISTIR 6131, Mathematics and Com-
puter Sciences Division, National Institute of Standards and Technol-
ogy, 1998.

[216] B. W. Rust and D. P. O’Leary, “Residual periodograms for choosing


regularization parameters for ill-posed problems.” Inverse Problems
24, 2008.

[217] S. Saarinen, R. Bramley and G. Cybenko, “Ill-conditioning in neural


network training problems.” SIAM J. Sci. Stat. Comput. 14:693–714,
1993.

[218] S. A. Savari and D. P. Bertsekas, “Finite termination of asynchronous


iterative algorithms.” Parallel Comp. 22:39–56, 1996.

[219] J. A. Scales, A. Gersztenkorn and S. Treitel, “Fast lp solution of large,


sparse, linear systems: application to seismic travel time tomogra-
phy.” J. Comp. Physics 75:314–333, 1988.

[220] J. L. Schafer, Analysis of Incomplete Multivariate Data. Chapman &


Hall, London, 1997.

[221] S. Schechter, “Relaxation methods for linear equations.” Comm. Pure


Appl. Math. 12:313–335, 1959.

[222] M. E. Schlesinger and N. Ramankutty, “An oscillation in the global


climate system of period 65–70 years.” Nature 367:723–726, 1994.

[223] G. A. F. Seber and C. J. Wild, Nonlinear Regression. J. Wiley, New


York, 1989.

[224] R. Sheriff and L. Geldart, Exploration Seismology. Cambridge Uni-


versity Press, second edition, Cambridge, 1995.

[225] D. Sima, S. Van Huffel and G. Golub, “Regularized total least squares
based on quadratic eigenvalue problem solvers.” BIT 44:793–812,
2004.

[226] Spline2. http://www.structureandchange.3me.tudelft.nl/.

[227] J. Sjöberg and M. Viberg, “Separable non-linear least squares


minimization–possible improvements for neural net fitting.” IEEE
Workshop in Neural Networks for Signal Processing, Amelia Island
Plantation, FL, 1997.
298 REFERENCES

[228] N. Srivastava, R. Suaya, V. Pereyra and K. Banerjee, “Accurate cal-


culations of the high-frequency impedance matrix for VLSI intercon-
nects and inductors above a multi-layer substrate: A VARPRO suc-
cess story.” In Exponential Data Fitting and Its Applications, Eds. V.
Pereyra and G. Scherer. Bentham Books, Oak Park, IL, 2010.

[229] T. Steihaug and Y. Yalcinkaya, “Asynchronous methods in least


squares: An example of deteriorating convergence.” Proc. 15 IMACS
World Congress on Scientific Computation, Modeling and Applied
Mathematics, Berlin, Germany, 1997.

[230] G. W. Stewart, “Rank degeneracy.” SIAM J. Sci. Stat. Comput.


5:403–413, 1984.

[231] G. W. Stewart, “On the invariance of perturbed null vectors under


column scaling.” Numer. Math. 44:61–65, 1984.

[232] G. W. Stewart, Matrix Algorithms. Volume I: Basic Decompositions.


SIAM, Philadelphia, 1998.

[233] T. Strutz, Data Fitting and Uncertainty. Vieweg+Teubner Verlag,


Wiesbaden, 2011.

[234] B. J. Thijsse and B. Rust, “Freestyle data fitting and global temper-
atures.” Computing in Sci. and Eng. pp. 49–59, 2008.

[235] R. Tibshirani, “Regression shrinkage and selection via the lasso.”


Journal of the Royal Statistical Society. Series B (Methodological)
58:267–288, 1996.

[236] A. N. Tikhonov, “On the stability of inverse problems.” Dokl. Akad.


Nauk SSSR 39:195–198, 1943.

[237] L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM,


Philadelphia, 1997.

[238] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.


Tibshirani, D. Botstein and R. B. Altman, “Missing value estimation
methods for DNA microarrays.” Bioinformatics 17:520–525, 2001.

[239] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem.
Computational Aspects and Analysis, SIAM, Philadelphia, 1991.

[240] E. van den Berg and M. P. Friedlander, “Probing the Pareto frontier
for basis pursuit solutions.” SIAM J. Sci. Comp. 31:890–912, 2008.

[241] E. van den Berg and M. P. Friedlander, “Sparse optimization with


least-squares constraints.” SIAM J. Optim. 21:1201–1229, 2011.
REFERENCES 299

[242] A. van den Bos, Parameter Estimation for Scientists and Engineers.
Wiley-Interscience, Hoboken, NJ, 2007.

[243] A. van der Sluis, “Condition numbers and equilibration of matrices.”


Numer. Math. 14:14–23, 1969.

[244] J. W. van der Veen, R. de Beer, P. R. Luyten and D. Van Ormondt,


“Accurate quantification of in vivo 31P NMR signals using the variable
projection method and prior knowledge.” Magn. Reson. Med. 6:92–
98, 1988.

[245] L. Vanhamme, A. van den Boogaart and S. Van Huffel, “Improved


method for accurate and efficient quantification of MRS data with
use of prior knowledge.” J. Magn. Reson. 129:35–43, 1997.

[246] L. Vanhamme, S. Van Huffel, P. Van Hecke and D. van Ormondt,


“Time-domain quantification of series of biomedical magnetic reso-
nance spectroscopy signals.” J. Magn. Reson. 140:120–130, 1999.

[247] L. Vanhamme, T. Sundin, P. Van Hecke, S. Van Huffel and R. Pin-


telon, “Frequency-selective quantification of biomedical magnetic res-
onance spectroscopy data.” J. Magn. Reson. 143:1–16, 2000.

[248] S. Van Huffel, ‘Partial singular value decomposition algorithm.” J.


Comp. Appl. Math. 33:105–112, 1990.

[249] B. Walden, R. Karlsson and J.-G. Sun, “Optimal backward pertur-


bation bounds for the linear least squares problem.” Numer. Linear
Algebra Appl. 2:271–286, 2005.

[250] I. Wasito and B. Mirkin, “Nearest neighbors in least squares data


imputation algorithms with different missing patterns.” Comp. Stat.
& Data Analysis 50: 926–949, 2006.

[251] D. S. Watkins, Fundamentals of Matrix Computations. J. Wiley, New


York, 2002.

[252] K. Weigl and M. Berthod, Neural Networks as Dynamical Bases in


Function Space. Report 2124, INRIA, Programe Robotique, Image et
Vision, Sophia-Antipolis, France, 1993.

[253] K. Weigl, G. Giraudon and M. Berthod, Application of Projection


Learning to the Detection of Urban Areas in SPOT Satellite Im-
ages. Report #2143, INRIA, Programe Robotique, Image et Vision,
Sophia-Antipolis, France, 1993.
300 REFERENCES

[254] K. Weigl and M. Berthod, “Projection learning: Alternative approach


to the computation of the projection.” Proceedings of the European
Symposium on Artificial Neural Networks pp. 19–24, Brussels, Bel-
gium, 1994.
[255] J. H. Wilkinson, Rounding Errors in Algebraic Processes. Prentice-
Hall, Englewood Cliffs, NJ, 1963. Reprint Dover, New York, 1994.
[256] H. Wold and E. Lyttkens, “Nonlinear iterative partial least squares
(NIPALS) estimation procedures.” Bull. ISI 43:29–51, 1969.
[257] S. Wold, A. Ruhe, H. Wold and W. J. Dunn, “The collinearity prob-
lem in linear regression. The partial least squares (PLS) approach to
generalized inverses.” SIAM J. Sci. Stat. Comput. 5:735–743, 1984.
[258] R. Wolke and H. Schwetlick, “Iteratively reweighted least squares:
Algorithms, convergence analysis, and numerical comparisons.” SIAM
J. Sci. Stat. Comput. 9:907–921, 1988.
[259] S. J. Wright, Primal-Dual Interior-Point Methods. SIAM, Philadel-
phia, 1997.
[260] S. J. Wright, J. N. Holt, “Algorithms for non-linear least squares
with linear inequality constraints.” SIAM J. Sci. and Stat. Comput.
6:1033–1048, 1985.

[261] P. Zadunaisky and V. Pereyra, “On the convergence and precision of


a process of successive differential corrections.” Proc. IFIPS 65, 1965.
[262] H. Zha, “Singular values of a classical matrix.” The American Math-
ematical Monthly 104:172–173, 1997.

[263] Netlib Repository at UTK and ORNL. http://www.netlib.org.


[264] NEOS Wiki. http://www.neos-guide.org.

[265] NIST/SEMATECH e-Handbook of Statistical Methods.


http://www.itl.nist.gov/div898/strd/nls/nls_info.shtml, 2011.
Index

Activation function, 232 method, 107


Active constraint, 127, 156 norm, 267
Aerodynamic modeling, 242 Chi-square distribution, 278
Annual temperature anomalies, 213 Cholesky factor, 34, 66, 114
Asynchronous Coefficient of determination, 38, 278
block Gauss-Seidel, 190 adjusted, 38, 216
iterative methods, 189 Complex exponentials, 184
method, 117 Compressed sensing, 4, 41, 91, 121,
Autocorrelation, 15, 216, 218 144
test, 14 Computed tomography (CT), 110
Condition number, 55, 268
B-spline, 260 estimation, 88
B-splines representation, 203 Conjugate gradient method, 109
Back-scattering, 258 Control vertices, 205
Backward stable, 88, 265 Covariance, 31, 277
Basic solution, 41, 42, 91, 94, 97, 145 matrix, 31
Bidiagonalization, 99–102, 112 approximation, 155
method method, 81
Golub-Kahan, 111 Cubic splines, 203
LSQR, 111
Block Data approximation, 2
Gauss-Seidel, 117, 224 Data fitting, 4
Jacobi, 117 problem, 1
methods, 117, 186 Data imputation, 131
nonlinear Gauss-Seidel, 186, 259 Derivative free methods, 183
nonlinear Jacobi, 186 Descent direction, 153, 272
Bolzano-Weierstrass theorem, 156 Direct methods, 65, 91
Bound constraints, 126, 127 Distribution function, 276

Cauchy-Schwartz inequality, 267 Eckart-Young-Mirski theorem, 53


CFL condition, 254 Elastic waves, 258
CGLS solution convergence, 116 Euclidean norm, 267
Chaotic relaxation, 117, 188 Expectation, 276
Chebyshev
acceleration, 107 F-distribution, 279

301
302 INDEX

Feasible region, 156 Ill-conditioned, 191, 207, 223


Finite-element simulation, 238 Jacobian, 168, 176
First-order necessary condition, 151 Ill-conditioning, 191
Fitting model, 4, 13, 16 Inexact methods, 186
Flop, 264 Interlacing property, 80, 123, 141,
Fréchet derivative, 50 270
Frobenius norm, 268 Interpolation, 3
Inverse problem, 129
Gaussian Iterative
density function, 11 methods, 105, 107, 108, 119
distribution, 10, 11 process, 117
errors, 12 refinement, 85
model, 147, 155
probability function, 278 Jacobian, 272
Gauss-Newton
damped, 167 Kronecker product, 209, 210
direction, 166 Krylov
inexact, 170 iterations, 194
method, 163 process, 111
Generalized cross-validation, 197 subspace, 108
Genetic algorithm, 248 methods, 225
Geological
medium, 259 Lagrange multipliers, 158, 273
surface modeling, 221 Lagrangian, 273
Geophone receiver array, 259 Laplace
Givens rotations, 71 density function, 11
fast, 72 distribution, 11
Globally convergent algorithm, 271 errors, 12
Global minimum, 271 Large-scale problems, 105
Global optimization, 177, 241 Least squares
Gradient vector, 272 data fitting problem, 6
Grammian, 28 fit, 5, 7, 9
matrix, 65 overdetermined, 68
Gram-Schmidt recursive, 81
modified, 77 Least squares problems
orthogonalization, 70, 75 condition number, 47
Ground boom signature, 242 large
linear, 117
Hessian, 28, 152, 163, 165, 272 linear, 25
Householder transformations, 71, 122 constrained, 121
Hybrid methods, 176 minimum-norm solution, 93
Hypothesis testing, 280 modifying, 80
rank deficient, 39
IEEE standard shapes, 252 Least squares solution
INDEX 303

full-rank problem, 34 training algorithm, 231


Level curves, 63 Newton’s method, 163
Level of significance, 280 NIPALS, 181
Level set, 153 NMR, 7
Levenberg-Marquardt nuclear magnetic resonance, 2,
algorithm, 170 248
asynchronous BGS iteration, 190 spectroscopy, 150
direction, 171 problem, 29
implementation, 250 Non-dominated solutions, 248
method, 170 Nonlinear data fitting, 148
programs, 178 Nonlinear least squares, 147, 148
step, 171 ill-conditioned, 201
Linear constraints, 121 large scale, 186
equality, 121 separable, 158, 231, 234, 236
inequality, 125 unconstrained, 150
Linear data fitting, 4 Non-stationary methods
Linearized sensitivity analysis, 255 Krylov methods, 108
Linear least squares applications, 203 Normal distribution, 14
Linear prediction, 41 Normal equations, 28, 65
Lipschitz continuity, 273 Normalized cumulative periodograms,
Local 16
minimizer, 156, 271 Numerically rank-deficient problems,
minimum, 151 192
Locally convergent algorithm, 271 Numerical rank, 92
Log-normal model, 12
LU factorization, 68 Operator norm, 267
Peters-Wilkinson, 93 Optimality conditions, 151
constrained problems, 156
Machine epsilon, 263 Order of fit, 4
Matrix norm, 267 Orthogonal factorization
Maximum likelihood principle, 6, 9 complete, 43
Minimal solution, 145 Orthogonal matrices, 269
Model basis functions, 4 Orthogonal projection, 269
Model validation, 217 derivative, 49
Monte Carlo, 176
Moore-Penrose Parameter estimation, 1
conditions, 53 Pareto
generalized inverse, 182 equilibrium point, 160
Multiobjective optimization, 160 front, 161, 248
Multiple initial points, 176 equispaced representation, 161
random, 163 optimal, 160
Multiple right-hand sides, 184 Pearson correlation coefficient, 277
Permutation matrix, 269
Neural networks, 231 Perturbation analysis, 58
304 INDEX

Piezoelectric transducer, 251 Signal to noise ratio, 278


Poisson data, 11 Singular value, 50, 110
PRAXIS, 183, 252 normalized, 92
Preconditioning, 114 Singular value decomposition, 47, 50
Probability density function, 275 economical, 50
Probability function, 275 generalized, 54, 122, 129
Probability of an event, 275 partial, 140
Projection matrix, 48 Singular vectors, 50
Proper splitting, 107 Stability, 88
Proxies, 238 Standard deviation, 185, 277
Pseudoinverse, 47 Standard error, 37
Pure-data function, 4, 10 Stationary methods, 107
Stationary point, 272
QR factorization, 33, 70, 94 Statistically independent, 277
economical, 34 Steepest descent, 153
full, 82 Step length, 177
pivoted, 39, 96 Stopping criteria, 114
rank-revealing, 95 Sum-of-absolute-values, 5, 11
Quadratic constraints, 129 Surrogates, 238
SVD computations, 101
Randomness test, 14 Systems of linear equations
Random signs, 14 parallel iterative solution, 188
Random variable, 275
continuous, 275 Taylor’s theorem, 272
discrete, 275 Tensor product, 209
Rank deficiency, 191 bivariate splines, 208
Rank-deficient problems, 91 cubic B-splines, 222
Rational model, 148 data, 224
Regularization, 192, 241 fitting, 224
Residual analysis, 16 Tikhonov regularization, 118, 163
Response surfaces, 238, 244 Toeplitz matrix, 41, 88
Root mean square, 277 Total least squares, 136
Truncated SVD, 118
Sample space, 275
Second-order Uniformly monotone operator, 273
sufficient conditions, 151 Unitary matrix, 34
Secular equation, 130 Unit round-off, 263
Seismic UTV decomposition, 98
ray tracing, 259
survey, 258 Vandermonde matrix, 57
Sensitivity analysis, 59, 127 Variable projection, 231, 235, 249
Sherman-Morrison-Woodbury algorithm, 181, 183
formula, 270 principle, 181
Sigmoid activation functions, 244 Variance, 276
INDEX 305

VARP2, 184
VARPRO program, 183, 236
Vector
contracting, 273
Lipschitz, 273
norms, 267

Weighted residuals, 6
White noise, 5, 14
test, 14, 15
vector, 278

You might also like