Stat 413
Stat 413
Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1
Preface 1
2 Model Checking 21
2.1 The Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Resampling Bootstrap . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Smooth Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Stochastic Optimization 39
4.1 Basic Stochastic Search . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Simple Random Search . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Localized Random Search . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Enhanced Localized Search . . . . . . . . . . . . . . . . . . . 42
4.1.4 Noisy Loss Function . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Gradient-free Methods . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Example: Least Squares and Neural Networks . . . . . . . . . . . . . 47
A Appendix 49
A.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Preface
This course recently came into being a few years ago. At its birth, it was called
Statistical Computing. Then, it was rebranded as Computing for Data Science. If
I were to give is a more descriptive name, I would call it Stochastic Computation
and Algorithms. The main focus of these notes is how to calculate a desired number
using randomization to achieve it. In Chapter 1, we first discuss how to generate
random numbers using a deterministic computer. It has been said that “It may
seem perverse to use a computer, that most precise and deterministic of all machines
conceived by the human mind, to produce ‘random’ numbers” (?). Nevertheless, we
require random numbers, for some definition of random, in order to proceed with
stochastic algorithms. In Chapter 3, we consider how to use stochastic algorithms
to compute volumes and integrate functions. In Chapter 4, we consider how to use
stochastic algorithms to optimize a function.
There are three textbooks I used as inspiration for these course notes: George
Fishman’s Monte Carlo: concepts, algorithms, and applications (Fishman, 2013);
the famous Numerical Recipes by Press, Teukolsky, Vetterling, and Flannery (?);
and James Spall’s Introduction to stochastic search and optimization (Spall, 2005).
Adam B Kashlak
Edmonton, Canada
Fall 2022
1
One year later, I decided to add a short new chapter, Chapter 2, on model
checking. Here, we will discuss simulating data to check the performance of a model.
Of note, we will cover an introduction to cross validation and bootstrapping.
This came about after discussions with a past student who is currently working
from ATB financial. According to her, most of the work revolves around using
Monte Carlo methods to check and validate models. Hence, I thought it would be
a valuable addition to this course.
Adam B Kashlak
Edmonton, Canada
Winter 2024
2
Chapter 1
Introduction
Type ?set.seed into the R command prompt and start reading under the “Details”
heading. It is easy to take random number generation for granted, but an incredible
amount of work has gone into making sure that your simulations in R or Python or
other languages are as random as possible given a deterministic computer. Many
of the implemented methods within R or Python or other computing languages are
based on algorithms such as Linear Feedback Shift Registers (LFSR) and Linear
Congruential Generators (LCG). More details on these methods can be found in
Chapter 7 of Fishman (2013) and Chapter 7 of ?). When discussing “random”
numbers generated on a computer, they are often called psuedorandom numbers as
they are deterministic but satisfy many properties of a set of truly random numbers.
For this chapter, we assume that it is possible to generate independent uniformly
random binary variables. That is, for any n ∈ N, we can generate B1 , . . . , Bn such
that P (Bi = 1) = P (Bi = 0) = 0.5 and such that Bi and Bj are independent for all
i 6= j. At the time of writing, pure randomness has not been achieved and one has
to be cautious of the fact that such a sequence of Bi ’s are necessarily deterministic,
but sufficiently chaotic to be considered as random.
From uniformly random bits comes uniformly random variates on the unit inter-
iid
val [0, 1]. We will denote U1 , . . . , Un ∼ Uniform [0, 1] to be n independent uniform
random variates. This implies that the cumulative distribution function (CDF) for
Ui is
0, u ≤ 0
P (Ui < u) = u, u ∈ (0, 1)
1, u ≥ 1
and Ui ⊥⊥ Uj for i 6= j.1 Generating such uniform variates can be done in R via the
function runif( ). This function directly calls compiled C code that implements
1
The symbol ⊥
⊥ is often used to indicate independence between two random variables.
3
the chosen random generator from set.seed( ).
Note that the events U = 0 and U = 1 have probability zero of occurring. In
fact, in the C code underlying the R code for random number generation, if the
psuedorandom uniform variate is exactly 0 or 1, the code adjusts very slightly up
or down, respectively, so that runif( ) never actually returns 0 or 1.
Remark 1.0.1. In probability & measure theory, there is a theorem proving the
existence of countably infinite independent random variables on R. The technique of
the proof precisely uses the same approach as the computer does. That is, (1) gener-
ate independent Bernoulli random variables. (2) use those to construct independent
uniform random variables. (3) use the uniforms to construct a general collection of
independent random variables. See STAT 571 notes for more details on this proof.
Z = F −1 (U )
P (Z ≤ t) = P F −1 (U ) ≤ t
= P (U ≤ F (t)) = F (t).
for u ∈ [0, 1]. A jump discontinuity in a CDF implies a point mass exists in the
distribution. Only a countable number of these are possible.
4
1.1.1 CDFs with computable inverses
If the function F has an inverse that exists and furthermore if that inverse can be
expressed in a closed form analytically, then generating random variables from F
is as easy as evaluating the inverse F −1 . However, just because generating random
variables from F is easy to code does not imply that this approach is computationally
efficient. More on this will be discussed in Section 1.2.
5
1.1.2 CDFs without computable inverses
If often is the case that F −1 exists, but no closed form analytic expression is known.
The most obvious example of this is the normal or Gaussian distribution. The CDF
is written in terms of an integral
Z x
1 2
F (x) = e−t /2 dt
2π −∞
F (x0 ) − u
f (x0 ) =
x0 − x1
and solving for x1 gives
F (x0 ) − u
x1 = x0 − .
f (x0 )
The algorithm can be iterated by repeating the above with x1 in place of x0 to get
a new root x2 . This may continue until convergence is achieved.
It can be shown using Taylor’s Theorem that if F has a continuous second
derivative, then the Newton-Raphson algorithm converges quadratically. Indeed,
denoting the true root by x∞ , we have that
|xi+1 − x∞ | = Ci (xi − x∞ )2 .
But this requires some additional conditions such as f (x∞ ) = F 0 (x∞ ) 6= 0. Other-
wise, convergence can be hindered.
Furthermore, the choice of starting point is very important for quick convergence.
Also, certain functions will cause this algorithm to fail.√For example, try applying
√
Newton-Raphson to F (x) = x for x ≥ 0 and F (x) = − −x for x < 0 for an initial
value of x0 = 1 and see what happens.
6
Algorithm 1 The Newton-Raphson Algorithm
Input:
F : D → R, an invertible differentiable function
f : D → R+ , the derivative f = F 0
x0 ∈ D, an initial value in the domain of F
τ ∈ R+ , a threshold for convergence
Algorithm:
For i ≥ 1
xi = xi−1 − F (xi−1 )/f (xi−1 )
if |xi − xi−1 | < τ
stop
return xi
Output:
xi such that F (xi ) ≈ 0
Remark 1.1.3 (Halley’s Method). Many other methods of root finding exist. One
method known as Halley’s
p method can be achieved by applying Newton-Raphson to
0
the function F (x)/ |F (x)| assuming that for any x such that F (x) = 0, we have
that F 0 (x) 6= 0. In Halley’s method, the update set of
F (xi )
xi+1 = xi −
F 0 (xi )
is replaced by
2F (xi )F 0 (xi )
xi+1 = xi − .
2F 0 (xi )2 − F (xi )F 00 (xi )
This approach comes with its own assumptions and concerns, which will be left to
other courses to investigate. See “Householder Methods” for even more.
7
Much like a binary search algorithm, the Bisection algorithm is guaranteed to
stop after dlog2 ((b − a)/τ1 )e number of steps. Hence, the tighter the interval [a, b]
is, the quicker the algorithm will converge, so like the starting value for Newton-
Raphson, the closer you are to the true root, the faster the algorithm will terminate.
for x ∈ [0, 1] where B(α, β) = Γ(α)Γ(β)/Γ(α + β) is the beta function and Γ( ) is the
gamma function.3 The CDF is the known at the regularized incomplete beta function
Z x
1
F (x) = tα−1 (1 − t)β−1 dt.
B(α, β) 0
8
Algorithm 2 The Bisection Algorithm
Input:
F : D → R, an function with a root to find
a, b ∈ D such that F (a) < 0 < F (b), interval to search
τ1 , τ2 ∈ R+ , thresholds for convergence
Algorithm:
Set x0 = a and x1 = b
For i ≥ 1
Compute z = (x0 + x1 )/2
Compute F (z)
if F (z) < 0
Set x0 = z
else if F (z) ≥ 0
Set x1 = z
if |x1 − x0 | < τ1 or |F (z)| < τ2
stop
return z
Output:
z such that F (z) ≈ 0
for some normalizing constant C and a CDF in terms of something called a Gaussian
Hypergeometric Function. Suffice to say, we don’t want to try to write down F let
alone invert it. Luckily, as noted above in Example 1.1.2 on the Cauchy distribution,
√
a t random variate X with ν degrees of freedom can be defined as X = Z/ V where
Z ∼ N (0, 1), V ∼ χ2 (ν), and Z ⊥⊥ V . This is precisely how R generated t random
variates.
Example 1.2.2 (Exponential distribution). As discussed in Example 1.1.1, expo-
nential random variates with rate parameter λ > 0 can be generated by using the
9
probability integral transform
1
X = − log(U )
λ
for U ∼ Uniform [0, 1]. However, more efficient methods exist.
In Fishman (2013) Section 3.10, it is noted that more efficient methods of gen-
eration exist. In particular, let X0 be a random variate from a truncated exponential
distribution. That is, for x ∈ [0, log 2], X0 has PDF and CDF
fX0 (x) = 2e−x and FX0 (x) = 2 − 2e−x .
Next, let K ∼ Geometric(1/2), which is supported on the non-negative integers such
that
pK (k) = P (K = k) = (1/2)k+1
such that X0 ⊥ ⊥ K. Then, we can write X = X0 + K log 2, which is Exponential (1).
Indeed, any x ∈ R+ can be uniquely written at x = x0 + k log 2 for k ∈ Z+ and
x0 ∈ [0, log 2). Denoting the convolution product by ?, the PDF is
fX (x) = (fX0 ? pK )(x) = P (K = k) fX0 (x0 )
= 2−k−1 2e−x0 = exp(−k log 2 − x0 ) = e−x .
Meanwhile, the current R source code uses an algorithm published by Ahrens & Di-
eter in 1972 in a paper called “Computer methods for sampling from the exponential
and normal distributions.”
10
are independent N (0, 1) random variables.
Proof. This theorem is proven via a change of variables. Note that
1
V = (X 2 + Y 2 ) and
2
1
U= arctan(Y /X).
2π
Then, defining the functions v(x, y) = (x2 + y 2 )/2 and u(x, y) = arctan(y/x)/2π,
gives partial derivatives
∂v ∂v
= x and =y
∂x ∂y
∂u 1 1 −y −y
= 2 2 2
=
∂x 2π 1 + y /x x 4πv(x, y)
∂u 1 1 1 x
= 2 2
=
∂y 2π 1 + y /x x 4πv(x, y)
Then, be change of variables, we have that
∂u ∂v ∂u ∂v
fX,Y (x, y) = fU,V (u, v) −
∂x ∂y ∂y ∂x
y2 x2
= 1[u(x, y) ∈ [0, 1]] e−v(x,y) − −
4πv(x, y) 4πv(x, y)
2 2
1 x +y
= exp − .
2π 2
As this is the Gaussian PDF in R2 with mean zero and covariance I2 , we have our
conclusion.
11
Note that the acceptance-rejection method is a critical step in the Metropolis-
Hastings algorithm for performing Markov Chain Monte Carlo. However, MCMC
is beyond scope of this course. In these notes, we restrict to independent sampling
rather than invoking Markov chains.
Theorem 1.3.1 (Von Neumann, 1951). Let f (x) be a pdf with support on [a, b] that
Rb
can be written as f (x) = cg(x)h(x) where h(x) ≥ 0, a h(x)dx = 1, g(x) ∈ [0, 1],
and some constant c ≥ 1. If Z ∼ h(z) and U ∼ Uniform [0, 1] and Z ⊥⊥ U , then
X = Z|{U ≤ g(Z)} has pdf f (x).
Proof. Since U and Z are independent, the joint pdf is simply f (u, z) = 1u∈[0,1] h(z).
Consequently, the conditional pdf for X is
Z g(z)
h(z|U ≤ g(Z))P (U ≤ g(Z)) = f (u, z)du
0
R g(z)
0 f (u, z)du
h(z|U ≤ g(Z)) = .
P (U ≤ g(Z))
R g(z)
The numerator becomes 0 f (u, z)du = g(z)h(z). The denominator becomes
ZZ
P (U ≤ g(Z)) = 1[u ≤ g(z)] du h(z)dz
[0,1]×[a,b]
Z bZ 1 Z b
= 1[u ≤ g(z)] du h(z)dz = g(z)h(z)dz
a 0 a
As f (x) is a pdf,
Z b Z b
1= f (x)dx = c g(x)h(x)dx.
a a
Rb
Thus, a g(x)h(x)dx = c−1 , and consequently, h(z|U ≤ g(Z)) = f (z).
Remark 1.3.2. While many choices of c and g(x) are possible, the acceptance-
rejection method is most efficient when c is close to 1. This is because we accept a
12
Algorithm 3 The Acceptance-Rejection Algorithm
Input:
g(x) ∈ [0, 1]
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ g(z)
stop
return x = z
Output:
x from pdf f (x) = cg(x)h(x).
randomly generated variate X when U ≤ g(Z) which occurs with probability 1/c as
per the above proof.
Hence, given a pdf of interest f and a candidate distribution h, then c is chosen
to be
c = sup{f (x)/h(x)}
x
and g chosen accordingly.
Furthermore, we can consider the event U ≤ g(Z) to have a geometric distri-
bution with probability 1/c of success. Hence, the expected number of iterates until
acceptance is achieved is c.
13
Algorithm 4 The Acceptance-Rejection-Squeeze Algorithm
Input:
g, gL , gU ∈ [0, 1] such that gL ≤ g ≤ gU
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ gL (z) pretest with gL
stop
else if u ≤ gU (z) pretest with gU
if u ≤ g(z) only evaluate g if pretests fail
stop
return x = z
Output:
x from pdf f (x) = cg(x)h(x).
In this setting, we require one Uniform and one Exponential random variate until
a sample is “accepted”. Then, one final Uniform random variate is required to set
the sign, i.e. positive or negative.
1.3.1 Acceptance-Rejection-Squeeze
Sometimes the function g(x) may be slow to compute. However, if we can construct
an envelope gL (x) ≤ g(x) ≤ gU (x) where gL , gU are “simpler” functions, then we
have use gL and gU to perform pretests on whether or not we should accept or reject.
The most natural choices for gL and gU come from power series. For example,
1 − x ≤ e−x ≤ 1 − x + x2 /2 for x ≥ 0. Hence, instead of merely evaluating
exp[−(x − 1)2 /2] in the previous example for the half-normal distribution, we could
choose
(x − 1)2 (x − 1)2 (x − 1)4
gL (x) = 1 − and gU (x) = 1 − + .
2 2 8
14
formly random angle between 0 and 2π and a distance to the origin (radius) that is
Uniform [0, 1]. After that is achieved, we claim that
r
V2
−2 log(U 2 + V 2 ) 2 ∼ N (0, 1) .
U +V2
This is because U 2 +V 2 is Uniform [0, 1] thus making − log(U 2 +V 2 ) ∼ Exponential (1).
Furthermore, the sine and cosine of the angle of the ordered pair (U, V ) can be com-
puted as √U 2U+V 2 and √U 2V+V 2 . Thus, we can compare to the original Box-Muller
transform
√
x = 2v cos(2πu), and
√
y = 2v sin(2πu)
Let D0 ⊂ R2 be defined as
n p o
D0 = (x, y) : 0 ≤ x ≤ r(y/x)
and let (X, Y ) be a uniform random point in D1 , some bounded region such that
D0 ⊂ D1 and let W = X 2 . Then, if W ≤ r(Y /X), the Z = Y /X has pdf f (z).
15
Proof. Let |D0 | denote the area 5 of the region D0 . Then, conditioning on (X, Y )
being in D0 , the joint pdf of (X, Y √ −1 2
√ = |D0 | . Since W = X
) is simply f (x, y)
and Z = Y /X, we have that X = W and Y = Z W . Applying the change of
variables (X, Y ) → (W, Z) gives
√ √
fW,Z (w, z) = JfX,Y ( w, z w)
where J is the Jacobian determinant
∂x ∂x √
1/2 w 0 1
∂w ∂z
J = det ∂y ∂y = det √ √ = .
∂w ∂z
z/2 w w 2
Therefore, fW,Z (w, z) = [2|D0 |]−1 1[0 ≤ w ≤ r(z)] . Therefore, the pdf for Z is
Z r(z)
r(z)
fZ (z) = fW,Z (w, z)dw = .
0 2|D0 |
R∞
Since fZ (z) must integrate to 1, we have that 2|D0 | = −∞ r(z)dz.
The region D0 may be hard to visualize and even harder to sample from. Hence,
we can bound D0 with a rectangle D1 by noting the following. Onpthe x-axis,
the region D0 is constrained to the intervalp [0, x∨ ] where x∨ = supz r(z). On
the y-axis, we have that y = zx = z r(z). Hence, the region D0 is constrained
√ √
to the interval [y∧ , y∨ ][inf z z rz, supz z rz]. Thus, we can use this rectangle for
implementing the ratio of uniforms method as detailed in Algorithm 5.
Note that since we are sampling from a rectangle D1 rather than the actual
set of interest D0 , there is an acceptance-rejection step in Algorithm 5. Thus, the
probability of accepting a randomly drawn point (X, Y ) is |D0 |/|D1 |. From the
above proof, this probability can be written down explicitly:
Z ∞
|D0 | 1
P ((X, Y ) ∈ D0 ) = = r(z)dz.
|D1 | 2x∨ (y∨ − y∧ ) −∞
Remark 1.4.1 (Shift and Scale). Note that we can always take Y /X and shift and
scale it to get a0 + a1 (Y /X) if desired. The above theorem, proof, and algorithm can
be modified accordingly.
As noted, we already saw specific examples of the ratio of uniforms method. In
the case of generating Cauchy random variables, we found that for X ∼ Uniform [0, 1]
and Y ∼ Uniform [−1, −1], then Z = Y /X is Cauchy conditioned on X 2 + Y 2 ≤ 1.
If we set r(z) = (1 + z 2 )−1 , then
1
x∨ = sup √ =1
z∈R 1 + z2
z z 1
y∨ = sup √ = sup √ =1
z∈R 1+z 2 z∈R |z| 1 + z −2
5
i.e. Lebesgue measure
16
Algorithm 5 The Ratio of Uniforms Algorithm
Input:
function r(z) = cf (z) for some c > 0
Algorithm: p
Compute x∨ = supz r(z).
√
Compute y∨ = supz z rz.
√
Compute y∧ = inf z z rz.
Repeat
Generate x from pdf Uniform [0, x∨ ]
Generate y from pdf Uniform [y∧ , y∨ ]
if x2 ≤ r(y/x)
stop
return z = y/x
Output:
z from pdf f (z).
and y∧ = −1 from the same formula. Furthermore, the half circle is recovered by
noting that
1 p
0≤x≤ p ⇒ 0 ≤ x2 + y 2 ≤ 1
1 + y 2 /x2
And thus we can rederive the formula for a Cauchy random variable via the ratio
of uniforms method. In this case, the probability of acceptance is π/4.
17
variables.6 However, this would be highly computationally inefficient.
Instead, we can apply the Ratio of Uniforms method by choosing r(z) = z α−1 ez
where we impose that α > 1.7 Thus, the region D0 is
q
2 α−1 −y/x
D0 = (x, y) ∈ R : 0 ≤ x ≤ (y/x) e ,
and sample from P ∩ [0, 1] × [0, 1]. More details on this can be found in Fishman
(2013) and ?).
18
for someP constant r > 0. The most common setting is r = 1. Then, choosing
x0 = 1 − ni=1 xi gives the (n + 1)-long vector (x0 , x1 , . . . , xn ), which is a probability
vector.8
To generate a uniform random point inside the simplex Sn (1), we first generate
n Uniform [0, 1] random variables U1 , . . . , Un and sort them so that U1 ≤ U2 ≤
. . . ≤ Un . Note that there are effcient algorithms for generating an ordered set of
independent random variables; see order statistics. Then, we simply set X1 = U1
and Xi = Ui − Ui−1 for i = 2, . . . , n. The remainder is X0 = 1 − Un .
In Bayesian analysis and other topics in statistics and machline learning, we
P to generate a probability vector p = (p1 , . . . , pn ) such that pi ≥ 0 for all
often want
i and ni=1 pi = 1. Furthermore, we may want a distribution on p other than the
uniform. That is, we say that p has a Dirichlet distribution with parameter vector
α = (α1 , . . . , αn ) with αi > 0 for all i if it has pdf
n
Γ(α1 + . . . + αn ) Y αi −1
f (p) = pi .
Γ(α1 ) . . . Γ(αn )
i=1
This approach can be shown to yield a Dirichlet random variate, by first starting
with the joint pdf of X1 , . . . , Xn , which is
n n
! n
Y xαi i −1 e−xi X Y xαi −1
i
fX (x1 , . . . , xn ) = = exp − xi .
Γ(αi ) Γ(αi )
i=1 i=1 i=1
19
where we can find that det(J) = T n−1 . Thus, we have that
Pn−1 Qn−1
−T [T (1 − pi )]αn −1 i=1
i=1 (T pi )αi −1 n−1
fp (p1 , . . . , pn−1 , T ) = e Qn T
i=1 Γ(αi )
Qn−1 αi −1 P
(1 − n−1 αn −1
P
i=1Qpi ) i=1 pi
= n T αi −1 e−T .
i=1 Γ(α i )
1.5.2 Spheres
The first problem in this section is to generate a uniformly random point on the
surface of an n-dimensional hypersphere
P 2 ( or2 just the usual 3D sphere, if you prefer
n
), which is Sn−1 (r) = {z ∈ R : zi = r }. To achieve this, we can simulate a
multivariate standard normal random vector and normalize it. That is, if
X = (X1 , . . . , Xn ) ∼ N (0, In ) ,
W ∼ Beta (n, 1) ,
Γ(n/2)
fZ (z|r) = for a sphere with radius r
2π n/2 rn−1
fW (w) = nwn−1 for w ∈ [0, 1].
nΓ(n/2) Γ(n/2 + 1)
fZ (z|r = w)fW (w) = n/2
= .
2π π n/2
20
Chapter 2
Model Checking
Introduction
In statistical and data science applications, we almost surely estimate parameters
whether about the data itself (e.g. correlation, skewness) or in the process of fitting a
model to the data. However, the task of determining whether or not these estimates
are “good” becomes a bit more challenging. In some classical settings, like least
squares regression with homogeneous normal errors, there are exact formulae for
computing statistical significance and the width of a confidence interval. But in
more complex modelling settings, such nice formulae likely do not exist. Hence, in
this chapter, we discuss some methods of model checking; namely, we first look at
bootstrap methods for estimating variances and biases for estimators, and secondly,
we look at cross validation for model tuning and selection.
21
jackknife estimate of θ,
n
1X
θ̂jack = θ̂i .
n
i=1
We can go one step further by combining the orginal estimate and the bias to get a
“de-biased” or “bias-corrected” estimator,
This has some nice properties in theory, but we will only investigate its performance
in simulation.
Similarly, we can compute a jackknife estimator for the standard error, which
comes from the sample variance of the n jackknife estimators. It is
v
u n
un − 1 X
se(
ˆ θ̂)jack = t (θ̂i − θ̂jack )2 .
n
i=1
The inclusion of the (n − 1) in the above standard error and bias formulae may
look a bit arbitrary at first. However, these multipliers are included to make the
Jackknife estimate of the sample mean and variance of the sample mean coincide
with the usual formula.
n n
1X 1 X
θ̂jack = Xj = X̄.
n n−1
i=1 j6=i
Hence, the jackknife bias-corrected estimator for the population mean is precisely
just the usual sample mean, which, of course, is already unbiased!
Example 2.1.2 (The variance of the sample mean). Let X1 , . . . , Xn be iid with the
sample mean µ̂ = X̄ as an estimator for the population mean µ. Then, we know
that
σ2
Var (µ̂) =
n
where σ 2 is the variance of the Xi . In practice, we typically do not know
Pn what σ is,
2
22
Then, under the assumption that the Xi are normal,1 we can construct a (1 − α)
confidence interval for µ as
p
|X̄ − µ| ≤ z1−α/2 σ̂ 2 /n
where z1−α/2 comes from the normal distribution. We could also use the t-distribution,
but if n is large enough and α is not extremely small, then the normal z-score ap-
proach will be fine.
We could take a Jackknife approach to the estimation of the variance of µ̂. To
do that we compute
1 X
X̄i = Xj ,
n−1
i6=j
which gives n difference estimates of the sample mean. Therefore, we can consider
the jackknife standard error, which is
v
u n
un − 1 X
se(
ˆ X̄)jack = t (X̄i − X̄)2
n
i=1
v
u n 2
un − 1 X 1 1 p
= t X̄ − Xi = σ̂ 2 /n.
n n−1 n−1
i=1
23
To see if this improves our estimate, let a = 1. In that case, the order statistics
are distributed as X(k) ∼ Beta (k, n − k + 1) for k = 1, . . . , n. Hence, the original
estimator has a bias of
n 1
EX(n) − 1 = −1=− .
n+1 n+1
The bias corrected estimator has a bias of
1 n−1
E 2X(n) − X − X(n−1) −1
n (n) n
n 1 n n−1n−1
=2 − − −1
n + 1 n n + 1 n n+1
1 1 1
= 2n − 1 − n + 2 − − n − 1 = − .
n+1 n n(n + 1)
Hence, the bias went from order n−1 to n−2 , which is a good improvement..
can use these θi∗ to get an idea of the variability of the original estimate θ̂. For
example, we can compute a bootstrap estimate of Var θ̂ by computing
B
1 X ∗
Var θ̂ = (θi − θ̄i∗ )2
boot B−1
i=1
24
where θ̄i∗ is the sample average of the B bootstrap estimators.
y = β0 + β1 x + ε
to get estimates of β̂0 and β̂1 . Then, new data can be simulated by
25
done for multiple splits or folds of the data. While there are a multitude of cross
validation methods, we will just briefly mention the most common ones being k-fold
and leave-one-out cross validation.
As an example of where this would be used (see Jupyter notebook), we can
consider ridge regression, which fits a linear model by minimizing the penalized
least squares equation:
Leave-One-Out
Leave-one-out cross validation looks a lot like the Jackknife estimator. However,
to overall goal is difference. Given a dataset X1 , . . . , Xn , the Jackknife recomputes
an estimator θ̂ on the dataset with one point missing, i.e. θ̂i from X1 , . . . , Xi−1 ,
Xi+1 , . . . , Xn . The goal is to estimator the bias and standard error of the estimator
θ̂.
In leave-one-out cross validation, we also start from X1 , . . . , Xi−1 , Xi+1 , . . . , Xn .
This dataset is used to fit some statistical models. Each model is then tested on
the missing data point Xi . In this way, the model that gives the best prediction for
data point Xi can be selected from the candidates.
k-fold
In k-fold cross validation, we first randomize the order of the data. This prelimi-
nary step is important in case your data has already come in a specific order, i.e.
chronological.
Once your data is randomized, you split it into k sub-data-sets like
Then, each subset is sequentially left out. The remaining k − 1 sets are combined
and used to fit a statistical model. Then, the left out set is used for testing.
26
Chapter 3
Introduction
In Chapter 3 of these lecture notes, we are concerned with using randomly generated
variates to actually compute things. This will take on different forms. One will
be to compute the volume of some region in space. Another will be to numerically
integrate some function. Much as before, we will have to be clever to sample random
variates in a “smart” way in order to produce efficient algorithms.
One of the new considerations in this chapter is the accuracy of computation.
That is, we will be concerned with topics like error estimation, confidence intervals
for estimates, and dependecy on sample size. That is, how large of a sample do we
need to get an estimate with a given accuracy.
27
Remark 3.1.1. Without loss of generality, we will assume that D lives within the
d-dimensional unit cube
C = [0, 1]⊗d .
We can always scale and translate a region D to make this so.
This region is 1/2d of the d-dimensional sphere. By looking up formulae for volume
and surface area, we find that
(π/4)d/2 2(π/4)d/2
Vol(D) = and Surf(D) = .
Γ(d/2 + 1) Γ(d/2)
Therefore, we may want to compute this volume using Algorithm 6 to an accuracy
of ε. However, Vol(D) → 0 as d → ∞. Hence, we would have to adapt ε to the
dimension of the problem. Therefore, we will consider the normalized error bound:
Vol(D) − k −d S 1 Surf(D)
≤
Vol(D) k Vol(D)
28
Algorithm 6 Volume via regular grid
Input:
function φ(x) that is 1 on D and 0 otherwise.
k ∈ N, the number of points per dimension to sample.
Algorithm:
Set S = 0.
Construct a grid X of k d points x.
for each x ∈ X
if φ(x) = 1
S ←S+1
return S/k d
Output:
Volume estimate of region D.
Surf(D)
ε = n−1/d
Vol(D)
Surf(D) d
−d
n=ε
Vol(D)
d
Γ(d/2 + 1) d
2
=
ε Γ(d/2)
d
d
=
ε
Hence, if we want the relative error less than, say, 0.001, then the sample sizes are
approximately
106.6 , 1010.4 , 1014.4 , 1018.5
for d = 2, 3, 4, 5.
29
This algorithm is detailed in Algorithm 7. Going one step farther, we can compute
the variance of this estimate as well. We can consider each bi = φ(xi ) to be a
Bernouli random variable with probability |D| of getting a 1 and 1 − |D| of getting
a zero. Hence, for each i
Var (bi ) = |D|(1 − |D|),
and for our estimate, we have
n
!
1X |D|(1 − |D|)
σ 2 = Var bi = .
n n
i=1
Thus, the variance of the estimate decreases at the usual n−1 rate. We can also
compute the unbiased estimate of the variance, which is
(S/n)(1 − S/n)
σ̂ 2 = .
n−1
The “usual” normal style confidence interval is
S
|D| − ≤ t1−α/2,n−1 σ̂
n
where t1−α/2,n−1 is the value such that P T > t1−α/2,n−1 = α/2 where T ∼
t (n − 1). This hints that our estimation error decreases at a rate of O(n−1/2 ) for
sample size n.
The problem with this is that in the normal distribution case, the estimate for
the mean µ̂ and the estimate for the variance σ̂ 2 are independent of each other.
However, for these Bernouli random variables, this is not the case. Hence, S/n and
σ̂ 2 may be correlated, which can invalidate the confidence interval. We will consider
two more powerful methods for confidence interval construction and sample size
estimation.
Chebyshev’s Inequality
One approach to confidence interval construction is to use Chebyshev’s inequality.
This is a very general and very useful upper bound on the tail probability of a
random variable only assuming a finite variance. Hence, it can apply to a wide
range of distributions.
EZ 2
P (|Z| > ε) ≤ .
ε2
30
Algorithm 7 Volume via uniform Monte Carlo
Input:
function φ(x) that is 1 on D and 0 otherwise.
n ∈ N, the number random points to simulate.
Algorithm:
Set S = 0.
for i = 1, . . . , n
Generate x with iid entries xj ∼ Uniform [0, 1] for j = 1, . . . , d.
if φ(x) = 1
S ←S+1
return estimate S/n
return variance [(S/n)(1 − S/n)]/(n − 1)
return confidence interal (see below)
Output:
Volume estimate of region D along with variance and confidence interval.
Hoeffding’s Inequality
Hoeffding’s Inequality is a slightly more sophisticated result than Chebyshev’s in-
equality. It applies to sums if bounded random variables1 such as sums of Bernoulli
1
and more generally to sums of sub-Gaussian random variables
31
random variables, which is how we state the following version of Hoeffding’s inequal-
ity.
iid
Theorem 3.1.2 (Hoeffding’s Inequality). Let B1 , . . . , Bn ∼ Bernoulli (p) for some
p ∈ (0, 1) and B̄ = n−1 ni=1 Bi . Then,
P
3.2 Integration
Above, we noticed that computing the volume of some region D ⊂ C can be written
as an integral of some function φ(x) that is 1 for x ∈ D and 0 for x ∈
/ D. That is,
Z n
1X
Vol(D) = φ(x)dx ≈ φ(xi )
C n
i=1
32
3.2.1 Deterministic Approaches
There are tons of approaches to numerical integration in existence. We will briefly
discuss non-Monte Carlo methods before resuming the MC discussion. Methods
referred to a “Quadrature” typically involve picking points xi and weights ωi and
computing
Z 1 Xn
g(x)dx ≈ ωi g(xi ).
0 i=1
Let 0 = x0 < x1 < . . . < xn = 1. The rectangular or midpoint rule yields
Z 1 n
X xi + xi−1
g(x)dx ≈ (xi − xi−1 )g ,
0 2
i=1
which is just the width times the height of n rectangles. The trapezoidal rule yields
Z 1 n
X g(xi ) + g(xi−1 )
g(x)dx ≈ (xi − xi−1 ) ,
0 2
i=1
which linearly interpolates between (xi−1 , g(xi−1 )) and (xi , g(xi )). Of course, many
more sophisticated methods exist.
iid
If we take each xi to be a uniform random variate Xi ∼ Uniform [0, 1], then we have
for each i = 1, . . . , n that Z 1
Eg(Xi ) = g(x)dx
0
and thus
n
! Z 1
1X
E g(Xi ) = g(x)dx,
n 0
i=1
which means we have an unbiased estimator of the integral.
Furthermore, we can compute the variance and get
n n
!
1X 1 X 1
Var g(Xi ) = 2 Var (g(Xi )) = Var (g(X)) .
n n n
i=1 i=1
33
For X ∼ Uniform [0, 1],
Z 1 Z 1 2 Z 1
2 2
σ = Var (g(X)) = g(x) dx − g(x)dx = g(x)2 dx − µ2 .
0 0 0
1 Pn
= σ 2 /n. This can
Therefore, we have the usual conclusion that Var n i=1 g(Xi )
be estimated by the unbiased estimator
2
n n n
1 X
g(xi )2 − 1
X 1 X
σ̂n2 = g(xi )2 − µ̂2n
g(xj ) =
n−1 n n−1
i=1 j=1 i=1
Confidence Intervals
R1
If we have that 0 g(x)4 dx < ∞,2 then the central limit theorem tells us that
µ̂ − µ d
pn −
→ N (0, 1) .
σ̂n2 /n
Steaming Computation
For a highly accurate estimate of µ, we may require a very large n. In the above
formulation, computing the variance σ̂n2 requires computation of µ̂n first. Thus, we
would have to save n evaluations g(xi ) in memory, which could become cumbersome.
Instead, we can reformulate the above computations in a streaming fashion that does
not require saving all of the evaluations to memory.
A first attempt would be to write
n n
!2
1 X 1 X
σ̂n2 = g(xi )2 − g(xi ) .
n−1 n(n − 1)
i=1 i=1
Pn Pn
Hence, we could save i=1 g(xi ) and i=1 g(xi )2 only. However, the use of sub-
traction could result in loss of numerical precision.
2
i.e. finite fourth moment.
34
Algorithm 8 Integral via uniform Monte Carlo
Input:
function g(x) on the domain [0, 1].
n ∈ N, the number random points to simulate.
Algorithm:
Set S = 0. Integral Estimate
Set V = 0. Variance Estimate
for i = 1, . . . , n
Generate x ∼ Uniform [0, 1].
Compute g(x)
V ← ( i−1 1
i−2 )V + i (g(x) − S)
2
1 i−1
S ← i g(x) + ( i )S
return estimate S
return variance V p
return confidence interval S ± z1−α/2 V /n
Output:
Volume estimate of region D along with variance and confidence interval.
g(xi ) n − 1
µ̂n = + µ̂n−1
n n
and use summations to compute σn2 at each step of the algorithm.
35
3.3 Importance Sampling
In the previous sections, we only considered sampling uniformly from [0, 1]. However,
it is often worth considering non-uniform distributions in order to sample more
efficiently. That is, starting from the above integration problem, say we want to
compute Z 1
µ= g(t)dt = E[g(X)]
0
for X ∼ Uniform [0, 1]. The estimator we had before will be denoted
n
1X
µ̂n = g(xi )
n
i=1
for x1 , . . . , xn drawn from the Uniform [0, 1] distribution. Then, given some density
function f (t) which has support on [0, 1] and f (t) > 0 for all t ∈ [0, 1], we can write
Z 1 Z 1 Z 1
f (t) g(t)
µ= g(t)dt = g(t) dt = f (t)dt = E[g(Z)/f (Z)]
0 0 f (t) 0 f (t)
for Z ∼ f (z). Numerically speaking, we can simulate z1 , . . . , zn from f (z) and then
estimate the integral to be
n
1 X g(zi )
µ̃n = .
n f (zi )
i=1
Then, the question to consider is, “What should f (z) be in order to get the most
efficient estimate?”
Based on the above equations, we can derive the variance for our new estimator
to be
" 2 #
g(Z)
Var (g(Z)/f (Z)) = E −µ
f (Z)
Z 1
g(Z)2 g(t)2
2
=E − µ = dt − µ2 .
f (Z)2 0 f (t)
R1
In contrast, the variance of g(X) for X uniform is 0 g(t)2 dt − µ2 . Thus, we can
consider the difference
Z 1
g(Z) 2 1
Var (g(X)) − Var = g(t) 1 − dt.
f (Z) 0 f (t)
Thus, we want this integral to be as big as possible to indicate a large drop in the
variance. Solving for the perfect f is typically not doable. However, the above
equation gives us valuable intuition. In particular, if g(t)2 is big then we want
f (t) > 1 and if g(t)2 is small then we want f (t) < 1. Thus we are sampling more
or less frequently than the uniform distribution depending on what f is.
36
Example 3.3.1. If we want to numerically estimate the integral of g(t) = t2 +
t3 , we can try simulating from the uniform distribution—i.e. Beta (1, 1)—and the
Beta (2, 1) distribution with pdf f (t) = 2t. Computing the variance difference from
above yields
Z 1 Z 1
1 1 3
Z
2 3 2 1 4 5 6
(t + t ) 1 − dt = (t + 2t + t )dt − (t + 2t4 + t5 )dt
0 2t 0 2 0
1 1 1 1 1 1 15
= + + − + + = ≈ 0.268.
5 3 7 8 5 12 56
The variance under the uniform distribution is
Z 1 Z 1 2
2 3 2 2 3
(t + t ) dt − (t + t ) ≈ 0.3359.
0 0
where varAi (g(X)) is the variance over each stratum Ai . Therefore, our goal is to
chose the Ai and ni to reduce the variance.
37
One way to get a variance reduction is to choose each ni = nωi , which simply
says to sample from each Ai proportional to its size. That is,
k
X ωi
Var (µ̂ss
n) = varAi (g(X))
n
i=1
k
1X
ωi EAi (g(X) − EAi [g(X)])2
=
n
i=1
k
1 X
ωi EAi (g(X) − E[g(x)] + EAi [g(X)] − E[g(x)])2
=
n
i=1
k
1X
ωi EAi (g(X) − E[g(x)])2 − (EAi [g(X)] − E[g(x)])2
=
n
i=1
k
1X
= Var (µ̂n ) − ωi (EAi [g(X)] − E[g(x)])2 ,
n
i=1
which is less than or equal to Var (µ̂n ). The main intuition here is that we are
forcing the sampling to be more evenly spread out. However, this approach can still
be outperformed with more clever thought. In particular, this above partitioning
does not take the function g into account at all.
If we instead partition into intervals [0, 0.5] and [0.5, 1] with n/2 points in each, we
get
2 n/2 2 n/2
ss
X 1/2 X 1 XX
µ̂n = xi,j = xi,j ,
n/2 n
i=1 j=1 i=1 j=1
which is the same as µ̂n except that exactly half of the samples are in each of the
two subintervals. The variance of this estimator is
2 2
X (1/2)2 1 X (1/2)2 1
Var (µ̂ss
n) = varAi (X) = = .
n/2 2n 12 48n
i=1 i=1
3
Recall that the variance of the Uniform [a, b] distribution is (b − a)2 /12.
38
Chapter 4
Stochastic Optimization
Introduction
In the last chapter, we saw how Monte Carlo methods can be used to estimate
volumes and integrals. In this chapter we will consider a different but also important
application of Monte Carlo methods: Optimization.
In many areas of application, we find ourselves having to solve similar minimiza-
tion / maximization problems. For example, in parametric statistics, we often wish
to find a model that maximizes the likelihood. Alternatively, we may want to choose
model parameters that best minimize a loss or risk function. In this chapter, we will
formulate all problems as minimizing some loss function L(θ) which maps a vector
θ ∈ Rd into R. The goal will be to find θ∗ = arg minθ {L(θ)}. Of course, any of the
ideas below can be easily reformulated into a maximization problem.
There are also two settings we will consider: noise-free and noisy optimization.
In the first case, we simply try to minimize the deterministic function L(θ), which
nevertheless may be a very complex function to minimize. In the second case, we
do not get to observe L(θ), but instead observe L(θ) + εθ where εθ is a random error
in the measurement that may depend on θ.
39
Algorithm 9 Simple Random Search
Input:
Loss function L(θ) to minimize.
Domain Θ to search over.
Maximum number of samples n.
Algorithm:
Pick θbest ∈ Θ uniformly at random
Set Lbest = L(θbest )
for i from 2 to n
Pick θ ∈ Θ uniformly at random
Compute L = L(θ)
if L < Lbest (i.e. if we do improve)
Set Lbest = L and θbest = θ
Output:
{θbest , Lbest }
• The minimizer θ∗ is unique, and for all ε > 0, inf kθ−θ∗ k>ε L(θ) > L(θ∗ )
• A “continuity”-like condtion: for all ε > 0, there exists δ > 0 such that
which basically means that θ∗ can be some isolated singleton point with respect
to how we are randomly generating θ’s.
However, we won’t prove that in these notes as we would have to discuss what
convergence almost surely means first as well as other analysis topics.
40
Algorithm 10 Localized Random Search
Input:
Loss function L(θ) to minimize.
Domain Θ to search over.
Maximum number of samples n.
Distribution f to sample from (typically standard normal)
Algorithm:
Pick θ1 ∈ Θ uniformly at random
Set L1 = L(θ1 )
for i from 2 to n
Pick v from distribution f
Set θi = θi−1 + v
if θi ∈
/Θ
Set θi ← θi−1 and Li ← Li−1
Go to next i
Compute Li = L(θi )
if Li ≥ Li−1 (i.e. if we don’t improve)
Set Li ← Li−1 and θi ← θi−1
Output:
{θn , Ln }
In the localized search method detailed in Algorithm 10, we move from one θ to
the next by randomly perturbing it by some random vector v. That is, the new θ
becomes θ + v. Thus, this method randomly walks around the domain Θ looking
for the minimum. It can be shown that if
• L is continuous
• Θ is bounded
then our sequence of θ’s will get arbitrarily close to the minimizer θ∗ as n → ∞. If
there is more than one minimizer—i.e. θ∗ is not unique—then this algorithm will
eventually become close to one of the minmizers.
41
4.1.3 Enhanced Localized Search
The last algorithm we consider in this section is almost identical to the above local-
ized search method. However, it is “enhanced” by adding a bias term that pushes
the random walk in a given direction if that direction looks to be minimizing the
loss function L. This bias term gives the random walk some inertial to keep drifting
in the same direction (roughly). This is detailed in Algorithm 11.
Specifically, the update step from the localized search,
LRS: θi ← θi−1 + v,
is replaced with
ERS: θi ← θi−1 + bi−1 + v,
where bi−1 is the bias term computed on the previous step of the search algorithm.
If the new θi gives an improvement in the loss value—i.e. L(θi ) < L(θi−1 )—then
the new bias term is
bi ← 0.2bi−1 + 0.4v.
Otherwise, the new bias term is a shrunken version of the previous term bi ←
0.5bi−1 . The choices of these scaling values (0.2,0.4,0.5) were picked by the algorithm
designers, but other values could be considered as well. There is nothing inherently
perfect about those choices other than that they seem to work well in practice.
More references to such enhanced localized search methods as discussed on page 45,
chapter 2, of Spall (2005).
42
Algorithm 11 Enhanced Localized Search
Input:
Loss function L(θ) to minimize.
Domain Θ to search over.
Maximum number of samples n.
Distribution f to sample from (typically standard normal)
Algorithm:
Pick θ1 ∈ Θ uniformly at random
Set L1 = L(θ1 )
Set b1 = 0
for i from 2 to n
Pick v from distribution f
Set θi = θi−1 + bi−1 + v (try adding v)
if θi ∈ Θ and Li := L(θi ) < L(θi−1 )
Set bi = 0.2bi−1 + 0.4v
Go to next i
Set θi = θi−1 + bi−1 − v (try subtracting v)
if θi ∈ Θ and Li := L(θi ) < L(θi−1 )
Set bi = 0.2bi−1 − 0.4v
Go to next i
Set bi = 0.5bi−1
Set Li ← Li−1 and θi ← θi−1 (or keep θ unchanged)
Output:
{θn , Ln }
Thus, we have reduced the variance by a factor of 1/k. However, this would require
k-times more evaluations of L at every step. This may be prohibitively expensive
depending on how hard it is to compute loss measurements.
Instead, we can set a threshold for improvement. That is, in the originally stated
(noise-free) algorithms, we update θi−1 to θi if L(θi ) < L(θi−1 ). In this noisy-setting,
we can pick a threshold τi , which could be dependent on the iteration i, and update
the parameter vector θ if
The intuition comes from hypothesis testing. If τi is, say, two standard deviations,
then if L(θi ) > L(θi−1 then
This means that we would have only a 2.3% chance of picking a worse choice of
parameters θ. Of course, this conceptualization assumes we know what the unknown
43
noise η 2 is. In practice, we would have to treat τ as a tuning parameter. If τ is
small, we jump around Θ more liberally. If τ is large, we require a large drop in the
observed loss value before we jump to a new θ.
∂Q(θ, v)
y(θ) =
∂θ
be the function for which we want to find a root. The function y(θ) is a noisy
measurement of the gradient of Q.
Robbins-Monro
The Robbins-Monro algorithm dates back to the 1950’s and will look very similar to
Newton-Raphson. In short, we require a sequence of step-sizes or sometimes called
a gain sequence ak such that
∞
X ∞
X
ak = ∞ and a2k < ∞,
k=1 k=1
which basically means that the step-sizes ak → 0 but not too fast. This allows us
to explore the space Θ, but eventually calm down into a final solution.
44
Algorithm 12 The Robbins-Monro Algorithm
Input:
y : Θ → R, a function to find the root
θ0 ∈ Θ, an initial value in Θ
τ ∈ R+ , a threshold for convergence
Algorithm:
For k ≥ 1
θk = θk−1 − ak y(θk−1 )
if kθk − θk−1 k < τ
stop
return θk
Output:
θk such that y(θk ) ≈ 0
Given parameters from the previous set θk , the update step for the Robbins-
Monro algorithm is
θk+1 ← θk − ak y(θk ).
The algorithm is detailed in Algorithm 12. A standard choice for the step-size
sequence ak is ak = a/k for some constant a > 0. Besides this sequence, a few more
technical conditions must be statisfied to prove convergence. These include additive
mean-zero noise, a gaurenttee that the variance doesn’t blow up, and convexity for
L(θ). However, convergence criteria take a lot of background knowledge to discuss
properly and thus will be skipped in these notes.
Kiefer–Wolfowitz
Much like the Robbins-Monro algorithm from the previous section, the update step
for the Kiefer-Wolfowitz algorithm is
45
Algorithm 13 The Kiefer–Wolfowitz Algorithm
Input:
y : Θ → R, a function to find the root
θ0 ∈ Θ, an initial value in Θ
τ ∈ R+ , a threshold for convergence
Algorithm:
For k ≥ 1
Estimate the gradient ŷk (θk−1 )
θk = θk−1 − ak ŷk (θk−1 )
if kθk − θk−1 k < τ
stop
return θk
Output:
θk such that y(θk ) ≈ 0
where ŷk (θk ) is an estimate of the gradient at the input θk . To estimate the gradient,
we require a second step-size sequence ck . Then, we get
1
2ck [Q(θk + ck e1 ) − Q(θk − ck e1 )]
ŷk (θk ) =
..
.
1
2ck [Q(θk + ck ed ) − Q(θk − ck ed )]
One natural choice for these sequences is ak = a/k for some a > 0 as before and
ck = c/k α for some c > 0 and some α ∈ (0, 1/2), e.g. α = 1/3 or α = 1/4, but
α 6= 1/2 as then the final sum will not converge.
46
perturbation vector v from some suitable distribution (e.g. normal) and estimate
the gradient at only 2 points. The gradient equation is
v1
[Q(θk + ck v) − Q(θk − ck v)] .
ŷk (θk ) = .. .
2ck
vd
Thus, a method like this is nicer in high dimensions especially if Q is costly to
evaluate.
Taking a derivative means we can consider the Robbins-Monro algorithm with the
following update equation:
∂h
θk+1 = θk − ak [zk − h(θ, xk )] (θk , xk ).
∂θ
In this case, we consider each input-output pair one at a time rather than all at
once. This is useful, for example, with large and/or streaming data sets.
47
This above formulation can be applied to any least squares problem where h and
the gradient of h are able to be computed. For example, h could be some nonlinear
regression equation or a function described by a neural network. The segmented
construction of a neural network allows for evaluating the gradient of h.
48
Appendix A
Appendix
The determinant of J will be of particular interest and is often also referred to simply
as the Jacobian. Note that we want this determinant to be non-zero. Otherwise,
the mapping from x → y is not invertible.1 The change-of-variables formula is as
follows. Note that more general formulations of this result exist, but we stick with
a simpler one suitable for our purposes.
Proposition A.1.1 (Change of Variables). Let U ⊂ Rn be an open set, and let the
change of variables function ϕ : x → y be one-to-one, differentiable, have continuous
partial derivatives, and non-zero Jacobian for all x ∈ U . Then, for any real-valued
function f that is continuous with support contained in ϕ(U ),
Z Z
f (y)dy = f (x)|det(J)|dx.
ϕ(U ) U
49
In this case, the Jacobian matrix is
∂x ∂x ∂x
∂r ∂φ ∂θ sin φ cos θ r cos φ cos θ −r sin φ sin θ
J = ∂y ∂y ∂y
∂θ =
sin φ sin θ r cos φ sin θ r sin φ cos θ ,
∂r ∂φ
∂z ∂z ∂z cos φ −r sin φ 0
∂r ∂φ ∂θ
and the determinant is det(J) = r2 sin φ. Let ϕ : (r, φ, θ) → (x, y, z). If we have a
region U ⊂ R3 and function f , then
ZZZ
f (x, y, z)dx dy dz =
ϕ(U )
ZZZ
f (r sin φ cos θ, r sin φ sin θ, r cos φ)r2 sin φ dr dφ dθ.
U
∂x ∂y ∂x ∂y
fW,Z (w, z) = fX,Y (x(w, z), y(w, z)) − ,
∂w ∂z ∂z ∂w
which is used (among other places) in the proof of the Box-Muller transform and
the Ratio of Uniforms method.
50
Bibliography
51