0% found this document useful (0 votes)
17 views55 pages

Stat 413

The course notes for STAT 413, titled 'Computing for Data Science' or 'Stochastic Computation and Algorithms', focus on using randomization for calculations in data science. Key topics include random number generation, model checking, volume computation, integration, and stochastic optimization. The notes also emphasize the importance of Monte Carlo methods and provide a comprehensive overview of various algorithms and techniques used in stochastic computation.

Uploaded by

Adarsh raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views55 pages

Stat 413

The course notes for STAT 413, titled 'Computing for Data Science' or 'Stochastic Computation and Algorithms', focus on using randomization for calculations in data science. Key topics include random number generation, model checking, volume computation, integration, and stochastic optimization. The notes also emphasize the importance of Monte Carlo methods and provide a comprehensive overview of various algorithms and techniques used in stochastic computation.

Uploaded by

Adarsh raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Computing for Data Science

or Stochastic Computation and Algorithms


Course notes for STAT 413

Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1

February 13, 2024


cbna
This work is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/
by-nc-sa/4.0/.
Contents

Preface 1

1 Random Number Generation 3


1.1 Probability Integral Transform . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 CDFs with computable inverses . . . . . . . . . . . . . . . . . 5
1.1.2 CDFs without computable inverses . . . . . . . . . . . . . . . 6
1.1.3 CDFs with discrete distributions . . . . . . . . . . . . . . . . 8
1.2 Other Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Box-Muller Transform . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Acceptance-Rejection Sampling . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Acceptance-Rejection-Squeeze . . . . . . . . . . . . . . . . . . 14
1.3.2 Box-Muller without Trigonometry . . . . . . . . . . . . . . . 14
1.4 Ratio of Uniforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Gamma Random Variables . . . . . . . . . . . . . . . . . . . 17
1.5 Points in Geometric Objects . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Simplexes and the Dirichlet Distribution . . . . . . . . . . . . 18
1.5.2 Spheres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Model Checking 21
2.1 The Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Resampling Bootstrap . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Smooth Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Volume Computation and Integration 27


3.1 Volume Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Sampling from a regular grid . . . . . . . . . . . . . . . . . . 28
3.1.2 Uniform Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Deterministic Approaches . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . 33
3.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Stochastic Optimization 39
4.1 Basic Stochastic Search . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Simple Random Search . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Localized Random Search . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Enhanced Localized Search . . . . . . . . . . . . . . . . . . . 42
4.1.4 Noisy Loss Function . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Gradient-free Methods . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Example: Least Squares and Neural Networks . . . . . . . . . . . . . 47

A Appendix 49
A.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Preface

Never really seized by the need to calculate, he was more


apt to be aware of pattern than of brute numeration.
Ratner’s Star
Don DeLillo (1976)

This course recently came into being a few years ago. At its birth, it was called
Statistical Computing. Then, it was rebranded as Computing for Data Science. If
I were to give is a more descriptive name, I would call it Stochastic Computation
and Algorithms. The main focus of these notes is how to calculate a desired number
using randomization to achieve it. In Chapter 1, we first discuss how to generate
random numbers using a deterministic computer. It has been said that “It may
seem perverse to use a computer, that most precise and deterministic of all machines
conceived by the human mind, to produce ‘random’ numbers” (?). Nevertheless, we
require random numbers, for some definition of random, in order to proceed with
stochastic algorithms. In Chapter 3, we consider how to use stochastic algorithms
to compute volumes and integrate functions. In Chapter 4, we consider how to use
stochastic algorithms to optimize a function.
There are three textbooks I used as inspiration for these course notes: George
Fishman’s Monte Carlo: concepts, algorithms, and applications (Fishman, 2013);
the famous Numerical Recipes by Press, Teukolsky, Vetterling, and Flannery (?);
and James Spall’s Introduction to stochastic search and optimization (Spall, 2005).

Adam B Kashlak
Edmonton, Canada
Fall 2022

1
One year later, I decided to add a short new chapter, Chapter 2, on model
checking. Here, we will discuss simulating data to check the performance of a model.
Of note, we will cover an introduction to cross validation and bootstrapping.
This came about after discussions with a past student who is currently working
from ATB financial. According to her, most of the work revolves around using
Monte Carlo methods to check and validate models. Hence, I thought it would be
a valuable addition to this course.

Adam B Kashlak
Edmonton, Canada
Winter 2024

2
Chapter 1

Random Number Generation

Introduction
Type ?set.seed into the R command prompt and start reading under the “Details”
heading. It is easy to take random number generation for granted, but an incredible
amount of work has gone into making sure that your simulations in R or Python or
other languages are as random as possible given a deterministic computer. Many
of the implemented methods within R or Python or other computing languages are
based on algorithms such as Linear Feedback Shift Registers (LFSR) and Linear
Congruential Generators (LCG). More details on these methods can be found in
Chapter 7 of Fishman (2013) and Chapter 7 of ?). When discussing “random”
numbers generated on a computer, they are often called psuedorandom numbers as
they are deterministic but satisfy many properties of a set of truly random numbers.
For this chapter, we assume that it is possible to generate independent uniformly
random binary variables. That is, for any n ∈ N, we can generate B1 , . . . , Bn such
that P (Bi = 1) = P (Bi = 0) = 0.5 and such that Bi and Bj are independent for all
i 6= j. At the time of writing, pure randomness has not been achieved and one has
to be cautious of the fact that such a sequence of Bi ’s are necessarily deterministic,
but sufficiently chaotic to be considered as random.
From uniformly random bits comes uniformly random variates on the unit inter-
iid
val [0, 1]. We will denote U1 , . . . , Un ∼ Uniform [0, 1] to be n independent uniform
random variates. This implies that the cumulative distribution function (CDF) for
Ui is 
 0, u ≤ 0
P (Ui < u) = u, u ∈ (0, 1)
1, u ≥ 1

and Ui ⊥⊥ Uj for i 6= j.1 Generating such uniform variates can be done in R via the
function runif( ). This function directly calls compiled C code that implements
1
The symbol ⊥
⊥ is often used to indicate independence between two random variables.

3
the chosen random generator from set.seed( ).
Note that the events U = 0 and U = 1 have probability zero of occurring. In
fact, in the C code underlying the R code for random number generation, if the
psuedorandom uniform variate is exactly 0 or 1, the code adjusts very slightly up
or down, respectively, so that runif( ) never actually returns 0 or 1.

Remark 1.0.1. In probability & measure theory, there is a theorem proving the
existence of countably infinite independent random variables on R. The technique of
the proof precisely uses the same approach as the computer does. That is, (1) gener-
ate independent Bernoulli random variables. (2) use those to construct independent
uniform random variables. (3) use the uniforms to construct a general collection of
independent random variables. See STAT 571 notes for more details on this proof.

1.1 Probability Integral Transform


Thus far, we have our collection of iid uniform random variates U1 , . . . , Un . However,
more interesting and diverse distributions are often desirable. We will use U1 , . . . , Un
to construct new random variables with arbitrary probability distributions. The first
tool we will make use of is the probability integral transform.

Theorem 1.1.1 (Probability Integral Transform). Let F : R → [0, 1] be a strictly


increasing continuous function such that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
Then, F −1 : [0, 1] → R exists and if U ∼ Uniform [0, 1], then

Z = F −1 (U )

has F as its cumulative distribution function.

Proof. We want the CDF for Z = F −1 (U ). That is,

P (Z ≤ t) = P F −1 (U ) ≤ t


= P (U ≤ F (t)) = F (t).

Alternatively, this theorem implies that if Z has continuous increasing CDF F ,


then F (Z) is uniform on [0, 1]. Note that this theorem can be extended to CDFs
with discontinuities by defining the inverse CDF as follows:

F −1 (u) = inf{t ∈ R : F (t) ≥ u}

for u ∈ [0, 1]. A jump discontinuity in a CDF implies a point mass exists in the
distribution. Only a countable number of these are possible.

4
1.1.1 CDFs with computable inverses
If the function F has an inverse that exists and furthermore if that inverse can be
expressed in a closed form analytically, then generating random variables from F
is as easy as evaluating the inverse F −1 . However, just because generating random
variables from F is easy to code does not imply that this approach is computationally
efficient. More on this will be discussed in Section 1.2.

Example 1.1.1 (Exponential Distribution). The exponential distribution is criti-


cally important in queuing processes, Markov processes, and other models where a
“memoryless” waiting time is required. Thus, being able to simulate exponential
random variates is very desirable.
The exponential distribution with rate parameter λ > 0 respectively has PDF and
CDF
f (x) = λe−λx and F (x) = 1 − e−λx .
Hence, F −1 (u) = −λ−1 log(1 − u). However, if U ∼ Uniform [0, 1] then 1 − U ∼
Uniform [0, 1]. Thus, to generate exponential random variates with rate λ, we can
compute
1
X = − log(U ).
λ
The parameter λ is referred to as the rate parameter as it can be interpreted as
the reciprocal of the expected waiting time. That is, EX = 1/λ. Hence, if we are
modelling, say, the time until the next lightning strike occurs in a thunderstorm or
the next atom radioactively decays, a larger λ means a faster rate of occurrences.
Note that while this is an easy way to generate exponential random variates, it
is not the way that R does it with the rexp() function in the base stats package.
This will be discussed in a subsequent section.

Example 1.1.2 (Cauchy Distribution). The Cauchy distribution is a Student’s t-


distribution with only 1 degree of freedom. While is has a bell-curve shape similar to
the normal distribution, none of the moments of the Cauchy distribution are defined.
e.g. it does not have a finite mean.
A random variable X is Cauchy if it respectively has the PDF and CDF
1 1 1
f (x) = 2
and F (x) = arctan(x) + .
π(1 + x ) π 2

Thus, we can write X = F −1 (U ) = tan (π(U − 1/2)) to generate Cauchy random


variates X given U ∼ Uniform [0, 1]
The R source code uses this approach for generating Cauchy random variates via
rcauchy(). For the more general t-distribution, one can note that if Z ∼ N (0, 1),

V ∼ χ2 (ν), and Z ⊥ ⊥ V , then X = Z/ ν ∼ t(ν). Thus, the R code in rt.c
generates a standard normal random variate Z and a chi squared variate V in order
to return a t-random variate, which will be Cauchy if ν = 1.

5
1.1.2 CDFs without computable inverses
If often is the case that F −1 exists, but no closed form analytic expression is known.
The most obvious example of this is the normal or Gaussian distribution. The CDF
is written in terms of an integral
Z x
1 2
F (x) = e−t /2 dt
2π −∞

which does not have a closed form.


However, such a issue does not completely ruin our ability to use the probability
integral transform. Indeed, if we wish to compute x = F −1 (u) for some u ∈ [0, 1],
this is equivalent to finding an x such that 0 = F (x) − u. Hence, we have a root
finding problem, which can be solved by many algorithms. Most notably, we can
use the Newton-Raphson method, which is detailed in Algorithm 1.

The Newton-Raphson Method


The Newton-Raphson algorithm can be used to find the root of F (x) − u assuming
that the PDF f (x) = F 0 (x) exists and that both f (x) and F (x) can be evaluated on
a computer. The idea behind the Newton-Raphson algorithm is to start at a point
x0 in the domain D of F . Then, a tangent line can be drawn passing through the
point (x0 , F (x0 ) − u). Let xi be the root of this tangent line, then

F (x0 ) − u
f (x0 ) =
x0 − x1
and solving for x1 gives
F (x0 ) − u
x1 = x0 − .
f (x0 )
The algorithm can be iterated by repeating the above with x1 in place of x0 to get
a new root x2 . This may continue until convergence is achieved.
It can be shown using Taylor’s Theorem that if F has a continuous second
derivative, then the Newton-Raphson algorithm converges quadratically. Indeed,
denoting the true root by x∞ , we have that

|xi+1 − x∞ | = Ci (xi − x∞ )2 .

But this requires some additional conditions such as f (x∞ ) = F 0 (x∞ ) 6= 0. Other-
wise, convergence can be hindered.
Furthermore, the choice of starting point is very important for quick convergence.
Also, certain functions will cause this algorithm to fail.√For example, try applying

Newton-Raphson to F (x) = x for x ≥ 0 and F (x) = − −x for x < 0 for an initial
value of x0 = 1 and see what happens.

6
Algorithm 1 The Newton-Raphson Algorithm
Input:
F : D → R, an invertible differentiable function
f : D → R+ , the derivative f = F 0
x0 ∈ D, an initial value in the domain of F
τ ∈ R+ , a threshold for convergence
Algorithm:
For i ≥ 1
xi = xi−1 − F (xi−1 )/f (xi−1 )
if |xi − xi−1 | < τ
stop
return xi
Output:
xi such that F (xi ) ≈ 0

Remark 1.1.3 (Halley’s Method). Many other methods of root finding exist. One
method known as Halley’s
p method can be achieved by applying Newton-Raphson to
0
the function F (x)/ |F (x)| assuming that for any x such that F (x) = 0, we have
that F 0 (x) 6= 0. In Halley’s method, the update set of

F (xi )
xi+1 = xi −
F 0 (xi )

is replaced by
2F (xi )F 0 (xi )
xi+1 = xi − .
2F 0 (xi )2 − F (xi )F 00 (xi )
This approach comes with its own assumptions and concerns, which will be left to
other courses to investigate. See “Householder Methods” for even more.

The Bisection Method


In the context of random variate generation, the existence of a CDF usually implies
the existence of a PDF. However, given a function F to find a root, we may not
have access to the derivative f . In that case, we require a gradient-free root finding
method. Many of these exist. In this section, we will highlight a simple algorithm
applicable when it is known that the root of F lies in the finite interval [a, b] and
F (a) < 0 and F (b) > 0.2 This is the Bisection algorithm detailed in Section 2.3
of Fishman (2013) and written out in Algorithm 2. It is effectively a binary search
algorithm.
2
Or alternatively, F (a) > 0 > F (b). Just as long as the signs are different.

7
Much like a binary search algorithm, the Bisection algorithm is guaranteed to
stop after dlog2 ((b − a)/τ1 )e number of steps. Hence, the tighter the interval [a, b]
is, the quicker the algorithm will converge, so like the starting value for Newton-
Raphson, the closer you are to the true root, the faster the algorithm will terminate.

Example 1.1.4 (Beta Distribution). The beta distribution commonly occurs as a


Bayesian prior and posterior distributions when a parameter of interest lies in the
unit interval. If X ∼ Beta (α, β), then the PDF is
1
f (x) = xα−1 (1 − x)β−1
B(α, β)

for x ∈ [0, 1] where B(α, β) = Γ(α)Γ(β)/Γ(α + β) is the beta function and Γ( ) is the
gamma function.3 The CDF is the known at the regularized incomplete beta function
Z x
1
F (x) = tα−1 (1 − t)β−1 dt.
B(α, β) 0

Of note, when α = β = 1/2, the distribution is known as the arcsin distribution as


its CDF is
2 √
F (x) = arcsin( x)
π
for x ∈ [0, 1]. Finding a root x of F (x) − u for some u ∈ [0, 1] could be done with
the Bisection algorithm.

1.1.3 CDFs with discrete distributions


There are many discrete probability distributions that we may wish to simulate from.
These include the binomial, Poisson, geometric, and hypergeometric distributions.
In such a case, we have a discrete support D = x0 , x1 , x2 , . . . and a CDF F such
that
0 = F (x0 ) < F (x1 ) < . . . < F (xi ) < F (xi+1 ) < . . . ≤ 1.
If the support is finite with, say, |D| = m, then F (xm ) = 1. Otherwise, F (xi ) → 1
as i → ∞.4
One easy approach (coding-wise) to generating from random variates from a
discrete distribution is to first compute the ordered values F (xi ) ∈ [0, 1]. Then,
generate a random U ∼ Uniform [0, 1] and find the smallest i such that F (xi ) <
U . In practice, the C code written to generate, say, Poisson or binomial random
variables for R is immensely complex and needing to consider various extreme cases
and numerical stability.
R∞
3
For n ∈ N, Γ(n) = (n − 1)! and for general x ∈ R+ , γ(x) = 0 tx−1 e−x dt.
4
We can also extend the support of F to −∞ if desired. However, most discrete distributions
of interest have support contained in Z+ .

8
Algorithm 2 The Bisection Algorithm
Input:
F : D → R, an function with a root to find
a, b ∈ D such that F (a) < 0 < F (b), interval to search
τ1 , τ2 ∈ R+ , thresholds for convergence
Algorithm:
Set x0 = a and x1 = b
For i ≥ 1
Compute z = (x0 + x1 )/2
Compute F (z)
if F (z) < 0
Set x0 = z
else if F (z) ≥ 0
Set x1 = z
if |x1 − x0 | < τ1 or |F (z)| < τ2
stop
return z
Output:
z such that F (z) ≈ 0

1.2 Other Transformations


The probability integral transform described in the previous section takes a uniform
random variate U and transforms it into a new random variate with some CDF F .
In this section, we will consider other ways to transform one random variate into
another.
Example 1.2.1 (Student’s t distribution). The t distribution is of critical impor-
tance in statistical hypothesis testing under the assumption of Gaussian errors. How-
ever, we may also wish to generate t random variates. The t distribution with ν
degrees of freedom has PDF

f (x) = C(1 + x2 /ν)−(ν+1)/2

for some normalizing constant C and a CDF in terms of something called a Gaussian
Hypergeometric Function. Suffice to say, we don’t want to try to write down F let
alone invert it. Luckily, as noted above in Example 1.1.2 on the Cauchy distribution,

a t random variate X with ν degrees of freedom can be defined as X = Z/ V where
Z ∼ N (0, 1), V ∼ χ2 (ν), and Z ⊥⊥ V . This is precisely how R generated t random
variates.
Example 1.2.2 (Exponential distribution). As discussed in Example 1.1.1, expo-
nential random variates with rate parameter λ > 0 can be generated by using the

9
probability integral transform
1
X = − log(U )
λ
for U ∼ Uniform [0, 1]. However, more efficient methods exist.
In Fishman (2013) Section 3.10, it is noted that more efficient methods of gen-
eration exist. In particular, let X0 be a random variate from a truncated exponential
distribution. That is, for x ∈ [0, log 2], X0 has PDF and CDF
fX0 (x) = 2e−x and FX0 (x) = 2 − 2e−x .
Next, let K ∼ Geometric(1/2), which is supported on the non-negative integers such
that
pK (k) = P (K = k) = (1/2)k+1
such that X0 ⊥ ⊥ K. Then, we can write X = X0 + K log 2, which is Exponential (1).
Indeed, any x ∈ R+ can be uniquely written at x = x0 + k log 2 for k ∈ Z+ and
x0 ∈ [0, log 2). Denoting the convolution product by ?, the PDF is
fX (x) = (fX0 ? pK )(x) = P (K = k) fX0 (x0 )
= 2−k−1 2e−x0 = exp(−k log 2 − x0 ) = e−x .
Meanwhile, the current R source code uses an algorithm published by Ahrens & Di-
eter in 1972 in a paper called “Computer methods for sampling from the exponential
and normal distributions.”

1.2.1 Box-Muller Transform


The Gaussian or Normal distribution is perhaps the most widely used probabil-
ity distribution. While its shape is an elegant bell curve, the CDF has no closed
form making use of tools like the Probability Integral Transform very tedious to
implement. Luckily, the Box-Muller transform gives us an alternative approach.
We say that X ∼ N µ, σ 2 if

 
1 1 2
fX (x) = √ exp − (x − µ)
2πσ 2 2
for µ ∈ R and σ 2 > 0. Strictly speaking σ 2 = 0 allows for a degenerate Gaussian
distribution that reduces to a point mass at the mean µ. It can be shown that
if Z ∼ N (0, 1), which is referred to as a standard normal distribution, that X =
µ + σZ ∼ N µ, σ 2 . Hence, we only need to be concerned with generating standard


normal random variates Z.


Theorem 1.2.1 (Box-Muller Transform, 1958). If U ∼ Uniform [0, 1] and V ∼
Exponential (1) and U ⊥
⊥ V , then

x = 2v cos(2πu), and

y = 2v sin(2πu)

10
are independent N (0, 1) random variables.
Proof. This theorem is proven via a change of variables. Note that
1
V = (X 2 + Y 2 ) and
2
1
U= arctan(Y /X).

Then, defining the functions v(x, y) = (x2 + y 2 )/2 and u(x, y) = arctan(y/x)/2π,
gives partial derivatives
∂v ∂v
= x and =y
∂x ∂y
∂u 1 1 −y −y
= 2 2 2
=
∂x 2π 1 + y /x x 4πv(x, y)
∂u 1 1 1 x
= 2 2
=
∂y 2π 1 + y /x x 4πv(x, y)
Then, be change of variables, we have that
∂u ∂v ∂u ∂v
fX,Y (x, y) = fU,V (u, v) −
∂x ∂y ∂y ∂x
y2 x2
= 1[u(x, y) ∈ [0, 1]] e−v(x,y) − −
4πv(x, y) 4πv(x, y)
 2 2

1 x +y
= exp − .
2π 2

As this is the Gaussian PDF in R2 with mean zero and covariance I2 , we have our
conclusion.

The Box-Muller transform reduces the task of generating a normal random


variate to the tasks of generating a uniform and an exponential random variate.
However, as is typical in these notes, more sophisticated and optimized generation
methods exist in the R source code.

1.3 Acceptance-Rejection Sampling


Acceptance-Rejection sampling, sometimes just called Rejection sampling, is a pow-
erful technique for sampling from complex distributions. In short, we wish to sample
from some probability density f (x), but it is hard to do so. The solution is to find a
proposal or candidate distribution h(x) that is close to f and easy to sample from.
Then, we sample from h instead and reject any samples that don’t look like they
came from f . This is made more precise in the following theorem attributed to John
Von Neumann.

11
Note that the acceptance-rejection method is a critical step in the Metropolis-
Hastings algorithm for performing Markov Chain Monte Carlo. However, MCMC
is beyond scope of this course. In these notes, we restrict to independent sampling
rather than invoking Markov chains.

Remark 1.3.1. In the following theorem, we use | to denote conditioning. That


is, we define a random variable X to equal another random variable Z after forcing
a condition to hold. For example, you could consider a random 20-sided dice roll
conditioned on the number displayed is less than 8.
In general, for two real-valued random variables X and Y , we can write the
conditional density function as
f (x, y)
fY |X (y|x) =
fX (x)

where f (x, y) is the joint density function.

Theorem 1.3.1 (Von Neumann, 1951). Let f (x) be a pdf with support on [a, b] that
Rb
can be written as f (x) = cg(x)h(x) where h(x) ≥ 0, a h(x)dx = 1, g(x) ∈ [0, 1],
and some constant c ≥ 1. If Z ∼ h(z) and U ∼ Uniform [0, 1] and Z ⊥⊥ U , then
X = Z|{U ≤ g(Z)} has pdf f (x).

Proof. Since U and Z are independent, the joint pdf is simply f (u, z) = 1u∈[0,1] h(z).
Consequently, the conditional pdf for X is
Z g(z)
h(z|U ≤ g(Z))P (U ≤ g(Z)) = f (u, z)du
0
R g(z)
0 f (u, z)du
h(z|U ≤ g(Z)) = .
P (U ≤ g(Z))
R g(z)
The numerator becomes 0 f (u, z)du = g(z)h(z). The denominator becomes
ZZ
P (U ≤ g(Z)) = 1[u ≤ g(z)] du h(z)dz
[0,1]×[a,b]
Z bZ 1 Z b
= 1[u ≤ g(z)] du h(z)dz = g(z)h(z)dz
a 0 a

As f (x) is a pdf,
Z b Z b
1= f (x)dx = c g(x)h(x)dx.
a a
Rb
Thus, a g(x)h(x)dx = c−1 , and consequently, h(z|U ≤ g(Z)) = f (z).

Remark 1.3.2. While many choices of c and g(x) are possible, the acceptance-
rejection method is most efficient when c is close to 1. This is because we accept a

12
Algorithm 3 The Acceptance-Rejection Algorithm
Input:
g(x) ∈ [0, 1]
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ g(z)
stop
return x = z
Output:
x from pdf f (x) = cg(x)h(x).

randomly generated variate X when U ≤ g(Z) which occurs with probability 1/c as
per the above proof.
Hence, given a pdf of interest f and a candidate distribution h, then c is chosen
to be
c = sup{f (x)/h(x)}
x
and g chosen accordingly.
Furthermore, we can consider the event U ≤ g(Z) to have a geometric distri-
bution with probability 1/c of success. Hence, the expected number of iterates until
acceptance is achieved is c.

Example 1.3.3 (Half-Normal Distribution). This example is taken from Example


3.1 of Fishman (2013). The half-normal distribution is the standard normal but
only supported on the non-negative reals. That is,
r
2 −x2 /2
f (x) = e
π
r
2e −(x−1)2 /2 −x
= e e .
π
p
Thus, taking c = 2e/π, g(x) = exp[−(x − 1)2 /2], and h(x) = exp(−x), we can
sample from the exponential distribution e−x to apply the Algorithm 3 to generate
half-normal random variates.
Furthermore, this allows for an alternative way to sample from the normal dis-
tribution. That is, use Algorithm 3 to generate X as half-normal. Then, generate
U ∼ Uniform [0, 1] and set

X if U ≥ 1/2
X← .
−X if U < 1/2

13
Algorithm 4 The Acceptance-Rejection-Squeeze Algorithm
Input:
g, gL , gU ∈ [0, 1] such that gL ≤ g ≤ gU
h(x), a pdf to sample from
Algorithm:
Repeat
Generate z from pdf h(z)
Generate u from Uniform [0, 1]
if u ≤ gL (z) pretest with gL
stop
else if u ≤ gU (z) pretest with gU
if u ≤ g(z) only evaluate g if pretests fail
stop
return x = z
Output:
x from pdf f (x) = cg(x)h(x).

In this setting, we require one Uniform and one Exponential random variate until
a sample is “accepted”. Then, one final Uniform random variate is required to set
the sign, i.e. positive or negative.

1.3.1 Acceptance-Rejection-Squeeze
Sometimes the function g(x) may be slow to compute. However, if we can construct
an envelope gL (x) ≤ g(x) ≤ gU (x) where gL , gU are “simpler” functions, then we
have use gL and gU to perform pretests on whether or not we should accept or reject.
The most natural choices for gL and gU come from power series. For example,
1 − x ≤ e−x ≤ 1 − x + x2 /2 for x ≥ 0. Hence, instead of merely evaluating
exp[−(x − 1)2 /2] in the previous example for the half-normal distribution, we could
choose
(x − 1)2 (x − 1)2 (x − 1)4
gL (x) = 1 − and gU (x) = 1 − + .
2 2 8

1.3.2 Box-Muller without Trigonometry


Instead of generating two independent uniform deviates to perform the Box-Muller
transform (i.e. sampling within the square [0, 1]×[0, 1]), we can instead sample from
within the unit circle using an acceptance-rejection step to generate normal random
variates without need to evaluate sines and cosines.
Indeed, we can generate U, V ∼ Uniform [−1, 1] until U 2 + V 2 ≤ 1 and hence
lies within the unit circle. This means that the ordered pair (U, V ) has a uni-

14
formly random angle between 0 and 2π and a distance to the origin (radius) that is
Uniform [0, 1]. After that is achieved, we claim that
r
V2
−2 log(U 2 + V 2 ) 2 ∼ N (0, 1) .
U +V2
This is because U 2 +V 2 is Uniform [0, 1] thus making − log(U 2 +V 2 ) ∼ Exponential (1).
Furthermore, the sine and cosine of the angle of the ordered pair (U, V ) can be com-
puted as √U 2U+V 2 and √U 2V+V 2 . Thus, we can compare to the original Box-Muller
transform

x = 2v cos(2πu), and

y = 2v sin(2πu)

for u uniform and v exponential in this case.


The same trick can be applied generate Cauchy random variates. Recall from
Example 1.1.2 that for U ∼ Uniform [0, 1], then X = tan[π(U − 1/2)] is Cauchy. If
we preferred to not evaluate the tangent, we can instead sample from the upper-half
of the unit circle. That is, generate U ∼ Uniform [−1, 1] and V ∼ Uniform [0, 1]
until U 2 + V 2 ≤ 1. Then, the tangent of the angle of the ordered pair (U, V ) is
simply
U/V ∼ Cauchy.
See sections 7.3.4 and 7.3.7 of ?) for detailed code for these methods.

1.4 Ratio of Uniforms


In this section, we generalize the previously discussed idea of generating a uniform
element of the unit circle to simulate from more complex univariate distributions.
This Ratio-of-Uniforms method is attributed to Kinderman & Monahan (?). The
idea is to define a region in R2 (e.g. the unit circle), then generate a uniformly
random point (U, V ) from that region, and then return U/V , which will have the
desired distribution. Once again, we have a theorem to make this more precise.
Theorem 1.4.1 (Ratio of Uniforms). Let f (z) be a pdf on R and r(z) a non-negative
function such that
r(z)
f (z) = R ∞ .
−∞ r(t)dt

Let D0 ⊂ R2 be defined as
n p o
D0 = (x, y) : 0 ≤ x ≤ r(y/x)

and let (X, Y ) be a uniform random point in D1 , some bounded region such that
D0 ⊂ D1 and let W = X 2 . Then, if W ≤ r(Y /X), the Z = Y /X has pdf f (z).

15
Proof. Let |D0 | denote the area 5 of the region D0 . Then, conditioning on (X, Y )
being in D0 , the joint pdf of (X, Y √ −1 2
√ = |D0 | . Since W = X
) is simply f (x, y)
and Z = Y /X, we have that X = W and Y = Z W . Applying the change of
variables (X, Y ) → (W, Z) gives
√ √
fW,Z (w, z) = JfX,Y ( w, z w)
where J is the Jacobian determinant
 ∂x ∂x   √ 
1/2 w 0 1
∂w ∂z
J = det ∂y ∂y = det √ √ = .
∂w ∂z
z/2 w w 2

Therefore, fW,Z (w, z) = [2|D0 |]−1 1[0 ≤ w ≤ r(z)] . Therefore, the pdf for Z is
Z r(z)
r(z)
fZ (z) = fW,Z (w, z)dw = .
0 2|D0 |
R∞
Since fZ (z) must integrate to 1, we have that 2|D0 | = −∞ r(z)dz.

The region D0 may be hard to visualize and even harder to sample from. Hence,
we can bound D0 with a rectangle D1 by noting the following. Onpthe x-axis,
the region D0 is constrained to the intervalp [0, x∨ ] where x∨ = supz r(z). On
the y-axis, we have that y = zx = z r(z). Hence, the region D0 is constrained
√ √
to the interval [y∧ , y∨ ][inf z z rz, supz z rz]. Thus, we can use this rectangle for
implementing the ratio of uniforms method as detailed in Algorithm 5.
Note that since we are sampling from a rectangle D1 rather than the actual
set of interest D0 , there is an acceptance-rejection step in Algorithm 5. Thus, the
probability of accepting a randomly drawn point (X, Y ) is |D0 |/|D1 |. From the
above proof, this probability can be written down explicitly:
Z ∞
|D0 | 1
P ((X, Y ) ∈ D0 ) = = r(z)dz.
|D1 | 2x∨ (y∨ − y∧ ) −∞
Remark 1.4.1 (Shift and Scale). Note that we can always take Y /X and shift and
scale it to get a0 + a1 (Y /X) if desired. The above theorem, proof, and algorithm can
be modified accordingly.
As noted, we already saw specific examples of the ratio of uniforms method. In
the case of generating Cauchy random variables, we found that for X ∼ Uniform [0, 1]
and Y ∼ Uniform [−1, −1], then Z = Y /X is Cauchy conditioned on X 2 + Y 2 ≤ 1.
If we set r(z) = (1 + z 2 )−1 , then
1
x∨ = sup √ =1
z∈R 1 + z2
z z 1
y∨ = sup √ = sup √ =1
z∈R 1+z 2 z∈R |z| 1 + z −2
5
i.e. Lebesgue measure

16
Algorithm 5 The Ratio of Uniforms Algorithm
Input:
function r(z) = cf (z) for some c > 0
Algorithm: p
Compute x∨ = supz r(z).

Compute y∨ = supz z rz.

Compute y∧ = inf z z rz.
Repeat
Generate x from pdf Uniform [0, x∨ ]
Generate y from pdf Uniform [y∧ , y∨ ]
if x2 ≤ r(y/x)
stop
return z = y/x
Output:
z from pdf f (z).

and y∧ = −1 from the same formula. Furthermore, the half circle is recovered by
noting that
1 p
0≤x≤ p ⇒ 0 ≤ x2 + y 2 ≤ 1
1 + y 2 /x2
And thus we can rederive the formula for a Cauchy random variable via the ratio
of uniforms method. In this case, the probability of acceptance is π/4.

1.4.1 Gamma Random Variables


In this section, we examine the method of ?) for generating Gamma random variates
making use of the Ratio of Uniforms algorithm. Note that the current R source code
uses methods attributed to Ahrens & Dieter in rgamma() and other methods also
exist.
We say that a random variable Z ∼ Gamma (α, β) for α, β > 0 if Z ≥ 0 and has
pdf
β α α−1 −βz
f (z) = z e .
Γ(α)
But if Z ∼ Gamma (α, 1), then βZ ∼ Gamma (α, β). Thus, we can restrict our
investigate to generating Gamma random variables with β = 1.
Gamma distributions “add” in the sense that if Z1 ∼ Gamma (α1 , β) and Z2 ∼
Gamma (α2 , β) and Z1 ⊥ ⊥ Z2 , then Z1 + Z2 ∼ Gamma (α1 + α2 , β). Furthermore,
Gamma (1, 1) is the standard exponential distribution. Hence, one could generate a
Gamma (n, 1) random variable by summing n independent Exponential (1) random

17
variables.6 However, this would be highly computationally inefficient.
Instead, we can apply the Ratio of Uniforms method by choosing r(z) = z α−1 ez
where we impose that α > 1.7 Thus, the region D0 is
 q 
2 α−1 −y/x
D0 = (x, y) ∈ R : 0 ≤ x ≤ (y/x) e ,

and the bounding rectangle is


"  # "  #
α − 1 (α−1)/2 α + 1 (α+1)/2
 
x ∈ 0, , y ∈ 0, ,
e e

and the probability of an acceptance is


|D0 | Γ(α)eα
= .
|D1 | 2(α − 1)α−1 (α + 1)α+1
For large α, this is approximately O(α−1/2 ) as can be derived via Stirling’s approx-
imation to the Gamma function. This implies that the ratio of uniforms will have
vanishingly small acceptance probabilities as α grows. The proposed solution to
this problem is to not sample from the rectangle D1 but to instead construct the
parallelogram defined by
n o
P = (x, y) ∈ R2 : α−1/2 ≤ y − x ≤ 2/αe, 0 ≤ y ≤ 1
p

and sample from P ∩ [0, 1] × [0, 1]. More details on this can be found in Fishman
(2013) and ?).

1.5 Points in Geometric Objects


The following two subsections consider random generation of points within a geomet-
ric object. The first object of interest is the simplex, which has direct application
to topics like Bayesian priors on probability vectors. The second is spheres and
ellipsoids, which arise in areas like directional data among others.

1.5.1 Simplexes and the Dirichlet Distribution


A simplex in Rn is a convex polytope consisting of n + 1 verticies. Effectively, it is a
generalized “triangle” or “tetrahedron” in n-dimensions. Fixing one of the vertices
at the origin allows us to define a standard simplex as
n
( )
T
X
Sn (r) = x = (x1 , . . . , xn ) : xi ≤ r, xi ≥ 0 ∀i
i=1
6
When α = n is an integer, the Gamma distribution is sometimes called the Erlang distribution
named after Agner Krarup Erlang.
7
The 0 < α < 1 is handled by other algorithms.

18
for someP constant r > 0. The most common setting is r = 1. Then, choosing
x0 = 1 − ni=1 xi gives the (n + 1)-long vector (x0 , x1 , . . . , xn ), which is a probability
vector.8
To generate a uniform random point inside the simplex Sn (1), we first generate
n Uniform [0, 1] random variables U1 , . . . , Un and sort them so that U1 ≤ U2 ≤
. . . ≤ Un . Note that there are effcient algorithms for generating an ordered set of
independent random variables; see order statistics. Then, we simply set X1 = U1
and Xi = Ui − Ui−1 for i = 2, . . . , n. The remainder is X0 = 1 − Un .
In Bayesian analysis and other topics in statistics and machline learning, we
P to generate a probability vector p = (p1 , . . . , pn ) such that pi ≥ 0 for all
often want
i and ni=1 pi = 1. Furthermore, we may want a distribution on p other than the
uniform. That is, we say that p has a Dirichlet distribution with parameter vector
α = (α1 , . . . , αn ) with αi > 0 for all i if it has pdf
n
Γ(α1 + . . . + αn ) Y αi −1
f (p) = pi .
Γ(α1 ) . . . Γ(αn )
i=1

If all of the αi = 1, then we have a uniform distribution on the simplex. If the


αi > 1, the the mass of the distribution is pushed towards the centre of the simplex.
If the αi < 1, then the mass is pushed towards the verticies of the simplex.
An easy to code method for generating random variates from a Dirichlet distri-
bution is to generate n independent Gamma random variables
Xi ∼ Gamma (αi , 1) for i = 1, . . . , n
such that Xi ⊥
⊥ Xj for all i 6= j. Then, we set
Xi
pi = Pn .
i=1 Xi

This approach can be shown to yield a Dirichlet random variate, by first starting
with the joint pdf of X1 , . . . , Xn , which is
n n
! n
Y xαi i −1 e−xi X Y xαi −1
i
fX (x1 , . . . , xn ) = = exp − xi .
Γ(αi ) Γ(αi )
i=1 i=1 i=1

Then, we denote the sum ni=1 xiP


P
= T , and we change variables from xi to pi = xi /T
1 and set pn = ni=1 xi = T . Hence, xi = pi T for i = 1, . . . , n − 1,
for i = 1, . . . , n −P
and xn = T (1 − n−1 i=1 pi ). The Jacobian matrix with i, jth entry ∂xi /∂pj is
 
T 0 ... p1
 0 T ... p2 
J =  ..
 
.. .. .. 
 . . . . 
Pn−1
−T −T . . . 1 − i=1 pi
8
i.e. the entries are non-negative and sum to one.

19
where we can find that det(J) = T n−1 . Thus, we have that
Pn−1 Qn−1
−T [T (1 − pi )]αn −1 i=1
i=1 (T pi )αi −1 n−1
fp (p1 , . . . , pn−1 , T ) = e Qn T
i=1 Γ(αi )
Qn−1 αi −1 P
(1 − n−1 αn −1
P
i=1Qpi ) i=1 pi
= n T αi −1 e−T .
i=1 Γ(α i )

And integrating out T from 0 to ∞ gives the desired formula as


Z ∞ P
T αi −1 e−T dT.
P
Γ( αi ) =
0

1.5.2 Spheres
The first problem in this section is to generate a uniformly random point on the
surface of an n-dimensional hypersphere
P 2 ( or2 just the usual 3D sphere, if you prefer
n
), which is Sn−1 (r) = {z ∈ R : zi = r }. To achieve this, we can simulate a
multivariate standard normal random vector and normalize it. That is, if

X = (X1 , . . . , Xn ) ∼ N (0, In ) ,

then the vector Z with entries


X1 Xn
Z1 = pPn , . . . , Zn = pPn ,
i=1 Xi i=1 Xi

will be uniformly distributed on the surface of a n-dimensional sphere of radius 1.


To change the radius, multiply each Zi by some r > 0.
The second problem isP to generate points within an n-dimensional sphere, which
is Bn−1 (r) = {z ∈ Rn : zi2 ≤ r2 }. This can be achieved by sampling from the
surface as above and the beta distribution. Indeed, let Z be uniformly random on
the surface of the sphere and let

W ∼ Beta (n, 1) ,

be independent of Z. Then we claim that W Z is uniformly distributed within the


sphere of radius 1 where Z is the position on the sphere and W acts as a random
radius. To see this, note that

Γ(n/2)
fZ (z|r) = for a sphere with radius r
2π n/2 rn−1
fW (w) = nwn−1 for w ∈ [0, 1].
nΓ(n/2) Γ(n/2 + 1)
fZ (z|r = w)fW (w) = n/2
= .
2π π n/2

20
Chapter 2

Model Checking

Introduction
In statistical and data science applications, we almost surely estimate parameters
whether about the data itself (e.g. correlation, skewness) or in the process of fitting a
model to the data. However, the task of determining whether or not these estimates
are “good” becomes a bit more challenging. In some classical settings, like least
squares regression with homogeneous normal errors, there are exact formulae for
computing statistical significance and the width of a confidence interval. But in
more complex modelling settings, such nice formulae likely do not exist. Hence, in
this chapter, we discuss some methods of model checking; namely, we first look at
bootstrap methods for estimating variances and biases for estimators, and secondly,
we look at cross validation for model tuning and selection.

2.1 The Jackknife


An early precursor to the modern day bootstrap and cross validation methods is
the jackknife estimator, which dates back to the 1950s. The rough idea is to take
a data set of, say, n observations and compute an estimator θ̂ for some parameter
of interest θ; e.g. this could be the mean or variance or something more complex.
Then, we compute the estimator θ̂i for the same data with the ith entry removed; i.e.
(x1 , . . . , xi−1 , xi+1 , . . . , xn ). The result is a collection of n leave-one-out estimators
θ̂1 , . . . , θ̂n . These θ̂i can be used to get a sense of the distribution of the original
parameter estimate θ̂. This approach can be used, for example, to estimate the bias
and variance of θ̂.
The bias of an estimator is b(θ̂) = E[θ̂] − θ. The problem is that we typically do
not know what either E[θ̂] or θ is. In the jackknife approach, we use θ̂ in place of θ
and the jackknife estimator in place of E[θ̂]. That is, for the bias computation, we
first compute θ̂, the leave-one-out estimates θ̂i , and then combine them to get the

21
jackknife estimate of θ,
n
1X
θ̂jack = θ̂i .
n
i=1

Then, the jackknife bias estimator is defined to be

b̂(θ̂)jack = (n − 1)(θ̂jack − θ̂).

We can go one step further by combining the orginal estimate and the bias to get a
“de-biased” or “bias-corrected” estimator,

θ̂ − b̂(θ̂)jack = nθ̂ − (n − 1)θ̂jack .

This has some nice properties in theory, but we will only investigate its performance
in simulation.
Similarly, we can compute a jackknife estimator for the standard error, which
comes from the sample variance of the n jackknife estimators. It is
v
u n
un − 1 X
se(
ˆ θ̂)jack = t (θ̂i − θ̂jack )2 .
n
i=1

The inclusion of the (n − 1) in the above standard error and bias formulae may
look a bit arbitrary at first. However, these multipliers are included to make the
Jackknife estimate of the sample mean and variance of the sample mean coincide
with the usual formula.

Example 2.1.1 (The Sample Mean).PThis is an illustrative but ultimately uninter-


esting example. That is, if θ̂ = n−1 ni=1 Xi , then removing one data point gives
θ̂i = (n − 1)−1 nj6=i Xj , and summing over all i gets us back where we started.
P

n n
1X 1 X
θ̂jack = Xj = X̄.
n n−1
i=1 j6=i

Hence, the jackknife bias-corrected estimator for the population mean is precisely
just the usual sample mean, which, of course, is already unbiased!

Example 2.1.2 (The variance of the sample mean). Let X1 , . . . , Xn be iid with the
sample mean µ̂ = X̄ as an estimator for the population mean µ. Then, we know
that
σ2
Var (µ̂) =
n
where σ 2 is the variance of the Xi . In practice, we typically do not know
Pn what σ is,
2

the classic approach is to estimate the sample variance as (n − 1) −1 2


i=1 (Xi − X̄) .

22
Then, under the assumption that the Xi are normal,1 we can construct a (1 − α)
confidence interval for µ as
p
|X̄ − µ| ≤ z1−α/2 σ̂ 2 /n
where z1−α/2 comes from the normal distribution. We could also use the t-distribution,
but if n is large enough and α is not extremely small, then the normal z-score ap-
proach will be fine.
We could take a Jackknife approach to the estimation of the variance of µ̂. To
do that we compute
1 X
X̄i = Xj ,
n−1
i6=j
which gives n difference estimates of the sample mean. Therefore, we can consider
the jackknife standard error, which is
v
u n
un − 1 X
se(
ˆ X̄)jack = t (X̄i − X̄)2
n
i=1
v
u n  2
un − 1 X 1 1 p
= t X̄ − Xi = σ̂ 2 /n.
n n−1 n−1
i=1

In the above two examples, nothing interesting really happened. However, we


can take this same Jackknife idea and apply it to a variety of statistical estimators
θ̂ for some parameter θ.
iid
Example 2.1.3. Maximum of Uniforms Let U1 , . . . , Un ∼ Uniform [0, a] for some
unknown maximum a > 0. One easy estimator for a is â = max{U1 , . . . , Un }.
However, we know that this estimator will have a negative bias as necessarilly â < a.
Hence, we can try to improve this estimator with the Jackknife.
We need to use the standard order statistic notation X(1) ≤ X(2) ≤ . . . ≤ X(n) . If
we delete the ith data point, then either âi = X(n) being the max Xj or âi = X(n−1)
being the second largest Xj in the case that Xi is the maximum data point and we
deleted it. Therefore, the Jackknife estimator is
n−1 1
âjack =X(n) + X(n−1)
n n
which is actually smaller than X(n) which is smaller than a. However, we can use
this value to compute a bias-corrected estimator for a as
n2 − 2n + 1 n−1
nâ − (n − 1)âjack = nX(n) − X(n) − X
 n n  (n−1)
1 n−1
= 2X(n) − X(n) − X(n−1) .
n n
1
or close enough to normal that we can safely invoke the central limit theorem.

23
To see if this improves our estimate, let a = 1. In that case, the order statistics
are distributed as X(k) ∼ Beta (k, n − k + 1) for k = 1, . . . , n. Hence, the original
estimator has a bias of
n 1
EX(n) − 1 = −1=− .
n+1 n+1
The bias corrected estimator has a bias of
  
1 n−1
E 2X(n) − X − X(n−1) −1
n (n) n
n 1 n n−1n−1
=2 − − −1
n + 1 n n + 1 n n+1 
1 1 1
= 2n − 1 − n + 2 − − n − 1 = − .
n+1 n n(n + 1)
Hence, the bias went from order n−1 to n−2 , which is a good improvement..

2.2 The Bootstrap


Bootstrap is a collection of various resampling methods that can be used to ef-
fectively sample a new dataset from a given dataset with the aim of estimating
variance, estimating bias, constructing confidence intervals, and other tasks. The
original bootstrap was proposed in 1979 by Bradley Efron with a lot of follow up
work in the 1980s and 1990s. However, research into bootstrap methods continues
to the present day. We will briefly consider three bootstrap methodologies: the
original bootstrap that samples from the empirical data distribution; the paramet-
ric bootstrap that samples from a parametric model fit to the data; and the smooth
bootstrap, which combines the two ideas and samples from a kernel density estimate
for the data distribution. There are many other bootstrap methods that have been
developed and studied for a variety of complex datasets and modelling scenerios.

2.2.1 Resampling Bootstrap


The original bootstrap method is resamping with replacement. That is, given a
dataset x1 , . . . , xn , we compute some estimate θ̂ like, say, the sample mean or sample
variance. Then, we choose a number of times to resample the data, B ∈ N. For
each i = 1, . . . , B, we sample n datapoints from x1 , . . . , xn with replacement and
recompute the estimate θi∗ . In the end, there are B reestimated θ1∗ , . . . , θB
∗ , and we

can use these θi∗ to get an idea of the variability of the  original estimate θ̂. For
example, we can compute a bootstrap estimate of Var θ̂ by computing

B
  1 X ∗
Var θ̂ = (θi − θ̄i∗ )2
boot B−1
i=1

24
where θ̄i∗ is the sample average of the B bootstrap estimators.

2.2.2 Parametric Bootstrap


Instead of sampling with replacement, the parametric bootstrap generates new boot-
strapped datasets by simulating data from a fitted model. An easy example would be
to assume that x1 , . . . , xn come from a normal distribution. The mean andvariance
can be estimated and new bootstrap samples can be drawn from N µ̂, σ̂ 2 .
A more interesting application would be to fit some statistical model and then
sample from that model. For example, given pairs (xi , yi ), we can fit a linear model

y = β0 + β1 x + ε

to get estimates of β̂0 and β̂1 . Then, new data can be simulated by

yi∗ = β̂0 + β̂1 xi + εi

with εi simulated from a desired distribution. In classic regression theory, we would


assume that εi is normal. But if the distribution of εi differs from normal, we can
use such bootstrapping to estimate the significance of β̂0 and β̂1 .

2.2.3 Smooth Bootstrap


The smooth or kernel bootstrap is a mix of the previous two ideas. The resampling
bootstrap samples with replacement from the data whereas the parametric boostrap
samples from some parametric distribution based on the data. The smooth boot-
strap samples from a kernel estimator for the density of the data. In effect, this
just means that we sample a new point xi with replacement and add a small bit of
Gaussian noise to it.
Given a dataset, x1 , . . . , xn , we create a new sample x∗1 , . . . , x∗n where each x∗i =
xi + zi for zj ∼ N 0, σ 2 where typically σ 2 is small. If we take σ 2 → 0, we recover
the original resampling bootstrap. From here, we proceed as before by estimating
a parameter of interest based on the new bootstrapped dataset θi∗ .

2.3 Cross Validation


Selecting a good model for your data can be a challenging task. This includes picking
terms in a regression model but also tuning hyperparameters such as in ridge or lasso
regression. In an ideal world, we would be able to fit many models to a dataset and
then test them on a brand new independent dataset. In the real world, we generally
don’t get that lucky. Thus, one approach to model selection and model tuning and
avoiding overfitting is to use Cross Validation.
Cross Validation is a simple idea: Split your data into pieces; fit or train a
model on one piece; test or validate you model on the other piece. This is typically

25
done for multiple splits or folds of the data. While there are a multitude of cross
validation methods, we will just briefly mention the most common ones being k-fold
and leave-one-out cross validation.
As an example of where this would be used (see Jupyter notebook), we can
consider ridge regression, which fits a linear model by minimizing the penalized
least squares equation:

minp kY − Xβk2 + λkβk2 .



β∈R

In the above optimization, λ ≥ 0 is a tuning parameter. Hence, we could use cross


validation to try to find the “best” choice of λ.

Leave-One-Out
Leave-one-out cross validation looks a lot like the Jackknife estimator. However,
to overall goal is difference. Given a dataset X1 , . . . , Xn , the Jackknife recomputes
an estimator θ̂ on the dataset with one point missing, i.e. θ̂i from X1 , . . . , Xi−1 ,
Xi+1 , . . . , Xn . The goal is to estimator the bias and standard error of the estimator
θ̂.
In leave-one-out cross validation, we also start from X1 , . . . , Xi−1 , Xi+1 , . . . , Xn .
This dataset is used to fit some statistical models. Each model is then tested on
the missing data point Xi . In this way, the model that gives the best prediction for
data point Xi can be selected from the candidates.

k-fold
In k-fold cross validation, we first randomize the order of the data. This prelimi-
nary step is important in case your data has already come in a specific order, i.e.
chronological.
Once your data is randomized, you split it into k sub-data-sets like

{X1 , . . . , Xk }, {Xk+1 , . . . , X2k }, . . . , {Xn−k+1 , . . . , Xn }.

Then, each subset is sequentially left out. The remaining k − 1 sets are combined
and used to fit a statistical model. Then, the left out set is used for testing.

26
Chapter 3

Volume Computation and


Integration

Introduction
In Chapter 3 of these lecture notes, we are concerned with using randomly generated
variates to actually compute things. This will take on different forms. One will
be to compute the volume of some region in space. Another will be to numerically
integrate some function. Much as before, we will have to be clever to sample random
variates in a “smart” way in order to produce efficient algorithms.
One of the new considerations in this chapter is the accuracy of computation.
That is, we will be concerned with topics like error estimation, confidence intervals
for estimates, and dependecy on sample size. That is, how large of a sample do we
need to get an estimate with a given accuracy.

3.1 Volume Computation


Consider the following problem: Let D be a be a bounded region in Rd . How can
we compute the d-dimensional volume of the region D? While the region D may be
complex, we assume there exists a function

1, x ∈ D
φ(x) =
0, x ∈ D

that is easy to compute. Then, we want to compute the integral


Z Z
vol(D) = dx = φ(x)dx.
D Rd

Hence, we will try to sample points in Rd to estimate the integral of φ(x).

27
Remark 3.1.1. Without loss of generality, we will assume that D lives within the
d-dimensional unit cube
C = [0, 1]⊗d .
We can always scale and translate a region D to make this so.

3.1.1 Sampling from a regular grid


Eschewing randomness for the moment, we could merely sample from a regular grid
within the unit cube C. That is, for some k ∈ N, we can generate a grid of n = k d
points of the form x = (x1 , . . . , xd ) where each
 
0 + 0.5 1 + 0.5 (k − 1) + 0.5
xi ∈ , ,... .
k k k
Then, we can simply count how many of these points x are in D, i.e. φ(x) = 1. This
is performed in Algorithm 6. The obvious problem here is that k d will get very big
very fast as the dimension increases.
In Section 2.1 of Fishman (2013), it is claimed that the error of this estimate
can be upper bounded as follows:
S Surface(D)
Vol(D) − d

k k
where Surface(D) is the “surface area” of the boundary of D. If we are in 2-
dimensions, this is the length of the boundary. If we are in 3-dimensions, this is the
usually idea of surface area. In d-dimensional space, this is a (d − 1)-dimensional
boundary measure. Since the number of points considered n = k d , this implies that
the error bound shrinks like O(n−1/d ). Another way to think about this is that if
we want an error less than some ε > 0, then we require on the order of n & (1/ε)d
where the notation & means n ≥ C(1/ε)d for some unspecified constant C > 0.
Example 3.1.2 (Fraction of a Sphere). As an example, consider the region

D = {x ∈ Rd : xi ≥ 0 and x21 + . . . + x2k ≤ 1}.

This region is 1/2d of the d-dimensional sphere. By looking up formulae for volume
and surface area, we find that
(π/4)d/2 2(π/4)d/2
Vol(D) = and Surf(D) = .
Γ(d/2 + 1) Γ(d/2)
Therefore, we may want to compute this volume using Algorithm 6 to an accuracy
of ε. However, Vol(D) → 0 as d → ∞. Hence, we would have to adapt ε to the
dimension of the problem. Therefore, we will consider the normalized error bound:
Vol(D) − k −d S 1 Surf(D)

Vol(D) k Vol(D)

28
Algorithm 6 Volume via regular grid
Input:
function φ(x) that is 1 on D and 0 otherwise.
k ∈ N, the number of points per dimension to sample.
Algorithm:
Set S = 0.
Construct a grid X of k d points x.
for each x ∈ X
if φ(x) = 1
S ←S+1
return S/k d
Output:
Volume estimate of region D.

Then, the sample size becomes

Surf(D)
ε = n−1/d
Vol(D)
Surf(D) d


−d
n=ε
Vol(D)
 d 
Γ(d/2 + 1) d

2
=
ε Γ(d/2)
 d
d
=
ε

Hence, if we want the relative error less than, say, 0.001, then the sample sizes are
approximately
106.6 , 1010.4 , 1014.4 , 1018.5
for d = 2, 3, 4, 5.

3.1.2 Uniform Monte Carlo


Instead of using a regular grid, we can do what we did in Chapter 1 of these
notes. That is, we can sample uniformly at random from the unit cube C and
use acceptance-rejection to compute the volume of D.
Indeed, we can simulate X1 , . . . , Xn ∈ C with iid Uniform [0, 1] entries. Then,
n n
|{xi ∈ D}|
Z Z
1X 1X
vol(D) = φ(x)dx = φ(xi )dx ≈ φ(xi ) = .
Rd Rd n n n
i=1 i=1

29
This algorithm is detailed in Algorithm 7. Going one step farther, we can compute
the variance of this estimate as well. We can consider each bi = φ(xi ) to be a
Bernouli random variable with probability |D| of getting a 1 and 1 − |D| of getting
a zero. Hence, for each i
Var (bi ) = |D|(1 − |D|),
and for our estimate, we have
n
!
1X |D|(1 − |D|)
σ 2 = Var bi = .
n n
i=1

Thus, the variance of the estimate decreases at the usual n−1 rate. We can also
compute the unbiased estimate of the variance, which is

(S/n)(1 − S/n)
σ̂ 2 = .
n−1
The “usual” normal style confidence interval is

S
|D| − ≤ t1−α/2,n−1 σ̂
n

where t1−α/2,n−1 is the value such that P T > t1−α/2,n−1 = α/2 where T ∼
t (n − 1). This hints that our estimation error decreases at a rate of O(n−1/2 ) for
sample size n.
The problem with this is that in the normal distribution case, the estimate for
the mean µ̂ and the estimate for the variance σ̂ 2 are independent of each other.
However, for these Bernouli random variables, this is not the case. Hence, S/n and
σ̂ 2 may be correlated, which can invalidate the confidence interval. We will consider
two more powerful methods for confidence interval construction and sample size
estimation.

Chebyshev’s Inequality
One approach to confidence interval construction is to use Chebyshev’s inequality.
This is a very general and very useful upper bound on the tail probability of a
random variable only assuming a finite variance. Hence, it can apply to a wide
range of distributions.

Theorem 3.1.1 (Chebyshev’s Inequality). Let Z ∈ R be a mean zero random


variable with EZ 2 < ∞. Then,

EZ 2
P (|Z| > ε) ≤ .
ε2

30
Algorithm 7 Volume via uniform Monte Carlo
Input:
function φ(x) that is 1 on D and 0 otherwise.
n ∈ N, the number random points to simulate.
Algorithm:
Set S = 0.
for i = 1, . . . , n
Generate x with iid entries xj ∼ Uniform [0, 1] for j = 1, . . . , d.
if φ(x) = 1
S ←S+1
return estimate S/n
return variance [(S/n)(1 − S/n)]/(n − 1)
return confidence interal (see below)
Output:
Volume estimate of region D along with variance and confidence interval.

As computed above, we have Z = S/n − |D|. Then, EZ 2 is the variance of the


estimate being |D|(1 − |D|)/n. Applying Chebyshev’s inequality results in
|D|(1 − |D|) 1
P (|S/n − D| > ε) ≤ 2
≤ .
nε 4nε2
This is because |D|(1 − |D|) ≤ 1/4. Therefore, if we want a confidence level of 1 − α,
then we set
α = (4nε2 )−1 ⇒ ε = (4nα)−1/2
and conclude that the confidence interval
|S/n − D| ≤ (4nα)−1/2
has probability at least 1 − α.
Using this approach, we learn a few things. First, we see again that the error
tolerance ε = O(n−1/2 ) for a fixed confidence level 1 − α. We also see that the
sample size n = (2αε2 )−1 . Hence, n is inversely proportional to α; e.g. cutting α in
half requires a doubling of the sample size. Lastly, even though we have the same
O(n−1/2 ) conclusion as occurs for the normal distribution, usage of Chebyshev’s
inequality does not invoke the Central Limit Theorem and hence applies for any n
as opposed to as n → ∞.

Hoeffding’s Inequality
Hoeffding’s Inequality is a slightly more sophisticated result than Chebyshev’s in-
equality. It applies to sums if bounded random variables1 such as sums of Bernoulli
1
and more generally to sums of sub-Gaussian random variables

31
random variables, which is how we state the following version of Hoeffding’s inequal-
ity.
iid
Theorem 3.1.2 (Hoeffding’s Inequality). Let B1 , . . . , Bn ∼ Bernoulli (p) for some
p ∈ (0, 1) and B̄ = n−1 ni=1 Bi . Then,
P

P |B̄ − p| > t ≤ 2 exp(−2nt2 ).




Applying Hoeffding’s inequality to the above problem results in

P (|S/n − |D|| > ε) ≤ 2 exp(−2nε2 ).

Therefore, a 1 − α confidence interval can be computed by


r
2 log(α/2)
α = 2e−2nε ⇒ ε= − ,
2n
which results in the confidence interval
r
log(α/2)
|S/n − |D|| ≤ −
2n
having probability at least 1 − α.
Once again, we see that the width of the confidence interval is proportional to
n−1/2 much like with Chebyshev’s inequality. However, now we have that the sample
size depends on ε and α as n = − log(α/2)/2ε2 . Hence, n grows like − log(α) as
α → 0.

3.2 Integration
Above, we noticed that computing the volume of some region D ⊂ C can be written
as an integral of some function φ(x) that is 1 for x ∈ D and 0 for x ∈
/ D. That is,
Z n
1X
Vol(D) = φ(x)dx ≈ φ(xi )
C n
i=1

where the x1 , . . . , xn are uniformly random points in the unit cube C.


Now, we will do the same thing, but for an integrable function g : R → R. That
is, assume that the domain of g is [0, 1]. Then, we can pick x1 , . . . , xn ∈ [0, 1] and
estimate Z 1 n
1X
g(x)dx ≈ g(xi ).
0 n
i=1

First, we have to determine how to pick the points xi .

32
3.2.1 Deterministic Approaches
There are tons of approaches to numerical integration in existence. We will briefly
discuss non-Monte Carlo methods before resuming the MC discussion. Methods
referred to a “Quadrature” typically involve picking points xi and weights ωi and
computing
Z 1 Xn
g(x)dx ≈ ωi g(xi ).
0 i=1
Let 0 = x0 < x1 < . . . < xn = 1. The rectangular or midpoint rule yields
Z 1 n  
X xi + xi−1
g(x)dx ≈ (xi − xi−1 )g ,
0 2
i=1

which is just the width times the height of n rectangles. The trapezoidal rule yields
Z 1 n  
X g(xi ) + g(xi−1 )
g(x)dx ≈ (xi − xi−1 ) ,
0 2
i=1

which linearly interpolates between (xi−1 , g(xi−1 )) and (xi , g(xi )). Of course, many
more sophisticated methods exist.

3.2.2 Monte Carlo Integration


In contrast to deterministic approaches, we will focus on Monte Carlo methods for
numerically computing the valtue of an integral. Considering the same problem as
above, we want to estimate
Z 1 n
1X
µ= g(x)dx ≈ g(xi ) = µ̂.
0 n
i=1

iid
If we take each xi to be a uniform random variate Xi ∼ Uniform [0, 1], then we have
for each i = 1, . . . , n that Z 1
Eg(Xi ) = g(x)dx
0
and thus
n
! Z 1
1X
E g(Xi ) = g(x)dx,
n 0
i=1
which means we have an unbiased estimator of the integral.
Furthermore, we can compute the variance and get
n n
!
1X 1 X 1
Var g(Xi ) = 2 Var (g(Xi )) = Var (g(X)) .
n n n
i=1 i=1

33
For X ∼ Uniform [0, 1],
Z 1 Z 1 2 Z 1
2 2
σ = Var (g(X)) = g(x) dx − g(x)dx = g(x)2 dx − µ2 .
0 0 0

1 Pn
= σ 2 /n. This can

Therefore, we have the usual conclusion that Var n i=1 g(Xi )
be estimated by the unbiased estimator
  2 
n n n
1 X
g(xi )2 −  1
X 1 X
σ̂n2 = g(xi )2 − µ̂2n

g(xj )  =
n−1 n n−1
i=1 j=1 i=1

Confidence Intervals
R1
If we have that 0 g(x)4 dx < ∞,2 then the central limit theorem tells us that

µ̂ − µ d
pn −
→ N (0, 1) .
σ̂n2 /n

Therefore, we can use z-scores to construct an asymptotic (1−α)-confidence interval


for µ being p
|µ̂n − µ| ≤ z1−α/2 σ̂n2 /n.
Note that in this formulation that µ̂n and σ̂n2 is computed from the same data. If
one has the CPU cycles to spare, it is preferable to generate independent sets of
data to estimate µ̂n and σ̂n2 independently.

Steaming Computation
For a highly accurate estimate of µ, we may require a very large n. In the above
formulation, computing the variance σ̂n2 requires computation of µ̂n first. Thus, we
would have to save n evaluations g(xi ) in memory, which could become cumbersome.
Instead, we can reformulate the above computations in a streaming fashion that does
not require saving all of the evaluations to memory.
A first attempt would be to write
n n
!2
1 X 1 X
σ̂n2 = g(xi )2 − g(xi ) .
n−1 n(n − 1)
i=1 i=1
Pn Pn
Hence, we could save i=1 g(xi ) and i=1 g(xi )2 only. However, the use of sub-
traction could result in loss of numerical precision.
2
i.e. finite fourth moment.

34
Algorithm 8 Integral via uniform Monte Carlo
Input:
function g(x) on the domain [0, 1].
n ∈ N, the number random points to simulate.
Algorithm:
Set S = 0. Integral Estimate
Set V = 0. Variance Estimate
for i = 1, . . . , n
Generate x ∼ Uniform [0, 1].
Compute g(x)
V ← ( i−1 1
i−2 )V + i (g(x) − S)
2
1 i−1
S ← i g(x) + ( i )S
return estimate S
return variance V p
return confidence interval S ± z1−α/2 V /n
Output:
Volume estimate of region D along with variance and confidence interval.

A more sophisticated approach is to write


n
X n−1
X
(n − 1)σ̂n2 − (n − 2)σ̂n−1
2
= (g(xi ) − µ̂n )2 − (g(xi ) − µ̂n−1 )2
i=1 i=1
2
= g(xn ) − nµ̂2n + (n − 1)µ̂2n−1
n
!2
1 X
= g(xn )2 − g(xi ) + (n − 1)µ̂2n−1
n
i=1
1
= g(xn ) − (g(xn ) + (n − 1)µ̂n−1 )2 + (n − 1)µ̂2n−1
2
n
(n − 1)2 − n(n − 1) 2
   
n−1 n−1
= g(xn )2 − 2 g(xn )µ̂n−1 − µ̂n−1
n n n
n−2 2 1
σ̂n2 = σ̂n−1 + [g(xn ) − µ̂n−1 ]2 .
n−1 n
Therefore, we can update

g(xi ) n − 1
µ̂n = + µ̂n−1
n n
and use summations to compute σn2 at each step of the algorithm.

35
3.3 Importance Sampling
In the previous sections, we only considered sampling uniformly from [0, 1]. However,
it is often worth considering non-uniform distributions in order to sample more
efficiently. That is, starting from the above integration problem, say we want to
compute Z 1
µ= g(t)dt = E[g(X)]
0
for X ∼ Uniform [0, 1]. The estimator we had before will be denoted
n
1X
µ̂n = g(xi )
n
i=1

for x1 , . . . , xn drawn from the Uniform [0, 1] distribution. Then, given some density
function f (t) which has support on [0, 1] and f (t) > 0 for all t ∈ [0, 1], we can write
Z 1 Z 1 Z 1
f (t) g(t)
µ= g(t)dt = g(t) dt = f (t)dt = E[g(Z)/f (Z)]
0 0 f (t) 0 f (t)

for Z ∼ f (z). Numerically speaking, we can simulate z1 , . . . , zn from f (z) and then
estimate the integral to be
n
1 X g(zi )
µ̃n = .
n f (zi )
i=1
Then, the question to consider is, “What should f (z) be in order to get the most
efficient estimate?”
Based on the above equations, we can derive the variance for our new estimator
to be
" 2 #
g(Z)
Var (g(Z)/f (Z)) = E −µ
f (Z)
Z 1
g(Z)2 g(t)2
 
2
=E − µ = dt − µ2 .
f (Z)2 0 f (t)
R1
In contrast, the variance of g(X) for X uniform is 0 g(t)2 dt − µ2 . Thus, we can
consider the difference
  Z 1  
g(Z) 2 1
Var (g(X)) − Var = g(t) 1 − dt.
f (Z) 0 f (t)
Thus, we want this integral to be as big as possible to indicate a large drop in the
variance. Solving for the perfect f is typically not doable. However, the above
equation gives us valuable intuition. In particular, if g(t)2 is big then we want
f (t) > 1 and if g(t)2 is small then we want f (t) < 1. Thus we are sampling more
or less frequently than the uniform distribution depending on what f is.

36
Example 3.3.1. If we want to numerically estimate the integral of g(t) = t2 +
t3 , we can try simulating from the uniform distribution—i.e. Beta (1, 1)—and the
Beta (2, 1) distribution with pdf f (t) = 2t. Computing the variance difference from
above yields
Z 1 Z 1
1 1 3
  Z
2 3 2 1 4 5 6
(t + t ) 1 − dt = (t + 2t + t )dt − (t + 2t4 + t5 )dt
0 2t 0 2 0
   
1 1 1 1 1 1 15
= + + − + + = ≈ 0.268.
5 3 7 8 5 12 56
The variance under the uniform distribution is
Z 1 Z 1 2
2 3 2 2 3
(t + t ) dt − (t + t ) ≈ 0.3359.
0 0

Hence, dropping by 0.268 is a big drop in the variance of this estimator.

3.4 Stratified Sampling


Another useful tool for variance reduction is to take the region you are trying to
integrate over, break it into k pieces (or strata), and sample from each to estimate
the integral.
To integrate g(x) over the intervalSk [0, 1], we partition the interval into disjoint
sets A1 , . . . , Ak such that [0, 1] = i=1 Ai . Then, we can write
Z 1 k Z k Z k
X X dx X
µ= g(x)dx = g(x)dx = ωi g(x) = ωi EAi [g(x)]
0 Ai Ai ωi
i=1 i=1 i=1
R
where ωi = |Ai | = Ai g(x)dx. The point of weighting with ωi is so that each
integral becomes an expectation. Note that ω1 + . . . + ωk = 1. We can also change
the probability distribution for sampling on each Ai if we want to complicate this
further.
For each Ai , we sample ni points from Ai and denote these points as xi,j . Then,
our estimator becomes
k ni
ss
X ωi X
µ̂n = g(xi,j )
ni
i=1 j=1

and the variance of the estimator becomes


k
X ω2
Var (µ̂ss
n) = i
varAi (g(X))
ni
i=1

where varAi (g(X)) is the variance over each stratum Ai . Therefore, our goal is to
chose the Ai and ni to reduce the variance.

37
One way to get a variance reduction is to choose each ni = nωi , which simply
says to sample from each Ai proportional to its size. That is,
k
X ωi
Var (µ̂ss
n) = varAi (g(X))
n
i=1
k
1X
ωi EAi (g(X) − EAi [g(X)])2
 
=
n
i=1
k
1 X
ωi EAi (g(X) − E[g(x)] + EAi [g(X)] − E[g(x)])2
 
=
n
i=1
k
1X
ωi EAi (g(X) − E[g(x)])2 − (EAi [g(X)] − E[g(x)])2
  
=
n
i=1
k
1X
= Var (µ̂n ) − ωi (EAi [g(X)] − E[g(x)])2 ,
n
i=1

which is less than or equal to Var (µ̂n ). The main intuition here is that we are
forcing the sampling to be more evenly spread out. However, this approach can still
be outperformed with more clever thought. In particular, this above partitioning
does not take the function g into account at all.

Example 3.4.1. If we simply want to integrate g(x) = x on [0, 1], we generate


x1 , . . . , xn uniformly and get3
n
1X 1
µ̂n = xi and Var (µ̂n ) = .
n 12n
i=1

If we instead partition into intervals [0, 0.5] and [0.5, 1] with n/2 points in each, we
get
2 n/2 2 n/2
ss
X 1/2 X 1 XX
µ̂n = xi,j = xi,j ,
n/2 n
i=1 j=1 i=1 j=1

which is the same as µ̂n except that exactly half of the samples are in each of the
two subintervals. The variance of this estimator is
2 2
X (1/2)2 1 X (1/2)2 1
Var (µ̂ss
n) = varAi (X) = = .
n/2 2n 12 48n
i=1 i=1

Hence, our variance has reduced by a factor of 4.

3
Recall that the variance of the Uniform [a, b] distribution is (b − a)2 /12.

38
Chapter 4

Stochastic Optimization

Introduction
In the last chapter, we saw how Monte Carlo methods can be used to estimate
volumes and integrals. In this chapter we will consider a different but also important
application of Monte Carlo methods: Optimization.
In many areas of application, we find ourselves having to solve similar minimiza-
tion / maximization problems. For example, in parametric statistics, we often wish
to find a model that maximizes the likelihood. Alternatively, we may want to choose
model parameters that best minimize a loss or risk function. In this chapter, we will
formulate all problems as minimizing some loss function L(θ) which maps a vector
θ ∈ Rd into R. The goal will be to find θ∗ = arg minθ {L(θ)}. Of course, any of the
ideas below can be easily reformulated into a maximization problem.
There are also two settings we will consider: noise-free and noisy optimization.
In the first case, we simply try to minimize the deterministic function L(θ), which
nevertheless may be a very complex function to minimize. In the second case, we
do not get to observe L(θ), but instead observe L(θ) + εθ where εθ is a random error
in the measurement that may depend on θ.

4.1 Basic Stochastic Search


The ideas in this section come from Chapter 2 of Spall (2005). These methods are
all relatively simple methods to code and implement and, though simple, can still
be applicable to real world problems.

4.1.1 Simple Random Search


The first and by far easiest method of stochastic search and optimization is to simply
pick points (θ’s) uniformly at random. This is outlined in Algorithm 9. In practice,
we don’t have to simulate θ’s uniformly at random, but can use other distributions

39
Algorithm 9 Simple Random Search
Input:
Loss function L(θ) to minimize.
Domain Θ to search over.
Maximum number of samples n.
Algorithm:
Pick θbest ∈ Θ uniformly at random
Set Lbest = L(θbest )
for i from 2 to n
Pick θ ∈ Θ uniformly at random
Compute L = L(θ)
if L < Lbest (i.e. if we do improve)
Set Lbest = L and θbest = θ
Output:
{θbest , Lbest }

as well. However, with no additional information on L(θ), sampling uniformly is


best.
Under some mild conditions on the loss function L, it can be shown that Al-
gorithm 9 will find the optima as n → ∞. The theorem presented in Spall (2005)
notes that the optimum can be found when

• The minimizer θ∗ is unique, and for all ε > 0, inf kθ−θ∗ k>ε L(θ) > L(θ∗ )

• The minimum value L(θ∗ ) > −∞.

• A “continuity”-like condtion: for all ε > 0, there exists δ > 0 such that

P (L(θ) < L(θ∗ ) + ε) ≥ δ,

which basically means that θ∗ can be some isolated singleton point with respect
to how we are randomly generating θ’s.

However, we won’t prove that in these notes as we would have to discuss what
convergence almost surely means first as well as other analysis topics.

4.1.2 Localized Random Search


While the simple search method from the previous section is valid and will suceed
as n → ∞, it doesn’t make use of the loss function L(θ) at all. Typically, we would
expect some smoothness to the function L. That is, if we find a “small” value at
θ0 , then we would expect θ’s close to θ0 to also give small values of L.

40
Algorithm 10 Localized Random Search
Input:
Loss function L(θ) to minimize.
Domain Θ to search over.
Maximum number of samples n.
Distribution f to sample from (typically standard normal)
Algorithm:
Pick θ1 ∈ Θ uniformly at random
Set L1 = L(θ1 )
for i from 2 to n
Pick v from distribution f
Set θi = θi−1 + v
if θi ∈

Set θi ← θi−1 and Li ← Li−1
Go to next i
Compute Li = L(θi )
if Li ≥ Li−1 (i.e. if we don’t improve)
Set Li ← Li−1 and θi ← θi−1
Output:
{θn , Ln }

In the localized search method detailed in Algorithm 10, we move from one θ to
the next by randomly perturbing it by some random vector v. That is, the new θ
becomes θ + v. Thus, this method randomly walks around the domain Θ looking
for the minimum. It can be shown that if

• L is continuous

• Θ is bounded

• v ∼ N (0, cId ) for some choice of c > 0

then our sequence of θ’s will get arbitrarily close to the minimizer θ∗ as n → ∞. If
there is more than one minimizer—i.e. θ∗ is not unique—then this algorithm will
eventually become close to one of the minmizers.

Remark 4.1.1 (Boundary Conditions). Note that if Θ is a bounded subset of Rd


and if the v’s come from the normal distribution then we could pick a θ ∈
/ Θ. In
this case, we would need to have our algorithm check to make sure we haven’t left
the domain of interest.

41
4.1.3 Enhanced Localized Search
The last algorithm we consider in this section is almost identical to the above local-
ized search method. However, it is “enhanced” by adding a bias term that pushes
the random walk in a given direction if that direction looks to be minimizing the
loss function L. This bias term gives the random walk some inertial to keep drifting
in the same direction (roughly). This is detailed in Algorithm 11.
Specifically, the update step from the localized search,

LRS: θi ← θi−1 + v,

is replaced with
ERS: θi ← θi−1 + bi−1 + v,
where bi−1 is the bias term computed on the previous step of the search algorithm.
If the new θi gives an improvement in the loss value—i.e. L(θi ) < L(θi−1 )—then
the new bias term is
bi ← 0.2bi−1 + 0.4v.
Otherwise, the new bias term is a shrunken version of the previous term bi ←
0.5bi−1 . The choices of these scaling values (0.2,0.4,0.5) were picked by the algorithm
designers, but other values could be considered as well. There is nothing inherently
perfect about those choices other than that they seem to work well in practice.
More references to such enhanced localized search methods as discussed on page 45,
chapter 2, of Spall (2005).

4.1.4 Noisy Loss Function


The above algorithms were stated assuming that the loss function L is noise-free—i.e.
deterministic. Trying to minimize a deterministic function in multiple dimensions
can be a hard task. Adding noise into it can make it even more arduous. Noise or
measurement errors can enter into a loss evaluation in many ways. In these notes,
we will just consider additive noise. That is, we don’t observe L(θ), but instead
observe
L(θ) + εθ
where εθ represents some additive noise that can vary depending on the value of θ.
iid
Typically, we will just consider the simplest setting of having εθ ∼ N 0, η 2 for some


unknown η 2 > 0. We will consider two approaches to handling noisy measurements:


multiple evaluations and thresholding.
Based on the usual ideas of unbiased estimators and the law of large numbers,
we can estimate the true value of L(θ), by making k evaluations L(θ) + εθ and
averaging them together. In which case, we would have measurements L1 , . . . , Lk
and an average value of L̄. As a result,

EL̄ = L(θ) and Var L̄ = η 2 /k.




42
Algorithm 11 Enhanced Localized Search
Input:
Loss function L(θ) to minimize.
Domain Θ to search over.
Maximum number of samples n.
Distribution f to sample from (typically standard normal)
Algorithm:
Pick θ1 ∈ Θ uniformly at random
Set L1 = L(θ1 )
Set b1 = 0
for i from 2 to n
Pick v from distribution f
Set θi = θi−1 + bi−1 + v (try adding v)
if θi ∈ Θ and Li := L(θi ) < L(θi−1 )
Set bi = 0.2bi−1 + 0.4v
Go to next i
Set θi = θi−1 + bi−1 − v (try subtracting v)
if θi ∈ Θ and Li := L(θi ) < L(θi−1 )
Set bi = 0.2bi−1 − 0.4v
Go to next i
Set bi = 0.5bi−1
Set Li ← Li−1 and θi ← θi−1 (or keep θ unchanged)
Output:
{θn , Ln }

Thus, we have reduced the variance by a factor of 1/k. However, this would require
k-times more evaluations of L at every step. This may be prohibitively expensive
depending on how hard it is to compute loss measurements.
Instead, we can set a threshold for improvement. That is, in the originally stated
(noise-free) algorithms, we update θi−1 to θi if L(θi ) < L(θi−1 ). In this noisy-setting,
we can pick a threshold τi , which could be dependent on the iteration i, and update
the parameter vector θ if

L(θi ) + εi < L(θi−1 ) + εi−1 − τi .

The intuition comes from hypothesis testing. If τi is, say, two standard deviations,
then if L(θi ) > L(θi−1 then

P (L(θi ) + εi < L(θi−1 ) + εi−1 − τi ) ≤ 2.3%.

This means that we would have only a 2.3% chance of picking a worse choice of
parameters θ. Of course, this conceptualization assumes we know what the unknown

43
noise η 2 is. In practice, we would have to treat τ as a tuning parameter. If τ is
small, we jump around Θ more liberally. If τ is large, we require a large drop in the
observed loss value before we jump to a new θ.

4.2 Stochastic Approximation


In this final section, we will consider some methods of stochastic approximation,
which are either concerned with root finding or optimization. Most methods can be
used for either problem. Mainly, if we want to minimize a loss function L(θ), we are
effectively looking for the root of
d
L(θ) = 0.

dL
Of course, this assumes we have access to the derivative dθ . We also assume in
these final sections that all measurements are noisy.

4.2.1 Gradient Methods


Earlier in these notes, we discussed both the Newton-Raphson and bisection al-
gorithm as techniques for root finding for deterministic functions. However, these
methods will not work well in the presence to noise. To solve the noisy root-finding
problem (i.e. finding a critical point of the loss function), we will consider the
Robbins-Monro algorithm
In this section, we will assume that the loss function L(θ) = E[Q(θ, v)] where Q
is the noisy loss function with some noise variable v. Then, let

∂Q(θ, v)
y(θ) =
∂θ
be the function for which we want to find a root. The function y(θ) is a noisy
measurement of the gradient of Q.

Robbins-Monro
The Robbins-Monro algorithm dates back to the 1950’s and will look very similar to
Newton-Raphson. In short, we require a sequence of step-sizes or sometimes called
a gain sequence ak such that

X ∞
X
ak = ∞ and a2k < ∞,
k=1 k=1

which basically means that the step-sizes ak → 0 but not too fast. This allows us
to explore the space Θ, but eventually calm down into a final solution.

44
Algorithm 12 The Robbins-Monro Algorithm
Input:
y : Θ → R, a function to find the root
θ0 ∈ Θ, an initial value in Θ
τ ∈ R+ , a threshold for convergence
Algorithm:
For k ≥ 1
θk = θk−1 − ak y(θk−1 )
if kθk − θk−1 k < τ
stop
return θk
Output:
θk such that y(θk ) ≈ 0

Given parameters from the previous set θk , the update step for the Robbins-
Monro algorithm is
θk+1 ← θk − ak y(θk ).
The algorithm is detailed in Algorithm 12. A standard choice for the step-size
sequence ak is ak = a/k for some constant a > 0. Besides this sequence, a few more
technical conditions must be statisfied to prove convergence. These include additive
mean-zero noise, a gaurenttee that the variance doesn’t blow up, and convexity for
L(θ). However, convergence criteria take a lot of background knowledge to discuss
properly and thus will be skipped in these notes.

4.2.2 Gradient-free Methods


Very often, we do not have access to the dervivative of a noise loss function. That is
for some L and Q such that L(θ) = E[Q(θ, v)], we can evalutate Q at chosen values
of θ but cannot evaluate ∂Q/∂θ. This problem necessitates gradient-free methods.
These methods typically revolve around sampling points to estimate the gradient.
Most noteably, we will discuss the Kiefer–Wolfowitz method, whic is also known
as the finite difference method. A second method to consider is the simultaneous
perturbation method, which tries to reduce the number of sample points to use to
estimate the gradient.

Kiefer–Wolfowitz
Much like the Robbins-Monro algorithm from the previous section, the update step
for the Kiefer-Wolfowitz algorithm is

θk+1 ← θk − ak ŷk (θk )

45
Algorithm 13 The Kiefer–Wolfowitz Algorithm
Input:
y : Θ → R, a function to find the root
θ0 ∈ Θ, an initial value in Θ
τ ∈ R+ , a threshold for convergence
Algorithm:
For k ≥ 1
Estimate the gradient ŷk (θk−1 )
θk = θk−1 − ak ŷk (θk−1 )
if kθk − θk−1 k < τ
stop
return θk
Output:
θk such that y(θk ) ≈ 0

where ŷk (θk ) is an estimate of the gradient at the input θk . To estimate the gradient,
we require a second step-size sequence ck . Then, we get
 1
2ck [Q(θk + ck e1 ) − Q(θk − ck e1 )]

ŷk (θk ) = 
 .. 
. 
1
2ck [Q(θk + ck ed ) − Q(θk − ck ed )]

where ei = (0, . . . , 0, 1, 0, . . . , 0) are the standard basic vectors in Rd with a 1 in


the ith position. Thus, at each iterate, we need to evaluate the loss function at 2d
different inputs. This algorithm is detailed in Algorithm 13.
The convergence of this method depends on proper selection of the step-size or
gain sequences ak and ck . We require:

• positivity: ak > 0 and ck > 0 for all k

• convergence to zero: ak → 0 and ck → 0


P∞ P∞ P∞ 2 2
• summations: k=0 ak = ∞, k=0 ak ck < ∞, k=0 ak /ck < ∞.

One natural choice for these sequences is ak = a/k for some a > 0 as before and
ck = c/k α for some c > 0 and some α ∈ (0, 1/2), e.g. α = 1/3 or α = 1/4, but
α 6= 1/2 as then the final sum will not converge.

Random Perturbation FDSA


This method is the same as Kiefer-Wolfowitz except for how the gradient is esi-
mated. Instead of picking 2d points along the Cartesian axes, we randomly draw a

46
perturbation vector v from some suitable distribution (e.g. normal) and estimate
the gradient at only 2 points. The gradient equation is
 
v1
[Q(θk + ck v) − Q(θk − ck v)]  . 
ŷk (θk ) =  ..  .
2ck
vd
Thus, a method like this is nicer in high dimensions especially if Q is costly to
evaluate.

Simultaneous Perturbation FDSA


A similar idea is the simultaneous perturbation method. In this case, we choose a
different perturbation vector v with entries of ±1 with probability 1/2 each (this is
known as the Rademacher distribution). Other distributions can also be considered,
but this setup is closest to the classic Kiefer-Wolfowitz algorithm. The gradient
estimation equation is
 
1/v1
[Q(θk + ck v) − Q(θk − ck v)]  . 
ŷk (θk ) =  ..  .
2ck
1/vd
Note that for Rademacher vectors v, this approach is the same idea as the random
perturbation method as 1/vi = vi when vi = ±1. However, the point is that other
distributions can also be chosen as long as v doesn’t get close to zero with high
probability.

4.3 Example: Least Squares and Neural Networks


If we have a collection of input and output pairs, (xk , zk ), a common desire is to fit
a function that best approximates the relation between xk and zk . That is, we have
a parametric function h(θ, xk ) and errors εk = zk − h(θ, Pxnk ). The goal is to pick a
∗ 2
parameter vector θ to minimize the expected value of k=1 εk . That is,
n
1 X
minimize: L(θ) = E (zk − h(θ, xk ))2 .
2n
k=1

Taking a derivative means we can consider the Robbins-Monro algorithm with the
following update equation:
∂h
θk+1 = θk − ak [zk − h(θ, xk )] (θk , xk ).
∂θ
In this case, we consider each input-output pair one at a time rather than all at
once. This is useful, for example, with large and/or streaming data sets.

47
This above formulation can be applied to any least squares problem where h and
the gradient of h are able to be computed. For example, h could be some nonlinear
regression equation or a function described by a neural network. The segmented
construction of a neural network allows for evaluating the gradient of h.

48
Appendix A

Appendix

A.1 Change of Variables


A tool that is integral to many of the theorems and proofs in these notes is the
change-of-variables formula. Imagine we are in Rn with coordinate variables (x1 , . . . , xn ).
But now we wish to consider a new coordinate system (y1 , . . . , yn ). This allows us
to define the n × n Jacobian matrix
 ∂y1 ∂y1 
∂x1 . . . ∂xn
J =  ... .. ..  .

. . 
∂yn ∂yn
∂x1 ... ∂xn

The determinant of J will be of particular interest and is often also referred to simply
as the Jacobian. Note that we want this determinant to be non-zero. Otherwise,
the mapping from x → y is not invertible.1 The change-of-variables formula is as
follows. Note that more general formulations of this result exist, but we stick with
a simpler one suitable for our purposes.
Proposition A.1.1 (Change of Variables). Let U ⊂ Rn be an open set, and let the
change of variables function ϕ : x → y be one-to-one, differentiable, have continuous
partial derivatives, and non-zero Jacobian for all x ∈ U . Then, for any real-valued
function f that is continuous with support contained in ϕ(U ),
Z Z
f (y)dy = f (x)|det(J)|dx.
ϕ(U ) U

Example A.1.2 (Spherical Coordinates). An archetypal application is switching


from Cartesian to polar coordinates. In R3 , we can define two coordinate systems
(x, y, z) and (r, θ, φ) such that

x = r sin φ cos θ, y = r sin φ sin θ, z = r cos φ.


1
see the Inverse Function Theorem.

49
In this case, the Jacobian matrix is
 ∂x ∂x ∂x   
∂r ∂φ ∂θ sin φ cos θ r cos φ cos θ −r sin φ sin θ
J =  ∂y ∂y ∂y 
∂θ  =
 sin φ sin θ r cos φ sin θ r sin φ cos θ  ,

∂r ∂φ
∂z ∂z ∂z cos φ −r sin φ 0
∂r ∂φ ∂θ

and the determinant is det(J) = r2 sin φ. Let ϕ : (r, φ, θ) → (x, y, z). If we have a
region U ⊂ R3 and function f , then
ZZZ
f (x, y, z)dx dy dz =
ϕ(U )
ZZZ
f (r sin φ cos θ, r sin φ sin θ, r cos φ)r2 sin φ dr dφ dθ.
U

The change-of-variables formula is critical to working with probabilities and


random variables. This is because probabilities are actually measures which are
actually integrals. More precisely, consider real valued random variables X, Y with
joint density function fX,Y (x, y). Then,
ZZ
P ((X, Y ) ∈ U ) = fX,Y (x, y)dx dy.
U

If we wish to transform into new random variables W, Z where (W, Z) = ϕ(X, Y ),


then
ZZ
P ((W, Z) ∈ ϕ(U )) = fW,Z (w, z)dw dz =
ϕ(U )
ZZ
∂x ∂y ∂x ∂y
fX,Y (x(w, z), y(w, z)) − dw dz.
U ∂w ∂z ∂z ∂w

This implies that the joint density of W and Z is

∂x ∂y ∂x ∂y
fW,Z (w, z) = fX,Y (x(w, z), y(w, z)) − ,
∂w ∂z ∂z ∂w

which is used (among other places) in the proof of the Box-Muller transform and
the Ratio of Uniforms method.

Pro Tip : det(AB) = det(A) det(B) and det(A−1 ) = det(A)−1 .

50
Bibliography

George Fishman. Monte Carlo: concepts, algorithms, and applications. Springer


Science & Business Media, 2013.

James C Spall. Introduction to stochastic search and optimization: estimation,


simulation, and control, volume 65. John Wiley & Sons, 2005.

51

You might also like