STA732
Statistical Inference
Lecture 01: Course Introduction
Yuansi Chen
Spring 2023
Duke University
https://www2.stat.duke.edu/courses/Spring23/sta732.01/
1
Goal of Lecture 01
• Logistics
• Introduce “the problem”
• Discuss what it means to have “the best” estimator?
• (If time permits) Review of measure theory basics
2
Logistics
Coordinates
• TA: Christine Shen
[email protected] • Course websites:
• Main:
https://www2.stat.duke.edu/courses/Spring23/sta732.01/
• Sakai
3
Lectures and Office hours
• Lectures: Monday and Wednesday 3:30-4:45pm in Old Chem
025
• Office hours:
• Yuansi: MW 4:45-5:30
• Christine: see website
• Ed Discussion on Sakai
4
Sakai tools
• Announcements
• Zoom Meetings (for possible online office hours)
• Resources (for HW problem sets)
• Ed Discussion (for online discussions)
• Gradescope (for HW submission and exam grading)
5
Textbooks
• Keener, Theoretical Statistics: Topics for a Core Course, 2010
• Lehmann and Casella, Theory of Point Estimation, 1998
• Lehmann and Romano, Testing Statistical Hypotheses, 2005
All available online via Duke library website
6
Grading
• Weekly homework (due on Wednesdays at 11am)
• One midterm + one final
Homework 25%
Midterm 25%
Final 45%
Participation 5%
7
Scribing
Everyone is required to scribe at least one lecture note. Please sign
up via the link in Sakai.
8
Policies
Check website
• Duke Community Standard
• I will not lie, cheat, or steal in my academic endeavors
• I will conduct myself honorably in all of my endeavors
• I will act if the standard is compromised
• Plagiarism
• Can use online resources, but make sure you understand in
your own language and make sure to cite them (code or theory)
• Answer sharing between groups or individuals are not allowed
• Homework Policy:
• Late HW: No homework more than two full days (48 hours) late
will be accepted. Each late day will result in a one-level
down-grade (10% off) of that HW
• Regrade requests on Gradescope within 2 days
• Drop the HW with the lowest score for final grade
• Exam Policy: no makeup exams 9
HW0 released
• Designed for trying out Gradescope on Sakai
• due Wednesday 18 at 11am
• will not be counted into final grade
10
The statistical inference problem
Statistical inference in a dictionary
Oxford Dictionary of Statistics or Wikipedia:
“Statistical inference is the process of using data analysis to infer
properties of an underlying distribution of probability”
11
Statistical experiment
Statistical experiment
A statistical experiment is a procedure/process that generates a
collection of data, X
• For example, a coin tossing experiment: tossing a coin n times.
12
Statistical experiment
Statistical experiment
A statistical experiment is a procedure/process that generates a
collection of data, X
• For example, a coin tossing experiment: tossing a coin n times.
Sample space
The set of possible data values is called the sample space S
• For example, in the coin tossing experiment, S = {0, 1}n , the
sample space contains all length n string with 0s and 1s
12
Statistical model
Statistical model
A statistical model is a family of possible distributions {Pθ , θ ∈ Ω}
for X, where Ω is called the parameter space
• Note that the family can be very small (e.g. a single
distribution) or very large (e.g. all absolutely continuous
distributions)
• Bayesian also puts assumption on θ (we will deal later)
• A model is in essence a collection of assumptions regarding
the sampling distribution of the data
13
Event in the sample space
• E ⊂ S is called an event
• Each distribution in a model can specify the probability of an
event
Pθ (E) = Probθ (X ∈ E)
14
Example: coin tossing experiment
• Data: X = (X1 , X2 , . . . , Xn ) where
1 if the ith toss is H
Xi =
0 if the ith toss is T
• Statistical model: contains all joint distributions of n
independent Bernoulli distribution with equal head
probability θ, θ ∈ [0, 1]
∑ ∑
xi
Pθ (X1 = x1 , . . . , Xn = xn ) = θ (1 − θ)n− xi
15
Statistical inference
Inference
Inference about g(θ) (estimand) is making an “educated guess”
about g(θ) based on the data
• for example, blindly guessing θ to be 0 is a type of inference
• guessing θ to be a real number between 0 and 1 is also an
example of inference
16
Common types of statistical inference problems
1. Point estimation
2. Hypothesis testing
3. Interval estimation (confidence intervals or credible regions)
4. Prediction
17
Common types of statistical inference problems (2)
1. Point estimation: an estimator is a statistic (a function of the
data) for the purpose of guessing the value for some g(θ),
which is hopefully close to g(θ)
18
Common types of statistical inference problems (2)
1. Point estimation: an estimator is a statistic (a function of the
data) for the purpose of guessing the value for some g(θ),
which is hopefully close to g(θ)
2. Hypothesis testing: let Ω = Ω0 ∪ Ω1 be a disjoint union. Ask
whether
H0 : θ ∈ Ω0 or H1 : θ ∈ Ω1
18
Common types of statistical inference problems (3)
3. Interval estimation: suppose g(θ) ∈ R. We want an interval
that contains g(θ) “with high probability”
• A level (1 − α) ∗ 100% confidence interval [l(X), u(X)]:
Pθ (g(θ) ∈ [l(X), u(X)]) ≥ 1 − α
(1 − α) is called the significance level or coverage level
19
Common types of statistical inference problems (3)
3. Interval estimation: suppose g(θ) ∈ R. We want an interval
that contains g(θ) “with high probability”
• A level (1 − α) ∗ 100% confidence interval [l(X), u(X)]:
Pθ (g(θ) ∈ [l(X), u(X)]) ≥ 1 − α
(1 − α) is called the significance level or coverage level
4. Prediction: what would a new data point look like?
19
Based on our definition, the requirement of doing inference is quite
low. We need a notion of “good inference” to compare inference
methods, and to rule out the clearly useless inference methods!
20
What does it mean to have “the best”
estimator?
Take point estimation as an example
Objective of Point estimation:
Construct a statistics T(X) that is “close” to g(θ).
• What is a formal notion of “closeness”?
• Introduce a loss function
L(θ, d) = the loss incurred when estimating g(θ) by d
Note that d is taken to our estimator which is a statistic
(depends on X)
21
Examples of loss functions
• Squared error loss
L(θ, d) = (d − g(θ))2
• Lp loss
L(θ, d) = |d − g(θ)|p
• ϵ-step error loss, ϵ > 0
1 if |d − g(θ)| > ϵ
L(θ, d) =
0 otherwise
• In general, the loss does not have to be symmetric
3(d − g(θ)) if d − g(θ) > 0
L(θ, d) =
−(d − g(θ)) otherwise
We assume for simplicity the minimum value is taken at g(θ)
22
Example: a normal experiment
• Data: X = (X1 , X2 , . . . , Xn ) i.i.d. N (θ, 1)
• Estimand: g(θ) = θ
• Loss fun: L(θ, d) = (d − θ)2 , the squared error loss
23
Example: a normal experiment
• Data: X = (X1 , X2 , . . . , Xn ) i.i.d. N (θ, 1)
• Estimand: g(θ) = θ
• Loss fun: L(θ, d) = (d − θ)2 , the squared error loss
If we take d = δ(X) as our estimator, then the loss L(θ, δ(X))
• is random (under repeated experiments)
• depends on the unknown θ
23
Example: a normal experiment
• Data: X = (X1 , X2 , . . . , Xn ) i.i.d. N (θ, 1)
• Estimand: g(θ) = θ
• Loss fun: L(θ, d) = (d − θ)2 , the squared error loss
If we take d = δ(X) as our estimator, then the loss L(θ, δ(X))
• is random (under repeated experiments)
• depends on the unknown θ
How do we evaluate the performance of different estimators? say
∑
• sample mean X̄ = n1 Xi
• sample median med(X) = median (X1 , . . . , Xn )
23
Risk function
Risk function
The risk function R(θ, δ) for an estimator δ is the average loss under
repeated experiments (This is the frequentist prespective).
R(θ, δ) = EX∼Pθ [L(θ, δ(X))] .
In general, we want to find estimators that has “low” risk
but how low is low? low for which θ?
24
Example: a normal experiment (cont’d)
• The risk of sample mean is
[( )2 ] 1
R(θ, X̄) = Eθ X̄ − θ = Varθ (X̄) =
n
• The risk of sample median
• is also constant over all θ
• is larger than n1 (we will prove later)
In this case, the sample mean is preferred under the squared error
loss (and repeated experiments) for all θ. We say the sample mean
is uniformly better than the sample median.
25
A natural question:
Given a loss function, does there exist a uniformly best estimator,
which has lower risk than any other estimator over all values of θ?
26
The answer to the previous question is NO in general!
Proof of NO uniformly best estimator
1. If δ ∗ exists which is uniformly better, then taking δc = c, we
must have
R(θ, δ ∗ ) ≤ R(θ, δc )
2. In particular, R(θ, δ ∗ ) ≤ R(θ, g(θ))
3. Since L(θ, g(θ)) is the minimum, L(θ, δ ∗ (X)) = L(θ, g(θ)) for all
θ and all X. This is a degenerate case
In other words, δ ∗ (X) = arg minδ L(θ, δ) no matter what data X is.
27
Various approaches to define “good” estimator with “low” risk
Since no uniformly best estimator exist, we must be careful when
claiming that “we have the best estimator”. We need some
compromises or restrictions in defining what is “the best”.
Three general approaches
1. Restrict attention to a smaller (but hopefully reasonable) class
of estimators (avoid comparison to all estimators)
2. Applying global measures for risk and minimize those, rather
than trying to find an estimator with lowest risk at every θ
(don’t need to be good at every θ)
3. Large sample (asymptotic) approach
28
I. Restrict attention to a smaller class of estimators
Strategy A: restrict attention to unbiased estimators
• bias: EX∼Pθ δ(X) − g(θ) is the bias of δ
• UMVU: Uniformly minimum variance unbiased estimator
• If δ is unbiased, then for square loss
R(θ, δ) = Eθ (δ − g(θ))2
= Varθ (δ) + (Eθ δ − g(θ))2
= Varθ (δ)
29
I. Restrict attention to a smaller class of estimators
Strategy A: restrict attention to unbiased estimators
• bias: EX∼Pθ δ(X) − g(θ) is the bias of δ
• UMVU: Uniformly minimum variance unbiased estimator
• If δ is unbiased, then for square loss
R(θ, δ) = Eθ (δ − g(θ))2
= Varθ (δ) + (Eθ δ − g(θ))2
= Varθ (δ)
Strategy B: restrict attention to estimators with certain symmetry
It is sometimes reasonable to require an estimator to be equivariant
δ(X1 + c, . . . , Xn + c) = δ(X1 , . . . , Xn ) + c
29
II. Global approaches, A
Strategy A: minimax
• Want to minimize maximum value of risk function, i.e. find δ ∗
satisfying
sup R(θ, δ ∗ ) ≤ sup R(θ, δ) for any other δ
θ∈Ω θ∈Ω
• Such an estimator is called minimax. It seems rather
pessimistic:
• An estimator might be good for a vast majority of θ, but only
bad for a single value of θ. It will not be considered good under
the minimax strategy!
30
II. Global approaches, B
Strategy B: minimize the average risk
• Want to minimize the averaged risk under some weight
function, i.e. find δ ∗ to minimize
∫
R(θ, δ)dΛ(θ)
where Λ(θ) is some measure over θ
• If Λ is taken as a probability distribution over the parameter
space Ω, then it is called the prior distribution
• Depending on the prior about θ, we weight the risk differently
• Such an estimator is called a Bayes estimator
31
III. Large sample approach
Intuition for large sample approach: when n tends to infinity, the
risk simplifies and we might be able to define which estimator is the
best without making too much compromises
32
Now we can understand why the Lehmann and Casella book
(Theory of Point Estimation) is organized as follows
• Preparations
• Unbiasedness
• Equivariance
• Average Risk Optimality
• Minimaxity and Admissibility
• Asymptotic Optimality
33
What is covered in this course?
• The first half: focus on the logic of Lehmann and Casella
• The second half: focus on
• hypothesis testing
• how the classic studies the maximum likelihood estimator
• But before the first half, we need to have some probability
background and to build the basic language
• Measure theory basics
• Exponential families
• Sufficient statistics
• Rao-Blackwell theorem (a generic way to improve an estimator)
34
Review of measure theory basics
Measure theory overview
• Measure theory is the foundation of all rigorous statistical
theory which is built on top of probability theory
• We will go through the basics, but it is recommended to review
Sta 711 textbooks to get a thorough understanding!
35
Measure
Given a set X , a measure µ maps subsets A ⊆ X to [0, ∞)
• Example 1: if X is countable (e.g. X = Z), the counting
measure #(A) equals the number of points in A
• Example 2: if X = Rn , the Lebesgue measure is
∫ ∫
λ(A) = · · · A dx1 . . . dxn = Vol(A)
Due to pathological sets, λ(A) can only be defined for some subsets
A ⊆ Rn . This leads to the introduction of σ-algebra (or σ-field).
36
σ-algebra
A σ-algebra F on a set X is a collection of subsets of X satisfying
• it includes X and the empty set
• it is closed under complement
• it is closed under countable unions
• Example 1: if X is countable, F = 2X (all subsets)
• Example 2: if X = Rn , F is the Borel σ-field, B, the smallest
σ-algebra that contains all rectangles.
37
Formal definition of measure under σ-algebra
Given (X , F) (a measurable space), a measure is any map
µ : F → [0, ∞] such that
(∞ ) ∞
∪ ∑
µ Ai = µ(Ai ) for disjoint Ai ∈ F
i=1 i=1
If in addition µ(X ) = 1, then µ is a probability measure
38
Integrals
∫
We can now define integrals using measures. Intuitively, fdµ
means summing f with weights µ(A) on A
∫ ∑
• For counting measure, f(x)d#(x) = x∈X f(x)
∫ ∫ ∫
• For Lebesgue measure, f(x)dλ(x) = · · · f(x)dx1 . . . dxn .
• Indicator fun 1x∈A
∑
• Simple fun i ai 1x∈Ai
• Measurable fun (if pre-image is in F): those can be
approximated by simple funs (Theorem 1.8 in Keener)
39
Densities
Given (X , F) and two measures µ, P, we say P is absolutely
continuous with respect to µ if P(A) = 0 whenever µ(A) = 0. Note
this as P ≪ µ.
• If P ≪ µ, we can define the density function
dP
p= ,
dµ
∫
where P(A) = A p(x)dµ(x). This is also called Radon-Nikodym
derivative
• If µ is the counting measure, then p is a probability mass
function. If µ is the Lesbegue measure, then p is a probability
density function
According to the definition, the density function is not unique but agrees almost
everywhere
40
Probability space
A probability space is the triple (Ω, F, P)
• Sample space Ω, which is the set of all possible outcomes
• Event space F, A ⊂ F is called an event
• Probability function P, P(A) is the probability of A
41
Random variable
A random variable is a function Y : Ω → X
• We say Y has distribution Q (or Y ∼ Q) if
P(Y ∈ B) = P({w : Y(w) ∈ B}) = Q(B)
for B ∈ F
42
Expectation
The expectation is an integral with respect to P
∫ ∫
E[Y] = Y(w)dP(w) = xdQ(x).
Ω
43
Need to know more about measure theory?
• More in Keener Chap. 1
• More in Sta 711
44
Summary
What we have covered
• Statistical inference problem
• Intuitively how to argue for the best estimator
45
Summary
What we have covered
• Statistical inference problem
• Intuitively how to argue for the best estimator
What is next lecture?
• Exponential families
45
Thank you
46
47