Lecture Note Statistical Inference
Lecture Note Statistical Inference
April, 2020
Jimma, Ethiopia
Statistical Inference(Stat-3052)
Outline
1 Chapter 0: Preliminaries
Definitions of Some Basic Terms
Sampling Distribution
What is Statistical Inference?
What is Statistical Inference?
What is Statistical Inference?
What is Statistical Inference?
2 Chapter 1: Parametric Point Estimation
Methods of Finding Parametric Point Estimators
Methods of Finding Parametric Point Estimators
Methods of Finding Parametric Point Estimators
Methods of Finding Parametric Point Estimators
Maximum Likelihood (ML) Method
Properties of MLE
Properties of MLE
Properties of MLE
Properties of MLE
Method of Moments
Properties of Point Estimators
Properties of Point Estimators
Properties of Point Estimators
Unbiased Estimators
Unbiased Estimators
Mean Square Error (MSE) of an Estimator
Efficiency of an Estimator May 12, 2019 2 / 78
Chapter 0: Preliminaries
The aim of statistical inference is to make certain determinations with regard to the
unknown constants known as parameter(s) in the underlying distribution.
With the intention of emphasizing the importance of the basic concepts, we begin with a
review of the definitions of terms related to random sampling distribution of some
estimators in the preliminary chapter.
The first step in statistical inference is Point Estimation, in which we compute a single
value (statistic) from the sample data to estimate a population parameter.
General concept of point estimators, different methods of finding estimators and
clarification of their properties are discussed in Chapter 1.
Then proceed to Interval Estimation, a method of obtaining, at a given level of confidence
(or probability), two statistics which include within their range an unknown but fixed
parameter are discussed in Chapter 2.
In Chapter 3 we discuss a 2nd major area of statistical inference is Testing of
Hypotheses. The significance of the differences between estimated parameters from two
or more samples are also included in this chapter; such as the significance the difference
of two population means.
Nonparametric methods that does not based on sampling distributions are discussed in
Chapter 4 (Group Work to be presented by Students).
n
1X
X̄ = Xi . (1)
n
i=1
n
1X 2
S2 = Xi − X̄ . (2)
n
i=1
n
1 X 2
S2∗ = Xi − X̄ . (3)
n−1
i=1
size n.
I The sampling distribution of the sample mean, X̄ :
n n
" #
1X 1X
E X̄ =E Xi = E [Xi ]
n n
i=1 i=1
n
1X
= µ, since E [Xi ] = µ
n
i=1
1
= (nµ) = µ.
n
n n n
" # 2 X 2 X
1X 1 1
σ 2 , sinceXis are iid
Var X̄ =Var Xi = Var [Xi ] =
n n n
i=1 i=1 i=1
1 2 σ2
= nσ = .
n n
2
Therefore, X̄ ∼ N µ, σn is the sampling distribution of, X̄ :
May 12, 2019 6 / 78
Assessment
Every scientific discipline applies statistics to seek relevant information from a given
sample of data. The procedure leads to conclusions regarding a population, which
includes all possible observations of the process or phenomenon, and is called
statistical inference.
The goal of statistical inference is to develop the mathematical theory of statistics,
mostly building on calculus and probability.
Every scientific discipline applies statistics to seek relevant information from a given
sample of data. The procedure leads to conclusions regarding a population, which
includes all possible observations of the process or phenomenon, and is called
statistical inference.
The goal of statistical inference is to develop the mathematical theory of statistics,
mostly building on calculus and probability.
Every scientific discipline applies statistics to seek relevant information from a given
sample of data. The procedure leads to conclusions regarding a population, which
includes all possible observations of the process or phenomenon, and is called
statistical inference.
The goal of statistical inference is to develop the mathematical theory of statistics,
mostly building on calculus and probability.
2 Interval estimation;
May 12, 2019 8 / 78
What is Statistical Inference?
Statistics is closely related to probability theory, but have entirely different goals.
Recall, from statical theory, that a typical probability problem starts with some assumptions
about the distribution of a random variable (e.g., that it’s binomial), and the objective is to
derive some properties (probabilities, expected values, etc) of said random variable based
on the stated assumptions.
In statistics, a sample from a given population is observed, and the goal is to learn
something about that population based on the sample.
Every scientific discipline applies statistics to seek relevant information from a given
sample of data. The procedure leads to conclusions regarding a population, which
includes all possible observations of the process or phenomenon, and is called
statistical inference.
The goal of statistical inference is to develop the mathematical theory of statistics,
mostly building on calculus and probability.
Definition:
A point Estimate of some population parameter θ is a single numerical value
of a statistic θ̂. The statistic θ̂ is called the point estimator.
Example: Consider the following Bernoulli pmf of discrete random variables X = {0, 1}
and parameter p, with parametric space Θ = (p)
f Xj |p = pxj (1 − p)1−xj , where xj = 0, 1.
n
Y 1 −1
L µ|Xj = √ exp 2
(xi − µ)2
2πσ 2σ
j=1
n n
1 −1 X 2
= √ exp (xi − µ)
2πσ 2σ 2
j=1
n n
1 −1 X 2
ln L µ|Xj =ln √ exp (xi − µ)
2πσ 2σ 2
j=1
√ n
1 X
= − n ln 2π − n ln (σ) − 2
(xi − µ)2 (6)
2σ
j=1
n
1 P
µ1 = n
Xi = X̄ and
j=1
m1 = E [X ] = µ
Equating µ1 = m1
n
1 P
µ1 = m1 ⇒ n
Xi = µ
j=1
n
1 P
Therefore, µ̂ = n
Xi = X̄
j=1
m2 = E X 2 = σ 2 + µ2 (Verfy!)
Equating µ2 = m2
µ2 =m2
n
1X 2
Xi =σ 2 + µ2
n
j=1
n
1X 2
Xi =σ 2 + X̄ 2 , since X̄ = µ
n
j=1
n
1X 2
Xi − X̄ 2 =σ 2
n
j=1
n
1 X 2
Xi − X̄ =σ 2
n
j=1
n 2
1
Thus, σ̂ 2 =
P
n
Xi − X̄
j=1
The main point here is that, how an estimator should be close to the true value of the
unknown parameter.
When an estimator is unbiased, the bias is zero.
Example 1: Suppose that X is a random variable with mean µ and variance σ 2 . Let
X1 , · · · , Xn be a random sample of size n from the population represented by X. Show that
X̄ and S 2∗ defined in Equations (1) and (3) are unbiased estimator of µ and σ 2 ,
respectively.
Discussion 1:"
n n n
#
1X 1X 1X 1
E X̄ =E Xi = E [Xi ] = µ, = (nµ) = µ, since E [Xi ] = µ.
n n n n
i=1 i=1 i=1
Therefore, X̄ is unbiased estimator of the population mean µ.
h i
Var [Xi ] =E (Xi − µ)2
h i
σ 2 =E Xi2 − 2µXi + µ2
h i
=E Xi2 − 2µE [Xi ] + µ2
h i
=E Xi2 − 2µ2 + µ2
h i
=E Xi2 − µ2
h i
⇒ E Xi2 =σ 2 + µ2
σ2
To show that E X̄ 2 = µ2 +
n
(∗∗),
h i 2
Var X̄ =E X̄ 2 − E X̄
σ2 h i 2
=E X̄ 2 − E X̄
n
h i
=E X̄ 2 − µ2
h i σ2
⇒ E X̄ 2 =µ2 +
n
Assertion:
The mean square error of θ̂ is equal to the variance
h iof the estimator plus the squared bias. That
is, MSE(θ̂) = Var θ̂ + (Bias)2
Proof: 2 h i h i 2
MSE(θ̂) =E θ̂ − θ = E θ̂ − E θ̂ + E θ̂ − θ
h i2 h i 2 h i h i
=E θ̂ − E θ̂ + E θ̂ − θ − 2 θ̂ − E θ̂ E θ̂ − θ
h i2 h i 2 h h i h i i
=E θ̂ − E θ̂ + E E θ̂ − θ − 2E θ̂ − E θ̂ E θ̂ − θ (10)
| {z }
=0−−−−−−−(∗∗∗)
h i2 h i 2 h i
=E θ̂ − E θ̂ + E E θ̂ − θ = Var θ̂ + (Bias)2 .
The mean square error of an estimator, which is equivalent to the sum of its variance and
the square of its bias, can be used as a relative measure of efficiency (RE) when
comparing two or more estimators.
Remark:
If this relative efficiency is less than 1, we would conclude that θ̂1 is a more efficient estimator of
θ than θ̂2 , in the sense that it has a smaller mean square error.
Definition: Consistency
An estimator θn , based on a sample size n, is a consistent estimator
of a parameter θ, if for any positive number ,
h i
lim Pr θ̂n − θ ≤ = 1 (12)
n→∞
X1 + X2 + . . . + X7
θ̂1 = and
7
2X1 − X6 + X4
θ̂2 = .
2
n
2. Let σ 2 be unknown. Set S 2 = 1
(Xi − µ)2 .
P
n−1
i=1
n−1 2
(n−1)S 2 P Xi −µ
Recall that σ2
= σ
∼ χ2(n−1) .
i=1
From the Chi-Square tables, determine any pair 0 < L < U for which
Pr [L ≤ X ≤ U] = 1 − α, where X ∼ χ2n−1 . Then we have
" #
(n − 1)S 2
Pr L ≤ ≤ U = 1 − α, for all σ 2
σ2
" #
1 σ2 1
Pr ≤ ≤ =1−α
U (n − 1)S 2 L
" #
(n − 1)S 2 (n − 1)S 2
Pr ≤ σ2 ≤ =1−α
U L
" #
(n − 1)S 2 2 (n − 1)S 2
Pr ≤ σ ≤ = 1 − α.
χ2n−1,α/2 χ2n−1,1−α/2
where χ2n−1,α/2 and χ2n−1,1−α/2 are the values that a χ2n−1 variate exceeds with
probabilities α/2 and (1 − α/2), respectively.
Example: The compressive strengths of 40 test cubes of concrete samples with the
sample mean and sample standard deviation of 60.14 and 5.02 N/mm2 , respectively. We
also assume that the compressive strengths are normally distributed. To facilitate the
application, let us assume that the estimated standard deviation of 5.02N/mm2 is the true
on known value.
a. Construct a 95% confidence interval for the population mean µ.
b. Construct an upper one-sided 99% confidence limit for the population variance.
c. Construct a 95% two-sided confidence limit for the population variance.
Discussion: Given: n=40, X̄ = 60.4 and Ŝ = 5.02
a. From standardized normal Table Zα/2 = Z0.025 = 1.96. Using Equation (16), we have
h √ √ i
Pr X̄ − Zα/2 Ŝ/ n ≤ µ ≤ X̄ + Zα/2 Ŝ/ n = 1 − α
h √ √ i
Pr 60.4 − 1.96 ∗ 5.02/ 40 ≤ µ ≤ 60.4 + 1.96 ∗ 5.02/ 40 = 0.95
Pr [58.58 ≤ µ ≤ 61.70] = 0.95
Therefore we are 95% confident that the interval (58.58, 61.70) includes the true
population mean µ.
The length of confidence interval is 61.70 − 58.58 = 3.12.
May 12, 2019 39 / 78
Example Cont...
b. From χ Table χ2n−1, α = χ239, 0.01 = 21.426. Using Equation (18), we have
2
" #
2 (n − 1)S 2
Pr σ ≤ 2 =1−α
χn−1,1−α
" #
39(5.02)2
Pr σ 2 ≤ = 0.99
χ239,0.01
39(25.2004) h i
Pr σ 2 ≤ = 0.99 ⇒ Pr σ 2 ≤ 45.87 = 0.99.
21.426
Hence the 99% upper confidence limit for σ is 6.76N/mm2 .
c. From χ2 Table χ2n−1, α/2 = χ239, 0.025 = 58.120 and χ2n−1, 1−α/2 = χ239, 0.975 = 23.654.
Using Equation (17), we have
" #
(n − 1)S 2 2 (n − 1)S 2
Pr ≤σ ≤ 2 =1−α
χ2n−1,α/2 χn−1,1−α/2
" #
(39)(25.2004) 2 (39)(25.2004)
Pr ≤ σ ≤ = 0.95
χ239, 0.025 χ239,0.975
982.816 982.816 h i
Pr ≤ σ2 ≤ = 0.95 ⇒ Pr 16.9 ≤ σ 2 ≤ 41.55 = 0.95.
58.120 23.654
Hence the 95% two-sided confidence limit for σ is (4.11, 6.45) in N/mm2 .
This interval is fairly wide because there is a lot of variability in the compressive strengths
of cubes of concrete samples measured.
May 12, 2019 40 / 78
Simultaneous Confidence Interval for the Mean and Variance (Small Sample)
In Example
1 and 2, the position was adopted that only one of the parameters in the
N µ, σ 2 distribution was unknown. In practice, both µ and σ 2 are most often unknown.
In this sub section we try pave the way to solving the problem.
Definition:
If X̄ and S are the sample mean and sample standard deviation of a random samples
X1 , X2 , · · · , Xn from a normal distribution with unknown variance σ 2 , a 100(1 − α)% confidence
interval for population mean µ is
X̄ − tn−1;α/2 √Sn , X̄ + tn−1;α/2 √Sn
where tn−1;α/2 is the upper 100α/2 percentage point of the t distribution with n − 1 degrees of
freedom.
One-sided confidence bounds for the mean of a t− distribution are also of interest and are
simply to use only the appropriate lower or upper confidence limit from Equation (19) and
replace tn−1;α/2 by tn−1;α .
May 12, 2019 42 / 78
Simultaneous Confidence Interval cont...
!
(n − 1)S 2 (n − 1)S 2
, 2
χ2n−1,α/2 χn−1,1−α/2
!
9(0.2842 ) 9(0.2842 )
= ,
χ29,0.025 χ29,0.975
0.726 0.726
= , = (0.038 , 0.269)
19.02 2.70
This last expression may be converted into a confidence interval on the standard deviation
σ by taking the square root of both sides, resulting in (0.195 , 0.518).
Therefore, at the 95% level of confidence, the thermal conductivity data indicate that the
process standard deviation could be as small as 0.195 Btu/hr − ft −0 F and large as
0.518 Btu/hr − ft −0 F
Population Proportion
Suppose that a random sample of size n has been taken from a large (possibly infinite)
population and that X ≤ n observations in this sample belong to a class of interest. Then
X
P̂ = (21)
n
is a point estimator of the proportion of the population p that belongs to this class, where n and p
are the parameters of a binomial distribution.
P̂ − p
=q ≈ N (0, 1)
p(p−1)
n
q
p(p−1)
where the quantity n
in Equation (22) is called the standard error of the point
estimator P̂.
m 2 n 2
1 1
Further recall that if SX2 = and SY2 =
P P
m−1
Xi − X̄ n−1
Yi − Ȳ , then
i=1 i=1
2 2
(m−1)SX (n−1)SY
σ2
∼ χ2(m−1) and σ2
∼ χ2(n−1) .
By independence of X and Y ,
(m − 1)SX2 + (n − 1)SY2
∼ χ2(m+n−2) (24)
σ2
From Equations (23) and (24)
X̄ − Ȳ − (µ1 − µ2 )
r
2 +(n−1)S 2
(m−1)SX
∼ tm+n−2 . (25)
Y 1 1
m+n−2 m
+n
and N µ2 , σ22 distributions, respectively, with all µ1 , µ2 , σ12 and σ22 unknown.
Thus, from a t-distribution in Equation (25), the confidence interval for the difference of the true
parameter means (µ1 − µ2 ) is
s
(m − 1)SX2 + (n − 1)SY2
1 1
X̄ − Ȳ ± tm+n−2,α/2 + (26)
m+n−2 m n
and N µ2 , σ22 distributions, respectively, with all µ1 , µ2 , σ12 and σ22 unknown.
σ12
A 100(1 − α)% confidence interval required for is then,
σ22 !
SX2 SX2
Fn−1,m−1;1−α/2 , Fn−1,m−1;α/2 . (28)
SY2 SY2
May 12, 2019 52 / 78
Figure: Confidence Interval for Ratio of Variances from two Independent
Population (F-Distribution)
1. Let X̄ = 102 and that n = 50 and σ 2 = 10. What is a 95% confidence interval for µ?
2. A survey was made in the core course, asking (among other things) the annual salary of
the jobs that the students had before enrolling as a full–time PhD students. Here is a
subset (n = 10) of those responses (in thousands of dollars):
20, 34, 52, 21, 26, 29, 71, 41, 23, 67
a. Construct a 95% confidence interval for the true average income for incoming full–time
PhD students.
b. Construct a 95% confidence interval for the true standard deviation income for incoming
full–time PhD students.
3. A forester wishes to estimate the average number of "count trees" per acre (trees larger
than a specified size) on a 2,000-acre plantation. She can then use this information to
determine the total timber volume for trees in the plantation. A random sample of n = 50
one-acre plots is selected and examined. The average (mean) number of count trees per
acre is found to be 27.3, with a standard deviation of 12.1. Use this information to
construct a 99% confidence interval for µ, the mean number of count trees per acre for the
entire plantation.
Figure: The distribution of Z when H0 : µ̂ = µ0 is true, with critical region for the
two-sided alternative Ha : µ̂ 6= µ0
Figure: Critical Regions for One-sided Alternative Ha : θ̂ > θ0 (left), and the One-sided
Alternative Ha : θ̂ < θ0 (right), for Standardized Normal Distributed Z.
Critical values: The values of the test statistic that separate the rejection and
non-rejection regions. They are the boundary values obtained corresponding to the preset
level.
Rejection region: The set of values for the test statistic that leads to rejection of H0 .
Non-rejection region: the set of values not in the rejection region that leads to
non-rejection of H0 .
H0 True H0 False
Fail to reject H0 Correct Decision Type II error
Reject H0 Type I error Correct Decision
The type I error specification is the probability of making errors when the null
hypothesis is true. This specification is commonly represented with the symbol α.
For example if we say that a test has α ≤ 0.05 we guarantee that if the null hypothesis is
true the test will not make more than 1/20 mistakes.
P(of type I error) = P(of rejecting H0 whereas H0 is true)=α (the significance level).
The type II error specification is the probability of making errors when the null
hypothesis is false. This specification is commonly represented with the symbol β.
P(of type II error) = P(of accepting H0 whereas H0 is false)=β.
For example if we say that for a test β is unknown we say that we cannot guarantee how it
will behave when the null hypothesis is actually false.
The power of test: specification is the probability of correctly rejecting the null hypothesis
when it is false, or the power of a test is the probability of making the correct decision
when the alternative hypothesis is true.
Thus the power specification is 1 − β.
May 12, 2019 62 / 78
P-Value
The p-value is the probability that the null hypothesis is true (based on the data) or
p-value is the smallest significance level at which the null hypothesis can be rejected.
We noted previously that reporting the results of a hypothesis test in terms of a P-value is
very useful because it conveys more information than just the simple statement "reject H0 "
or "fail to reject H0 ".
The p-value is a number between 0 and 1 that represents a probability.
The observed level of significance or p-value is the probability of obtaining a result as far
away from the expected value as the observation is, or farther, purely by chance, when the
null hypothesis is true.
Notice that a smaller observed level of significance indicates that the null hypothesis is less
likely.
If this observed level of significance is small enough, we conclude that the null hypothesis
is not plausible.
In many instances we choose a critical level of significance before observations are made.
The most common choices for the critical level of significance are 10%, 5%, and 1%.
If the observed level of significance is smaller than a particular critical level of significance,
we say that the result is statistically significant at that level of significance.
If the observed level of significance is not smaller than the critical level of significance, we
say that the result is not statistically significant at that level of significance.
Assume that a random sample X1 , X2 , . . . , Xn has been taken from the population. Based
on our previous discussion, the sample mean X̄ is an unbiased point estimator of µ with
variance σ 2 /n.
1. The test of hypotheses:
H0 : µ̂ = µ0 versus Ha : µ̂ 6= µ0
where µ0 is a specified constant.
2. The test statistic: X̄ − µ0
Zcal = √ (29)
σ/ n
3. If the null hypothesis H0 : µ̂ = µ0 is true,E X̄ = µ0 , and it follows that the distribution of
Z ∼ N (0, 1) is the standard normal the probability is 1 − α that the test statistic Zcal falls
between −Zα/2 and Zα/2 , where Zα/2 is the 100α/2 percentage point of the standard
normal distribution. That is,
Pr −Zα/2 ≤ Zcal ≤ Zα/2 = 1 − α.
4. Rejection Region: Reject H0 if the observed value of the test statistic Zcal is either
Zcal > Zα/2 or Zcal < −Zα/2 .
The important point upon which the test procedure relies is that if X1 , X2 , . . . , Xn is a
random sample from a normal distribution with mean µ and unknown variance σ 2 , then the
random variable
X̄ − µ0
T = √ (30)
S/ n
has a t distribution with n − 1 degrees of freedom.
mean X̄ is an unbiased point estimator of
Based on our previous discussion, the sample √
µ and estimated sample standard deviation S/ n we have:
1. The test of hypotheses:
H0 : µ̂ = µ0 versus Ha : µ̂ 6= µ0
where µ0 is a specified constant.
2. The test statistic: X̄ − µ0
Tc = √ (31)
S/ n
The critical region to control the type I error probability at the desired level, the t
percentage points tα/2, n−1 and as the boundaries of the critical region −tα/2, n−1 and
tα/2, n−1 so that we would reject H0 : µ̂ = µ0 if
Tc > tα/2, n−1 or Tc < −tα/2, n−1
.
Figure: Critical Regions for two-sided (a), One-sided Alternative Ha : θ̂ > θ0 (left), and
the One-sided Alternative Ha : θ̂ < θ0 (right), for Student t Distributed T.
n 2
1
where, S 2 =
P
n−1
Xi − X̄
i=1
1. The test of hypotheses:
H0 : σ̂ 2 = σ02 versus Ha : σ̂ 2 6= σ02
where σ02 is a specified constant population variance.
2. The test statistic: (n − 1)S 2
χ2cal = (32)
σ02
The critical region to control the type I error probability at the desired level, the χ2
percentage points χ2n−1,α/2 and as the boundaries of the critical region χ2n−1,α/2 and
χ2n−1,1−α/2 so that we would reject H0 : σ̂ 2 = σ02 if
χ2cal > χ2n−1,1−α/2 or χ2cal < χ2n−1,α/2
Recall that a random sample of size n has been taken from a large (possibly infinite)
population and that X (≤ n) observations in this sample belong to a class of interest.
Then is a point estimator of the proportion of the population p that belongs to this class.
Note that n and p are the parameters of a binomial distribution with mean p and
variance np(1 − p), if p is not too close to either 0 or 1 and if n is relatively large.
1. The test of hypotheses:
H0 : p̂ = p0 versus Ha : p̂ 6= p0
where p the binomial parameter and assuming that X ∼ N (np0 , np0 (1 − p0 )).
2. The test statistic: X − np0
Zcal = p (33)
np0 (1 − p0 )
3. Rejection Region: Reject H0 if the observed value of the test statistic Zcal is either
Zcal > Zα/2 or Zcal < −Zα/2 .
4. Conclusion: Reject H0 since Zcal = −1.95 < −Zα = −Z0.05 = −1.645. We conclude
that the process is capable.
1. Test hypothesis: H0 : µ1 − µ2 = D0
2. Test Statistics
X̄1 − X̄2 − D0
Tc = r (34)
1
Sp n
+ n1
1 2
92.255 − 92.733 − 0
= r = −0.35
1 1
2.70 8
+ 8
4. Conclusion: Since −tα/2,14 = 2.145 < Tc = −0.35 < t0.025,14 = 2.145 H0 does not be
rejected. That is, at the α = 0.05 level of significance, we do not have strong evidence to
conclude that catalyst 2 results in a mean yield that differs from the mean yield when
catalyst 1 is used.
May 12, 2019 77 / 78
Tests on Two Population Proportions
Recall that a random sample of size n1 and n2 has been taken from a large (possibly
infinite) populations and that X1 (≤ n) and X2 (≤ n) observations in this sample belong to a
class of interest.
There is a point estimator of the proportion of the population p that belongs to this class.
Note that n and p are the parameters of a binomial distribution with mean p and
variance np(1 − p), if p is not too close to either 0 or 1 and if n is relatively large.
1. The test of hypotheses:
H0 : p̂1 = p̂2