Chapter Four
Chapter Four
Statistical Estimation
Page 1 of 12
may instead use an interval estimate; he estimates that the mean weekly summer income of
second-year business students to lie between $380 and $420.
Numerous applications of estimation occur in the real world. For example, television network
executives want to know the proportion of television viewers who are tuned in to their networks;
an economist wants to know the mean income of university graduates; and a medical researcher
wishes to estimate the recovery rate of heart attack victims treated with a new drug. In each of
these cases, to accomplish the objective exactly, the statistics practitioner would have to examine
each member of the population and then calculate the parameter of interest. For instance,
network executives would have to ask each person in the country what he or she is watching to
determine the proportion of people who are watching their shows. Because there are millions of
television viewers, the task is both impractical and prohibitively expensive.
An alternative would be to take a random sample from this population, calculate the sample
proportion, and use that as an estimator of the population proportion. The use of the sample
proportion to estimate the population proportion seems logical.
The selection of the sample statistic to be used as an estimator, however, depends on the
characteristics of that statistic. Naturally, we want to use the statistic with the following most
desirable qualities.
1. Unbiased Estimator- An unbiased estimator of a population parameter is an estimator
whose expected value is equal to that parameter.
This means that if you were to take an infinite number of samples and calculate the value of the
estimator in each sample, the average value of the estimators would equal the parameter. This
amounts to saying that, on average, the sample statistic is equal to the parameter (i.e., E( X̄ ) = μ
). We also know that the sample proportion is an unbiased estimator of the population proportion
because E(ṗ) = p and that the difference between two sample means is an unbiased estimator of
the difference between two population means because E( X̄ 1─ X̄ 2) = μ1−μ2 .
Knowing that an estimator is unbiased only assures us that its expected value equals the
parameter; it does not tell us how close the estimator is to the parameter. Another desirable
quality is that as the sample size grows larger, the sample statistic should come closer to the
population parameter. This quality is called consistency.
2. Consistency- An unbiased estimator is said to be consistent if the difference between the
estimator and the parameter grows smaller as the sample size grows larger.
The measure we use to gauge closeness is the variance (or the standard deviation). Thus, X̄ is a
consistent estimator of μ because the variance of is σ 2/n. This implies that as n grows larger, the
variance X̄ of grows smaller. As a consequence, an increasing proportion of sample means falls
close to μ. Similarly, ṗ is a consistent estimator of p because it is unbiased and the variance of ṗ
is p(1-p)/n, which grows smaller as n grows larger.
A third desirable quality is relative efficiency, which compares two unbiased estimators of a
parameter.
3. Relative Efficiency- If there are two unbiased estimators of a parameter, the one whose
variance is smaller is said to have relative efficiency.
Statisticians have established that the sample median is an unbiased estimator but that its
variance is greater than that of the sample mean (when the population is normal). As a
consequence, the sample mean is relatively more efficient than the sample median when
estimating the population mean.
Page 2 of 12
4.2 Interval Estimator of Population Mean
4.2.1 Interval Estimator for Population Mean When σ is Known
In order to develop an interval estimate of a population mean, either the population standard
deviation σ or the sample standard deviation s must be used to compute the margin of error. In
most applications σ is not known, and s is used to compute the margin of error. In some
applications, however, large amounts of relevant historical data are available and can be used to
estimate the population standard deviation prior to sampling. In addition, in quality control
applications where a process is assumed to be operating correctly, or “in control,” it is
appropriate to treat the population standard deviation as known. We refer to such cases as the σ
known case. In this section, we introduce an example in which it is reasonable to treat σ as
known and show how to construct an interval estimate for this case.
We now describe how an interval estimator is produced from a sampling distribution. Suppose
we have a population with mean μ and standard deviation σ . The population mean is assumed to
be unknown, and our task is to estimate its value. As we just discussed, the estimation procedure
requires the statistics practitioner to draw a random sample of size n and calculate the sample
mean X̄ .
The central limit theorem presented stated that X̄ is normally distributed if X is normally
distributed, or approximately normally distributed if X is non-normal and n is sufficiently large.
−μ
This means that the variable z = is standard normally distributed (or approximately so).
σ /√n
Thus, we can develop the following probability statement associated with the sampling
distribution of the mean:
σ σ
p( μ−z α /2 <¿ μ+ z α /2 ) = 1−α which was derived from
√n √n
p(−z α/ 2 <¿ z α / 2) = 1−α
Using a similar algebraic manipulation, we can express the probability in a slightly different
form:
σ σ
p( X̄ −z α / 2 √ n ¿ μ<+ z α /2 √ n ) = 1−α
Notice that in this form the population mean is in the center of the interval created by adding and
subtracting z α/ 2standard errors to and from the sample mean (margin error). It is important for
you to understand that this is merely another form of probability statement about the sample
mean. This equation says that, with repeated sampling from this population, the proportion of
σ σ
values of for which the interval X̄ −z α/ 2 ,+ z α /2 includes the population mean μ is equal to
√n √n
1-α . This form of probability statement is very useful to us because it is the confidence interval
estimator of μ.
Confidence interval is a range of values constructed from sample data so that the population
parameter is likely to occur within that range at a specified probability. The specified probability
is called the level of confidence.
Page 3 of 12
The probability 1- α is called the confidence level (coefficient).
X̄ −z α/ 2 σ is called the Lower Confidence Level (LCL)
√n
σ
+ z α /2 is called the Upper Confidence Level (UCL)
√n
Because the confidence level is the probability that the interval includes the actual value of μ, we
generally set 1- α close to 1 (usually between 0.90 and 0.99). In table below we list four
confidence level is 1- 𝛼 = 0.95, 𝛼 = 0.05, 𝛼/2 = 0.025, and z α/ 2= z 0.025= 1.96. The resulting
commonly used confidence intervals and their associated value of z α/ 2 . For example, if the
confidence interval estimator is then called the 95% confidence interval estimator of µ.
1- 𝛼 𝛼 𝛼/2
Four Commonly Used Confidence Levels and z α/ 2
z α/ 2
0.90 0.1 0.05 z 0.05=1.645
0.95 0.05 0.025 z 0.025= 1.96
0.98 0.02 0.01 z 0.01=2.33
0.99 0.01 0.005 z 0.005=2.575
Example: The Doll Computer Company makes its own computers and delivers them directly to
customers who order them via the Internet. Doll competes primarily on price and speed of
delivery. To achieve its objective of speed, Doll makes each of its five most popular computers
and transports them to warehouses across the country. The computers are stored in the
warehouses from which it generally takes 1 day to deliver a computer to the customer. This
strategy requires high levels of inventory that add considerably to the cost. To lower these costs,
the operations manager wants to use an inventory model.
He notes that both daily demand and lead time are random variables. He concludes that demand
during lead time is normally distributed, and he needs to know the mean to compute the optimum
inventory level. He observes 25 lead time periods and records the demand during each period.
These data are listed here. The manager would like a 95% confidence interval estimate of the
mean demand during lead time. From long experience, the manager knows that the standard
deviation is 75 computers. Construct confidence interval for Doll Computer Company.
Demand during lead-time
235 261 374 46 316 309 499 25 334
6 3
421 374 361 53 296 514 462 36
5 9
330 302 344 38 332 348 439 39
6 4
We need four values to construct the confidence interval estimate of µ. They are , z α/ 2,σ , n
Solution
=
Σxi 9,254
¿ =370.16
The confidence interval is set at 95%; thus, 1- 𝛼 = 0.95, 𝛼 = 0.05, and 𝛼/2 = 0.025
n 25
Page 4 of 12
From the above table we can find z α/ 2= z 0.025= 1.96.
Substituting the above attributes into the confidence interval estimator, we find
σ 75
± zα/ 2 = 370.16 ± 1.96 = 370.16 ± 29.40 = (340.76, 399.56).
√n √ 25
Here the numerical value 29.40 represents margin error.
Interpretation: The operations manager estimates that the mean demand during lead-time lies
between 340.76 and 399.56. He can use this estimate as an input in developing an inventory
policy.
Practical Advice
If the population follows a normal distribution, the confidence interval provided by confidence
interval estimator expression is exact. In other words, if expression were used repeatedly to
generate 95% confidence intervals, exactly 95% of the intervals generated would contain the
population mean. If the population does not follow a normal distribution, the confidence interval
provided by confidence interval estimator expression will be approximate. In this case, the
quality of the approximation depends on both the distribution of the population and the sample
size.
In most applications, a sample size of n ≥ 30 is adequate when using the expression to develop
an interval estimate of a population mean. If the population is not normally distributed, but is
roughly symmetric, sample sizes as small as 15 can be expected to provide good approximate
confidence intervals. With smaller sample sizes, the expression should only be used if the analyst
believes, or is willing to assume, that the population distribution is at least approximately
normal.
4.2.2 Interval Estimator of Population Mean When σ is Unknown
When developing an interval estimate of a population mean we usually do not have a good
estimate of the population standard deviation either. In these cases, we must use the same sample
to estimate both μ and σ. This situation represents the σ unknown case. When s is used to
estimate σ, the margin of error and the interval estimate for the population mean are based on a
probability distribution known as the t distribution (student t distribution). Although the
mathematical development of the t distribution is based on the assumption of a normal
distribution for the population we are sampling from, research shows that the t distribution can
be successfully applied in many situations where the population deviates significantly from
normal. Later in this section we provide guidelines for using the t distribution if the population is
not normally distributed.
The t distribution is a family of similar probability distributions, with a specific t distribution
depending on a parameter known as the degrees of freedom. The t distribution with one degree
t distribution with two degrees of freedom, with three degrees of
of freedom is unique, as is the
freedom, and so on. As the number of degrees of freedom increases, the difference between the t
distribution and the standard normal distribution becomes smaller and smaller.
Note that a t distribution with more degrees of freedom exhibits less variability and more closely
resembles the standard normal distribution. Note also that the mean of the t distribution is zero.
Page 5 of 12
We place a subscript on t to indicate the area in the upper tail of the t distribution. For example,
just as we used z0.025 to indicate the z value providing a 0.025 area in the upper tail of a standard
normal distribution, we will use t0.025 to indicate a 0.025 area in the upper tail of a t distribution.
In general, we will use the notation tα/2 to represent a t value with an area of α/2 in the upper tail
of the t distribution.
To know t-value we can use t-distribution table. Each row in the table corresponds to a separate
t distribution with the degrees of freedom shown. For example, for a t distribution with 9
degrees of freedom, t0.025 = 2.262. Similarly, for a t distribution with 60 degrees of freedom,
t0.025 = 2.000. As the degrees of freedom continue to increase, t 0.025 approaches z 0.025 = 1.96. In
fact, the standard normal distribution z values can be found in the infinite degrees of freedom
row (labeled ∞ ) of the t distribution table. If the degrees of freedom exceed 100, the infinite
degrees of freedom row can be used to approximate the actual t value; in other words, for more
than 100 degrees of freedom, the standard normal z value provides a good approximation to the t
value.
The following characteristics of the t distribution are based on the assumption that the
population of interest is normal, or nearly normal.
a. It is, like the z distribution, a continuous distribution.
b. It is, like the z distribution, bell-shaped and symmetrical.
c. There is not one t distribution, but rather a "family" of t distributions. All t distributions
have a mean of 0, but their standard deviations differ according to the sample size, n.
There is a t distribution for a sample size of 20, another for a sample size of 22, and so
on. The standard deviation for a t distribution with 5 observations is larger than for a t
distribution with 20 observations.
d. The t distribution is more spread out and flatter at the center than the standard normal
distribution. As the sample size increases, however, the t distribution approaches the
standard normal distribution, because the errors in using s to estimate σ decrease with
larger samples.
To develop a confidence interval for the population mean using the t distribution, we adjust the
above formula to:
s
±t
√n
To put it another way, to develop a confidence interval for the population mean with an unknown
population standard deviation we:
ii. Estimate the population standard deviation (𝝈) with the sample standard deviation (s).
a. Assume the sample is from a normal population.
Page 6 of 12
whether 𝝈 is known or not. When 𝝈 is known, we use z; when it is not, we use t. The rule of
using z when the sample is 30 or more is based on the fact that the t distribution approaches the
normal distribution as the sample size increases.
When the sample reaches 30, there is little difference between the z and t values, so we may
ignore the difference and use z. We will show this when we discuss the details of the t
distribution and how to find values in a t distribution. The following chart summarizes the
decision-making process.
NO Use
YES NO YES
appropriate
Use the z Use the t Use the z
non-
distribution distribution distribution
parametrics
test
The following example will illustrate a confidence interval for a population mean when the
population standard deviation is unknown and how to find the appropriate value of t in a table.
Example: A tire manufacturer wishes to investigate the tread life of its tires. A samples of 10
tires driven 50,000 miles revealed a sample mean of 0.32 inch of tread remaining with a standard
deviation of 0.09 inch. Construct a 95 percent confidence interval for the population mean.
Would it be reasonable for the manufacturer to conclude that after 50,000 miles the population
mean amount of tread remaining is 0.30 inches?
Solution: To begin, we assume the population distribution is normal. In this case, we don't have a
lot of evidence, but the assumption is probably reasonable. We do not know the population
standard deviation, but we know the sample standard deviation, which is 0.09 inches. To use the
central limit theorem, we need a large sample, that is, a sample of 30 or more. In this instance
there are only 10 observations in the sample. Hence, we cannot use the central limit theorem. We
s
use the formula: ± t .
√n
To find the value of t we use t distribution table. The first step for locating t is to move across
the row identified for "Confidence Intervals" to the level of confidence requested. In this case we
want the 95 percent level of confidence, so we move to the column headed "95%." The column
Page 7 of 12
on the left margin is identified as "df." This refers to the number of degrees of freedom. The
number of degrees of freedom is the number of observations in the sample minus the number of
samples, written n ─ 1. In this case it is 10 - 1 = 9. For a 95 percent level of confidence and 9
degrees of freedom, we select the row with 9 degrees of freedom. The value of t is 2.262.
To develop a confidence interval for a proportion, we need to meet the following assumptions.
1. The binomial conditions, discussed in Chapter 2, have been met. Briefly, these conditions
are:
a. The sample data is the result of counts.
b. There are only two possible outcomes. (We usually label one of the outcomes a
"success" and the other a "failure.")
Page 8 of 12
c. The probability of a success remains the same from one trial to the next.
d. The trials are independent. This means the outcome on one trial does not affect the
outcome on another.
2. The values np and n(1 - p) should both be greater than or equal to 5. This condition allows
us to employ the standard normal distribution, that is, Z, to complete a confidence interval.
With margin of error, the general expression for an interval estimate of a population proportion is
as follows.
√
ṗ ± z α/ 2 σ ṗ, and σ ṗ=
p(1− p)
n
,
But σ ṗ=
√ p(1− p)
n
cannot be used directly in the computation of the margin of error because p
will not be known; p is what we are trying to estimate. So ṗ is substituted for p and for an
√
interval estimate of a population proportion is given by ṗ ± z α/ 2
ṗ (1− ṗ)
n
.
Example: The following example illustrates the computation of the margin of error and interval
estimate for a population proportion. A national survey of 900 women golfers was conducted to
learn how women golfers view their treatment at golf courses in the United States. The survey
found that 396 of the women golfers were satisfied with the availability of tee times. Thus, the
point estimate of the proportion of the population of women golfers who are satisfied with the
availability of tee times is 396/900 = 0.44. Using the above expression and a 95% confidence
level,
√
ṗ ± z α/ 2
ṗ (1− ṗ)
n √
= 0.44 ± 1.96
0.44(1−0.44)
900
= 0.44 ± 0.0324 = 0.4076 to 0.4724
Thus, the margin of error is 0.0324 and the 95% confidence interval estimate of the population
proportion is 0.4076 to 0.4724. Using percentages, the survey results enable us to state with 95%
confidence that between 40.76% and 47.24% of all women golfers are satisfied with the
availability of tee times.
4.4 Determining the Sample Size
A concern that usually arises when designing a statistical study is "How many items should be in
the sample?" If a sample is too large, money is wasted collecting the data. Similarly, if the
sample is too small, the resulting conclusions will be uncertain. The necessary sample size
depends on three factors:
i. The level of confidence desired.
ii. The margin of error the researcher will tolerate.
iii. The variability in the population being studied.
The first factor is the level of confidence. Those conducting the study select the level of
confidence. The 95% and the 99% levels of confidence are the most common, but any value
between 0 and 100 percent is possible. The 95% level of confidence corresponds to a z 𝛼/2 value
Page 9 of 12
of 1.96, and a 99% level of confidence corresponds to a z 𝛼/2 value of 2.58. The higher the level of
confidence selected, the larger the size of the corresponding sample.
The second factor is the allowable error. The maximum allowable error (or margin of error
designated as E), is the amount that is added and subtracted to the sample mean (or sample
proportion) to determine the endpoints of the confidence interval. It is the amount of error those
conducting the study are willing to tolerate. It is also one-half the width of the corresponding
confidence interval. A small allowable error will require a large sample. A large allowable error
will permit a smaller sample.
The third factor in determining the size of a sample is the population standard deviation. If the
population is widely dispersed, a large sample is required. On the other hand, if the population is
concentrated (homogeneous), the required sample size will be smaller. However, it may be
necessary to use an estimate for the population standard deviation. Here are three suggestions for
finding that estimate.
i. Use a comparable study. Use this approach when there is an estimate of the
dispersion available from another previous study. Suppose we want to estimate the
number of hours worked per week by refuse workers. Information from certain state or
federal agencies who regularly sample the workforce might be useful to provide an
estimate of the standard deviation. If a standard deviation observed in a previous study is
thought to be reliable, it can be used in the current study to help provide an approximate
sample size.
ii. Use a range-based approach. To use this approach we need to know or have an
estimate of the largest and smallest values in the population. Recall from Chapter 2,
where we described the Empirical Rule, that virtually all the observations could be
expected to be within plus or minus 3 standard deviations of the mean, assuming that the
distribution was approximately normal. Thus, the distance between the largest and the
smallest values is 6 standard deviations. We could estimate the standard deviation as one-
sixth of the range. For example, the director of operations at University Bank wants an
estimate of the number of checks written per month by college students. She believes that
the distribution is approximately normal, the minimum number of checks written is 2 per
month, and the most is 50 per month. The range of the number of checks written per
month is 48, found by 50 - 2. The estimate of the standard deviation then would be 8
checks per month, 48/6.
iii. Conduct a pilot study. This is the most common method. Suppose we want an estimate
of the number of hours per week worked by students enrolled in the College of Business at the
University of Addis. To test the validity of our questionnaire, we use it on a small sample of
students. From this small sample we compute the standard deviation of the number of hours
worked and use this value to determine the appropriate sample size.
To understand how the sample size determination process works, we return to the σ known case
presented in the above section. The confidence interval estimate is
σ
± z α/ 2
√n
σ
Let E = the desired margin of error, and E = z α / 2
√n
Page 10 of 12
z σ
Solving for √ n ,we’ve √ n , = αE/2 ,
2 2
z α/2 σ
Squaring both sides of this equation, we obtain n= 2
E
Note that we use the same formula to determine the sample size in case of unknown σ .
Example: A student in public administration wants to determine the mean amount members of
city councils in large cities earn per month as remuneration for being a council member. The
error in estimating the mean is to be less than $100 with a 95 percent level of confidence. The
student found a report by the Department of Labor that estimated the standard deviation to be
$1,000. What is the required sample size?
Solution: The maximum allowable error, E, is $100. The value of z α/ 2 for a 95 percent level of
confidence is 1.96, and the estimate of the standard deviation is $1,000. Substituting these values
into the above formula gives the required sample size as:
2
z α /2 σ
2
( 1.96 )2 (1,000)2 3,841,600
n= = = = 384.16
E
2
(100)
2 10,000
The computed value of 384.16 is rounded up to 385. A sample of 385 is required to meet the
specifications.
If the student wants to increase the level of confidence, for example to 99 percent, this will
require a larger sample. The z value corresponding to the 99 percent level of confidence is 2.58.
2
z α/2 σ ( 2.58 )2 (1,000)2 6,656,400
2
n= = = = 665.64
E
2
(100)
2 10,000
We recommend a sample size of 666. Observe how much the change in the confidence level
changed the size of the sample. An increase from the 95 percent to the 99 percent level of
confidence resulted in an increase of 281 observations. This could greatly increase the cost of the
study, both in terms of time and money. In contrast, it increases the accuracy of study
conclusion. Hence, the level of confidence should be considered carefully.
The procedure just described can be adapted to determine the sample size for a proportion.
Again, three items need to be specified: the desired level of confidence, the margin of error in the
population proportion and an estimate of the population proportion.
ṗ (1− ṗ)
E = z α/ 2
n √
Solving this equation for n provides a formula for the sample size that will provide a margin of
error of size E.
Page 11 of 12
2
z α/2 ṗ(1− ṗ)
n= 2
E
Note, however, that we cannot use this formula to compute the sample size that will provide the
desired margin of error because will not be known until after we select the sample. What we
need, then, is a planning value for that can be used to make the computation. Using p* to
denote the planning value for ṗ , the following formula can be used to compute the sample size
that will provide a margin of error of size E.
2 ¿ ¿
z α/2 P (1−P )
n= 2
E
In practice, the planning value p* can be chosen by one of the following procedures.
a) Use the sample proportion from a previous sample of the same or similar units.
b) Use a pilot study to select a preliminary sample. The sample proportion from this sample
can be used as the planning value, p*.
c) Use judgment or a “best guess” for the value of p*.
d) If none of the preceding alternatives apply, use a planning value of p* = 0.50.
Example: The study in the previous example also estimates the proportion of cities that have
private refuse collectors. The student wants to estimate the margin of error to be within 0.10 of
the population proportion, the desired level of confidence is 90 percent, and no estimate is
available for the population proportion. What is the required sample size?
Solution: The estimate of the population proportion is to be within 0.10, so E = 0.10. The desired
level of confidence is 0.90, which corresponds to a z α/ 2 value of 1.65. Because no estimate of the
population proportion is available, we use p* = 0.50. The suggested number of observations is
2 ¿ ¿ 2
z α /2 P (1−P ) ( 1.65 ) 0.5 (1−0.5) 0.680625
n= = = = 68.0625
E
2
(0.1)
2
0.01
Therefore, the student needs a random sample of 69 cities.
Page 12 of 12