Engineering Data Analysis Guide
Engineering Data Analysis Guide
Transforming
lives by
Educating for
the BEST.
Republic of the Philippines
Cagayan State University
Carig Campus
CSU Mission
COLLEGE OF ENGINEERING
CSU is committed
to transform the
lives of people and
communities FLIPPED NOTES NUMBER 7
through high
quality instruction
and innovative
research,
development,
production and
In partial fulfilment for the requirements of the course
extension. ENGINEERING DATA ANALYSIS
CSU – IGA
Competence
Social Responsibility
By:
Unifying Presence
SUBONG, JOEMAR D.
BACANI, VALERIE ELAINE M.
Personal
Responsibility
Empathy
Research Skill
Entrepreneurial Skill
I. Introduction
Objectives:
Definition 1
Point estimate of a population parameter is a single value of a statistic used to estimate the value
of the target parameter. For example, the sample mean x is a point estimate of the population
mean μ. Similarly, the sample proportion p is a point estimate of the population proportion P.
Definition 2
A point estimate of some population parameter μ is a single numerical value x of a statisticX . The
statistic X is called the point estimator.
An estimator should be “close” in some sense to the true value of the unknown parameter.
Formally, we say that X is an unbiased estimator of μ if the expected value of X is equal μ.
This is equivalent to saying that the mean of the probability distribution of X (or the mean of
the sampling distribution of X ) is equal to μ.
E ( X ) =0
E ( X ) −θ
Example 1 If the average height of 100 randomly selected men aged 18 is 70.6 inches, then
we would say that the average height of all 18-year-old men is (at least approximately) 70.6
inches.
Explanation Estimating a population parameter like this by a single number is called point
estimation. The only drawback with a point estimate is that it gives no indication of how
reliable the estimate is. In brief, in the case of estimating a population mean μ we use a
formula to compute from the data a number E, called the margin of error of the estimate, and
form the interval [x−−E,x−+E]. We do this in such a way that a certain proportion, say 95%,
of all the intervals constructed from sample data by means of this formula contains the
unknown parameter μ. Such an interval is called a 95% confidence interval for μ. (Shafer
and Zhang, 2019)
Objectives:
To become familiar with the concept of an interval estimate of the population
mean.
To calculate the confidence interval estimating the population mean
To determine the relationship of confidence interval with confidence coefficient
and sample size
The term large-sample refers to the sample being of a sufficiently large size that we
can apply the Central Limit Theorem to determine the form of the sampling distribution of
x̄ . The Central Limit Theorem says that, for large samples (samples of size n≥30), when
viewed as a random variable the sample mean X́ is normally distributed with mean μ x́ =μ and
σ
standard deviation σ x́ = . The Empirical Rule says that we must go about two standard
√n
deviations from the mean to capture 95% of the values of X́ generated by sample after
sample. A more precise distance based on the normality of X́ is 1.960 standard deviations,
1.960 σ
which is E= .
√n
It is standard practice to identify the level of confidence in terms of the area α in the
two tails of the distribution of X́ when the middle part specified by the level of confidence is
taken out. The following figures are shown to present the general situation and confidence
interval of this.
The z-value that cuts off a right tail of area c are denoted zc. Thus the number 1.960 in the
example is z 0.025, which is z α for α=1−0.95=0.05.
2
Figure 2. For 95% confidence the area in each tail is α/2=0.025
The level of confidence can be any number between 0 and 100%, but the most common
values are probably 90% (α=0.10), 95% (α=0.05), and 99% (α=0.01).
Example 2 A sample of size 49 has sample mean 35 and sample standard deviation 14.
Construct a 98% confidence interval for the population mean using this information. Interpret
its meaning.
Solution:
z /2
For confidence level 98%, α=1−0.98=0.02, so = z0.01. From the critical values table, we
read directly that z0.01=2.326. Thus
S 14
x́ ± z α / 2
√n
=35 ±2.326 ( )
√ 49
=35 ± 4.652 ≈ 35 ± 4.7
Definition 3
A confidence interval for a parameter is an interval of numbers within which we expect the true
value of the population parameter to be contained. It is a range of possible values p might take,
controlling the probability that μ is not lower than the lowest value in this range and not higher
than the highest value. It is used to express the precision and uncertainty associated with a
particular sampling method. A confidence interval consists of three parts. A confidence level,
statistic, and margin of error. The endpoints of the interval are computed based on sample
information.
Example 3 A random sample of 120 students from a large university yields mean
GPA 2.71 with sample standard deviation 0.51. Construct a 90% confidence interval for the
mean GPA of all students at the university.
Solution:
z
For confidence level 90%, α=1−0.90=0.10, so /2 = z0.05. From the critical values table we
read directly that z0.05=1.645. Since n=120, x̄ =2.71, and s=0.51. thus
S 0.51
x́ ± z α / 2
√n
=2.71± 1.165(√ 120)=2.71 ± 0.0766
One may be 90% confident that the true average GPA of all students at the university is
contained in the interval (2.71−0.08, 2.71+0.08)=(2.63,2.79).
Different intervals can be calculated having different value of aside from 95 % confidence.
This can be done by choosing a confidence coefficient other than .95.
Definition 4
The confidence coefficient is the proportion of times that a confidence interval encloses the true
value of the population parameter if the confidence interval procedure is used repeatedly a very
large number of times.
The first step in constructing a confidence interval with any desired confidence coefficient is
to notice that, for a 95% confidence interval, the confidence coefficient of 95% is equal to the
total area under the sampling distribution (1.00), less .05 of the area, which is divided equally
between the two tails of the distribution. Thus, each tail has an area of .025. Second, consider
that the tabulated value of z that cuts off an area of .025 in the right tail of the standard
normal distribution is 1.96. The value z = 1.96 is also the distance, in terms of standard
deviation, that x̄ is from each endpoint of the 95% confidence interval. By assigning a
confidence coefficient other than .95 to a confidence interval, we change the area under the
sampling distribution between the endpoint of the interval, which in turn changes the tail area
associated with z. Thus, this z-value provides the key to constructing a confidence interval
with any desired confidence coefficient.
Assumption: n 30
[When the value of σ is unknown, the sample standard deviation s may be used to
approximate σ in the formula for the confidence interval. The approximation is generally
quite satisfactory when n 30.]
Example 4 A random number of seniors in a certain university were asked to report the
number of hours they spent on their studies during a certain week. Results show that the
average was 40 hours and the standard deviation was 10 hours. A study will be conducted
and they aim to know if student are now studying more than they used to. Suppose 50
students are interviewed and the results yields a statistics of x̄ = 41.5 hours and s = 9.2
hours.
Estimate μ , the mean number of hours spent on study, using a 99% confidence interval.
Interpret the interval in term of the problem.
s 9.2
x 2.58 x 2.58 41.5 2.58 41.5 3.36
n n 50 or (38.14, 44.86).
Therefore, we can be 99% confident that the interval (38.14, 44.86) encloses the true mean
weekly time spent on the study. Since all the values in the interval fall above 38 hours and
below 45 hours, we conclude that there is tendency that students now spend more than 6
hours and less than 7.5 hours per day on average (suppose that they don't study on Sunday).
Solution
a. The form of a large-sample 95% confidence interval for a population mean is
s 9.2
x 1.96 x 1.96 41.5 1.96 41.5 2.55
n n 50 or (38.95, 44.05).
b. The 99% confidence interval for was determined in Example 4 to be (38.14, 44.86).
While the 95% confidence interval obtained in this example is (38.95, 44.05). From this,
it is concluded that the 95% confidence interval is narrower than the 99% confidence
interval.
Solution
a. Substitution of the values of the sample statistics into the general formula for a 99%
confidence interval for yield
s 9.2
x 2.58 x 2.58 41.5 2.58 41.5 2.37
n n 100 or (39.13, 43.87)
b. The 99% confidence interval based on a sample of size n = 100, constructed in part a is
(39.13, 43.87) and the 99% confidence interval based on a sample of size n = 50 is (38.14,
44.86). From this, it is concluded that the 99% confidence interval based on a sample of
size n = 100 is narrower than the latter.
The interpretation of a 95% confidence interval is that 95% of the intervals constructed in this
manner will contain the population mean. Thus, any interval computed in this manner has a
95% confidence of containing the population mean. By changing the constant from 1.96 to
1.645, a 90% confidence interval can be obtained. It should be noted from the formula for an
interval estimate that a 90% confidence interval is narrower than a 95% confidence interval
and as such has a slightly smaller confidence of including the population mean. Lower levels
of confidence lead to even more narrow intervals. In practice, a 95% confidence interval is
the most widely used.
In this section, the concepts of point estimation of the population mean , based on large
sample was introduced. Different terms such as confidence interval and confidence
coefficient were also presented and relationship of confidence interval with confidence
coefficient and sample size is also analyzed.
III. Estimation of a population mean: small sample case
Objectives
In the previous section, the Central Limit Theorem was used to estimate the
population mean of a large population sample. However, this cannot be used in this section
unless a certain assumption will be made and followed.
If this assumption is valid, then we may again use x as a point estimation for , and
the general form of a small-sample confidence interval for is as shown next box.
The graph for the Student’s t-distribution is similar to the standard normal curve and
at infinite degrees of freedom it is the normal distribution. This can be confirmed by reading
the bottom line at infinite degrees of freedom for a familiar level of confidence, e.g. at
column 0.05, 95% level of confidence, the t-value of 1.96 is at infinite degrees of freedom.
The mean for the Student’s t-distribution is zero and the distribution is symmetric
about zero, similar to the standard normal distribution.
The Student’s t-distribution has more probability in its tails than the standard normal
distribution because the spread of the t-distribution is greater than the spread of the standard
normal. Therefore the graph of the Student’s t-distribution will be thicker in the tails and
shorter in the center than the graph of the standard normal distribution.
The exact shape of the Student’s t-distribution depends on the degrees of freedom. As
the degrees of freedom increases, the graph of Student’s t-distribution becomes more like the
graph of the standard normal distribution.
The underlying population of individual observations is assumed to be normally
distributed with unknown population mean μ and unknown population standard deviation σ.
This assumption comes from the Central Limit theorem because the individual observations
in this case are the x̄s of the sampling distribution. The size of the underlying population is
generally not relevant unless it is very small. If it is normal then the assumption is met and
doesn’t need discussion
Solution:
Since the population is normally distributed, the sample is small, and the population standard
deviation is unknown, the formula that applies is
x́ ± t α
2
( √sn )
Confidence level 95% means that
α=1−0.95=0.05(7.2.3)
so α/2=0.025. Since the sample size is n=15, there are n−1=14 degrees of freedom and
t0.025=2.145. Thus
s
x́ ± t α
2
( )
√n
¿ 35 ±2.145 ( √1415 )
¿ 35 ±7.8
Therefore one may be 95% confident that the true value of μ is contained in the interval
(35−7.8, 35+7.8)=(27.2,42.8)
Solution:
Since the population is normally distributed, the sample is small, and the population standard
deviation is unknown, the formula that applies is
x́ ± t α
2
( √sn )
Confidence level 90% means that
α=1−0.90=0.10
so α/2=0.05. Since the sample size is n=12, there are n−1=11 degrees of freedom and t0.05
=1.796. Thus
x́ ± t α
2
( √sn )
0.15
¿ 2.71 ±1.796 ( )
√12
¿ 2.71 ±0.26
Therefore, one may be 90% confident that the true average GPA of all students at the
university is contained in the interval (2.71−0.26, 2.71+0.26)=(2.45,2.97)
Example 9 The average earnings per share (EPS) for 10 industrial stocks randomly selected
from those listed on the Dow-Jones Industrial Average was found to be X =1.85 with a
standard deviation of s=0.395. Calculate a 99% confidence interval for the average EPS of all
the industrials listed on the DJIA.
Solution
To help visualize the process of calculating a confident interval we draw the appropriate
distribution for the problem. In this case this is the Student’s t because we do not know the
population standard deviation and the sample is small, less than 30.
To find the appropriate t-value requires two pieces of information, the level of confidence
desired and the degrees of freedom. The question asked for a 99% confidence level. On the
graph this is shown where (1-α) , the level of confidence , is in the unshaded area. The tails,
thus, have .005 probability each, α/2. The degrees of freedom for this type of problem is n-1=
9. From the Student’s t table, at the row marked 9 and column marked .005, is the number of
standard deviations to capture 99% of the probability, 3.2498. These are then placed on the
graph remembering that the Student’s t is symmetrical and so the t-value is both plus or
minus on each side of the mean.
Inserting these values into the formula gives the result. These values can be placed on the
graph to see the relationship between the distribution of the sample means, X ‘s and the
Student’s t distribution.
x́ ± t α
2
( √sn )=1.8513 .2498 0.395
√ 10
=1.85510 .406
1.445 ≤ μ ≤ 2.257
With 99% confidence level, the average EPS of all the industries listed at DJIA is from 1.44
to 2.26.
IV. Estimation of a population proportion
Objectives
Compute the confidence interval to estimate a population proportion
Interpret the confidence interval in context.
This section focuses on the method for estimating population proportion. The
procedure to find the confidence interval for a population proportion is similar to that for the
population mean, but the formulas are a bit different although conceptually identical. While
the formulas are different, they are based upon the same mathematical foundation given by
the Central Limit Theorem. In determining if the problem falls under this section, the
underlying distribution must have a binary random variable and therefore is a binomial
distribution. (There is no mention of a mean or average.) If X is a binomial random variable,
then X ~ B(n, p) where n is the number of trials and p is the probability of a success. To form
a sample proportion, take X, the random variable for the number of successes and divide it
by n, the number of trials (or the sample size). The random variable P′ (read “P prime”) is the
sample proportion,
X
P' =
n
The formula for the confidence interval for a population proportion is shown below.
Remember that as p moves further from 0.5 the binomial distribution becomes less
symmetrical. Because we are estimating the binomial with the symmetrical normal
distribution the further away from symmetrical the binomial becomes the less confidence we
have in the estimate.
This conclusion can be demonstrated through the following analysis. Proportions are
based upon the binomial probability distribution. The possible outcomes are binary, either
“success” or “failure”. This gives rise to a proportion, meaning the percentage of the
outcomes that are “successes”. It was shown that the binomial distribution could be fully
understood if we knew only the probability of a success in any one trial, called p. The mean
and the standard deviation of the binomial were found to be:
μ=np
σ =npq
It was also shown that the binomial could be estimated by the normal distribution if BOTH
np and nq were greater than 5. Unfortunately, there is no correction factor for cases where the
sample size is small so np′ and nq’ must always be greater than 5 to develop an interval
estimate for p.
Example 9 According to a 2010 report from the American Council on Education, females
make up 57% of the college population in the United States. Students in a statistics class at
Tallahassee Community College want to determine the proportion of female students at TCC.
They select a random sample of 135 TCC students and find that 72 are female, which is a
sample proportion of 72 / 135 ≈ 0.533. So 53.3% of the students in the sample are female.
What can they conclude about the proportion of females at the college? How confident can
they be in their estimate?
Solution:
Step 1. Find a confidence interval.
Note that a confidence interval comes from a normal model of the sampling distribution and
there are two conditions for using a normal model for sample proportions:
The sample must be random.
The expected number of successes in the sample, np, and the expected number of
failures, n(1 – p), are both greater than or equal to 10. In symbols, this is np ≥ 10
and n(1 − p) ≥ 10. Recall that success doesn’t mean good and failure doesn’t mean bad.
A success is just what we are counting.
Advanced theory tells us that if the actual number of successes and failures in the sample are
greater than or equal to 10, then a normal model is still a good fit.
This sample contains 72 successes (female students) and 63 failures (male students). Both are
greater than 10. We therefore use the normal model for the sampling distribution.
Step 2. Find the margin of error:
Note that a sample proportion is only an estimate for the population proportion therefore, the
sample proportion is not equal to the population proportion, so there is some error due to
random chance. The standard deviation of the sample proportions is used to describe the
amount of error that is expected in random samples. This is called the standard error.
When using a normal model for the sampling distribution, 95% of sample proportions
estimate the population proportion within approximately 2 standard errors. So the margin of
error is the following:
p(1− p)
2
√ n
Now let’s calculate the margin of error for the TCC estimate of 53.3%. Since population
proportion p is unknown, the margin of error cannot be calculated. The solution to this
problem is to estimate the standard error using the sample proportion in place of p. This is
called the estimated standard error, and the formula is:
√¿ ¿ ¿
For this example, the estimated standard error is
0.533(1−0.533)
√ 135
≈ 0.043
So the margin of error for the 95% confidence interval is:
0.533(1−0.533)
2
√ 135
≈ 2 ( 0.043 )=0.086
Objectives
The figure below illustrates the conceptual framework of investigation in this section. Each
population has a mean and a standard deviation. We arbitrarily label one population as
Population 1 and the other as Population 2, and subscript the parameters with the
numbers 1 and 2 to tell them apart. We draw a random sample from Population 1 and label
the sample statistics it yields with the subscript 1. Without reference to the first sample we
draw a sample from Population 2 and label its sample statistics with the subscript 2.
Definition 5
Independence. Samples from two distinct populations are independent if each one is drawn
without reference to the other, and has no connection with the other.
The goal is to use the information in the samples to estimate the difference μ1−μ2 in the
means of the two populations and to make statistically valid inferences about it.
Since the mean x−1 of the sample drawn from Population 1 is a good estimator of μ1 and the
mean x−2 of the sample drawn from Population 2 is a good estimator of μ2, a reasonable
point estimate of the difference μ1−μ2 is x́ 1− x´2. In order to widen this point estimate into a
confidence interval, we first suppose that both samples are large, that is, that
both n1≥30 and n2≥30. If so, then the following formula for a confidence interval for μ1−μ2 is
valid. The symbols s21and s22 denote the squares of s1 and s2. (In the relatively rare case that
both population standard deviations σ1 and σ2 are known they would be used instead of the
sample standard deviations.
s12 s22
( x 1 x2 ) z /2
n1 n2
2 2
(Note: We have used the sample variances s1 and s2 as approximations to the corresponding
population parameters.)
The assumptions upon which the above procedure is based are the following:
Company 1 Company 2
n1=174 n2=355
x-1=3.51 x-2=3.24
s1=0.51 s2=0.52
Construct a point estimate and a 99% confidence interval for μ1−μ2 , the difference in average
satisfaction levels of customers of the two companies as measured on this five-point scale.
Solution:
x́ 1− x´2 =3.51−3.24=0.27
When estimating the difference between two population means, based on small samples from
each population, we must make specific assumptions about the relative frequency
distributions of the two populations, as indicated in the box.
3. The random samples are selected in an independent manner from two populations.
When these assumptions are satisfied, we may use the procedure specified in the next box to
construct a confidence interval for ( 1 2 ) , based on small samples (n1 and n2 < 30) from
respective populations.
Example 11 A software company markets a new computer game with two experimental
packaging designs. Design 1 is sent to 11 stores; their average sales the first month is 52 units
with sample standard deviation 12 units. Design 2 is sent to 6 stores; their average sales the
first month is 46 units with sample standard deviation 10 units. Construct a point estimate and
a 95% confidence interval for the difference in average monthly sales between the two
package designs.
Solution:
In words, we estimate that the average monthly sales for Design 1 is 6 units more per month
than the average monthly sales for Design 2.
Thus
2 1 1 1 1
√( ) √ (
( x 1−x 2 ) ±t α s p n + n =6 ± ( 2.131 ) 129.3 11 + 6 ≈6 ± 12.3
2 1 2
)
We are 95% confident that the difference in the population means lies in the
interval [−6.3,18.3], in the sense that in repeated sampling 95% of all intervals constructed
from the sample data in this manner will contain μ1−μ2. Because the interval contains both
positive and negative values the statement in the context of the problem is that we
are 95% confident that the average monthly sales for Design 1 is between 18.3 units higher
and 6.3 units lower than the average monthly sales for Design 2.
VI. Estimation of the difference between two population means: Matched pairs
The procedure for estimating the difference between two population means presented in
Section 5 were based on the assumption that the samples were randomly selected from the
target populations. Sometimes we can obtain more information about the difference between
population means ( 1 2 ) , by selecting paired observations.
Objectives
To compute the confidence interval estimating the difference in the means of two
distinct populations using paired sample
To test of hypotheses using the critical value approach
Testing hypotheses concerning the difference of two population means using paired
difference samples is done precisely as it is done for independent samples, although now the
null and alternative hypotheses are expressed in terms of μd instead of μ1−μ2. Thus the null
hypothesis will always be written
H0:μd=D0
The three forms of the alternative hypothesis, with the terminology for each case, are:
Form of Ha Terminology
H0:μd¿D0 Left-tailed
H0:μd¿D0 Right-tailed
H0:μd≠D0 Two-tailed
The same conditions on the population of differences that was required for
constructing a confidence interval for the difference of the means must also be met when
hypotheses are tested. Here is the standardized test statistic that is used in the test.
d́−D 0
T=
sd / √ n
where there are n pairs, d́ is the mean and sd is the standard deviation of their differences.
The test statistic has Student’s t-distribution with df=n−1 degrees of freedom.
Example 12 Suppose that the n = 10 pairs of achievement test scores were given in Table 7.
Find a 95% confidence interval for the difference in mean achievement, d ( 1 2 ) .
Student pair
1 2 3 4 5 6 7 8 9 10
Method 1 78 63 72 89 91 49 68 76 85 55
score
Method 2 71 44 61 84 74 51 55 60 77 39
score
Pair 7 19 11 5 17 -2 13 16 8 16
difference
Solution The differences between matched pairs of reading achievement test scores are
computed as
The value of t.025, based on (n -1) = 9 degrees of freedom, is given in Table 2 of Appendix C
as t.025 = 2.262. Substituting these values into the formula for the confidence interval, we
obtain
s
d t.025 d
n
6.53
11.0 2.262 11.0 4.7
10
or (6.3, 15.7).
We estimate, with 95% confidence, that the difference between mean reading achievement
test scores for method 1 and 2 falls within the interval from 6.3 to 15.7. Since all the values
within the interval are positive. method 1 seems to produce a mean achievement test score
that substantially higher than the mean score for method 2.
Using the data of Table 6.1, test the hypothesis that mean fuel economy for Type 1 gasoline
is greater than that for Type 2 gasoline against the null hypothesis that the two formulations
of gasoline yield the same mean fuel economy. Test at the 5% level of significance using the
critical value approach if the d́=0.14, sd=0.16 and n=3
Solution:
The only part of the table that we use is the third column, the differences.
Step 1. Since the differences were computed in the order Type 1 mpg−Type 2 mpg,
better fuel economy with Type 1 fuel corresponds to μd=μ1−μ2>0. Thus the test is
H0:μd=D0
Vs
H0:μd¿D0 at α=0.05
Step 4. Since the symbol in Ha is “>” this is a right-tailed test, so there is a single
critical value, tα=t0.05 with 88 degrees of freedom, which from the row labeled df=8
read off as 1.860. The rejection region is [1.860,∞).
Step 5. As shown in below, the test statistic falls in the rejection region. The decision
is to reject H0. In the context of the problem our conclusion is:
Conclusion: The data provide sufficient evidence, at the 5% level of significance, to conclude
that the mean fuel economy provided by Type 1 gasoline is greater than that for
Type 2 gasoline.
This section extends the method of Section 4 to the case in which we want to estimate the
difference between two population proportions. Suppose we wish to compare the proportions
of two populations that have a specific characteristic, such as the proportion of men who are
left-handed compared to the proportion of women who are left-handed.
Objectives
To construct a confidence interval estimating the difference in the proportions of two
distinct populations that have a particular characteristic of interest
The figure below illustrates the conceptual framework of our investigation. Each population
is divided into two groups, the group of elements that have the characteristic of interest (for
example, being left-handed) and the group of elements that do not. We arbitrarily label one
population as Population 1 and the other as Population 2, and subscript the proportion of each
population that possesses the characteristic with the number 1 or 2 to tell them apart. We
draw a random sample from Population 1 and label the sample statistic it yields with the
subscript 1. Without reference to the first sample we draw a sample from Population 2 and
label its sample statistic with the subscript 2.
Figure 7.1 Independent Sampling from Two Populations In Order to Compare Proportions
The goal is to use the information in the samples to estimate the difference p1−p2 in the two
population proportions and to make statistically valid inferences about it.
To judge the reliability of the point estimate ( pˆ 1 pˆ 2 ) , we need to know the characteristics of
its performance in repeated independent sampling from two populations. This information is
provided by the sampling distribution of ( pˆ 1 pˆ 2 ) , shown in the next box.
Sampling distribution of ( p 1 − p2 )
^ ^
For sufficiently large sample size, n1 and n2, the sample distribution of ( p^ 1 − ^p2 ) , based on
independent random samples from two populations, is approximately normal with
( pˆ pˆ ) ( p
ˆ1 p
ˆ2 )
Mean: 1 2
And
p1 q1 p q
( pˆ p
ˆ2 ) 2 2
1
n1 n2
Standard deviation:
where q1 = 1 - p1 and q2 = 1 - p2.
where ^p1 and ^p2 are the sample proportions of observations with the characteristics of
interest.
Assumption: The samples are sufficiently large so that the approximation is valid. As a general
rule of thumb, we will require that intervals
ˆ1q
p ˆ1 ˆ2q
p ˆ2
ˆ1 2
p ˆ2 2
p
n1 n2
and do not contain 0 or 1.
Solution
Because the “No public web access” population was labeled as Population 1 and the “Public
web access” population was labeled as Population 2, in words this means that we estimate
that the proportion of projects that passed on the first inspection increased by 13 percentage
points after records were posted on the web.
The sample sizes are sufficiently large for constructing a confidence interval since for sample 1:
p1 ( 1−^
p1 )
3
√ ^
n1
=3
√ ( 0.67 )( 0.33 )
500
=0.06
So that
p2 ( 1−^p2 )
3
√ ^
n2
=3
( 0.8 )( 0.2 )
√
100
=0.12
So that
p 2 ( 1− ^p 2) p^2 ( 1−^p2)
p2−3
^
√ ^
n2
,^
p 2+3
√ n2
=[ 0.8−0.12,0 .8+0.12 ]= [ 0.68,0 .92 ] ⊂ [ 0,1 ]
To apply the formula for the confidence interval, we first observe that the 90% confidence
level means that α=1−0.90=0.10 so that zα/2=z0.05 and z0.05=1.645. Thus the desired confidence
interval is
p1 − ^
(^ p2 )± z α
2 √ p1 ( 1− ^
^
n1
p1 )
+
p^2 ( 1− ^
√ n2
p2 )
¿−0.13 ±0.07
The 90% confidence interval is [−0.20,−0.06]. We are 90% confident that the difference in
the population proportions lies in the interval [−0.20,−0.06], in the sense that in repeated
sampling 90% of all intervals constructed from the sample data in this manner will
contain p1−p2. Taking into account the labeling of the two populations, this means that we
are 90% confident that the proportion of projects that pass on the first inspection is between 6
and 20 percentage points higher after public access to the records than before.
Objective
Compute the sample size required to estimate population parameters for population
mean, population proportion, and two independent samples.
Calculating the right sample size is crucial to gaining accurate information. In fact, in a
survey, the confidence level and margin of error almost solely depends on the number of
responses received.
The first thing to understand is the difference between confidence levels and margins of
error. Simply put, a confidence level describes how sure you can be that your results are
accurate, whereas the margin of error shows the range the survey results would fall between
if our confidence level held true.
The margin of error (MOE) for the 95% confidence interval (CI) for µ is
2s
MOE=E ≈
√n
where s is the standard deviation of the sample. And the 95% CI is;
X ± MOE
Example 15 A manufacturer of cereal boxes wants to know the mean weight of the boxes it
produces. Previous studies have shown the population standard deviation of the weights of
the boxes to be 0.1 ounces. They would like to estimate µ with 95% confidence and have the
MOE no greater than 0.012.
Solution
The 95% CI for µ depends on the MOE which depends on s and the sample size n
2s
MOE=E ≈
√n
The sample standard deviation s is an estimate for the population standard deviation σ. If we
happen to know σ, we will use it instead of s for our MOE
2σ
MOE=E ≈
√n
This client has asked that the MOE be no greater than 0.012, and we know σ=0.1
2σ
MOE=
√n
2 ( 0.1 )
0.012=
√n
Solving for n gives:
2 ( 0.1 )
0.012=
√n
¿
0.04
0.000144=
n
n=277.78 ≈ 278
To have a 95% confidence interval for µ with a MOE of 0.012, this company will have to
sample 278 boxes.
The sample size needed to be 95% confident that x́, the sample mean, will be within MOE of
the population mean, µ.
n≥¿
The margin of error (MOE) for the 95% confidence interval (CI) for a population proportion p
is
^p ( 1− ^p )
MOE=E ≈ 2
√ n
^p ± MOE
Solution
^p ( 1− ^p )
MOE=2
√ n
It turns out that the margin of error is largest when ^p = 0.5. So, since ^p is unknown before
we collect our data, we’ll plug-in ^p = 0.5 which gives us the widest possible MOE (i.e. we’re
assuming worst case scenario for estimating p).
After plugging-in ^p=0.5, and the requested MOE from the client, this gives us
^p ( 1− ^p )
MOE=2
√ n
0.5(1−0.5)
0.04=2
√ n
2
0.5 ( 0.5 )
2
(0.004) =(2
n √ )
0.5(0.5)
0.0016=4
0.0016
1
n= =625
0.0016
To have a 95% confidence interval for p with a MOE of 0.04, this company will have to
sample 625 coats.
The sample size needed to be 95% confident that ^p, the sample proportion, will be within MOE
of the population proportion, p
1
n≥
MOE2
In studies where the plan is to estimate the difference in means between two independent
populations, the formula for determining the sample sizes required in each comparison group is
given below:
n=2 ¿
where n is the sample size required in each group, Z is the value from the standard normal
distribution reflecting the confidence level that will be used, E is the desired margin of error
and σ reflects the standard deviation of the outcome variable
When we generated a confidence interval estimate for the difference in means, Sp can be used,
the pooled estimate of the common standard deviation, as a measure of variability in the
outcome, where Sp is computed as
( n1 −1 ) s 21+(n2−1)s 22
S p=
√ (n1 +n2−2)
Example 17 An investigator wants to plan a clinical trial to evaluate the efficacy of a new
drug designed to increase HDL cholesterol (the "good" cholesterol). The plan is to enrol
participants and to randomly assign them to receive either the new drug or a placebo. HDL
cholesterol will be measured in each participant after 12 weeks on the assigned treatment.
Based on prior experience with similar trials, the investigator expects that 10% of all
participants will be lost to follow up or will drop out of the study over 12 weeks. A 95%
confidence interval will be estimated to quantify the difference in mean HDL levels between
patients taking the new drug as compared to placebo. The investigator would like the margin
of error to be no more than 3 units. How many patients should be recruited into the study?
Solution
n=2 ¿
¿2¿
Samples of size n1=250 and n2=250 will ensure that the 95% confidence interval for the
difference in mean HDL levels will have a margin of error of no more than 3 units. These
sample sizes refer to the numbers of participants with complete data. The investigators
hypothesized a 10% attrition (or drop-out) rate (in both groups). In order to ensure that the
total sample size of 500 is available at 12 weeks, the investigator needs to recruit more
participants to allow for attrition.
N = 500/0.90 = 556
If they anticipate a 10% attrition rate, the investigators should enroll 556 participants. This
will ensure N=500 with complete data at the end of the trial
Objectives
χ2 (chi-square) distribution
Suppose we take a random sample of size n from a normal population with mean µ and
standard deviation σ. Then the sample statistic
2 ( n−1 ) s 2
χ=
σ2
Because of the characteristics just described, the χ 2 curve is right skewed. In another
word, the chi-square distribution is not symmetric.
There is a different curve for every different degrees of freedom, n-1. As the number
of degrees of freedom increases, the χ2 curve begins to look more symmetric.
Example 18 Find the critical values in the Χ2 distribution which separate the middle 95%
from the 2.5% in each tail, assuming there are 12 degrees of freedom.
Solution
Using the table of Chi-square distribution in the appendix, it can be inferred that the two
critical values given by the above conditions are 4.404 and 23.337.
For an easier view of the answer, the table below is given. The answer is highlighted by a red
color in a horizontal manner.
To construct the confidence intervals, we need to find the critical values of a chi-square
distribution for the given confidence level 100 (1 – α)%. We can use either the chi-square
table (table A-4) or technology. Table A-4 shows the degrees of freedom in the left column.
The area to the right of the χ2 critical value is given across the top of the table. (See appendix)
Since chi-square distribution is not symmetric, we cannot construct the confidence interval
for σ2 using the “point estimate ± Margin of error” method. We must find two different chi-
square critical values for each confidence interval for the given confidence level 100 (1 – α)
%.
2 2
where 1 /2 , and /2 are values of 2 that locate an area of /2 to the right and /2 to the left,
respectively, of a chi-square distribution based on (n - 1) degrees of freedom.
Assumption: The population from which the sample is selected has an approximate normal
distribution.
Example 19 Suppose a sample of 30 ECC students are given an IQ test. If the sample has a
standard deviation of 12.23 points, find a 90% confidence interval for the population standard
deviation.
Solution
We first need to find the critical values:
(n−1) s2 2 (n−1) s 2
<σ < 2
X 2α / 2 X 1−α / 2
(30−1)12.232 2 (30−1)12.232
<σ
42.557 17.708
101.9249<σ 2 244.9472
10.10<σ 15.65
So we are 90% confident that the standard deviation of the IQ of ECC students is between
10.10 and 15.65 bpm.
Summary
This chapter presented the technique of estimation - that is, using sample information to make
an inference about the value of a population parameter, or the difference between two
population parameters. In each instance, we presented the point estimate of the parameter of
interest, its sampling distribution, the general form of a confidence interval, and any
assumptions required for the validity of the procedure. In addition, we provided techniques
for determining the sample size necessary to estimate each of these parameters.
References
Shafer, D. S. & Zhang, Z. (2013). Introductory Statistics. Flat World Knowledge Inc.
Myers, S. L., Ye, K., & Walpole, R. E. (2007). Probability & statistics for engineers &
scientists. 8th Ed. Upper Saddle River, NJ: Pearson Prentice Hall.
Sullivan L., Power and Sample Size Determination. Boston Univeristy School of Public
Health
Appendix