STA 301D Statistical Methods
STA 301D Statistical Methods
COURSE TITLE
All right reserved. No part of this publication should be reproduced, stored in a retrieval system or
transmitted by any form or means, electronic, mechanical, photocopying or otherwise without the prior
permission of the copyright holder.
This Course Book “STA 301D Statistical Methods” has been exclusively written by
experts in the discipline to up-date your general knowledge of Education in order to
equip you with the basic tool you will require for your professional training as a teacher.
This three-credit course book of thirty-six (36) sessions has been structured to reflect
the weekly three-hour lecture for this course in the University. Thus, each session is
equivalent to a one-hour lecture on campus. As a distance learner, however, you are
expected to spend a minimum of three hours and a maximum of five hours on each
session.
To help you do this effectively, a Study Guide has been particularly designed to show
you how this book can be used. In this study guide, your weekly schedules are clearly
spelt out as well as dates for quizzes, assignments and examinations.
Also included in this book is a list of all symbols and their meanings. They are meant to
draw your attention to vital issues of concern and activities you are expected to perform.
Blank sheets have been also inserted for your comments on topics that you may find
difficult. Remember to bring these to the attention of your course tutor during your
face-to-face meetings.
It is in the foregoing context that the names of Prof. Nathaniel K. Howard and Dr.
Bismark K. Nkansah, University of Cape Coast, who wrote and edited the content of
this course book for CoDEUCC, will ever remain in the annals of the College. This
special remembrance also applies to those who assisted me in the final editing of the
document.
I wish to thank the Vice-Chancellor, Prof. Johnson Nyarko Boampong, the Pro-Vice-
Chancellor, Prof. (Mrs.) Rosemond Boohene and all the staff of the University’s
Administration without whose diverse support this course book would not have been
completed.
Finally, I am greatly indebted to the entire staff of CoDEUCC, especially Mrs. Christina
Hesse for formatting the scripts.
Any limitations in this course book, however, are exclusively mine. But the good
comments must be shared among those named above.
Content Page
About this Book ... ... ... ... ... ... ... ... i
Acknowledgement ... ... ... ... ... ... ... ... ii
Table of Contents ... ... ... ... ... ... ... ... iii
Symbols and their Meanings ... ... ... ... ... ...
Statistical Tables ... ... ... ... ... ... ... ... 257
References ... ... ... ... ... ... ... ... ... 280
INTRODUCTION
OVERVIEW
UNIT OBJECTIVES
SESSION OBJECTIVES
DO AN ACTIVITY
REFER TO
READ OR LOOK AT
SUMMARY
ASSIGNMENT
OVE R E OVE R E
SUMMA RY SUMMA RY
In this unit, you will learn about some common concepts used in hypothesis testing,
procedures in hypothesis testing, and how to conduct hypotheses tests that are based on
single samples. In Section 1, we shall discuss some basic concepts in hypotheses testing
and learn about test procedures. Sections 2 and 3 will focus on tests concerning single
population means for large and small samples, respectively. Sections 4 and 5 will focus
on tests concerning single population proportions for large and small samples,
respectively. The final section will be used to teach you how to conduct tests
concerning a single population variance.
UNIT OBJECTIVES
By the end of the unit, you should be able to:
1. explain the basic concepts used in hypotheses testing;
2. formulate the null and alternative hypotheses for single population parameters;
3. conduct tests for single population means (large and small samples);
4. conduct tests for single population proportions (large and small samples); and
5. conduct tests for single population variances.
Objectives
By the end of the session, you should be able to:
1. explain the term hypothesis;
2. distinguish between null and alternative hypotheses;
3. distinguish between simple and composite tests;
4. distinguish between one-sided and two-sided tests;
5. explain the types of errors that may occur in hypothesis tests;
6. define a test statistic;
7. explain the critical (or rejection) region of a test;
8. explain the level of significance of a test;
9. distinguish between significant and non-significant tests; and
10. explain two approaches to making a decision in a hypothesis test.
Now read on …
Definition 1
A null hypothesis is the assertion that is initially assumed to be true (or the prior belief
assertion). It is denoted by H 0 .
Definition 2
A alternative hypothesis is the assertion that contradicts the null hypothesis. It is
denoted by H 1 (or by H A in other literature).
Hypotheses testing involve the use of sample data to decide whether the null hypothesis
should be rejected or not. If evidence from sample data strongly contradicts the null
hypothesis, then the null hypothesis has to be rejected in favour of the alternative
hypothesis. If evidence from sample data does not strongly contradict the null
hypothesis, then we have no reason to reject the null hypothesis. In other words, we fail
to reject the null hypothesis if sample data does not strongly contradict it. Thus, the
possible conclussions from a hypothesis test should be “reject H 0 in favour of H 1 ” or
“fail to reject H 0 .”
Definition 3
If θ can only take on a single, then both the null hypothesis ( H 0 ) and alternative
hypothesis ( H 1 ) are called simple hypotheses.
Definition 4
If θ can take on multiple or range of values, then both the null ( H 0 ) and alternative are
called composite hypotheses.
In any particular hypothesis testing problem, the null and alternative hypotheses could
be a combination of simple and composite hypotheses. In most cases, the null
hypothesis is simple or composite; and the alternative hypothesis is composite.
Definition 5
When both the null and alternative hypotheses are composite and represent one-side of
the parameter space around some value θ 0 , then the test is said to be a one-sided test.
One-sided tests are also called one-tailed tests. Examples of one-sided tests are:
H 0 : θ ≤ θ 0 against H 1 : θ > θ 0 , or
H 0 : θ ≥ θ 0 against H 1 : θ < θ 0 .
Definition 6
When the null hypothesis is simple and the alternative hypothesis represents the rest of
the parameter space of θ , then the test is said to be a two-sided test.
Two-sided tests are also called two-tailed tests. An example of a two-sided test is:
H 0 : θ = θ 0 against H 1 : θ ≠ θ 0 .
1. H 0 is true and it is rejected at the end of the test. This is a wrong decision, and the
resulting error is classified as Type I error.
2. H 0 is false and it is rejected at the end of the test. This is a correct decision.
3. H 0 is true and it is not rejected at the end of the test. This is a correct decision.
4. H 0 is false and it is not rejected at the end of the test. This is a wrong decision,
and the resulting error is classified as Type II error.
H 0 is true H 0 is false
Reject H 0 Type I error Correct decision
Fail to reject H 0 Correct decision Type II error
As mentioned earlier, hypothesis testing involves the use of sample data to decide
whether the null hypothesis should be rejected or not. The decision to reject the null
hypothesis or not is based on the value of a test statistic. A test statistic is an estimator
whose value is calculated from sample data. Its distribution is known under the
assumption that the null hypothesis is true.
If there is a large difference between what is expected under the null hypothesis and
what is observed in a sample, then the null hypothesis is rejected; and the result is said
to be statistically significant. If, on the other hand, the difference between what is
expected and what is observed is small, then there is not enough evidence to reject the
null hypothesis; and the result is said to be not statistically significant.
There are two approaches to determining whether to reject the null hypothesis or not.
The first involves the determination of the rejection or critical region of the test. The
rejection or critical region is a set of values of the test statistic that will enable us to
reject H 0 . It is obtained by using a pre-determined level of significance (or size of the
test). The level of significance, denoted by α , is the probability of committing a Type I
error. The levels of significance often used in literature include α = 1% (i.e. α = 0.01 ),
or α = 5% (i.e. α = 0.05 ), or α = 10% (i.e. α = 0.10 ).
The second approach involves calculation of the p-value of the test. The p-value of the
test is the probability of observing the test statistic at least as extreme as observed under
the null hypothesis. The null hypothesis is rejected for “small” p-values (usually for
p < 0.05 ). Generally, the null hypothesis is rejected at the level of significance α if
p < α . For values of p ≥ α , there is not enough evidence to reject the null hypothesis.
We shall limit ourselves to the first approach in this module.
The probability of rejecting the null hypothesis when in fact, it is true, is always known
since it is either the pre-determined value of α or the p-value of the test. Therefore,
rejecting the null hypothesis is a strong and reliable statistical result. On the other hand,
being unable to reject the null hypothesis is a weak result and should not necessarily
lead to its acceptance because the probability of failing to reject the null hypothesis
when it is false is hardly known. Thus, in the event of failing to reject the null
hypothesis, the conclusion should be that there is no evidence to reject the null
hypothesis based on the given data.
3. Which of the following is true about the null and alternative hypotheses?
A. Exactly one hypothesis must be true.
B. Both hypotheses must be true.
C. It is possible for both hypotheses to be true.
D. It is possible for neither hypothesis to be true.
4. One-sided alternative hypotheses are phrased in terms of ______.
A. ≠
B. > or <
C. ≈ or =
D. ≥ or ≤
5. A sampling distribution can be based on which of the following?
A. Sample means
B. Sample correlations
C. Sample proportions
D. All of the above
6. A Type II error occurs when _____.
A. the null hypothesis is false and we fail to reject it.
B. the null hypothesis is true and we reject it.
C. the sample mean differs from the population mean.
D. the test is biased.
7. The form of the alternative hypothesis can be _____.
A. one-sided.
B. two-sided.
C. neither one-sided nor two-sided.
D. one-side or two-sided.
8. A two-sided test is one where _____.
A. results in only one direction can lead to rejection of the null hypothesis.
B. negative sample means lead to rejection of the null hypothesis.
C. results in either of two directions can lead to rejection of the null hypothesis.
D. no results lead to the rejection of the null hypothesis.
9. The value chosen for α in a hypothesis test is known as _____.
A. the rejection level.
B. the acceptance level.
C. the significance level.
D. the error in the hypothesis test.
In this session, we shall learn how to test H 0 : µ = µ 0 against any of the three possible
alternative hypotheses given that
(1) the population we are sampling is normal and its standard deviation, σ , is known;
and
(2) the standard deviation, σ , is unknown; but the sample size, n, is large.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population mean;
2. conduct tests for single means with known population σ ; and
3. conduct tests for single means with unknown population σ ; but large sample size,
n.
Now read on …
H1 : µ < µ0 H1 : µ > µ0 H1 : µ ≠ µ0
If the population we are sampling is normal and σ is known, then the test statistic is
given by
x − µ0
z= , (1.1)
σ n
where x is the mean of the sample data, µ the mean of the population, σ the standard
deviation of the population, and n the size of the sample. Note that the term σ n is
called the standard error of the mean.
Table 1.3 contains the critical regions for conducting any of the tests in Table 1.2.
In Table 1.3, − zα is the value of z that leaves a value of α to its left, zα is the value of
z that leaves a value of α to its right, and α
2
leaves a value of − α2 to its left and a
value of α to its right. An example of the z table is shown in Table I under Statistical
2
Tables at the end of the module.
We now illustrate how to conduct each of the tests in Table 1.2 with the following three
examples.
Example 1.1
The scores of an examination for some students have been normally distributed for
some time now with mean 200 and standard deviation 16. Currently some lecturers
think that the performance has gone down. To support this claim, scores of 100 students
were taken for a study. It was found that the mean for the hundred students was 193.2.
(a) Set up the null and alternative hypotheses.
(b) Will you agree with the lecturers’ claim at the 5% significance level?
Solution
(a) The null and alternative hypotheses are H 0 : µ = 200 (in this case µ 0 = 200 )
against H 1 : µ < 200 .
x − µ0 193.2 − 200
z= = = −4.25
σ n 16 100
From the z-tables, value of z that leaves an area of 0.05 to the left is 1.645. Thus
we would reject the null hypothesis if z ≤ −1.645 . Since z = −4.25 is less than
− 1.645 , we reject H 0 in favour of H 1 . We therefore agree with the lecturers that
the performance of students has gone down.
Example 1.2
The sales of a store had an average of GHc/ 8,000 per day. The store introduced several
advertising campaigns in order to increase sales. To determine whether or not the
advertising campaigns have been effective, a sample of 64 days of sales was selected. It
was found that the average was GHc/ 8,300 per day. From past information, it is known
that the population follows a normal distribution with a standard deviation of
GHc/ 1,200 .
(a) Set up the null and alternative hypotheses.
(b) Test whether or not the advertising campaigns were effective at the 0.05 level of
significance.
Solution
(a) We wish to test H 0 : µ = 8,000 against H 1 : µ > 8,000 .
x − µ0 8,300 − 8,000
z= = = 2.00
σ n 1,200 64
From the z-tables, value of z that leaves an area of 0.05 to its right is 1.645. Thus
we would reject the null hypothesis if z ≥ 1.645 . Since z = 2 is greater than
1.645 , we reject H 0 in favour of H 1 . We therefore conclude that the advertising
campaigns were effective.
Example 1.3
The head teacher of a certain Junior High School claims that the average height of
students in his school equals 130 cm. A random sample of nine students was selected
and their average height was found to be 131.08 cm. Suppose that the distribution of
heights of students is normal with standard deviation 1.5 cm.
(c) Set up the null and alternative hypotheses.
(d) Determine whether the data contradicts the head teachers claim, at the 0.01 level
of significance.
Solution
(a) The null hypothesis is H 0 : µ = 130 and the alternative hypothesis is H 1 : µ ≠ 130
(Note: both µ < 130 and µ > 130 contradict H 0 ).
From the z-tables, value of z that leaves an area of 0.005 to the right (and to the
left) is 2.575. Thus, we would reject the null hypothesis if z ≥ 2.575 or
z ≤ −2.575 . Now since z = 2.16 is neither greater than 2.575 nor less than
− 2.575 , we are unable to reject the null hypothesis at the 0.01 level of
significance. We therefore conclude that the data does not contradict the head
teacher’s claim.
2.2 Tests for Means from Single Populations with unknown σ but
large sample size, n
We now turn our attention to test means from single populations whose standard
deviations, σ , are not known, but the sample sizes are large.
When the population standard deviation, σ , is not known but the sample size is large
enough ( n ≥ 30 ); the z tests in Session 2.1 can easily be modified to yield valid test
procedures. A large sample size n implies that the standardized variable
x−µ
z=
s n
x − µ0
z=
s n
CoDEUCC/Post-Diploma in Mathematics and Science Education 15
UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 2 MEAN (LARGE SAMPLES)
Example 1.4
A random sample of size n = 100 observations taken from a population with mean µ
yielded the sample mean x = 18.9 and sample standard deviation of s = 12.6 . If the
hypotheses are H 0 : µ = 16 and H1 : µ ≠ 16 ,
(a) calculate the value of the appropriate test statistic for this test;
(b) hence determine whether H 0 should be rejected if the significance level were 1%.
Solution
(a) In this problem, we do not know σ ; but the sample size n = 100 , is large. We
therefore replace σ with s in Equation (2.1) to get
x−µ
z= .
s n
Substituting for x = 18.9 , µ 0 = 16 , s = 12.6 , and n = 100 into the equation above,
we obtain
x−µ 18.9 − 16
z= = = 2.30
s n 12.6 100
From the z-tables, z 0.005 = 2.575 . Thus, we would reject the null hypothesis if
z ≥ 2.575 or z ≤ −2.575 . Now since z = 2.30 is neither greater than 2.575 nor
less than − 2.575 , we cannot reject the null hypothesis at the 0.01 level of
significance.
14.1 14.5 15.5 16.0 16.0 16.7 16.9 17.1 17.5 17.8
17.8 18.1 18.2 18.3 18.3 19.0 19.2 19.4 20.0 20.0
20.8 20.8 21.0 21.5 23.5 27.5 27.5 28.0 28.3 30.0
30.0 31.6 31.7 31.7 32.5 33.5 33.9 35.0 35.0 35.0
36.7 40.0 40.0 41.3 41.7 47.5 50.0 51.0 51.8 54.4
55.0 57.0
In some situations, we may not know the standard deviation, σ , of the population we
are sampling and yet the sample size may be small (i.e. less than 30). In this session, we
shall learn how to perform tests for single population means when σ is not known and
n < 30 .
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population means;
and
2. conduct tests for single means for populations with unknown σ and small sample
size, n.
Now read on …
If the population standard deviation σ is unknown and the sample size n is small, then
we cannot assume that the sample standard deviation s will be a good approximation for
σ . We must, therefore, use the t-distribution instead of the standard normal z
distribution to make inferences about a population mean, µ .
Therefore, for tests concerning single population means where σ is not known and n is
small, we use the t statistic:
x − µ0
t=
s n
Table 1.4 contains the critical regions for testing H 0 : µ = µ 0 against any of the
possible alternatives.
Note that tα and t α are based on t − 1 degrees of freedom. Their values can be read
2
from the t-table (see Table II under Statistical Tables at the end of the module). Let us
illustrate with an example.
Example 1.5
The manufacturer of a new fiberglass tire claims that its average life will be at least
40,000 miles. To verify this claim a sample of 12 tires is tested, with their lifetimes (in
1,000s of miles) being as follows:
Tire Life
1 36.1
2 40.2
3 33.8
4 38.5
5 42.0
6 35.8
7 37.0
8 41.0
9 36.8
10 37.2
11 33.0
12 36.0
Observe that σ is not known and n = 12 is small, we therefore have to use the t-
distribution with 11 (= n − 1 ) degrees of freedom. From the data, n = 12 , µ 0 = 40 ,
x = 37.2833 and s = 2.7319 .
Substituting the above into the t statistic, we obtain
37.2833 − 40
t= = −3.445
2.7319 12
From the one-tailed t-tables, the value that corresponds α = 0.05 with 11 degrees of
freedom is 1.796. That is, t 0.05 (11) = 1.796 . Since, t = −3.445 is less than
− t0.05 (11) = −1.796 , we reject H 0 and conclude that the average life of the new
fiberglass will be less than 40,000 miles.
Example 1.6
A sample of 12 radon detectors of a certain type was selected, and each was exposed to
100 pCi/L of radon. The resulting readings were as follows:
105.6 90.9 91.2 96.9 96.5 91.3
100.1 105.5 99.6 107.7 103.3 92.4
Does this data suggest that the population mean reading under these conditions differ
from 100?
(a) State the null and alternative hypotheses.
(b) test these hypotheses at α = 0.05 .
Solution
We wish to test
H 0 : µ = 100 against H1 : µ ≠ 100 .
From the two-tailed t-tables, the value that corresponds α = 0.025 with 11 degrees of
freedom is 2.201. That is, t0.025 (11) = 2.201 . Since t = −0.8885 is neither less than
− t0.05 (11) = −2.201 , nor greater than t0.025 (11) = 2.201 ; we cannot reject H 0 . We,
therefore, conclude that the data does not suggest that the population mean reading
under these conditions differs from 100.
1. A manufacturer of computer disk drives monitors the retail prices of its drives in
order to gauge the market. For one type of drive the list price is GHc/ 750 , and the
manufacturer wishes to know whether the current mean retail price differs from
the list price. Seventeen retail establishments are sampled, and the current prices
for the drive are determined. The mean and standard deviation for the 17 retail
prices are calculated:
x = GHc/ 732 s = GHc/ 38
Does this sample provide sufficient evidence to conclude that the mean retail
price differs from the list price of GHc/ 750 ?
Do the data supply sufficient evidence to allow the manufacturer to conclude that
this type of engine meets the pollution standard? Assume that the manufacturer is
willing to risk a Type I error with probability equal to α = 0.01 .
3. The specifications for a certain kind of ribbon call for a mean breaking strength of
185 pounds. If five pieces randomly selected from different rolls have breaking
strengths of 171.6, 191.8, 187.3, 184.9 and 189.1 pounds, test the hypotheses
H 0 : µ = 185 against H1 : µ < 185
at the 0.05 level of significance.
In this session, you will learn about one of the most common tests based on count data,
namely, a test concerning the parameter, p, of the binomial distribution. You will learn
to conduct tests to determine whether the proportion (or percentage) of votes obtained
by a presidential candidate is enough to win him/her the presidency, or whether the true
proportion of schools on the School Feeding Programme is 20%.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population
proportion; and
2. conduct tests concerning single population proportions for large samples.
Now read on …
Tests concerning the population proportion, p, are based on a random sample of size n
from the population. If n is large [ np ≥ 10 and n(1 − p ) ≥ 10 ], then both x (the number
of successes in the experiment) and the estimator, pˆ = nx , are approximately normally
distributed.
Suppose that we wish to test the null hypothesis H 0 : p = p 0 against any of the
alternatives H 1 : p < p 0 or H 1 : p > p 0 or H 1 : p ≠ p 0 . If n is large and H 0 is true,
then the test statistic is given by
pˆ − p0
z= , (1.2)
p0 (1 − p0 ) n
Note that these test procedures are valid provided that n is large.
[i.e. np0 ≥ 10 and n(1 − p0 ) ≥ 10 ]
In the next three examples, we illustrate how to conduct each of the possible tests.
Example 1.7
An oil company claims that less than 20% of all car owners have not tried its gasoline.
Test the claim, at the 0.01 level of significance, if a random check reveals that 22 out of
200 car owners have not tried the company’s gasoline.
Solution
We wish to test
Here p0 = 0.20 , the number of successes, x = 22 , and the sample size is n = 200 .
22
Thus, pˆ = = 0.11 . Substituting these into Equation (1.2); we obtain
200
0.11 − 0.20 − 0.09
z= = = −3.1802 .
0.20 (1 − 0.20) 200 0.0283
From the z tables, z 0.01 = 2.33 . Therefore, the rejection region is z < −2.33 . Since
z = −3.1802 < −2.33 , we reject the null hypothesis. We, therefore, conclude that less
than 20% of all car owners have not tried the company’s gasoline.
Example 1.8
Natural cork in wine bottles is subject to deterioration, and as a result wine in such
bottles may experience contamination. It is reported that, in a tasting of commercial
24 CoDEUCC/Post -Diploma in Mathematics and Science Education
TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 4
Solution
We wish to test
Here p0 = 0.15 , the number of successes, x = 16 , and the sample size is n = 91 . Thus,
16
pˆ = = 0.1758 . Substituting these into Equation (1.2); we obtain
91
0.1758 − 0.15 0.0258
z= = = 0.6898.
0.15 (1 − 0.15) 91 0.0374
From the z tables, z 0.10 = 1.28 . Therefore, the rejection region is z > 1.28 . Since
z = 0.6898 < 1.28 , we fail to reject the null hypothesis. We therefore say that there is no
strong evidence to conclude that more than 15% of all such bottles are contaminated.
Example 1.9
In a certain school, 500 students were polled and 245 of them endorsed the new system
of paying club dues. Test the null hypothesis
H 0 : p = 0.55
H 1 : p ≠ 0.55 ,
Solution
Here p 0 = 0.55 , the number of successes, x = 245 , and the sample size is n = 500 .
245
Thus, pˆ = = 0.49 . Substituting these into Equation (1.2); we obtain
500
From the z tables, z α = z 0.05 = z 0.025 = 1.96 . Therefore, the rejection region is either
2 2
z < −1.96 or z > 1.96 . Since z = −2.703 < −1.96 , we reject the null hypothesis and
conclude that the percentage of students who endorsed the new system of paying dues is
different from 55%.
2. A manufacturer of a spot remover claims that his product removes 90% of all
spots. If, in a sample, only 174 of 200 spots were removed with the
manufacturer’s product, test the null hypothesis p = 0.90 against the alternative
hypothesis p < 0.90 at the 0.05 level of significance.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single
population proportion; and
2. conduct tests concerning single population proportions for small samples.
Now read on …
kα
∑ B(k ; n, p
k =0
0 ) ≤α ;
The critical region, if the alternative hypothesis were H 1 : p > p 0 , would be x ≥ kα∗ ,
where kα∗ is the smallest integer for which
∑ B(k ; n, p 0 ) ≤α.
k = k α∗
kα
2
α
∑ B(k ; n, p
k =0
0 )≤
2
;
n
α
∑ B(k ; n, p 0 )≤
2
.
k = k ∗α
2
Example 1.10
It is claimed that 40% of patients that attend a certain clinic on any day are smokers.
Suppose that on a particular day, 3 out of a sample of 13 patients attending the clinic
were found to be smokers. Test the hypothesis
H 0 : p = 0.40
against
H 1 : p ≠ 0.40
Solution
In this problem, x = 3 and n = 13. Since α = 0.05, k α = k 0.025 . Now, from binomial
2
13
To be able to reject the null hypothesis, either the number of successes, x is less than or
equal to 1; or greater than or equal to 10. Since x = 3 is not less or equal to 1, nor
greater or equal to 10, we cannot reject the null hypothesis. We, therefore, conclude that
40% of patients that attend a certain clinic on any day are smokers.
Example 1.11
It has been claimed that more than 40% of all shoppers can identify a highly advertised
trademark. If, in a sample, 10 of 18 shoppers were able to identify the trademark, test at
the 0.05 level of significance whether H 0 : p = 0.40 can be rejected against
H1 : p > 0.40 .
Solution
In this problem, x = 10 and n = 18. Since α = 0.05, kα∗ = k 0∗.05 , the critical region,
x ≥ k 0∗.05 , where k 0∗.05 is the smallest integer for which
n
∑ B(k ; n, p0 ) ≤ 0.05 .
k = k ∗0.05
Since x = 10 is not greater or equal to 12, we are not able to reject H 0 . We conclude,
therefore, that no more than 40% of all shoppers can identify a highly advertised
trademark.
In this session, you will learn how to conduct tests concerning single population
variances. In particular, you will learn how to test the null hypothesis H 0 : σ 2 = σ 02
against any one of the alternatives H 1 : σ 2 < σ 02 , or H 1 : σ 2 > σ 02 , or H 1 : σ 2 ≠ σ 02 .
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population
variance; and
2. conduct tests concerning single population variances.
Now read on …
Suppose that we wish to test the null hypothesis H 0 : σ 2 = σ 02 against any one of the
three alternatives H 1 : σ 2 < σ 02 , or H 1 : σ 2 > σ 02 , or H 1 : σ 2 ≠ σ 02 . If the population we
are sampling has a normal distribution, then the test statistic is given by
(n − 1) s 2
χ2 = (1.3)
σ 02
where χ 2 is the value of the random variable having the chi-square distribution with
n − 1 degrees of freedom. Values for the χ 2 can be read from chi-square distributions
tables (see Table IV under statistical Tables at the end of the module).
Table 1.6 contains the critical regions for testing H 0 : σ 2 = σ 02 against any of the three
possible alternatives.
Example 1.12
Given that n = 25 and s 2 = 9 , test H 0 : σ 2 = 10 against the two-side alternative
H 1 : σ 2 ≠ 10 at 1% significance level.
Solution
Since we are dealing with a two-sided test and the significance level is 1%, the critical
region must be less than χ 02.995 , or greater than χ 02.005 . From chi-square tables, the value
of χ 02.995 (with 24 degrees of freedom) is 9.886 and that for χ 02.005 (with 24 degrees of
freedom) is 45.558. Since χ 2 = 21.6 is neither less than 9.886 nor greater than 45.558,
we cannot reject the null hypothesis.
Example 1.13
Suppose that the thickness of a part used in a semiconductor is its critical dimension and
that measurements of the thickness of a random sample of 18 such parts have the
variance s 2 = 0.68 , where the measurements are in thousands of an inch. The process is
considered to be under control if the variation of the thickness is given by a variance not
greater than 0.36. Assume that the measurements constitute a random sample from a
normal population and test the null hypothesis σ 2 = 0.36 against H1 : σ 2 > 0.36 at the
0.05 level of significance.
Solution
From chi-square tables, the value of χ 02.005 (with 18 degrees of freedom) is 27.587.
Since χ 2 = 32.11 is greater than 27.587, we reject the null hypothesis.
Note tha if the population is not normal but the size of the sample is large, then we can
test the null hypothesis by using the statistic
s −σ0
z= ,
σ0 2n
2. In a random sample, the weights of 24 Black Angus steers of a certain age have a
standard deviation of 238 pounds. Assuming that the weights constitute a random
sample from a normal population, test the null hypothesis H 0 : σ = 250 against the
two-sided alternative H 0 : σ ≠ 250 at the 0.01 level of significance.
3. In a random sample, s = 2.53 minutes for the amount of time that 30 women took
to complete the written test for their driver’s licenses. At the 0.05 level of
significance, test the null hypothesis H 0 : σ = 2.85 against the alternative
hypothesis H1 : σ < 2.85 minutes.
4. Past data indicate that the standard deviation of measurements made on sheet
metal stampings by experienced inspectors is 0.41 square inch. If a new inspector
measures 50 stampings with a standard deviation of 0.49 square inch. Test the null
hypothesis H 0 : σ = 0.41 against the alternative hypothesis H1 : σ > 0.41 square
inch.
In this unit, you will learn about tests of hypotheses concerning two population samples.
You will learn how to conduct tests concerning two population means under various
assumptions, concerning two population proportions, and concerning two population
variances.
UNIT OBJECTIVES
By the end of the unit, you should be able to:
1. formulate null and alternative hypotheses for two population parameters;
2. conduct tests concerning two population means (large and independent samples);
3. conduct tests concerning two population means (small and independent samples,
σ 1 and σ 2 assumed equal);
4. conduct tests concerning two population means (small and independent samples,
σ 1 and σ 2 assumed unequal);
5. Tests concerning two population proportions; and
6. Tests concerning two population variances.
Each one of the scenarios above calls for an investigation to find the right answers. In
this session, you will learn to conduct tests concerning the means of two populations. In
particular, you will learn to conduct tests concerning two population means based on
large and independent samples.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning tests on two population
means;
2. conduct tests concerning means of two independent samples drawn from
populations with known σ s: and
3. conduct tests concerning means of two large and independent samples drawn from
populations with unknown σ .
Now read on …
All of the tests you learnt about in Unit 1 involved a single population. We now turn our
attention to the problem of comparing the means of two populations based on two
independent random samples drawn from these populations.
Suppose that we have two independent random samples with means x1 and x 2 , and
respective sample sizes n1 and n2 , from normal populations with means µ1 and µ 2 ;
and variances σ 12 and σ 22 .
x1 − x2 − δ
z= (2.1)
σ 12 σ 22
+
n1 n2
Can you think about the critical regions for the various tests? The critical regions
remain the same as those in Table 1.3 in Session 2 of Unit 1. This is the case because
the test statistic is the same standard normal random variable, z.
Example 2.1
A random sample of 100 observations is drawn from a normal population with variance
16 and the sample mean was found to be 10.8. Another sample of 64 observations is
drawn from a second and independent normal population with variance 25 and the
sample mean was found to be 9.6. Test the hypotheses:
H 0 : the population means are equal.
against
H1 : the population means are not equal.
at 5% significance level.
Solution
The hypotheses above is equivalent to
H 0 : µ1 − µ 2 = 0
against
H1 : µ1 − µ 2 ≠ 0
10.8 − 9.6 − 0
z= = 1.6260 .
16 25
+
100 64
From z tables, z α = z 0.025 = 1.96 , giving the critical region as z < −1.96 or z > 1.96 .
2
Since z = 1.626 is neither less than − 1.96 nor greater than 1.96 , we cannot reject the
null hypothesis. We conclude, therefore, that the population means are equal.
2.2 Tests Concerning Two Population Means (Large Independent Samples with
σ 12 and σ 22 unknown)
When independent random samples are drawn from populations which may not even be
normal with unknown variances, σ 12 and σ 22 , we can still conduct the test described
under Session 2.1 with s1 and s2 substituted for σ 12 and σ 22 ; respectively into
Equation (2.1) if both n1 and n2 are large. In that case, the test statistic in Equation
(2.1) becomes
x1 − x2 − δ
z= , (2.2)
s12 s22
+
n1 n2
where s12 and s 22 are the respective sample estimates for σ 12 and σ 22 ; and z is the usual
standard normal random variable.
Example 2.2
Suppose that we have randomly selected two independent samples from populations
having means µ1 and µ 2 . If x1 = 25 , x2 = 20 , s1 = 3 , s2 = 4 , n1 = 100 , n2 = 100 .
Test the null hypothesis H 0 : µ1 − µ 2 = 0 against H1 : µ1 − µ 2 > 0 at the 0.05 level of
significance. How do you conclude about how µ1 and µ 2 compare?
From z tables, z0.05 = 1.645 , giving the critical region as z > 1.645 . Since z = 10
greater than 1.645, we reject the null hypothesis and conclude that µ1 is greater than
µ2 .
5. A study of the number of business lunches that executives in the insurance and
banking industries claim as deductible expenses per month was based on random
samples and yielded the following results:
n 1 = 40 x 1 = 9.1 s 1 = 1.9
n 2 = 50 x 2 = 8.0 s 2 = 2.1
Assuming that the population variances σ 12 and σ 22 are equal, test the null
hypothesis H 0 : µ1 − µ 2 = 0 against H 0 : µ1 − µ 2 ≠ 0 ; at the 0.05 level of
significance.
6. Sample surveys conducted in a large county in a certain year and again 20 years
later showed that originally the average height of 400 ten-year old boys was 53.8
inches with a standard deviation of 2.4 inches, whereas 20 years later the average
height of 500 ten-year old boys was 54.5 inches with a standard deviation of 2.5
inches. Assuming that the population variances σ 12 and σ 22 are equal, test
H 0 : µ1 − µ 2 = −0.5 against H 0 : µ1 − µ 2 < −0.5 ; at the 0.05 level of significance.
In this session, we shall learn how to perform tests concerning means of two small
( n 1< 30 and n 2 < 30 ) independent samples drawn from populations with unknown
standard deviations.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests concerning two population
means; and
2. conduct tests concerning means of two small and independent samples drawn
from populations with unknown variances which are assumed equal.
Now read on …
Suppose that we have two independent random samples with means x1 and x 2 , and
respective sample sizes n 1< 30 and n 2 < 30 , from populations with means µ1 and µ 2 ;
and unknown variances σ 12 and σ 22 .
x1 − x2 − δ
t= , (2.3)
1 1
sp +
n1 n2
s12 and s22 are the respective variances of sample 1 and sample 2, and t has the t-
distribution with n1 + n2 − 2 degrees of freedom.
Can you think about the critical regions for the various tests under the first assumption?
The critical regions remain the same as those in Table 1.4 in Session 3 of Unit 1. This is
because the test statistic is the same t-distribution.
Example 2.3
Two independent random samples of sizes n 1= 16 and n2 = 10 from normal
populations with unknown standard deviations have means x1 = 23.4 and x2 = 18.2 ,
with corresponding standard deviations s1 = 3.5 and s2 = 4.8 . Test H 0 : µ1 − µ 2 = δ
against H 0 : µ1 − µ 2 > δ at the 10% significance level, assuming that the population
variances are equal.
Solution
Assuming that the population variances are equal, we first substitute n1 = 16 , s1 = 3.5 ,
n2 = 10 , s 2 = 4.8 into Equation (2.4) to evaluate s p as follows:
(n1 − 1) s12 + (n2 − 1) s22
sp =
n1 + n2 − 2
(16 − 1) (3.5) 2 + (10 − 1) (4.8) 2
=
16 + 10 − 2
= 4.04
x1 − x2 − δ
t=
1 1
sp +
n1 n2
23.4 − 18.2 − 0
=
1 1
(4.04) +
16 10
= 3.193
From the t-tables, t 0.10 with 24 (= 16 + 10 − 2) degrees of freedom is 1.318. Thus, the
critical region is t > 1.318 . Since t = 3.193 > 1.318 , we reject H 0 and conclude that
µ1 > µ 2 .
Example 2.4
In the comparison of two kinds of paint, a consumer testing service finds that four 1-
gallon cans of one brand covers on the average 546 square feet with a standard
deviation of 31 square feet, whereas four 1-gallon cans of another brand covers on the
average 492 square feet with a standard deviation of 26 square feet. Assuming that the
two populations sampled are normal, and have equal variances; test the null hypothesis
H 0 : µ1 − µ 2 = 0 against H 0 : µ1 − µ 2 > 0 at the 0.05 level of significance.
Solution
Assuming that the population variances are equal, we first substitute n1 = 4 , s1 = 31 ,
n2 = 4 , s2 = 26 into Equation (2.4) to evaluate s p as follows:
(n1 − 1) s12 + (n2 − 1) s22
sp =
n1 + n2 − 2
(4 − 1) (31) 2 + (4 − 1) (26) 2
=
4+4−2
= 28.609
From the t-tables, t 0.05 with 6 (= 4 + 4 − 2) degrees of freedom is 1.9432. Thus, the
critical region is t > 1.9432 . Since t = 2.67 > 1.9432 , we reject H 0 and conclude that
µ1 > µ 2 .
Example 2.5
A production supervisor at a major chemical company wishes to determine which of
two catalysts, catalyst XA–100 or catalyst ZB–200, maximizes the hourly yield of a
chemical process. In oder to compare the mean hourly yields obtained by using the two
catalysts, the supervisor runs the process using each catalyst for five one-hour periods.
The resulting yields (in pounds per hour) for each catalyst, along with the means and
variances of the yields, are given below:
Catalyst XA–100 Catalyst ZB–200
801 752
814 718
784 776
836 742
820 763
x 1 = 811 x 2 = 750.2
s12 = 386 s22 = 484.2
Assuming that the two populations sampled are normal, and have equal variances; test
H 0 : µ1 − µ 2 = 0 against H 0 : µ1 − µ 2 ≠ 0 by setting α equal to 0.10, 0.05, 0.01 and
0.001. How much evidence is there that the difference between µ 1 and µ 2 is equal to
0?
Solution
Assuming that the population variances are equal, we first substitute n1 = 5 , s12 = 386 ,
n2 = 5 , s22 = 484.2 into Equation (2.4) to evaluate s p as follows:
(n 1− 1) s12 + (n 2 − 1) s22
sp =
n 1+ n 2 − 2
(5 − 1) (386) + (5 − 1) (484.2)
=
5+5−2
= 20.859
(a) Now if α = 0.10 , then from the t-tables, t α = t 0.10 = t 0.05 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 1.8595. Thus, the critical regions are t < −1.8595 or
t > 1.8595 . Since t = 4.6066 > 1.8595 , we reject H 0 and conclude that µ1 ≠ µ 2 .
(b) Now if α = 0.05 , then from the t-tables, t α = t 0.05 = t 0.025 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 2.3060. Thus, the critical regions are t < −2.3060 and
t > 2.3060 . Since t = 4.6066 > 2.3060 , we reject H 0 and conclude that µ1 ≠ µ 2 .
(c) Now if α = 0.01 , then from the t-tables, t α = t 0.01 = t 0.005 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 3.3554. Thus, the critical regions are t < −3.3554 and
t > 3.3554 . Since t = 4.6066 > 3.3554 , we reject H 0 and conclude that µ1 ≠ µ 2 .
(d) Now if α = 0.001 , then from the t-tables, t α = t 0.001 = t 0.0005 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 5.0413. Thus, the critical regions are t < −5.0413 and
t > 5.0413 . Since t = 4.6066 is neither less than − 5.0413 nor greater than
5.0413 , we are not able to reject H 0 and therefore conclude that µ1 = µ 2 .
There is very strong evidence that there is no difference between µ 1 and µ 2 . That is,
there is evidence at the 90%, 95% and 99% levels that µ1 = µ 2 .
In this session, we shall learn how to perform tests concerning means of two small
( n 1< 30 and n 2 < 30 ) independent samples drawn from populations with unknown
standard deviations.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests concerning two population
means; and
2. conduct tests concerning means of two small and independent samples drawn
from populations with unknown variances which are assumed unequal.
Now read on …
Suppose that we have two independent random samples with means x1 and x 2 , and
respective small sample sizes n1 < 30 and n2 < 30 , from populations with means µ1
and µ 2 ; and unknown variances σ 12 and σ 22 .
Suppose that σ 12 and σ 22 are unknown and are assumed to be different from each other.
Then the appropriate test statistic for such tests is given by
If v is not a whole number, then we have to round down to the nearest whole number.
Note that the critical regions remain the same as in Table 1.4 in Session 3 of Unit 1.
This is because the test statistic is still the t-distribution.
Example 2.6
Rework Example 2.3 on the assumption that the population variances σ 12 and σ 22 are
different.
Solution
Assuming that the population variances are not equal and substituting n1 = 16 ,
x1 = 23.4 , s1 = 3.5 , n2 = 10 , x 2 = 18.2 , s 2 = 4.8 and δ = 0 into Equation (2.5), we
obtain
x1 − x2 − δ
t∗ =
s12 s22
+
n1 n2
23.4 − 18.2 − 0
=
(3.5) 2 (4.8) 2
+
16 10
= 2.9680
2 2
s12 s 22 (3.5) 2 (4.8) 2
+ +
v= n1 n2 = 16 10
= 15.7 ≈ 15
2 2 2 2
s12 s 22 (3.5) 2 (4.8) 2
n1 + n2 16 + 10
n1 − 1 n2 − 1 16 − 1 10 − 1
From the t-tables, t 0.10 with 15 degrees of freedom is 1.341. Thus, the critical region is
t ∗ > 1.341 . Since t ∗ = 2.9680 > 1.341 , we reject H 0 and conclude that µ1 > µ 2 .
Example 2.7
Rework Example 2.4 on the assumption that the population variances σ 12 and σ 22 are
different.
Solution
Assuming that the population variances are not equal and substituting n 1= 4 , x1 = 546 ,
s1 = 31 , n2 = 4 , x2 = 492 , s2 = 26 and δ = 0 into Equation (2.5), we obtain
x1 − x2 − δ
t∗ =
s12 s22
+
n1 n2
546 − 492 − 0
=
(31) 2 (26) 2
+
4 4
= 2.669
Now we need to substitute n1 = 4 , s1 = 31 , n2 = 4 , s2 = 26 into Equation (2.6), to
obtain the degrees of freedom:
2 2
s12 s 22 (31) 2 (26) 2
+ +
n1 n2 4 4
f = 2 2
= 2 2
= 5.8235 ≈ 5
s12 s 22 (31) 2 (26) 2
n1 + n2 4 + 4
n1 − 1 n2 − 1 4 −1 4 −1
From the t-tables, t 0.05 with 5 degrees of freedom is 2.015. Thus, the critical region is
t ∗ > 2.015 . Since t ∗ = 2.669 > 2.015 , we reject H 0 and conclude that µ1 > µ 2 .
Example 2.8
Rework Example 2.5 on the assumption that the population variances σ 12 and σ 22 are
different.
Solution
Assuming that the population variances are not equal and substituting n 1= 5 , x1 = 811 ,
s12 = 386 , n2 = 5 , x2 = 750.2 , s22 = 484.2 and δ = 0 into Equation (2.5), we obtain
x1 − x2 − δ
t∗ =
s12 s22
+
n1 n2
811 − 750.2 − 0
=
386 484.2
+
5 5
= 4.609
Now we need to substitute n 1= 5 , s12 = 386 , n2 = 5 , s22 = 484.2 into Equation (2.6), to
obtain the degrees of freedom:
2
s12 s 22 2
+ 386 484.2
n1 n2 5 + 5
f = 2 2
= 2 2
= 7.8994 ≈ 7
s12 s 22 386 484.2
5 5
n1 n2 +
+ 5 −1 5 −1
n1 − 1 n2 − 1
(a) Now if α = 0.10 , then from the t-tables, t α = t 0.10 = t 0.05 with 7 degrees of
2 2
freedom is 1.8946. Thus, the critical regions are t < −1.8946 or t > 1.8946 . Since
t = 4.609 > 1.8946 , we reject H 0 and conclude that µ1 ≠ µ 2 .
(b) Now if α = 0.05 , then from the t-tables, t α = t 0.05 = t 0.025 with 7 degrees of
2 2
freedom is 2.3646. Thus, the critical regions are t < −2.3646 and t > 2.3646 .
Since t = 4.609 > 2.3646 , we reject H 0 and conclude that µ1 ≠ µ 2 .
(c) Now if α = 0.01 , then from the t-tables, t α = t 0.01 = t 0.005 with 7 degrees of
2 2
freedom is 3.4995. Thus, the critical regions are t < −3.4995 and t > 3.4995 .
Since t = 4.609 > 3.4995 , we reject H 0 and conclude that µ1 ≠ µ 2 .
(d) Now if α = 0.001 , then from the t-tables, t α = t 0.001 = t 0.0005 with 7 degrees of
2 2
freedom is 5.4079. Thus, the critical regions are t < −5.4079 and t > 5.4079 .
Since t = 4.609 is neither less than − 5.4079 nor greater than 5.4079 , we are not
able to reject H 0 and therefore conclude that µ1 = µ 2 .
There is very strong evidence that there is no difference between µ 1 and µ 2 . That is,
there is evidence at the 90%, 95% and 99% levels that µ1 = µ 2 .
In this session, we shall learn how to conduct tests concerning the means of two
samples that are not independent of each other.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests concerning the means of
two samples that are not independent of each other; and
2. conduct tests concerning the means of two samples that are not independent of
each other.
Now read on …
Suppose that the original problem was to conduct any of the tests in
Table 2.1(a).
Table 2.1 (a): Test of H 0 : µ1 − µ 2 = δ against various alternatives
(a) (b) (c)
H 0 : µ1 − µ 2 = δ H 0 : µ1 − µ 2 = δ H 0 : µ1 − µ 2 = δ
H1 : µ1 − µ 2 < δ H1 : µ1 − µ 2 > δ H1 : µ1 − µ 2 ≠ δ
H1 : µ d < δ H1 : µ d > δ H1 : µ d ≠ δ
Let µ d be the mean of the normally distributed population of paired differences, and let
d and sd be the mean and standard deviation of a sample of n paired differences that
have been selected randomly from the population. Then the appropriate test statistic for
conducting any one of the tests in Table 2.1 (b) is given by
d −δ
t= , (2.7 )
sd n
As in all previous cases where the test statistic follows the t-distribution, the critical
regions are as in Table 1.4 in Session 3 of Unit 1.
Example 2.9
The data below are the weights before and after ten boxers were fed with a weight
reducing diet:
i 1 2 3 4 5 6 7 8 9 10
Before, xi 69 50 61 72 78 66 75 89 86 54
After, y i 66 49 63 70 71 65 75 88 87 51
Solution
By calculating the differences yi − xi , we obtain
i 1 2 3 4 5 6 7 8 9 10
xi 69 50 61 72 78 66 75 89 86 54
yi 66 49 63 70 71 65 75 88 87 51
yi − xi − 3 − 1 2 −2 −7 −1 0 −1 1 −3
From t-tables, t 0.05 with 9 degrees of freedom is 1.833. Thus, the critical region is
t < −1.833 . Since t = −1.897 < −1.833 , we reject H 0 and conclude that µ 2 < µ1 .
Example 2.10
The management of the Daily Guide Newspaper knowns that there are substantial
differences in the abilities of its machine operators. Therefore it decided to compare its
machines using the paired difference approach. Suppose that eight randomly selected
machine operators produce papers for one hour using machine 1 and for one hour using
machine 2, with the following results:
Machine Operator
1 2 3 4 5 6 7 8
Machine 1 53 60 58 48 46 54 62 49
Machine 2 50 55 56 44 45 50 57 47
Assumming normality, perform a hypothesis test to determine whether or not there is a
difference between the mean hourly outputs of the two machines. Use α = 0.05 .
Solution
We wish to H 0 : µ d = 0 against H1 : µ d ≠ 0 at α = 0.05 . By calculating the differences
between Machines 1 and 2, we obtain
Machine Operator
1 2 3 4 5 6 7 8
M1 53 60 58 48 46 54 62 49
M2 50 55 56 44 45 50 57 47
M1 − M 2 3 5 2 4 1 4 5 2
From t-tables, t0.025 with 7 degrees of freedom is 2.365. Thus, the critical region is
t < −2.365 , or t > 2.365 . Since t = 6.17 > 2.365 , we reject H 0 and conclude that
µ d ≠ 0 at α = 0.05 .
Example 2.11
Lactation promotes a temporary loss of bone mass to provide adequate amounts of
calcium for milk production. An experiment resulted in the following data on total body
bone mineral content for a sample both during lactation (L) and in post-weaning period
(P).
Subject
1 2 3 4 5 6 7 8 9 10
L 1928 2549 2825 1924 1628 2175 2114 2621 1843 2541
P 2126 2885 2895 1942 1750 2184 2164 2626 2006 2627
Does the data suggest that true average total body bone mineral content during post-
weaning exceeds that during lactation by more than 25? State and test the appropriate
hypotheses using the 0.05 level of significance.
Solution
We wish to test H 0 : µ d = 25 against H1 : µ d > 25 at α = 0.05 . By calculating the
differences between P and L, D = P − L , we obtain
Subject
1 2 3 4 5 6 7 8 9 10
L 1928 2549 2825 1924 1628 2175 2114 2621 1843 2541
P 2126 2885 2895 1942 1750 2184 2164 2626 2006 2627
D 198 336 70 18 122 9 50 5 163 86
From t-tables, t0.05 with 9 degrees of freedom is 1.833. Thus, the critical region is
t > 1.833 . Since t = 2.46 > 1.833 , we reject H 0 and conclude that µ d > 25 at
α = 0.05 . That is, the data suggests that the true average total body bone mineral
content during post weaning exceeds that during lactation by more than 25.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning tests of two population
proportions;
2. conduct tests concerning proportions of two large samples that are independent of
each other.
Now read on …
Tests concerning two different population proportions, p1 and p2 , are based on two
different random samples of sizes n1 and n2 , from populations 1 and 2, respectively.
Suppose that we select a random sample of size n1 from a population, and denote the
proportion of “successes” (i.e. sample units that fall into a certain category of interest)
by p̂1 . That is,
x1
pˆ1 = ,
n1
where x1 is the number of successes. Again, suppose that we select a random sample of
size n2 from a second and different population, and denote the proportion of successes
by
x2
pˆ 2 = .
n2
If each of the sample sizes n1 and n2 is large, and if the samples are independent of
each other, then we can compare p̂1 and p̂2 by testing the null hypothesis
H 0 : pˆ1 − pˆ 2 = δ against any one of the alternatives H1 : pˆ1 − pˆ 2 < δ , or
H1 : pˆ1 − pˆ 2 > δ , or H1 : pˆ1 − pˆ 2 ≠ δ using the test statistic
pˆ − pˆ 2 − δ
z= 1 . (2.8)
σ pˆ 1 − pˆ 2
If δ = 0, then σ pˆ 1 − pˆ 2 is estimated by
1 1
s pˆ 1 − pˆ 2 = pˆ (1 − pˆ ) + , (2.9)
n1 n2
where p̂ , called the combined sample proportion is given by
x1 + x2
pˆ = .
n1 + n2
If δ ≠ 0, then σ pˆ 1 − pˆ 2 is estimated by
pˆ1 (1 − pˆ1 ) pˆ 2 (1 − pˆ 2 )
s pˆ 1 − pˆ 2 = + . (2.10)
n1 − 1 n 2 −1
Note that the critical regions for the various tests will be the same as those in Table 1.3
in Unit 1 Session 2.
Example 2.12
If x1 = 18 , x 2 = 15 , n1 = 35 and n2 = 42 , test the null hypothesis
H 0 : p1 − p2 = 0
against
H1 : p1 − p2 > 0
at 5% significance level.
Solution
We note that δ = 0 and estimate σ pˆ 1 − pˆ 2 by Equation (2.9). We first find the combined
sample proportion as follows:
18 + 15
pˆ = = 0.4286 .
35 + 42
Therefore,
1 1
s pˆ 1 − pˆ 2 = pˆ (1 − pˆ ) +
n1 n2
1 1
= (0.4286)(0.5714) +
35 42
= 0.1133
64 CoDEUCC/Post-Diploma in Mathematics and Science Education
TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 5
18 15
Substituting pˆ1 = = 0.5143 , pˆ 2 = = 0.3571 , δ = 0 and s pˆ 1 − pˆ 2 = 0.1133 into
35 42
Equation (2.8); we obtain
pˆ1 − pˆ 2 − δ
z=
σ pˆ 1 − pˆ 2
0.5143 − 0.3571 − 0
= .
0.1133
= 1.3875
From the z tables z0.05 = 1.645 , therefore the critical region is z > 1.645 . Since
z = 1.3875 < 1.645 , we fail to reject H 0 . We therefore conclude that p1 = p2 .
Example 2.13
Refer to Example 2.12 and test the null hypothesis
H 0 : p1 − p2 = −0.15
against
H1 : p1 − p2 > −0.15
Solution
We note that δ ≠ 0 and estimate σ pˆ 1 − pˆ 2 by Equation (2.10).
We have
18 15
n1 = 35 , n2 = 42 , pˆ1 = = 0.5143 and pˆ 2 = = 0.3571 .
35 42
Therefore,
pˆ1 (1 − pˆ1 ) pˆ 2 (1 − pˆ 2 )
s pˆ 1 − pˆ 2 = +
n1 − 1 n 2 −1
(0.5143) (0.4857) (0.3571) (0.6429)
= +
35 − 1 42 − 1
= 0.1138
18 15
Substituting pˆ1 = = 0.5143 , pˆ 2 = = 0.3571 , δ = −0.15 and s pˆ 1 − pˆ 2 = 0.1138 into
35 42
Equation (2.8); we obtain
pˆ1 − pˆ 2 − δ
z=
σ pˆ 1 − pˆ 2
0.5143 − 0.3571 − (−0.15)
= .
0.1138
= 2.6995
From the z tables z0.05 = 1.645 , therefore the critical region is z > 1.645 . Since
z = 2.6995 > 1.645 , we reject H 0 . We therefore conclude that p1 > p2 − 0.15 . That is,
p1 exceeds p2 by more than 15%.
Rather than make assumptions about (equality or otherwise) of the unknown population
variances σ12 and σ 22 , we can perform a test to ascertain them. In this session, you will
learn about a method for comparing the variances of two different populations.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning tests of two population
variances;
2. conduct tests concerning variances of two samples that are independent of each
other.
Now read on …
Suppose that we have two independent random samples of sizes n1 and n2 , taken from
different normal populations with variances σ12 and σ 22 . Then we can compare σ12 and
σ 22 by testing the null hypothesis H 0 : σ 12 = σ 22 against any of the alternatives
H1 : σ 12 < σ 22 , or H1 : σ 12 > σ 22 , or H1 : σ 12 ≠ σ 22 . The appropriate test statistic is given
by
s12
F= , (2.11)
s22
where s12 and s 22 are the sample variances. The F-statistic is the value of a random
variable having the F-distribution with n1 − 1 (numerator degrees of freedom) and
n2 − 1 (denominator degrees of freedom). Values of F for a given numerator and
denominator degrees of freedom can be read from F-distribution tables. (see Table V
under Statistical Tables at the end of the module).
Example 2.14
Suppose that observations from two independent random samples from two normal
populations yielded the following results: n1 = 11 , s12 = 18.4 , n2 = 16 and s 22 = 13.5 .
Test the null hypothesis H 0 : σ 12 = σ 22 against the H 1 : σ 12 ≠ σ 22 at the 10% significance
level.
Solution
Substituting s12 = 18.4 and s22 = 13.5 into Equation (2.11), we obtain
s12 18.4
F= = = 1.363 .
s22 13.5
1
F1−α (n1 − 1, n2 − 1) = ,
Fα (n2 − 1, n1 − 1)
we have
1 1
F0.95 (10,15) = = = 0.35.
F0.05 (15,10) 2.85
Therefore, the critical region is F < 0.35 or F > 2.54 . Since F = 1.36 is neither less
than 0.35 nor greater than 2.54, we cannot reject the null hypothesis. We therefore
conclude that σ 12 = σ 22 .
Example 2.15
The result from a certain experiment reported the following data on tensile strength
(psi) of liner specimens both when a certain fusion process was used and when this
process was not used.
No fusion 2748 2700 2655 2822 2511
3148 3257 3213 3220 2753
n1 = 10 x1 = 2902.8 s1 = 277.3
Does the data suggest that the standard deviation of the strength of distribution for fused
specimens is smaller than that for not fused specimens? Carry out a test at the 0.01 level
of significance.
Solution
Let σ 1 represent the standard deviation of the strength of distribution for the not fused
specimens, σ 2 represent the standard deviation of the strength of distribution for the
fused specimens.
s12 (277.3) 2
F= = = 1.814 .
s22 (205.9) 2
Therefore, the critical region is F < 0.178 . Since F = 1.814 is greater than 0.178, we
cannot reject the null hypothesis. The data does not suggest that the standard deviation
of the strength of distribution for fused specimens is smaller than that for not fused
specimens.
Example 2.16
Refer to Example 2.4, test the null hypothesis H 0 : σ 1 − σ 2 = 0 against the alternative
hypothesis H1 : σ 1 − σ 2 > 0 at the 0.05 level of significance.
Solution
Substituting for s1 = 31 and s2 = 26 into Equation (2.11), we obtain
s12 (31) 2
F= = = 1.422 .
s22 (26) 2
Therefore, the critical region is F > 9.28 . Since F = 1.422 is less than 9.28, we cannot
reject the null hypothesis. We therefore conclude that σ 1 > σ 2 . That is, the standard
deviation of the first brand of paint is larger than that of the second.
1. In comparing the variability of the tensile strength of two kinds of structural steel,
an experiment yielded the following results: n 1= 13 , s12 = 19.2 , n 2 = 16 , and
s22 = 3.5 , where the units of measurement are 1,000 pounds per square inch.
Assuming that the measurements constitute independent random samples from
two normal populations, test the null hypothesis H 0 : σ 12 = σ 22 against the
alternative H1 : σ 12 ≠ σ 22 at the 0.02 level of significance.
2. To find out whether the inhabitants of two south pacific islands may be regarded
as having the same racial ancestry, an antropologist determines the cephalic
indicies of six adult males from each island, getting x1 = 77.4 , x2 = 72.2 , and the
corresponding standard deviations s1 = 3.3 and s2 = 2.1 . Test at the 0.10 level of
significance whether it is reasonable to assume that the two population samples
have equal variances.
3. For a sample of 28 elderly men, the sample standard deviation of serum ferritin
(mg/L) was s1 = 52.6 ; for 26 young men, the sample standard deviation was
s2 = 84.2 . Does the data suggest that the ferritin distribution in the elderly had a
smaller variance than in the younger adults? Carry out a test at the 0.01 level of
significance.
UNIT OUTLINE:
So far, all the tests that you have learnt about in Units 1 and 2 are
tests for quantitative data. In this unit, we shall turn our attention to
some common tests that are considered appropriate for qualitative
data.
We shall learn about situations in which observations can be classified as falling into
exactly one of a number of mutually exclusive categories. We shall then be concerned
with the number of observations that fall in each of these categories, and investigate
whether we can reject or fail to reject hypotheses about these numbers.
UNIT OBJECTIVES
By the end of the unit, you should be able to:
1. formulate the null and alternative hypotheses for tests of qualitative data;
2. conduct tests based on whether a set of observations are drawn from specified
distributions;
3. conduct goodness-of-fit tests for homogeneity;
4. conduct goodness-of-fit tests for independence; and
5. calculate and interpret measures of strength of association between two
categorical variables.
In this session, you will learn about the multinomial distribution, which is an extension
of the binomial.
Objectives
By the end of the session, you should be able to:
1. state at least four properties of the multinomial experiment; and
2. determine whether or not a given distribution follows the multinomial distribution.
You learnt about the binomial distribution during your study for the Diploma degree.
Recall that a binomial experiment possesses the following properties: (1) the
experimental consists of n identical trials, (2) each trial of the experiment can result in
two possible outcomes which may be classified as a “success” or a “failure”, (3) the
probability of a success or failure is the same for each experimental trial, and (4) the n
experimental trials are independent of each other. Multinomial experiments have similar
properties, although each trial of a multinomial experiment can result in two or more
possible outcomes.
Example 3.1
Empirical studies have shown that the distribution of blood group in the population of a
certain human race is as follows:
Blood type Percentage
A 41
B 9
AB 4
O 46
Assuming that the distribution of blood group is independent from person to person,
then this can be looked on as multinomial distribution with four possible outcome blood
group A, B, AB, O; with probabilities p1 = 0.49 , p 2 = 0.09 , p3 = 0.04 , and
p 4 = 0.46 , respectively. Note that p1 + p 2 + p3 + p 4 = 1 . Therefore, the given
distribution of blood groups follows a multinomial distribution.
Example 3.2
The table shows the market shares for different brands of televisions.
Brand of TV Market share
LG 20%
Samsung 30%
Panasonic 35%
Sony 15%
It is clear that the brands of television are independent of each other.
The TV brands LG, Samsung, Panasonic and Sony have distribution probabilities
p1 = 0.20, p 2 = 0.30, p3 = 0.35, and p 4 = 0.15, respectively. So we have
0.20 + 0.30 + 0.35 + 0.15 = 1 and therefore, the distribution of brands of television sets
follows a multinomial distribution.
Example 3.3
The table shows the results of a consumer preference survey.
A B Store Brand
61 53 36
Try to justify why this is a multinomial distribution.
Self-Assessment Questions
Exercise 3.1
1. A study of the political affiliation of students in Mangoase SHS is given in the
table below. Justify why this is a multinomial distribution.
2. Suppose that the distribution of marital status of women in a large city is given in
the table below. Explain why it is a multinomial distribution.
In this session, you will learn how to formulate and test hypotheses concerning outcome
probabilities of multinomial distributions.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning the outcome probabilities
of a multinomial distribution; and
2. conduct tests concerning outcome probabilities of a multinomial distribution.
Refer to Example 3.3. Suppose that we wish to test the null hypothesis that there is no
preference for any of the three brands against the alternative hypothesis that a
preference exists for one or more of the brands. Then we can let
p1 = Proportion of all customers who preferred brand A
p 2 = Proportion of all customers who preferred brand B
p3 = Proportion of all customers who preferred the store brand
Similarly,
E (n1 ) = E (n2 ) = 50 (If no preference exists)
To measure the degree of disagreement between the data and the null hypothesis, we
use the statistic
χ2 =
[n1 − E (n1 )]2 + [n2 − E (n2 )]2 + [n3 − E (n3 )]2
E (n1 ) E (n2 ) E ( n3 )
(n − 50) (n1 − 50) (n1 − 50)
2 2 2
= 1 + +
50 50 50
We shall now give the general form of a test of hypothesis concerning multinomial
probabilities. Suppose we wish to test the null hypothesis
χ =∑
2
k
[ni − E (ni )]2 , (3.1)
i =1 E ( ni )
where E (ni ) = npi 0 , is the expected number of type i outcomes assuming that H 0 is
true. The test statistic has an approximate chi-square distribution with (k − 1) degrees of
freedom. The total sample size is n and the rejection region is χ 2 ≥ χ α2 .
Note that the approximation is good on the assumption that the sample size is large
enough so that, for every cell, the expected cell frequency E (ni ) is equal to 5 or more.
82
CoDEUCC/ Post-Diploma in Mathematics and Science Education
TESTS ON CATEGORICAL DATA UNIT 3
SESSION 2
Example 3.4
The following is the criterion used by a certain firm to decide on annual pay rise of its
employees: employees who score an average above 80 in a series of evaluations will
receive a merit pay rise, those who score between 50 and 80 will receive the standard
pay rise, and those below 50 will receive no pay rise. The firm designed the plan with
the objective that, on the average, 25% of its employees would receive merit pay rise,
65% would receive standard pay rise, and 10% would receive no pay rise. The
distribution of pay rise for 600 employees after the evaluation is given below:
No pay rise Standard pay rise Merit pay rise
42 365 193
Test, at the 0.01 level of significance, whether the data indicate that the distribution of
pay rise differs from those established by the firm.
Solution
Let
p1 = Proportion of employees who receive no pay rise
p 2 = Proportion of employees who receive a standard pay rise
p3 = Proportion of employees who receive a merit pay rise
χ2 = ∑
3
[ni − E (ni )]2
i =1 E ( ni )
=
(42 − 60)2 + (365 − 390)2 + (193 − 150)2
60 390 150
= 19.33
From the χ 2 tables, the value of χ 02.01 with degrees of freedom, k − 1 = 2 is 9.210.
Therefore, the critical region is χ 2 ≥ 9.21034 . Since 19.33 is greater than 9.210, we
reject the null hypothesis and conclude that the data contradicts the company’s plan.
Example 3.5
The headteacher of a primary school is interested in knowing whether there exist colour
preferences among the pupils of his school. A sample of 100 pupils was drawn from the
school and shown identically shaped objects, coloured red, blue, yellow, green or pink.
When each child was asked to pick the most preferred colour, 30 picked red, 18 blue, 12
yellow, 20 green and 20 pink. Test, at 5% significance level, the hypothesis:
H 0 : there does not exist colour preferences among the pupils
against
H 1 : colour preference does exist.
Solution
If there are no preferences, then the probability of choosing any colour is the same.
Since there are five colours, the probability, pi of choosing any colour is
1
pi = or 0.2 (i = 1,2, , 5) .
5
84
CoDEUCC/ Post-Diploma in Mathematics and Science Education
TESTS ON CATEGORICAL DATA UNIT 3
SESSION 2
H 0 : p1 = p 2 = p3 = p 4 = p5 = 0.2
against
H 1 : at least one of the pi s ≠ 0.2
Thus,
χ2 = ∑
5
[ni − E (ni )]2
i =1 E ( ni )
10 2 (−2) 2 (−8) 2 0 2 0 2
= + + + +
20 20 20 20 20
= 5 + 0.2 + 3.2
= 8.4
At the 5% significance level and from the chi-square tables, χ 02.05 at df = 5 − 1 = 4 is
9.49. Therefore, the critical region is χ 2 ≥ 9.49. Since χ 2 = 8.4 < χ 02.05 (4) = 9.49 , we
cannot reject the null hypothesis. That is, we do not have sufficient evidence against the
null hypothesis.
Self-Assessment Questions
Exercise 3.2
1. Assume that a die is thrown 60 times and a record is kept of the number of times a
1, 2, 3, 4, 5 or 6 is observed.
Face 1 2 3 4 5 6
Number 13 10 8 10 12 7
86
CoDEUCC/ Post-Diploma in Mathematics and Science Education
TESTS ON CATEGORICAL DATA UNIT 3
SESSION 3
Objectives
By the end of the session, you should be able to:
1. formulate null and alternative hypotheses for carrying out goodness-of-fit tests of
a set of data to a Poisson, or a binomial, or a normal distribution;
2. conduct goodness-of-fit tests of a set of data to a Poisson
distribution;
3. conduct goodness-of-fit tests of a set of data to a binomial distribution; and
4. conduct goodness-of-fit tests of a set of data to a normal distribution.
The goodness-of-fit test can be applied to test a sample data set as coming from a
population having a Poisson, or binomial, or normal distribution. Unlike previous
statistical tests, however, the hypothesis of interest is the null hypothesis and not the
alternative. The test statistic for such tests is given by
k
χ2 = ∑ i
[n − E (ni )]2 , (3.2)
i =1 E ( n i )
where k is the number of classes; ni the number of observations that fall into class i;
and E (ni ) the expected number of observations that fall into class i. The expected
number of observations that fall into class i, E (ni ) , is given by E (ni ) = n pi , where pi
is the probability of an observation falling into class i and n is the total sample size.
The test statistic in Equation (3.2) is approximately chi-square distributed with degrees
of freedom given by (k − m − 1) , where m is the number of independent parameters that
have to be estimated from the sample. The chi-square approximation is particularly
The goodness-of-fit test is constructed in such a way that we will reject the null
hypothesis, at a given significance level α , if the observed value of the test statistic is
larger than or equal to the corresponding value, χ α2 , from chi-square tables. That is, we
will reject H 0 if χ 2 ≥ χ α2 .
Examples 3.6, 3.7, and 3.8 illustrate how to perform the calculations involved and test
of a set of data to a Poisson, binomial and normal distributions; respectively.
Example 3.6
The weekly number of power failures reported in a certain district in 50 weeks is
recorded as follows:
Number of failures Number of weeks
0 6
1 8
2 13
3 11
4 7
5 4
6 1
Determine whether the weekly number of power failures in the district follows a
Poisson distribution at the 5% significance level.
Solution
We wish to test the hypotheses:
H 0 : the weekly number of power failures follows a Poisson
distribution.
against
H 1 : the weekly number of power failures does not follow a
Poisson distribution.
where λ is the mean of the distribution. In this example, we shall have to estimate the
value of λ (note that its value is not given in the problem) by calculating the mean of
the given sample data. Calculating for the mean, we have
x f fx
0 6 0
1 8 8
2 13 26
3 11 33
4 7 28
5 4 20
6 1 6
50 121
Therefore, the mean x is given by
7
∑ f i xi 121
i =1
x= = = 2.42 ≈ 2.4
fi 50
Thus, we can calculate the various probabilities as follows.
(2.4) 0 e −2.4
p0 = = 0.091
0!
(2.4)1 e −2.4
p1 = = 0.218
1!
The rest of the calculation is summarized below.
Number of Number of Poison Expected
Failures Weeks Probabilities Frequencies
i ni pi E (ni ) = 50 pi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 7 0.125 6.25
5 4 0.060 3.00
6 1 0.024 1.20
Note that the values of pi for i = 0, 1, 2, , 6 can also be obtained from Poisson
probability tables. For example, the value for p0 can be read off as 0.091 from where
row with λ = 2.4 intersects with column x = 0 in Table VI under Statistical Tables at
the end of the module.
In the table above, we see that two out of seven expected frequencies
(i.e. approximately 29%) are less than 5. To satisfy the requirement that no more than
20% of the expected frequencies are less than 5, we will follow the common practice of
merging adjacent classes. In this case, it is the last three classes that we merge into one
to obtain five classes as shown.
Thus, we obtain
χ =∑
2
k
[ni − E (ni )]2
i =1 E ( ni )
(6 − 4.55) 2 (8 − 10.90) 2 (12 − 10.45) 2
= + ++
4.55 10.90 10.45
= 0.4621 + 0.7716 + + 0.0289 + 0.2299
= 1.49
Example 3.7
Four identical six-sided dice, each with faces marked 1 to 6, are rolled 200 times. At
each rolling, a record is made of the number of dice whose score on the uppermost face
are even. The results are as follows.
Test, at 5% level of significance, that the number of even faces follows a binomial
distribution with n = 4 and p = 0.5 .
Solution
We wish to test
H 0 : Number of even scores is ~ B (4, 0.5)
against
H 1 : Number of even scores is not ~ B (4, 0.5)
We have
p ( x) = B (n, p )= n C x p x (1 − p )n− x
Thus,
p (0) = B (4, 0.5)= 4 C 0 (0.5) 0 (0.5)4 = 0.0625
We can now calculate the expected cell frequencies and summarize them in a table as
shown.
x ni p ( xi ) E ( ni )
0 10 0.0625 12.50
1 41 0.2500 50.00
2 70 0.3750 75.00
3 57 0.2500 50.00
4 22 0.0625 12.50
Note that it is possible to read off the value of binomial probabilities for
binomial distribution tables. See, for example, Table III under Statistical Tables
at the end of the module.
The requirement for the use of the χ 2 distribution is satisfied since no expected
frequency is less than 5. We now evaluate the value of the test statistic as shown
χ =∑
2
k
[ni − E (ni )]2
i =1 E ( ni )
(10 − 12.50) 2 (2 − 12.50) 2
= ++
12.50 12.50
= 0.500 + 1.620 + + 0.980 + 7.220
= 10.653
The degrees of freedom is calculated as
Since χ 2 = 10.653 is greater than χ 02.05 (4 ) = 9.488, we reject H 0 and conclude that the
number of even scores is not approximately B (4, 0.5) .
Example 3.8
The following is the distribution of the readings obtained with a Geiger counter of the
number of particles emitted by a radioactive substance in 100 successive 40-second
intervals:
Number of particles Frequency
5– 9 1
10 – 14 10
15 – 19 37
20 – 24 36
25 – 29 13
30 – 34 2
35 – 39 1
Solution
(a) To answer this question, we need to calculate f x as shown.
Therefore,
x=
∑ f x = 2000 = 20,
∑ f 100
as required.
(b) Since a normally distributed variable ranges from negative infinity to positive
infinity, the area beyond the class interval must also be accounted for. Thus, the
area below 9.5 is the area below the Z value
9.5 − 20
Z= = −2.1
5
From Table I in the Appendix, the area below Z equals minus 2.1 is
approximately 0.0179.
To calculate the area between 9.5 and 14.5, the area below 14.5 is calculated as
follows
14.5 − 20
Z= = −1.1
5
From Table I in the Appendix, the area below Z equals minus 1.1 is
approximately 0.1357. Thus, the area between 9.5 and 14.5 is the difference in
the area below 9.5 and the area below 4.5, which is 0.1357 minus 0.0179 equal
to 0.1178.
To calculate the area between 14.5 and 19.5, the area below 19.5 is calculated as
follows
19.5 − 20
Z= = −0.1
5
From Table I, the area below Z equals minus 0.1 is approximately 0.4602. Thus,
the area between 14.5 and 19.5 is the difference in the area below 19.5 and the
area below 14.5, which is 0.4602 minus 0.1357 equal to 0.3245.
The area in each of the remaining class intervals is calculated in similar manner.
The calculations are summarized in the next table.
Classes Frequency p(x)
x < 9.5 1 0.0179
9.5 ≤ x < 14.5 10 0.1178
14.5 ≤ x ≤ 19.5 37 0.3245
19.5 ≤ x ≤ 24.5 36 0.3557
24.5 ≤ x ≤ 29.5 13 0.1554
29.5 ≤ x ≤ 34.5 2 0.0268
x > 39.5 1 0.0019
The expected normal curve frequencies for the various classes are summarized in
the table below.
χ2 = ∑
5
[ni − E (ni )]2
i =1 E ( ni )
(1 − 1.79) 2 (10 − 11.78) 2 (16 − 18.41) 2
= + ++
1.79 11.78 18.41
= 1.5823.
After combining classes to get them satisfy the requirement for use of the chi-
square distribution, 5 classes remain. Thus, the degrees of freedom for the test are
3, since the mean of the distribution is estimated from the sample.
From the chi-square tables and at the 0.05 level of significance, we have
χ 02.05 (3) = 7.815.
Since the calculated chi-square value of 1.5823 is less than the table value of
7.815, we fail to reject the null hypothesis. We therefore conclude that the data
may be looked upon as a random sample from a normal population.
Self-Assessment Questions
Exercise 3.3
1. The actual arrivals per minute during lunch time for 200 people are shown below.
2. A farmer kept record of the number of heifer calves born to each of his cows
during the first five years of breeding of each cow. The results are summarized
below.
No. of heifers 0 1 2 3 4 5
No. of Cows 4 19 41 52 26 8
For these data, x = 2.80 and s = 0.97 . Test at the 0.05 significance level, whether or
not the battery lives follow a normal distribution.
In this session, you will learn how to conduct goodness-of-fit test for homogeneity. This
test is used to determine whether frequency counts for a given variable are distributed
identically across different populations.
Objectives
By the end of the session, you should be able to:
1. formulate null and alternative hypotheses for goodness-of-fit tests for
homogeneity; and
2. conduct goodness-of-fit tests for homogeneity.
The goodness-of-fit test for homogeneity is considered appropriate when the following
conditions are `met.
1. The method for selecting a sample from each population is simple random
sampling.
2. The variable under study is categorical.
3. The expected frequency for each cell should be at least 5.
Suppose that data are sampled from p mutually exclusive populations, and that the
categorical variable has l mutually exclusive levels. If nij denotes the number of
individuals in the sample(s) that fall in row i and column j of the table, that is in (i, j ) th
cell , then the data can be arranged in a two-way table as shown in Table 3.1.
Variable
Population 1 2 l Totals
1 n11 n12 n1l P1
2 n21 n22 n2l P2
p n p1 n p2 n pl Pp
Totals L1 L2 Ll n
Thus, Table 3.1 is a p × l contingency table in which the data is sampled from p
populations and the variable of interest has l levels.
The number of observations, nij , that fall into each cell is called observed cell
frequency.
l
Pi = ∑ nij is the marginal total for row i, whilst
j =1
p
L j = ∑ nij is the marginal total for column j.
i =1
p l
Note that ∑ Pi = ∑ L j = n is the total sample size.
i =1 j =1
The null hypothesis states that, at any specific level of the variable, the p mutually
exclusive populations have the same proportions against the alternative that, at least one
of the null hypotheses is false. Thus,
against
H 1 : At least one of the H 0 is false
χ = ∑∑
2
p l [nij − E (nij )] 2 , (3.3)
i =1 j =1 E (nij )
where E (nij ) is the expected cell frequency for the (ij) th cell. It can be shown that
Pi × L j
E (nij ) = .
n
The tests statistic in Equation (3.3), under the null hypothesis, has an approximate chi-
square distribution with the number of degrees of freedom given by
df = ( p − 1)(l − 1) ,
where p is the number of populations, and l is the number of levels of the categorical
variable in the test.
χ 2 ≥ χ α2 [( p − 1)(l − 1)].
Example 3.9
In a study of television viewing habits of children, a developmental psychologist selects
a random sample of 300 primary school pupils, 100 boys and 200 girls. Each child is
asked which of the following television programmes they liked best: The Talented
Child, or The Pulpit, or Math and Science Quiz. The results are shown below.
Viewing Preferences
The Talented Child The Pulpit Math and Science Quiz
Boys 50 30 20
Girls 50 80 70
Do boys’ preferences for the television programmes differ significantly from the girls’
preferences? Use the 0.05 level of significance.
Solution
We can calculate the population totals and television programme totals as shown in the
table.
Viewing Preferences
Talented Child Pulpit Math & Science Quiz Totals
Boys 50 30 20 100
Girls 50 80 70 200
Totals 100 110 90 300
P1 × L1 100 × 100
E (n11 ) = = = 33.33
n 300
P1 × L2 100 × 110
E (n12 ) = = = 36.67
n 300
P1 × L3 100 × 90
E (n13 ) = = = 30.00
n 300
P2 × L1 200 × 100
E (n21 ) = = = 66.67
n 300
P2 × L2 200 × 110
E (n22 ) = = = 73.33
n 300
P2 × L3 200 × 90
E (n23 ) = = = 60.00
n 300
Substituting the observed and corresponding expected cell frequencies into Equation
(3.3), we obtain
χ = ∑∑
2
[
p l n − E (n ) 2
ij ij]
i =1 j =1 E (nij )
(50 − 33.33) 2 (70 − 60) 2
= ++
33.33 60
= 8.3375 + + 1.6667
= 19.3255
df = ( p − 1)(l − 1)
= (2 − 1)(3 − 1)
=2
Now from chi-square tables, the value of chi-square at the 0.05 level of significance,
with 2 degrees of freedom is 5.99. Since 19.3255 is greater than 5.99, we reject the null
hypothesis and conclude that at least one of the null hypotheses is false.
Self-Assessment Questions
Exercise 3.4
1. The Director for Academic Affairs of the University of Cape Coast was
concerned that males and females were accepted at different rates into the four
different colleges (EDUCATION, CANS, CHLS and HAAS) in the university.
He, therefore, collected the following data on the acceptance of 1200 males and
800 females who applied to the university:
Are males and females distributed equally among the various colleges?
(a) State the appropriate null and alternative hypotheses for conducting the test
above.
(b) Conduct the test at the 0.05 level of significance.
2. The head of surgery department at the university of Cape Coast medical school
was concerned that Surgical Residents in training applied unnecessary blood
transfusions at a different rate than the more experienced Attending Physicians.
Therefore, he ordered a study of the 49 Attending Physicians and 71 Residents in
Training with privileges at the hospital. For each of the 120 surgeons, the number
of blood transfusions prescribed unnecessarily in a one-year period was recorded.
Based on the number recorded, a surgeon was identified as either prescribing
unnecessary blood transfusion Frequently, Occasionally, Rarely, or Never. The
following is a summary of the resulting data.
Are attending physician and residents in training distributed equally among the
various unnecessary blood transfusion categories?
(a) State the appropriate null and alternative hypotheses for conducting the test
above.
(b) Conduct the test at the 0.05 level of significance.
In this session, you will learn how to conduct goodness-of-fit tests for independence,
which are similar to goodness-of-fit tests for homogeneity.
Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for goodness-of-fit tests for
independence; and
2. conduct goodness-of-fit tests for independence.
Now read on…
Table 3.2 is an r × c contingency table in which variable 1 (in the rows) is classified
into r categories and variable 2 (in the columns) is classified into c categories.
Table 3.2: An r × c contingency table
Variable 2
Variable 1 1 2 c Totals
1 n11 n12 n1c R1
2 n21 n22 n 2c R2
r n r1 nr 2 n rc Rr
Totals C1 C2 Cc n
The number of observation contained in each cell is called observed cell frequency.
c
Ri = ∑ nij is the marginal total for row i, whilst
j =1
r
C j = ∑ nij is the marginal total for column j.
i =1
r c
Note that ∑ Ri = ∑ C j = n is the total sample size.
i =1 j =1
χ = ∑∑
2
r c [nij − E (nij )] 2 ,
i =1 j =1 E (nij )
where E (nij ) is the expected cell frequency for the (ij) th cell. It can be shown that
Ri × C j
E (nij ) = .
n
The test statistic, under the null hypothesis, is approximately
chi-square distributed with the number of degrees of freedom given by
df = (r − 1)(c − 1) .
Note that in tests for independence, both row and column marginal totals are free to
vary although the sample size is fixed. The test for independence is a test of association,
not a test of cause and effect. Thus, the fact that two variables are dependent does not
imply that one causes the other.
Example 3.10
The table below is based on the classification by size and colour of a sample of 120
shirts drawn from a large population.
Size
Small Medium Large
Red 10 13 12
Yellow 12 11 14
Colour
Green 18 20 10
Solution
The row marginal total are R1 = 35 , R2 = 37 and R3 = 48 ; and the column marginal
totals are C1 = 40 , C 2 = 44 and C 3 = 36 . Therefore, the corresponding expected cell
frequencies are obtained by substituting appropriate values into
Ri × C j
E (nij ) = .
n
Thus,
35 × 40 35 × 44
E (n11 ) = = 11.67 , E (n12 ) = = 12.83 ,
120 120
35 × 36 37 × 40
E (n13 ) = = 10.50 , E (n21 ) = = 12.33 ,
120 120
37 × 44 37 × 36
E (n23 ) = = 13.57 , E (n23 ) = = 11.10 ,
120 120
48 × 40 48 × 44
E (n31 ) = = 16.00 , E (n32 ) = = 17.60 ,
120 120
48 × 36
E (n33 ) = = 14.40 .
120
Now, substituting both the observed and the their corresponding expected frequencies
into the test statistic, we obtain
χ = ∑∑
2
r c [nij − E (nij )]2
i =1 j =1 E (nij )
(10 − 11.67) 2 (13 − 12.83) 2 (10 − 14.40) 2
= + ++
11.67 12.83 14.40
= 3.63
df = (3 − 1)(3 − 1) = 4 .
Therefore, from chi-square tables, χ 02.05 (4) = 9.49 . Since χ 2 = 3.63 is less than
χ 02.05 (4) = 9.49 , we cannot reject the null hypothesis. Therefore, size and colour are
independent
Example 3.11
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents
were classified by gender (male or female) and by voting preference (NPP, NDC, or
PPP). The results are shown in the contingency table below.
Voting Preferences
NPP NDC PPP
Male 200 150 50
Female 250 300 50
Is there a gender gap? Do the men’s voting preferences differ significantly from the
women’s preferences? Use the 0.05 level of significance.
Solution
We wish to test the hypothesis
H 0 : Gender and voting preferences are independent.
against
H 1 : Gender and voting preferences are not independent.
The row marginal total are R1 = 400 , R2 = 600 ; and the column marginal totals are
C1 = 450 , C 2 = 450 and C3 = 100 . The sample size is 1,000. Therefore, the
corresponding expected cell frequencies are obtained by substituting appropriate values
into
Ri × C j
E (nij ) = .
n
Thus,
Voting Preferences
NPP NDC PPP
Male 180 180 40
Female 270 270 60
Now, substituting both the observed and the their corresponding expected frequencies
into the test statistic, we obtain
χ = ∑∑
2
r c [nij − E (nij )] 2
i =1 j =1 E (nij )
(200 − 180) 2 (50 − 60) 2
= ++
180 60
= 16.2
df = (2 − 1)(3 − 1) = 2 .
Therefore, from chi-square tables, χ 02.05 (2) = 5.99 . Since χ 2 = 16.2 is greater than
χ 02.05 (2) = 5.99 , we reject the null hypothesis. Therefore, Gender and voting
preferences are not independent.
Self-Assessment Questions
Exercise 3.5
1. A recent study of educational levels of 1000 voters and their political party
affiliations in Ghana showed the results given in the table. Set up the appropriate
hypotheses and test, at the 0.05 level of significance, if party affiliation is
independent of the educational level of the voters.
Party Affiliation
NPP NDC PPP
JHS 95 80 115
SHS 135 85 105
Tertiary 160 105 120
2. Pollsters have found that the public’s confidence in business is closely tied to the
economic climate. When businesses grow and employment increases, public
confidence goes high. When the opposite occurs, public confidence goes low. A
scholar hypothesized that there is a relationship between level of confidence in
business and job satisfaction, and that this is true for both union and non-union
workers. He analysed sample data collected by the National Opinion Research
Center and shown below.
Job Satisfaction
Dissatisfied
Dissatisfied
Moderately
Satisfied
Satisfied
A little
Very
Very
A Great Deal 26 15 2 1
Confidence
Only Some
Business
95 73 16 5
Level In
(Union)
Hardly Any 34 28 10 9
Job Satisfaction
Dissatisfied
Dissatisfied
Moderately
Satisfied
Satisfied
Very
Very
A little
(Non-union)
Business
Level In
(a) State the appropriate null and alternative hypotheses for these tests.
(b) Conduct the appropriate tests.
(c) The scholar concluded that his hypothesis was not supported by the data. Do you
agree?
3. Consider the data below. At the 0.05 level of significance, would you say that the
level of teaching evaluation is related to rank? Or are full professors more likely
to be judged above average than other ranks?
Rank
Teaching Senior Associate Full
Evaluation Lecturer Lecturer Professor Professor
Above
Average 36 62 45 50
Average 48 50 35 43
Below
Average 30 13 20 35
Objectives
By the end of this session, you should be able to:
1. explain coefficient of contingency;
2. calculate Pearson’s coefficient of contingency;
3. interpret Pearson’s coefficient of contingency;
4. calculate Cramer’s coefficient of contingency;
5. interpret Cramer’s coefficient of contingency.
χ2
PCoC = , (3.4)
n+ χ2
where
r
χ 2 = ∑∑
c [nij − E (nij )] 2 ,
i =1 j =1 E (nij )
Example 3.12
Refer to Example 3.11. Calculate and interpret Pearson’s coefficient of contingency.
Solution
We already know from Example 3.11 that the two variables, gender and voting
preferences, are dependent. Substituting χ 2 = 16.2 and n = 1,000 into Equation (3.4),
we have
χ2
PCoC =
n+ χ2
16.2
=
1,000 + 16.2
= 0.1263
We can, therefore, conclude that the association between gender and voting preferences
is low.
Example 3.13
Refer to Question 3 of Exercise 3.5. Calculate and interpret Pearson’s coefficient of
contingency.
Solution
χ2
PCoC =
n+ χ2
17.4354
=
467 + 17.4354
= 0.1897
We can, therefore, conclude that the association between Teaching Evaluation and Rank
is low.
χ2
CCoC = , (3.5)
n (t − 1)
where t is the smaller of (number of rows, number of columns). The value of CCoC lies
in the interval 0 to 1.
Example 3.14
Refer to Example 3.11. Calculate and interpret Cramer’s coefficient of contingency.
Solution
We already know from Example 3.11 that the two variables, gender and voting
preferences, are dependent. We note that r = 2 is smaller than c = 3 . Substituting
t = 2 , χ 2 = 16.2 and n = 1,000 into Equation (3.5), we have
χ2
CCoC =
n (t − 1)
16.2
=
1,000 (2 − 1)
= 0.1273
We can, therefore, conclude that the association between gender and voting preferences
is low.
Example 3.15
Refer to Question 3 of Exercise 3.5. Calculate and interpret Cramer’s coefficient of
contingency.
Solution
χ2
CCoC =
n (t − 1)
17.4354
=
467 (3 − 1)
= 0.1366
We can, therefore, conclude that the association between Teaching Evaluation and Rank
is low.
Self-Assessment Questions
Exercise 3.6
Objectives
By the end of the unit you should be able to:
1. state the general form of the simple linear regression equation and define the
terms involved;
2. determine an equation for the simple linear model using the method of least
squares and use it to make estimate for the response variable;
3. interpret the coefficients of the simple linear regression model obtained from a
given data;
4. determine the types of variation in a given dataset on two related variables;
5. compute the coefficient of determination;
6. compute the coefficient of correlation between two variables;
7. determine regression functions involving two variables and describe the
relationship that exists between them;
8. use the regression coefficients of the two models to obtain the correlation
between the two variables;
9. assess the quality of a regression model by using the standard error of estimate
10. relate results of simple linear regression using raw and transformed datasets.
Objectives
By the end of this session, you should be able to
1. State the general form of the simple linear regression equation and define the
terms involved;
2. determine an equation for the simple linear model using the method of least
squares and use it to make estimate for the response variable
Figure 4.1 shows a scatter plot of data on variables X and Y. The plot shows two
characteristics of the linear relationship:
1. A tendency for Y to decrease in straight line fashion as X increases.
2. A scattering of points around the straight line.
14
12
10
8
Y
0
3 4 5 6 7 8 9 10
X
Another observation from the plot is that for a value of X = 10 , for example, there are
two values of Y. For a specific value considered, there are, in theory, many Y values
with the same x, as a result of other factors that affect Y. Thus, for each x, there is a
population of Y. Denote the mean of these Y values as µ Y X . The straight line tendency
observed in the scatter plot can be represented by assuming that µ Y X is related to X by
µ Y X = β o + β1 x (4.1)
Thus, Equation (4.1) may be regarded as a line of means. The values β o and β 1 are
referred to as regression parameters.
If we take into account the other factors than X that affect Y, then the value of Y is the
sum of the average value and an error term that represents all the other factors. Thus,
Y = β o + β1 x + ε
At any value of X, there is a population of error term values that potentially occur which
describes the different potential effect on Y. We will discuss the behaviour of the error
terms into some more details in the next unit.
An overall measure of the quality of the fit is given by the sum of squared deviations or
errors (SSE). Denote, for now, the SSE by Q. Then
n n n
Q = ∑ ei2 = ∑ ei2 = ∑ [ y i − (bo + b1 xi )]2 (4.3)
i =1 i =1 i =1
To obtain the values of bo and b1 for which SSE is minimum, we differentiate the
quantity Q partially with respect to each br (r = 0, 1) and equating to zero, we obtain
the following equations:
∂Q n
= 2∑ [ y i − (bo + b1 xi )] = 0
∂bo i =1 (4.4)
nbo + b1 ∑ xi = ∑ y i
∂Q n
= 2∑ [ y i − (bo + b1 xi )]xi = 0
∂b1 i =1 (4.5)
bo ∑ xi + b1 ∑ xi2 = ∑ xi y i
Equations (4.4) and (4.5) are often called the normal equations. We can solve the two
equations for bo and b1 . Now writing the system of equations as
n
∑x i bo ∑ y i
= (4.6)
∑ x ∑x 2 b1 ∑ x y
i i i i
n ∑x i
b1 =
∑ yi ∑x y i i
=
n∑ xi y i − ∑ xi ∑ y i
(4.7)
∑x n∑ xi2 − (∑ xi )
2
n i
∑ xi ∑x 2
i
Substituting this value of b1 into Equation (4.4), and then dividing through by n
1 b
bo =
n
∑ y i − 1 ∑ xi
n
or
bo = y − b1 x (4.8)
Equation (4.2) obtained in terms of the least squares estimates in (4.7) and (4.8) gives
the least squares regression line. This line is what is referred to as the line of best fit.
Example 4.1
The data shown concerns the number of hours spent by ten groups of workers on similar
jobs.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)
n∑ xi y i − ∑ xi ∑ y i
b1 =
n∑ xi2 − (∑ xi )
2
Now, ∑ x = 67 ∑ y = 60 ∑ x y = 335 ∑ x 2
= 505 and ∑y 2
= 486
Therefore, the simple linear model for hours spent on the job by number of workers
is
y = 14.0 − 1.1943 x
(b) Substituting X = 4 into the equation, we obtain
y = 14.0 − 1.1943(4) = 9.2228
Therefore, when there are 4 workers on the job the expected time to complete the
job would be 9.2hrs.
Example 4.2
The amount (in Gh¢) of electricity consumed by a household for an average weekly
temperature (in Degrees Celsius) for eight weeks is given in the table shown.
Amount
75 71 92 88 108 120 118 126
of Elect
Solution
(a) The scatter plot of the data is as shown in Figure 4.2.
130
120
110
Electricity
100
90
80
70
20 25 30 35 40
Temperature
From the plot we realize that electricity consumption (Y) increases as temperature (X)
increases, and this relationship appears to be linear and quite strong. Thus, a straight
line equation would describe the relationship between X and Y.
n∑ xi y i − ∑ xi ∑ y i
b1 =
n∑ xi2 − (∑ xi )
2
n
Now, ∑ x = 238.80 ∑ y = 798 ∑x y
i =1
i i = 24997.0 ∑x 2
= 7609 and
∑y 2
= 82738
1 b
∑
bo =
n
y i − 1 ∑ xi
n
We have
1 2.4471
bo = (798) − (238.8) = 26.7034
8 8
Therefore, the simple linear model for electricity consumption (Y) in terms of
temperature (X) is
y = 26.7034 + 2.4471x
Self-Assessment Questions
Exercise 4.1
1. The profit y, in GH¢, of a certain small scale business establishment in
the xth year of its operation is given in the table.
x 1 2 3 4 5
y 125 140 165 195 230
(a)Find the simple linear regression model for determining the profit for any given
year.
(b) Use your model to determine the profit in the 6th year.
x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345
(a)Find the simple linear regression model for determining the sales for any given
home size.
(b) Use your model to determine the sales for home size of 180 sq.ft.
TPaste Industries produce various brands of tooth pastes. For effective management
of inventory, the company would like to predict more efficiently the demand for
one of its premium product, Toothgate. To develop a prediction model, the
company has gathered data concerning demand for Toothgate over the last 20 sales
periods (where a sales period is defined to be a four-week period). The data is
obtained over the following variables:
y ― the demand for Toothgate (in thousand pieces) in the sales period
x1 ― the price (in GH¢) offered by the company.
x2 ― price difference between price of the company and average industry price of
competitors’ similar product.
Period x1 x2 y Period x1 x2 y
1 4.35 -0.05 5.38 11 4.20 0.40 7.10
2 4.25 0.25 6.51 12 4.25 0.45 6.86
3 4.20 0.60 7.52 13 4.30 0.30 6.87
4 4.20 0.00 5.50 14 4.20 0.50 7.26
5 4.10 0.25 7.33 15 4.30 0.50 7.00
6 4.10 0.20 6.28 16 4.30 -0.05 5.65
7 4.30 0.05 5.87 17 4.05 0.10 6.50
8 4.30 -0.15 5.10 18 4.25 0.00 5.67
9 4.35 0.15 6.00 19 4.30 0.05 5.93
10 4.40 0.20 5.89 20 4.20 0.55 7.26
Objective
By the end of this session, you should be able to:
• Interpret the coefficients of the simple linear regression model obtained
from a given data.
It has been noted that µY X represents the mean of the response variable Y when the
value of the predictor variable X is x. It has also been noted that β o and β 1 are the
regression parameters.
Now, if X = 0 , then µ Y X = β o . Thus, β o is the mean response when the predictor
variable assumes the value 0. It should be noted that some care should be taken to
interpret the value of β o as its interpretation may not be relevant in the context of the
given problem.
Thus, β 1 is the change in the mean of the response variable associated with a unit
change in X. If β1 > 0, then the mean value of y increases as x increases. If β1 < 0, then
the mean value of y decreases as x increases.
Example 4.3
Refer to Example 4.1.
Interpret the coefficients of the linear regression model for determining the number of
hours spent on the job in terms of the number of workers involved.
Solution
Example 4.4
Refer to Example 4.2.
Interpret the coefficients of the linear regression model for determining the amount of
electricity consumed in terms of temperature.
Solution
The linear model was obtained as
y = 26.7034 + 2.4471x
From the equation , bo = 26.7034 .
The value of bo means that when the temperature is 0 C , the expected amount spent
on electricity would be Gh¢26.70.
128
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 2
Self-Assessment Questions
Exercise 4.2
130
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3
referred to as Total Variation in y, SST. If the model is really useful, then it should
explain a high portion of this variation. In this section, we will discuss how to compute
the portion of the total variation accounted for by the model and how this is used as a
measure of the usefulness of the model. Using this measure, we will also find a measure
of the linear relationship between the two variables.
Objectives
By the end of this session, you should be able to:
1. determine the types of variation in a given dataset on two related variables.
2. compute the coefficient of determination.
3. compute the coefficient of correlation between two variables.
represents the amount of variation in y that is not explained by using X (i.e., the model).
We will denote this unexplained variation by SSE.
14 Variable
y
Mean
12
10
8
y
0
3 4 5 6 7 8 9 10
x
the sum of square regression (SSR) and given by SSR = ∑ ( y i − y i ) . It represents the
ˆ 2
i =1
amount of variation in Y explained by the model. The three types of variation are given
by the computational formulae as
n n
SST = ∑ ( y i − y ) 2 = ∑ y i2 − ny 2 (4.9)
i =1 i =1
n n
n n
SSE = ∑ ( y i − yˆ i ) 2 = ∑ y i2 − bo ∑ y i + b1 ∑ xi y i (4.10)
i =1 i =1 i =1 i =1
n n n
SSR = ∑ ( yˆ i − y ) 2 = bo ∑ y i + b1 ∑ xi y i (4.11)
i =1 i =1 i =1
132
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3
Example 4.5
The amount (in Gh¢) of electricity consumed by a household for an average weekly
temperature (in Degrees Celsius) for eight weeks is given in the table shown.
Amount
75 71 92 88 108 120 118 126
of Elect
Find:
(a) the total variation in the amount of electricity consumed
(b) variation in the amount of electricity consumed which is accounted for by the
temperature levels.
Solution
(a) The total variation in amount of electricity is
n
SST = ∑ y i2 − ny 2
i =1
= 82738 − 8(99.75) 2
= 3137.5
(b) The regression equation for determining the amount of electricity bills given
temperature was obtained as
Amt = 26.7 + 2.45 Temp
n n
SSR = bo ∑ y i + b1 ∑ xi y i − ny 2
i =1 i =1
= 82474.259 − 79600.5
= 2873.759
134
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3
2
n n n
n∑ xi y i − ∑ xi ∑ y i
r2 = i =1 i =1 i =1 (4.13)
n 2 n n 2 n
2 2
n∑ y i − ∑ y i n∑ xi − ∑ xi
i =1 i =1 i =1 i =1
Example 4.6
The data on the number of hours spent by ten groups of workers on similar jobs
Example 4.1 is reproduced below.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)
Solution
From the table we obtain the following sums:
∑ x = 67 , ∑ y = 60, ∑ x y = 335 , ∑ x 2
= 505, and ∑y 2
= 486
Substituting into the expression in Equation (4.7) with n = 10, , we have
2
r =
(10(335) − 67(60) )2
[10(486) − (60) ][10(505) − (67) ]
2 2
448900
=
1260(561)
= 0.6351
Therefore, 63.5% of variation in number of hours of work is accounted for by the
number of people engaged on the work.
r2 , b1 > 0
r=
2
− r , b1 < 0
136
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3
For example, for the data in Example 4.6, the correlation coefficient between hours of
work and number of people at work is simply − 0.7969 (i.e., − 0.6351) . Note that the
value should be negative because we already know that as number of people at work
increases their time spent on the work decreases. Thus, b1 < 0 ; we actually do not need
to know the value of b1 to determine the sign of the correlation coefficient.
If we have computed the b1 and r 2 , then the above approach for determining the
correlation coefficient is simple. However, we can also obtain r by using the basic
definition of the correlation as
n
∑ ( xi − x )( yi − y )
r= i =1
(4.14)
n
2
n
2
∑ (
i =1 ix − x ) ∑
i =1 ( y i − y )
which simplifies as
n n n
n∑ xi y i − ∑ xi ∑ y i
r= i =1 i =1 i =1
(4.15)
n 2 n n 2 n
2 2
n∑ y i − ∑ y i n∑ xi − ∑ xi
i =1 i =1 i =1 i =1
We can portray the relationship described above in charts called scatter diagrams (or
scatter plots). Figure 4.4 shows scatter diagrams for specified values of the correlation
coefficients.
Y Y
X X
(a) Perfect positive (b) Perfect negative
correlation (r = +1) correlation (r = –1)
Y
Y
X X
X X
(e) Weak negative correlation (f) Strong negative
(- 0.5<r < 0) correlation (- 1<r < - 0.5)
138
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3
Y
Figure 4.4(a), (c), and (d) all have positive correlation
coefficients, indicating a direct relationship between X
and Y. On the other hand, Figure 4.4(b), (e) and (f) all
have negative correlation coefficients, indicating an
inverse relationship between X and Y. Figure 4.4(g) has
zero correlation, indicating that there is absolutely no
linear relationship between X and Y. X
–1 – 0.5 0 0.5 1
Figure 4.5
Self-Assessment Questions
Exercise 4.3
1. The data on profit y (in GH¢), of a certain small scale business establishment in the
xth year of its operation in Exercise 4.1, Question 1, is given in the table.
x 1 2 3 4 5
y 125 140 165 195 230
(a) Find the coefficient of determination for the model obtained for profit in terms
of the year.
(b) Deduce the coefficient of correlation between profit and the year of operation.
(c) Comment on your values in (a) and (b).
2. The data on sales price, y, (in thousands of GH¢) of a house and home size, x, (in
tens of square feet) in Exercise 4.1, Question 2, is given in the table.
x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345
(a) Find the coefficient of determination for the model obtained for sales price in
terms of home size.
(b) Deduce the coefficient of correlation between sales price and home size.
(c) Comment on your values in (a) and (b).
140
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 4
Objectives
By the end of this session, you should be able to:
1. determine regression functions involving two variables and describe the
relationship that exists between them.
2. use the regression coefficients of the two models to obtain the correlation
between the two variables.
σy
Observe that by multiplying (4.17) by 1 i.e. , we have
σ
y
σ xy σ xy σ y σy
b yx = = × =ρ
σ 2
x σ xσ y σ x σx
So we can obtain the regression coefficient in terms of the population correlation
coefficient, ρ and the standard deviations of the two variables.
Example 4.7
The data on electricity consumption and temperature in Example 4.2 is given below.
Amount of
75 71 92 88 108 120 118 126
Elect (y)
Temp (x) 19.0 19.0 23.5 30.0 34.7 36.4 37.0 39.2
Solution
(a) ∑ x = 238.80 , ∑ y = 798.00 , ∑ x = 7609 , ∑ y = 82738 2 2
n −1 7
∑y − 1n (∑ y )
82738 − 18 (798)
2 2 2
Var (Y ) = = = 448.2143
n −1 7
Therefore, σ x = 68.6886 = 8.29 and σ y = 448.2143 = 21.17
correlation coefficient.
4.2 Functions of X on Y
Similar to regression function of Y on X, the regression of X on Y is of the form
x = a o + a1 y. (4.18)
We will derive expression for finding the coefficients in (4.18) and also determine other
properties of the model.
Now, taking sum of both sides in Equation (4.18), and dividing through by n, we have
∑ x = na o + a1 ∑ y, ⇒ x = a o + a1 y
Thus, ( x , y ) lies on the regression line of X on Y. If we find a1 we can find a o from the
last equation.
Now multiply (4.18) by y and take sums, substitute for a o , we obtain
∑ xy = a ∑ y + a ∑ y
o 1
2
= ( x − a1 y )∑ y + a1 ∑ y 2
1 1
∑ x ∑ y − a1 (∑ y ) + a1 ∑ y 2
2
=
n n
[ ]
a1 n∑ y 2 − (∑ y ) = n∑ xy − ∑ x ∑ y
2
n∑ xy − ∑ x ∑ y
a1 =
n∑ y 2 − (∑ y )
2
1
n
∑ xy − xy
=
1
n
∑ y2 − y2
σ xy
=
σ y2
σ xy
Therefore, a o = x − y.
σ y2
Denoting a1 by bxy , we see that
σ xy σ xy σ x σ
bxy = = × =ρ x
σ y σ xσ y σ y
2
σy
Now we make the following observations about the two models Y on X and X on Y.
1. The product of the slopes of the two lines is
σ σy
b yx bxy = ρ x × ρ = ρ2 (4.19)
σy σx
Example 4.8
Refer to Example 4.7.
(a) Find the regression function of Temperature on Amount of Electricity consumption.
(b) Using regression coefficients, verify the correlation coefficient between the two
variables.
Solution
(a) We already know that for this data, correlation coefficient r = 0.9570 ,
x = 29.85, σ x = 8.29, y = 99.75, σ y = 21.17
Substituting into the expression
8.29
x = 29.85 + 0.9570 ( y − 99.75)
21.17
Simplifying, we obtain
x = −7.5317 + 0.3748 y
(b) Now, bxy = 0.3748 and b yx = 2.447 . Taking product of the two,
bxy b yx = 0.3748 × 2.447 = 0.9171
Therefore, the correlation coefficient between the two variables is 0.9577.
In the next sub-session, we will examine the geometric relationship between the two
regression lines involving two variables. For brevity of presentation henceforth, denote
the regression of Y on X as simply Yx and that of X on Y as X y .
4.3 Geometric Relationship Between Regression Functions
The general relationship between the graphs of regression lines Yx and X y is shown in
Figure 4.6. Note that this is one of two ways the two lines can relate. The two lines will
always have the same sign of slope. In the case in Figure 4.6, we assume that there is a
positive relationship between the two variables.
y
P( x , y )
θ
φy φx
x
Yx
Xy
1
− ρ σ xσ y
ρ
tan θ = (4.20)
σ x + σ y2
2
1 − ρ 2 = 0, ⇒ ρ 2 = 1 ⇒ ρ = ±1
Thus, if the two lines coincide, then it implies that the two variables are
perfectly correlated.
2. If θ = 90 , then the lines are perpendicular. This implies that
1− ρ 2
=∞ ⇒ ρ =0
2ρ
Thus, if the two lines are perpendicular, then it implies that the two variables are
uncorrelated.
Self-Assessment Questions
Exercise 4.4
Objective
∑ (Y )
n 2
i − Yˆi
S2 = i
(4.22)
n−2
Clearly, if Yi − Yˆi = 0 , then there is no error in prediction and so it makes sense that
S 2 = 0 . In that case, every observed point lies on the fitted line. As the quality of the fit
gets worse, the difference between Yi and Yˆi increases and S 2 gets larger. Therefore,
the standard error of estimate is another useful measure of the quality of the regression
equation.
As observed in Session 4.3, S 2 may also be written as
SSE
S2 =
n−2
also called the Mean square Error (MSE). Therefore, the standard error of estimate is
given by
SSE
S=
n−2
The Error Sum of Squares is given by
n
n n
SSE = ∑ y i2 − bo ∑ y i + b1 ∑ xi y i
i =1 i =1 i =1
where bo and b1 are the regression coefficients.
Example 4.9
Solution
(a) The regression equation for determining the amount of electricity consumption
given temperature was obtained as
Amt = 26.7 + 2.45 Temp
Thus, bo = 26.7 and b1 = 2.447 .
n n n
Also ∑ yi2 = 82738 ,
i =1
∑ yi = 798 ,
i =1
∑x y
i =1
i i = 24997
= 82738 − 82474.259
= 263.741
(b) The standard error of the estimate is given as
263.741
S= = 43.9568 = 6.63
6
In the example above, we cannot tell whether or not the value of the standard error is
low enough to justify that the model is good. If however we have two models for the
same purpose of predicting y, the model with the smaller error is usually preferred.
150
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 5
Equation (4.24) shows that if the correlation coefficient is known and the total variation
in the dependent variable Y is also known, we can find the standard error of estimate of
y.
The result also indicates that if the coefficient of determination is high, then S will be
low.
Self-Assessment Questions
Exercise 4.5
152
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 6
It is important to know the nature of the data used for any statistical
analysis as this has implications for clarity of interpretation of results. So
far we have been dealing with the raw data. We can also work with mean-corrected data
or standardized data. By mean-corrected data, the mean of all values on each variable is
subtracted from each of the values on that variable. If the mean-corrected data is then
divided by the standard deviation of the variable, the data is then standardized.
Depending on the nature of the raw data and the objective of the analysis, we may
choose any of the three forms of the data for analysis. In this session, we will consider
how the form of the data affects the regression model.
Objective
You can check that the data is standardized data by showing that the mean value is 0
and the variance is 1. Using the data in Table 4.1, obtain the simple linear regression of
Amount on Temperature. Your result should be the same as
Amt = 0.958Temp
Notice that in this case, the intercept bo = 0 . Notice also that b1 = 0.958 is the same as
the correlation coefficient between the original variables in Example 4.7. This result is
not peculiar to this data. The result can be generalized to any dataset.
From the general equation of the regression of Y on X,
σy
y= y+ρ (x − x)
σx
Re-arranging in terms of sample estimates, we have
y−y x−x
=r
sy sx
y− y x−x
Notice that the expression is the standardized value of Y and is the
sy sx
standardized value of X. Denote each simply by Y and X, we have
Y =rX (4.25)
This gives the regression Yx using standardized values, and the coefficient is the
1
correlation coefficient. The regression X y of X on Y then becomes X = Y .
r
1 n d2 n
s y2 = ∑
n i =1
( yi − y ) 2 =
n
∑ (Yi − Y ) 2 . Thus, s y2 = d 2 sY2
i =1
The correlation coefficient between x and y in terms of the transformed data X and Y is
therefore given as follows:
s xy cd s XY s
r= = = XY
sx s y cd s X sY s X sY
The result shows that the value of the correlation coefficient is not affected by the use of
the transformed data.
The slope of the regression line using the transformation is obtained similarly as
follows:
s xy cd s XY cd s XY d s XY
b yx = 2
= = 2 2 =
s x c 2 s X2 c sX c s X2
Example 4.10
The data in Exercise 4.1, Question 1 is on the profit y, (in GH¢), of a certain small scale
business establishment in the xth year of its operation. It is given in the table shown.
x 1 2 3 4 5
y 125 140 165 195 230
By using a suitable transformation,
(a) determine the regression model for profit in terms of the year of operation.
(b) find the correlation between x and y.
Solution
(a) Define the linear transformation as follows
x = 3 + u and y = 165 + 5v
y − 165
That is, u = x − 3 and v =
5
Using the transformation, we have the data as follows
u v uv
−2 −8 16
−1 −5 5
0 0 0
1 6 6
2 13 26
(b) The correlation between x and y is the same as that between u and v. Now,
s s
bvu = r v . So r = u bvu .
su sv
r 2 = bvu2 ×
∑ u 2 − nu 2 = bvu2
10
=
28.09 × 10
= 0.9794
∑ v 2 − nv 2 294 − 5(1.2) 2
286.8
Therefore, r = 0.9897 .
Self-Assessment Questions
Exercise 6.6
Unit Outline
Session 1: Test of Significance involving Simple Linear Regression Model
Session 2: Assumptions and Properties of Least Squares Regression Line
Session 3: Confidence Interval
Session4: The Mean and Variance of the Simple Linear Regression Estimates
Session 5: The Use of Analysis of Variance in Simple Linear Regression
Session 6: Using Matrix Algebra in Simple Linear Regression
Objectives
By the end of the unit you should be able to:
1. conduct a test of significance of the simple linear regression
model;
2. conduct a test of significance of the independent variable in a simple
linear regression model;
3. state and verify whether or not the regression assumptions are satisfied
for a given model;
4. state and verify that a developed simple linear regression model has
the desired properties;
5. calculate the 95 percent confidence interval for the mean value of the
dependent variable in a simple linear regression model;
6. calculate the 95 percent confidence interval for the slope of the
regression line;
SSR 1 MSR
F= = (5.1)
SSE (n − 2) MSE
In Equation (5.1), SSR is divided by 1 since there is only one variable in the model. In
the denominator, SSE is divided by (n − 2) since we have used the data to make two
estimates, bo and b1 . For a good model, we expect explained variation to be large and
unexplained variation to be small. Thus, a large value of F is preferred. The statistic F
has the F-distribution with degrees of freedom 1 and (n − 2) . We therefore reject the
null hypothesis for a large value of F.
Another approach is to examine the significance of X in predicting Y. Thus, the test is
based on the hypothesis
H o : β1 = 0
The hypothesis says that there is no change in the mean value of y associated with an
increase in x. That is, the variable X is not important in predicting Y. The alternative
hypothesis is that
H a : β1 ≠ 0
If we reject H o , we will conclude that X is significantly related to Y and is therefore
relevant in the model. To conduct the test, we have already computed the least squares
estimate b1 of β1 from a sample of n observations of the dependent variable Y. Note
that for each value of X, there are infinite number of Y values that could be observed. So
there are potentially infinite number of samples that could be obtained and hence
infinite population of potential values of estimates of the regression coefficient. Based
on the regression assumptions (Session 2), the population of all values of b1 is normally
distributed with mean β1 and standard deviation given by
σ b = s c11
1
where
1 1
c11 = n
= n
(5.2)
∑ ( xi − x ) 2
∑x 2
i − nx 2
i =1 i =1
and s is the square root of the mean square error (MSE) associated with the model. It is
given by
SSE
s2 = .
n−2
The test statistic for the test of H o is given by
b1 − β1 H o
t=
se(b1 )
Under H o the test statistic becomes
b1
t=
se(b1 )
which has the t-distribution with n − 2 degrees of freedom. A high value of the statistic
compared to tα 2 shows a departure of b1 from the hypothesized value. We will
therefore reject H o in this case.
Example 5.1
Refer to the data in Example 4.2.
162
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR UNIT 5
REGRESSION SESSION 1
(a) Determine whether or not temperature is significant in the model for amount of
electricity consumption.
(b) Determine whether or not the model is significant. Take α = 5% .
Solution
(a) In that data the model is found to be Amt = 26.7 + 2.447 Temp
We will determine whether or not Temperature is useful in this model.
The regression sum of squares SSR = 2873.759 and the total sum of squares is
SST = 3137.5 . Therefore, SSE = 3137.5 − 2873.759 = 263.741 and
263.741
s2 = = 43.9568 .
6
The null hypothesis for the test of significance is
H o : β1 = 0 against H a : β1 ≠ 0
The test statistic is given by
b1
t=
se(b1 )
1 1
Now, c11 = = = 0.00208
n
7609 − 8(29.85) 2
∑ xi2 − nx 2
i =1
b1 2.447 2.447
Thus, t = = = = 8.0926 .
se(b1 ) 6. 0.3023
From the t-distribution table, we find t 0.025, 6 = 2.447
We observe that the value of the test statistic is greater than the table value. We
therefore reject H o : β 1 = 0 and conclude that the value of b1 is far greater than 0.
Therefore, Temperature is significant in the model.
(b) We test the significance of the contribution of the entire model in accounting for
the variation in Y. The test statistic is given as
2873.759
F= = 65.3769
43.9568
From the F table, F0.05, 1, 6 = 5.99
Since the value of the test statistic is much greater than the table value, we reject
H o and conclude that the model is significant.
You will notice that the results of assessing the significance of the model by examining
the explained variation and that of examining the significance of X appear to be the
same.
r n−2
t= (5.3)
1− r2
which has the t-distribution with n − 2 degrees of freedom. A high value of t indicates
that the correlation is significant.
Example 5.2
In the sample of n = 10 pairs of values drawn on the number of hours taken to complete
a job by groups of workers of various sizes, the observed correlation coefficient
between the variables is r = 0.7969.
(a) Assess the significance of this value.
(b) Find the minimum value of r for a sample of this size that is significant at the 5%
level.
Solution
(a) Our hypothesis is H o : ρ = 0 against H a : ρ ≠ 0 . Substituting values into
r n−2
t= , we have
1− r 2
0.7969 10 − 2
t= = 3.7313
2
1 − 0.7969
From the t table, t 0.025, 8 = 2.306 . Thus, the statistic is greater. So we reject H o and
conclude that the correlation coefficient is significant.
164
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR UNIT 5
REGRESSION SESSION 1
Self-Assessment Questions
Exercise 5.1
166
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR
UNIT 5
REGRESSION SESSION 2
Objectives
By the end of this session, you should be able to:
1. State and verify whether or not the regression assumptions are satisfied
for a given model;
2. State and verify that a developed simple linear regression model has the desired
properties.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)
You will notice that at any value of the independent variable, X, the dependent variable
Y, could assume several values. For example, when X = 5 , Y could take two values 10
and 8. Again, X = 10 , Y = 1, 2 . This suggests that the value of Y is not influenced by
only the value of X, but also by other factors other than the value of X. There is
therefore a population of error term values that could potentially occur that describe the
different potential effect on Y of all factors other than X. These explain the variation in
Y values observed when X = x .
The scatter plot of the data with fitted regression line is shown in Figure 5.1.
14
12
10
8
Hrs
0
3 4 5 6 7 8 9 10
GpSize
Figure 5.1: Plot of on data on number of hours spent by ten groups of workers on
similar jobs with fitted line
The four stated assumptions can be summarized into one point; that the observations y i
are identically and independently normally distributed with mean y i = β o + β1 xi and
constant variance σ 2 . That is, y i ~ N (β o + β1 xi , σ 2 ) .
SN
Size of
Group (X)
No of
hours (Y)
()
Fits Yˆ Residual (e)
1 5 10 8.0303 1.96970
2 8 7 4.4474 2.55258
3 4 13 9.2246 3.77540
4 6 4 6.8360 -2.83601
5 10 1 2.0588 -1.05882
6 3 7 10.4189 -3.41889
7 5 8 8.0303 -0.03030
8 9 3 3.2531 -0.25312
9 10 2 2.0588 -0.05882
10 7 5 5.6417 -0.64171
n
1. The sum of the residuals is zero. That is, ∑e
i =1
i = 0.
n n
∑ e = ∑ (y
i =1
i
i =1
i − bo − b1 xi )
n n
= ∑ y i − nbo − b1 ∑ xi
i =1 i =1
=0
The second line follows from the first of the normal equations. For example, in
10
Table 5.1, it can be verified that ∑e
i =1
i = 0.
n
2. The sum of squared residuals ∑e
i =1
2
i is a minimum. This is the condition under
note that
n n
∑y
i =1
i = nbo + b1 ∑ xi
i =1
n n
= ∑ bo + ∑ b1 xi
i =1 i =1
n
= ∑ (bo + b1 xi )
i =1
n
= ∑ yˆ i
i =1
The result also implies that the mean of the fitted values is the same as the mean
of the observed values. In Table 5.1, verify that Y = Yˆ and is equal to 6.00.
4. Sum of weighted residuals is zero when the residual in the ith trial is weighted
n
by the level of independent variable in the ith trial. That is, ∑x e
i =1
i i = 0. We
observe that
n n
∑ x e = ∑ x (y
i =1
i i
i =1
i i − bo − b1 xi )
n n n
= ∑ xi y i − bo ∑ xi − b1 ∑ xi2
i =1 i =1 i =1
=0
The second line we have simply equated the second of the normal equations to
zero. Verify that this is true from Table 5.1.
5. Sum of weighted residuals is zero when the residual in the ith trial is weighted
n
by the fitted value of the response variable for the ith trial. That is, ∑ yˆ e
i =1
i i = 0.
Self-Assessment Questions
Exercise 5.2
Objectives
By the end of this session, you should be able to:
1. calculate the 95 percent confidence interval for the mean value of the
dependent variable in a simple linear regression model.
2. calculate the 95 percent confidence interval for the slope of the regression
line.
1 (x − x)2
x ′o ( X′X) −1 x o = + n o (5.5)
n
∑ xi2 − nx 2
i =1
The value tα 2 is based on n − 2 degrees of freedom.
Example 5.3
Refer to Example 4.1. The data on the number of hours spent by ten groups of workers
on similar jobs is reproduced below.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)
Find the 95% confidence interval for the estimate of the number of hours spent by 10
workers.
Solution
The model for y is already obtained as y = 14.0 − 1.19 x
∑ x = 67 , ∑ y = 60, ∑ x y = 335 , ∑ x 2
= 505, and ∑y 2
= 486
[ yˆ ± tα 2 S (Distance value) ]
= [2.1 ± 2.306(2.3973) 0.2941]
= [2.1 ± 2.9980]
= [−0.898, 5.098]
This means that in repeated sampling, 95 percent of all sample estimates of the mean of
y at X = 10 would lie in the interval [−0.898, 5.098] , calculated based on the given
formula.
Figure 5.2 is a MINITAB output of the 95 percent confidence interval for the problem
in Example 5.3.
14 Regression
95% CI
12
10
8
Hrs
3 4 5 6 7 8 9 10
GpSize
Figure 5.2: Geometric relationship between a 95-percent confidence interval and the
actual regression line
Note that the shorter the confidence interval, the better the prediction based on the
model. Geometrically, the closer the interval band to the regression line the better the
prediction. From Figure 5.2, what can you say about estimates based on the regression
line?
∑ (x
i =1
i − x)2 ∑x
i =1
2
i − nx 2
Example 5.4
Refer to Example 5.3. Calculate the 95 percent confidence interval for the slope of the
regression line. Comment on your result.
Solution
The model is obtained as y = 14.0 − 1.19 x .
1 1
s xx = ∑ x 2 − (∑ x ) = 505 − (67) 2 = 505 − 448.9 = 56.1 .
2
n 10
Thus, c11 = 0.0178 . Since S = 2.3973 and t 0.025, 8 = 2.306 , it follows that the desired 95
percent confidence interval is
[b1 ± tα 2 S c11 ]
= [−1.19 ± 2.306(2.3973) 0.0178 ]
= [−1.19 ± 0.7381]
= [−1.9281, − 0.4519]
The interval does not contain the hypothesized value H o : β1 = 0 . This implies that the
slope value is significantly different from zero. Therefore, the independent variable X is
significant in the model.
Self-Assessment Questions
Exercise 5.3
Refer to the data in Exercise 4.1, for Questions 3 and 4. In each relevant case, calculate
a 95 percent confidence interval
(a) for the slope of the regression line for demand for Toothgate.
(b) for demand when the price is GH¢4.40.
(c) for demand when the price difference is GH¢0.50.
In each case, comment on your result.
Objectives
∑ (x i − x )(Yi − Y )
b1 = i =1
n
(5.6)
∑ (x
i =1
i − x) 2
∑ ( xi − x )(Yi − Y ) = ∑ ( xi − x )Yi − Y ∑ ( xi − x )
i =1 i =1 i =1
n
But ∑ (x
i =1
i − x) = 0 .
Therefore,
n n
∑ ( xi − x )(Yi − Y ) = ∑ ( xi − x )Yi
i =1 i =1
So Equation (5.6) may be written simply as
∑ (x i − x )Yi
b1 = i =1
n
(5.7)
∑ (x
i =1
i − x) 2
∑ (x
i =1
i − x)2
n
E (b1 ) = E ∑ k i Yi
i =1
n
= ∑ k i E (Yi )
i =1
n
= ∑ k i ( β o + β 1 xi )
i =1
n n
= β o ∑ k i + β 1 ∑ k i xi
i =1 i =1
Thus,
n
E (b1 ) = β1 ∑ k i xi
i =1
By the argument preceding Equation (5.7), we have
n n
n ∑ (x i − x ) xi ∑ (x i − x )( xi − x )
∑k x i i = i =1
n
= i =1
n
=1
i =1
∑ (x
i =1
i − x) 2
∑ (x
i =1
i − x) 2
Substituting for k i ,
2
n
n
(x − x) σ 2
∑ ( xi − x ) 2
σ2
Var (b1 ) = σ 2 ∑ n i = i =1
2
= n
i =1 2
n
∑ ( xi − x )
∑ ( x i − x ) ∑ (x − x)2
2
i
i =1 i =1 i =1
∑ (x
i =1
i − x)2
Using the sample standard error as estimate of σ , we obtain an estimate of the standard
error of b1 taking the square root of the variance
MSE
s 2 (b1 ) = n (5.9)
∑ ( xi − x ) 2
i =1
1 n n
Var (bo ) = Var ∑ Yi − x ∑ k i Yi
n i =1 i =1
n 1
= Var ∑ − x k i Yi
i =1 n
2
1
n
= ∑ − x k i Var (Yi )
i =1 n
n
Expanding the bracket and noting that ∑k i =1
i = 0,
n
1 2
Var (bo ) = σ 2 ∑ 2 − x k i + x 2 k i2
i =1 n n
1
= σ 2 + x 2 ∑ k i2
n
2 1
n
( xi − x ) 2
=σ + x ∑ 2
2
n i =1 2
n
∑ ( x i − x )
i =1
2 1 x2
=σ + n
n
∑i =1
( xi − x ) 2
1
Noting again that x =
n
∑ xi , the result is further simplified as
n
σ 2 ∑ xi2
Var (bo ) = n
i =1
n∑ ( xi − x ) 2
i =1
Thus, an estimate of the standard error of bo is the square root of the variance
n
MSE ∑ xi2
s 2 (bo ) = n
i =1
(5.10)
n∑ ( xi − x ) 2
i =1
Yˆo = bo + b1 xo
Note that for the same values of x, we expect sample-to-sample variation in Yˆo . We
seek to estimate E (Yˆo ) and Var (Yˆo ) at X = x o .
Thus, the observed mean estimate Yˆo is an unbiased estimate of the population mean Yˆo
at a specific value X = x o .
1 n n
Var (Yˆo ) = Var ∑ Yi + ( xo − x )∑ k i Yi
n i =1 i =1
n 1
= Var ∑ + ( xo − x ) k i Yi
i =1 n
2
n
1
= ∑ + ( xo − x ) k i Var (Yi )
i =1 n
1 2( xo − x ) n n
= σ 2 +
n n
∑
i =1
k i + ( x o − x ) 2
∑
i =1
k i2
2
1 (x − x)
= σ 2 + n o
n
∑
i =1
( xi − x ) 2
Therefore, an estimate of the standard error of estimate of Yˆo is given by the square root
of
1 (x − x)2
s (Yo ) = MSE + n o
2 ˆ
(5.11)
n
∑
i =1
( xi − x ) 2
In Session 6, you will realize that this result is simply the product of standard error of
the estimate of the regression model Yˆ and the distance of X = xo from previously
observed mean value x of X.
Self-Assessment Questions
Exercise 5.4
∑x
i =1
2
i
Objectives
By the end of this session, you should be able to:
1. Present the analysis of variance of a simple regression analysis
2. Obtain the three types of variation using matrix approach
∑ ( yi − y ) 2 = ∑ ( yˆ i − y ) 2 + ∑ ( yi − yˆ i ) 2 + 2∑ ( yˆ i − y )( yi − yˆ i ) .
i =1 i =1 i =1 i =1
∑ ( yˆ
i =1
i − y )( y i − yˆ i ) = ∑ yˆ i ( y i − yˆ i ) − y ∑ ( y i − yˆ i )
i =1 i =1
From Property 1 of the estimated regression line in Session 2, the second sum
n n
∑ ( yi − yˆ i ) = ∑ ei = 0.
i =1 i =1
∑ yˆ i ( yi − yˆ i ) = ∑ yˆ i ei
i =1 i =1
n
= ∑ (bo + b1 xi )ei
i =1
n n
= bo ∑ ei + b1 ∑ xi ei
i =1 i =1
=0
n n
since ∑ ei = 0 and
i =1
∑x e
i =1
i i = 0 by Properties 1 and 3, respectively, of Session 2.
Therefore, we have
n n n
∑ ( yi − y ) 2 = ∑ ( yˆ i − y ) 2 + ∑ ( yi − yˆ i ) 2
i =1 i =1 i =1
(5.12)
∑(y
i =1
i − y ) 2 is the total sum of squares (SST)
∑ ( yˆ
i =1
i − y ) 2 is the regression sum of squares (SSR)
∑(y
i =1
i − yˆ i ) 2 is the error sum of squares (SSE)
Now if the slope of the estimated regression line is zero, i.e., under H o , then SSR = 0 .
On the other hand, if SSR is large and deviates far from 0, the linear term b1 x accounts
for a large variation in the observations.
An important point to note is that in the computation of these variations, we make use
of estimates of the sample which leads to a loss of various degrees of freedom. For
example, for SST, we use y , which comes with loss of 1 degree of freedom. Thus, for
the SST there are n − 1 degrees of freedom. Similarly, for the SSE, there are n − 2
degrees of freedom. Since the degrees of freedom are additive, from Equation (5.12),
we should have
df ( SSR) = df ( SST ) − df ( SSE )
Thus, SSR has 1 degree of freedom. It will be observed later that the degrees of freedom
of the SSR are always equal to the number of variables in the model.
To conduct the test of H o , have stated the statistic as
SSR 1 MSR
F= =
SSE (n − 2) MSE
which has the F distribution with 1 and n − 2 degrees of freedom.
Table 5.2 shows the general layout of the analysis of variance for the simple linear
model.
n n
∑ ( yˆ
i =1
i − y) 2 /1
Regression 1 ∑ ( yˆ i − y ) 2 ∑ ( yˆ i − y ) 2 / 1 n
i =1 i =1 ∑(y
i =1
i − yˆ i ) 2 /(n − 2)
∑(y
n
− yˆ i ) 2 /(n − 2)
Error n−2 ∑(y
i =1
i − yˆ i ) 2
i =1
i
n
Total n −1 ∑(y
i =1
i − y) 2
Example 5.5
Refer to the data in Example 4.2.
(a) Present the analysis of variance for the model for amount of electricity consumption
in terms of temperature.
(b) Comment on your analysis.
Solution
(a) We test the null hypothesis that there is no linear relationship between amount
spent on electricity and temperature level against the alternative that there is linear
relationship at α = 0.05 .
We know that
2
1 n
n
1
SST = ∑ y − ∑ y i = 82738 − (798) 2 = 3137.5
2
i
i =1 n i =1 8
SSE = 263.741,
SSR = 3137.5 − 263.741 = 2873.759
F0.05, 1, 6 = 5.99
(b) Since F = 65.3769 > F0.05, 1, 6 = 5.99 , the null hypothesis of no linear regression is
rejected, and we conclude that the amount spent on electricity is linearly influenced
by temperature.
expect some relationship to exist between the two tests; and there is actually a
relationship between them. To establish the relationship, we note that since the
estimated regression line is
yˆ i = y + b1 ( xi − x )
or
yˆ i − y = b1 ( xi − x )
By squaring and summing both sides over all i = 1, 2, , n we obtain
n n
SSR = ∑ ( yˆ i − y ) 2 = b12 ∑ ( xi − x ) 2 (5.13)
i =1 i =1
∑ (x
i =1
i − x)2
n
Thus, MSE = s 2 (b1 )∑ ( xi − x ) 2
i =1
Then,
n
b12 ∑ ( xi − x ) 2 2
MSR b12 b
F= = i =1
= = 1
MSE n
s (b1 ) s (b1 )
2
s 2 (b1 )∑ ( xi − x ) 2
i =1
Self-Assessment Questions
Exercise 5.3
1. Refer to Example 4.1.
The data on the number of hours spent by ten groups of workers on similar jobs is
as shown.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)
Conduct analysis of variance on the linear model for hours spent on the job by the
number of workers in the various groups.
x 1 2 3 4 5
y 125 140 165 195 230
(a) Conduct analysis of variance for the linear model for profit in terms of the year
of operation.
(b) Test the significance of the year of operation in the model.
(c) Comment on your results in (a) and (b).
Objectives
By the end of this session, you should be able to:
1. Derive the least squares simple linear regression using matrix approach
2. Obtain the three types of variation using matrix approach
y1 = βo + β1 x1
y2 = βo + β1 x2
yn = βo + β1 xn
where
y1 1 x1
y2 1 x2
β
Y = y 3 , X = 1 x3 β = o (5.15)
β1
y x n
n 1
Notice that
Y is n × 1 vector of observed values on Y
X is n × (k + 1) data matrix, k = 1
β is (k + 1) × 1 , k = 1
In the definition of the dimension of the matrices/vectors above, k is the number of
variables in the model; in this case, k = 1 .
For least squares estimation of the parameters, the normal equations are obtained as
follows:
Pre-multiplying (5.14) by X′ , we obtain
( X′X)β = X′Y
where a 2 × 2 matrix, or generally a (k + 1) × (k + 1) matrix.
If X′X has an inverse, we then obtain the solution for the vector β as
β = ( X′X) −1 X′Y (5.16)
194
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 6
n 2 n
∑ xi − ∑ xi
1 i =1
( X′X) −1 = i =1
(5.18)
n
n
− ∑ x
2 n
n∑ xi − ∑ xi
2
i n
i =1 i =1 i =1
Example 5.6
The data on the number of hours spent by ten groups of workers on similar jobs is as
shown.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)
Use matrix approach to obtain the regression equation for the number of hour spent in
terms of the number of workers in the group.
Solution
10 67 0.9002 − 0.1194 60
X′X = ( X′X) −1 = X′Y =
67 505 − 0.1194 0.0178 335
∑(y i − yˆ i ) 2
S2 = i =1
n−2
Taking the numerator,
n n n n
∑ ( yi − yˆ i ) 2 = ∑ yi2 − 2∑ yi yˆ i + ∑ yˆ i2
i =1 i =1 i =1 i =1
(5.19)
∑ ( yi − yˆ i ) 2 = ∑ yi2 − ∑ yˆ i2
i =1 i =1 i =1
(5.20)
Noting that
n n
∑y
i =1
2
i = Y ′Y and ∑ yˆ
i =1
2
i = ( Xβ)′( Xβ) = β ′( X′X)β
∑(y
i =1
i − yˆ i ) 2 = Y ′Y − β ′X′Y
60
Now, Y ′Y = 486 , β ′X′Y = (14.0018 − 1.1943) = 440.0178
335
Therefore,
noting that in simple linear regression, there is only one variable. Expanding and
simplifying, we obtain
n
SSR = ∑ yˆ i2 − ny 2
i =1
In vector form
2
1 n
SSR = β ′X′Y − ∑ y i (5.22)
n i =1
n
For example, for the data above, β ′X′Y = 440.0178 and ∑y
i =1
i = 60
1
Therefore, SSR = 440.0178 − (60) 2 = 80.0178
10
When we divide SSR by the number of variables in the model, we obtain the regression
mean square (MSR). Note that in simple linear regression, there is only one variable.
2
1 n
SST = Y ′Y − ∑ y i (5.23)
n i =1
You will find these matrix representations very useful in multiple regression.
n 2 n
∑ xi − ∑ xi 1
1 i =1
x ′o ( X′X) −1 x o = (1 x ) i =1
2 o
n
− ∑ xi n xo
n n
n∑ xi − ∑ xi
2
i =1 i =1 i =1
1
1
= (∑ x 2
− x o ∑ xi
− ∑ xi + nxo )
n∑ xi2 − (∑ xi )
2 i
x
o
1
= (∑ x 2
− xo ∑ xi − xo ∑ xi + nxo2 )
n∑ x − (∑ xi )
2 2 i
i
1
= (∑ x 2
− 2 xo ∑ xi + nxo2 )
n∑ x − (∑ xi )
2 2 i
i
=
∑ ( xi − x o ) 2
n∑ xi2 − (∑ xi )
2
∑ (x i − x o ) 2 = ∑ ( xi − x + x − x o ) 2 = ∑ ( xi − x ) 2 + n( x o − x ) 2
Making substitutions,
198
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 6
x ′o ( X′X) −1 x o =
∑x 2
i −nx 2 + n( xo − x ) 2
(
n ∑ xi2 − nx 2 )
=
∑x 2
i −nx 2
+
n( x o − x ) 2
n(∑ x 2
i − nx 2 ) ( n ∑ xi2 − nx 2 )
1 ( xo − x ) 2
= +
n ∑ xi2 − nx 2
1 ( xo − x ) 2
x ′o ( X′X) −1 x o = + (5.24)
n ∑ xi2 − nx 2
This result represents the distance of X = xo from previously observed mean value x
of X.
Self-Assessment Questions
Exercise 5.6
1. Let the following matrices be defined as follows:
Y is n × 1 vector of observed values on Y
X is n × (k + 1) data matrix
β is (k + 1) × 1 ,
In the definition of the dimension of the matrices/vectors above, k is the number of
variables in the model; For the simple linear model
state the dimensions of the following products:
(a) β′X′Y
(b) ( X′X) −1 X′Y
2. The data on sales price, y, (in thousands of Ghana Cedis) of a house and home size,
x, (in tens of square feet) in Exercise 4.1, Question 2, is given in the table.
x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345
200
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS
OF VARIANCE UNIT 6
Objectives
By the end of the unit you should be able to:
1. derive the general linear regression for given data on several variables;
2. interpret the coefficients of the derived model;
3. assess a derived regression model by using various measures of quality;
4. assess the quality of a regression model by conducting relevant statistical tests;
5. determine regression model in terms of indicator variables;
6. interpret the coefficients of dummy variable regression;
7. fit a polynomial model to a given data;
8. assess the suitability of the fitted model;
9. perform one-way analysis of variance on a given suitable data;
10. determine exact pairs of classes of samples that are different after establishing
that such differences exist;
11. perform a simple two-way analysis of variance on a given suitable data;
12. perform a two-way analysis of variance with interaction on a given suitable data.
Objectives
By the end of this session, you should be able to:
1. Derive the general linear regression for given data on several variables.
2. Interpret the coefficients of the derived model
Y1 = β o + β 1 x11 + β 2 x12 + + β j x1 j + + β p x1 p + ε 1
Y2 = β o + β 1 x 21 + β 2 x 22 + + β j x 2 j + + β p x 2 p + ε 2
Yi = β o + β 1 xi1 + β 2 xi 2 + + β j xij + + β p x1 p + ε 1
Yn = β o + β 1 x n1 + β 2 x n 2 + + β j x nj + + β p x1 p + ε 1
204
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND
UNIT 6
ANALYSIS OF VARIANCE SESSION 1
Example 6.1
Refer to the data in Example 4.1. Suppose it is suspected that the length of experience
of the members in a group that work on the given job also influences the time spent on
carrying out the job. The data including the average years of experience of the group is
as shown.
Table 6.1: Time spent on completing a piece of
job and influential variables
Size of Av. Years of
No. of Hrs (y)
Group (x1) Experience (x2)
5 5.5 10
8 7.5 7
4 4.0 13
6 9.0 4
10 9.0 1
3 7.0 7
5 5.0 8
9 8.0 3
10 7.5 2
7 8.0 5
Obtain the regression equation for the number of hours spent in terms of the number of
workers in the group and group average years of experience.
Solution
To interpret β1 and β 2 suppose that in the first instance, the number of workers
involved is c with average experience d. Then the average hours taken to do the job is
Yx = c = β o + β 1c + β 2 d
Again let the next group be made up of c + 1 members with average experience d. Then
the time taken on the job is
Yx =c +1 = β o + β1 (c + 1) + β 2 d
The difference between the two is
Yx =c +1 − Yx =c = β 1
206
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND
UNIT 6
ANALYSIS OF VARIANCE SESSION 1
Thus, β1 is interpreted as the difference or change in the average time on the job if one
additional worker is engaged (or for a unit increase in number of workers), whist
holding the value of the other variable unchanged. Therefore, for our example, if one
additional person is engaged, and does not cause any change in the original average
experience of the group, then the time taken to complete the job will reduce by
0.5861hrs (or 35 mins).
Now let suppose that the number of workers involved be q with average experience r.
Then the average hours taken to do the job is
Y x2 = r = β o + β 1 q + β 2 r
Again let the next group be made up of q members with average experience r + 1 d.
Then the time taken on the job is
Yx2 = r +1 = β o + β1 q + β 2 (r + 1)
Thus, β 2 is interpreted as the difference or change in the average time on the job if one
additional year of experience is gained by the workers engaged, whist holding the
number of workers unchanged. Therefore, for our example, if the group of workers
gains one additional year of experience, whilst their number remains unchanged, then
the time taken to complete the job will reduce by 1.4129hrs.
Self-Assessment Questions
Exercise 6.1
1. Refer to the data in Exercise 4.1, Question 2.
Suppose that the rating of the house is also believed to influence the price of the
house. Each house is therefore rated based on its ‘pleasant appearance’ on a scale
of 1 – 10, where 1 represents worst and 10 represents best. The data including the
new variable is presented in Table 6.2.
Table 6.2: Data on sales price of houses and
influential variables
Sales price Home Size Pleasant Rating
360 23 5
196.2 11 2
346.2 20 9
273 17 3
282 15 8
331.8 21 4
387 24 7
255.6 13 6
327 19 7
345 25 2
(a) Obtain the matrix X′X and its inverse.
(b) Hence, find the regression coefficients vector β and write down the regression
model for y.
(c) Interpret all the coefficients of your model.
208
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 2
Objectives
By the end of this session, you should be able to:
1. assess a derived regression model by using various measures of quality.
2. assess the quality of a regression model by conducting relevant statistical tests.
∑(y i − yˆ i ) 2
S2 = i =1
n − ( p + 1)
In matrix notation, the MSE may be written as
Y ′Y − β ′X′Y
S2 = (6.5)
n − ( p + 1)
The standard error of the estimate of the regression model is the square root of the
expression in (6.5)
For example, for the problem in Example 6.1, we obtain
Y ′Y = 486 , β ′X′Y = 470.6175
Therefore,
SSR
∑ (Yˆ − Y )
i
2
r2 = = i =1
n
SST
∑ (Y
i =1
i − Y )2
which is the ratio of the explained variation to the total variation. It measures the
proportion of the total variation in the n values of Y that is explained by the overall
regression model. The value r = r 2 is the multiple correlation coefficient.
Recall that in matrix form, the SSR and SST are given, respectively, by
2
1 n
SSR = b ′X′Y − ∑ y i
n i =1
2
1 n
SST = Y ′Y − ∑ y i
n i =1
Example 6.2
Refer to Example 6.1.
Find the multiple coefficient of determination. Comment on your result.
Solution
n
From Example 6.1, β ′X′Y = 470.6175 and ∑y
i =1
i = 60
1
SSR = 470.6175 − (60) 2 = 110.6175 , and
10
1
SST = 486.0 − (60) 2 = 126
10
Therefore, the multiple coefficient of determination is
110.6175
r2 = = 0.8779
126
Thus, the entire model explains 87.79 percent of variation in the amount of time taken
to perform a task.
Y = bo + b1 x1 + b2 x 2 + + b p x p
The test of significance of the model is based on the hypothesis
H o : β1 = β 2 = = β p = 0
against the alternative
H a : β j ≠ 0, for some j = 1, 2, , p
The hypothesis says that none of the variables ( x1 , x 2 , x3 , , x p ) has an effect on the
response variable Y. The alternative hypothesis, on the other hand, suggests that at least
one of the variables is influential.
The test statistic for the test is given by
n
∑ (Yˆ − Y )
i
2
/p
F= n
i =1
∑ (Y i − Yˆ ) 2 / n − ( p + 1)
i =1
That is, F is the ratio of the mean square regression to the mean square error. The
statistic has the F distribution with k and n − ( p + 1) degrees of freedom.
A large value of F indicates that the model accounts for a large variation in Y, implying
that the model is useful. Thus, we reject H o for a large value of F compared to a critical
value of f 0.05, k , n −( p +1), .
Example 6.3
Refer to Example 6.1.
Test the significance of the overall model for determining the amount of time spent on
the job by the number of workers involved and their average years of experience.
Solution
The test is based on the hypothesis that
H o : β1 = β 2 = 0 against H a : β j ≠ 0, for some j = 1, 2
We know that the means square error is
Y ′Y − β ′X′Y 15.3825
S2 = = = 2.1975
n−3 7
and the mean square regression is
2
1 n
b ′X′Y − ∑ y i
n i =1 110.6175
MSR = = = 55.3088
p 2
Total 9 126.0000
F2, 7 , ( 0.05) = 4.74
(b) Since F = 25.1690 > F0.05, 2, 7 = 4.74 , the null hypothesis of no linear regression is
rejected, and we conclude that the time spent on a piece of job is linearly
influenced by number of workers on the job and the group average experience.
Example 6.4
Refer to Example 6.1.
Test the significance of each of the independent variables.
Solution
The model is obtained as
y = 19.8875 − 0.5861x1 − 1.4129 x 2
Example 6.5
Refer to Example 6.1.
Determine the 95 percent confidence interval for β1 and β 2 .
Comment on your result in each case.
Solution
For the problem in Example 6.1, it is known that the model is
Y = 19.8875 − 0.5861x1 − 1.4129 x 2
and s = 1.4824. From the matrix,
2.0323 − 0.0024 − 0.2718
( X′X) = − 0.0024
−1
0.0299 − 0.0281
− 0.2718 − 0.0281
0.0652
c11 = 0.0299 and c 22 = 0.0652 based on n = 10.
The 95 percent confidence interval for β1 is given as
b1 ± t 0.025, 7 se(b1 )
= −0.5861 ± 0.6062
= (−1.1923, 0.0201)
The interval shows that 95 percent of all samples would yield a value of β1 within the
given interval based on our method of estimation. Since the interval contains 0, under
the hypothesis H o : β1 = 0 , it implies that the variable X 1 (the number of workers
engaged) is not statistically significant in the determination of Y, (the amount of time
spent on the job).
The 95 percent confidence interval for β 2 is given as
b2 ± t 0.025, 7 se(b2 )
= −1.4129 ± 2.365(1.4824 0.0652 )
= −1.4129 ± 0.8952
= (−2.3081, − 0.5177)
The interval shows that 95 percent of all samples would yield a value of β 2 within the
given interval based on our method of estimation. Since the interval does not contain 0,
under the hypothesis H o : β 2 = 0 , it implies that the variable X 2 (average years of
experience of workers engaged) is statistically significant in the determination of Y, (the
amount of time spent on the job).
Self-Assessment Questions
Exercise 6.2
Objectives
Without using the sex of the individual, the regression model for expenditure in terms
of income is given in Table 6.4.
Table 6.4: Regression Model for Expenditure
Predictor Coef SE Coef T P
Constant 72.42 41.82 1.73 0.103
Income 0.3614 0.1114 3.24 0.005
S = 34.4081 R-Sq = 39.7% R-Sq(adj) = 35.9
Notice that this model explains only about 40 percent of variation in the data. In
addition, the standard error is high. These suggest that this model could be improved.
Suppose we have some reason to believe that the sex of the individual also plays a part
in determining expenditure. Our suspicion is based on the scatter plot of the data (see
Figure 6.1) using the level of sex (i.e., male and female) as the grouping variables. In
fact, we should say that we are using Male (M) and Female (F) as indicators of the
variable Sex.
300 Sex
F
M
250
Exp
200
150
100
200 300 400 500 600
Income
The graph shows that males have the tendency to spend higher than their female
counterparts. It implies therefore that the level of one’s expenditure may be influence by
one’s sex. Based on this, we incorporate sex differences in the model for expenditure. In
order to the non-qualitative variable, Sex, on an acceptable scale, we have to create
‘indicator’ variables for it. Since there are only two levels of sex, the inclination will be
to define two indicators. However, this would pose a problem for the inversion of the
matrix X′X . To avoid this, we create only one indicator, say for female, as
1, if individual is a female
F =
0 otherwise
indicator
In general, if a qualitative variable has m levels it can be represented by m − 1
variables, each assigned the values 0 and 1.
150 1 220 0
180 1 277 0
150 1 303 1
195 1 326 0
210 332 1
1
125 1 333 1
186 1 339 1
258 1 344 0
223 1 346 0
Y= X=
198 1 352 1
260 1 356 0
255 1 413 0
214 1 418 0
197 1 419 0
200 1 431 1
183 1 431 1
229 1 434 0
294 1 555 0
18 6629 7 3707
X ′X = 6629 2536697 2521 , X′Y = 1399384
7 2521 17 1252
98.2535
b = ( X′X) ( X′Y) = 0.3345
−1
− 39.8716
Using MINITAB, we obtain the summary of the regression analysis result in addition to
the test as shown in Table 6.5.
Analysis of Variance
Source df SS MS F P
Regression 2 18994.0 9791.8 12.43 0.001
Residual Error 15 12409.0 827.299
Total 17 31403.0
Notice that sex and income are both significant in the model. Again, the inclusion of sex
has improved the quality of the model. This shows that sex differences for expenditure
exist and should not be disregarded. Notice also that the results from the use of table
values are confirmed by the p-values associated with the tests.
Notice that bo is the intercept of the model for males. It implies that bo (= 97.36)
represents the average expenditure for a male when no income is earned yet. For
females with no income average expenditure would be GH¢56.38.
Self-Assessment Questions
Exercise 4.3
1. The data on profit, in GH¢y, of a certain small scale business establishment in the
xth year of its operation in Exercise 2 of Section 1 is given in the table.
x 1 2 3 4 5
y 125 140 165 195 230
(a) Find the coefficient of determination for the model obtained for profit in terms
of the year.
(b) Deduce the coefficient of correlation between profit and the year of operation.
(c) Comment on your values in (a) and (b).
2. The data on sales price of a house (in thousands of Ghana Cedis), y, and home size
(in tens of square feet), x, in Exercise 4.1, Question 2, is given in the table.
x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345
(a) Find the coefficient of determination for the model obtained for sales price in
terms of home size.
(b) Deduce the coefficient of correlation between sales price and home size.
(c) Comment on your values in (a) and (b).
Objectives
By the end of this session, you should be able to:
1. Fit a polynomial model to a given data
2. Assess the suitability of the fitted model
where the coefficients β r (r = 0, 1, 2) are not all zero. The error term ε describes the
effects on y of all factors other than x and x 2 .
In a quadratic model, the mean of y is either
1. increasing at an increasing rate as x increases;
2. increasing at a decreasing rate as x increases;
3. decreasing at an increasing rate as x increases; or
4 decreasing at a decreasing rate as x increases.
y y
x x
y y
x x
Note that although the relationship between y and x is not linear because of the squared
term x 2 , the model is a linear model. This is because the expression
y = β o + β1 x + β 2 x 2
Toothgate (see Exercise 6.5, Question 2). A plot of demand versus Ad. expenditure is
shown in Figure 6.3. The graph also shows the fitted quadratic curve.
7.5
7.0
6.5
Demand
6.0
5.5
5.0
3.0 3.5 4.0 4.5 5.0 5.5
AdExp
Figure 6.3: Scatter plot showing a quadratic relationship between Demand and Ad.
Expenditure
We see that a quadratic curve appears to fit the scatter plot quite well. This quite clearly
suggests that a quadratic model might be suitable for the data.
We can use the least squares method to determine the values of the coefficients,
β r (r = 0, 1, 2) .
Let the error sum of squares be given as
n
Q = ∑ (bo + b1 xi + b2 xi2 − y i ) 2
i =1
∂Q n
= 2∑ (bo + b1 xi + b2 xi2 − y i ) xi = 0
∂b1 i =1
bo ∑ xi + b1 ∑ xi2 + b2 ∑ xi3 = ∑ xi y i
∂Q n
= 2∑ (bo + b1 xi + b2 xi2 − y i ) xi2 = 0
∂b2 i =1
bo ∑ xi2 + b1 ∑ xi3 + b2 ∑ xi4 = ∑ xi2 y i
The system of equations is thus obtained in matrices as
n
∑x ∑x i
2
i
o ∑ i
b y
x 3 = x y
∑ i ∑x ∑x ∑ i i
2
i i b1 (6.7)
4 x2 y
∑ xi ∑x ∑x ∑ i i
2 3
i i b2
Example 6.6
Refer to the problem in Exercise 4.1, Question 1, which is on the profit, in GH¢y, of a
certain small scale business establishment in the xth year of its operation. The data is
given in the table.
x 1 2 3 4 5
y 125 140 165 195 230
(a) Fit a least squares quadratic regression model to the data
(b) Obtain a plot of the data with the fitted quadratic curve. Comment on the graph.
(c) Find the coefficient of multiple determination of the model.
(d) Find also the standard error of the estimate of the model.
Solution
(a) Since the data involve large values, let us use the transformation in Example 4.10.
That is,
y − 165
x = 3 + u and y = 165 + 5v ⇒ u = x − 3 and v =
5
In order to solve the normal equations, we generate the relevant values in the table
shown.
u v uv u2 u 2v u3 u4
−2 −8 16 4 − 32 −8 16
−1 −5 5 1 −5 −1 1
0 0 0 0 0 0 0
1 6 6 1 6 1 1
2 13 26 4 52 8 16
Σ 0 6 53 10 21 0 34
5 0 10 a o 6
0 10 0 a = 53
1
10 0 34 a 21
2
From the equation, we see that
5a o + 10a 2 = 6 , 10a1 = 53 , and 10a o + 34a 2 = 21
y − 165
= −0.0857 + 5.3( x − 3) + 0.6429( x − 3) 2
5
Simplifying gives
y = 114.0 + 7.213 x + 3.215 x 2
(b) The quadratic curve relating profit to year of operation is given in Figure 6.4.
240
220
200
Profit
180
160
140
120
1 2 3 4 5
Year of operation
Figure 6.4: A fitted quadratic model for profit versus year of business operation
From the graph, the fitted curve passes through all the points almost exactly. This
indicates that there is almost perfect quadratic relationship between profit and year
of business operation.
(c) Using the quadratic equation, the fitted values are shown along with the original
values of y
SN y ŷ
1 125 124.429
2 140 141.286
3 165 164.571
4 195 194.286
5 230 230.429
∑y
i =1
i = 855.00 , ∑y
i =1
2
i = 153375.00 and ∑ yˆ
i =1
2
i = 153372.00
i =1 n i =1 5
2
n
1 n 1
s yˆyˆ = ∑ yˆ − ∑ yi = 153372.00 − (855) 2 = 7167
2
i
i =1 n i =1 5
The coefficient of multiple determination is
s yˆyˆ 7167
r2 = = = 0.9996
s yy 7170
Thus, the model accounts for 99.96 percent of variation in profit.
where the coefficients β r (r = 0, 1, 2, , p) are not all zero. If the degree is decided,
the coefficients may be determined by least squares estimation. For brevity, we will
write the average value of y in Equation (6.8) as
p
y = ∑ br x r
r =0
∂Q n
p
= ∑ y i − ∑ br xir (−2 xik )
∂bk i =1 r =0
Equating to zero,
n
p r +k
n
∑ ∑ r i ∑ xi y i
i =1 r = 0
b x = k
(6.9)
i =1
Equation (6.9) may further be written as
n n
∑
i =1
xik yˆ i = ∑ xik y i
i =1
It is noticeable that in Equation (6.9), the left-hand side is the kth moments of the
polynomial and the right-hand side is the kth moment of the data. Thus, to fit a
polynomial curve of degree p to a set of data by the method of least squares, it is
equivalent to equating the moments of order k = 0, 1, 2, , p of the polynomial to
those of the data.
Self-Assessment Questions
Exercise 6.4
1. Refer to the problem in Exercise 4.1, Question 1, which is on the profit, in GH¢y,
of a certain small scale business establishment in the xth year of its operation. The
data is given in the table.
x 1 2 3 4 5
y 125 140 165 195 230
(a) Test the significance of the quadratic term in the quadratic model.
(b) Obtain the analysis of variance for both linear and quadratic models.
(c) Obtain a plot of the data with the fitted linear and quadratic curves in separate
panels of the graph. Comment on the graphs.
(d) Which of the two models is more appropriate for the data? Explain.
2. Refer to the problem in Exercise 4.1, Question 3 and 4. Suppose that a variable on
expenditure on advertisement (X3) is included to determine the demand for
Toothgate. The data on X3 along with the demand (Y) is given in the table shown.
Period x3 y Period x3 y
1 3.50 5.38 11 5.00 7.10
2 4.75 6.51 12 4.90 6.86
3 5.25 7.52 13 4.80 6.87
4 3.50 5.50 14 5.10 7.26
5 5.00 7.33 15 5.00 7.00
6 4.50 6.28 16 4.25 5.65
7 3.25 5.87 17 5.00 6.50
8 3.25 5.10 18 3.75 5.67
9 4.00 6.00 19 3.80 5.93
10 4.50 5.89 20 4.80 7.26
(a) Determine the least squares quadratic model for y in terms of x3.
(b) Find the multiple coefficient of determination of the model.
(c) Find also the standard error of the estimate of the model.
Objectives
By the end of this session, you should be able to:
1. Perform one-way analysis of variance on a given suitable data
2. Determine exact pairs of classes of samples that are different after establishing
that such differences exist.
member of the ith class as xij . The layout of the sample is presented in Table 6.6.
Element
Class
1 2 j ni
2 x 21 x 22 x2 j x 2n2
m x m1 xm 2 x mj x mnm
Thus, for each class i, the sum of deviations from the class mean is
ni
∑ (x
j =1
ij − xi. ) = 0
We consider the sum of squared deviations of all values from the general mean given as
ni ni
Hence,
m ni m m ni
∑∑ ( x
i =1 j =1
ij − x.. ) = ∑ ni ( xi. − x.. ) + ∑∑ ( xij − xi. ) 2
2
i =1
2
i =1 j =1
(6.10)
∑∑ ( x
i =1 j =1
ij − x.. ) 2 represents the total variation (SST).
The right-hand side shows a decomposition of the total variation into two components.
The first separated component
m
∑ n (x
i =1
i i. − x.. ) 2 represents the variation between classes or treatments (SSTR).
If the values across classes vary widely, then the deviations of the class means from the
general mean is also supposed to be large. A large value of the between-class variation
is thus a reflection of wide variability or heterogeneity across classes. Wide variability
is an indication of real differences between classes or treatments.
The second component
m ni
∑∑ ( x
i =1 j =1
ij − xi. ) 2 represents the variation within classes.
If the values within each class do not vary widely, then the deviations of the class
values from their class mean is supposed to be small. A small value of the within-class
variation is thus reflection of low variability or homogeneity within classes.
More computational formulas for the variation sum of squares components are
m ni m ni
T..2
SST = ∑∑ ( xij − x.. ) 2 = ∑∑ xij2 −
i =1 j =1 i =1 j =1 n
m ni m
Ti.2 T..2
SSTR = ∑∑ ( xi. − x.. ) 2 = ∑ −
i =1 j =1 i =1 ni n
m ni m ni m
Ti.2
SSE = ∑∑ ( xij − xi. ) 2 = ∑∑ xij2 − ∑
i =1 j =1 i =1 j =1 i =1 ni
xij = µ + τ i + ε ij , i = 1, 2, , m;
(6.12)
j = 1, 2,, ni
The model given by Equation (6.12) is known as the fixed effect model. An equivalent
hypothesis to Equation (6.11) that involves the effect of the classes is given by
H o : τ i = 0, ∀i (6.13)
This says that there is no class or treatment effect on the responses, indicating that the m
population means are the same.
The components of the variations in Equation (6.10) are associated with respective
degrees of freedom. The SST has n − 1 degrees of freedom as a result of the constraint
ni
∑ (x
j =1
ij − x.. ) = 0 . Similarly, SSTR has m − 1 degrees of freedom. Then, since
∑∑ ( x
i =1 j =1
i. − x.. ) 2 / m − 1
F= m ni
(6.14)
∑∑ ( x
i =1 j =1
ij
2
− xi. ) / n − m
Example 6.7
A car manufacturing company wants to determine the fuel consumption rate of its new
cars. In an experiment, each of three sets of five cars of the same brand are filled with
the same brand of fuel and the distance covered before refueling is recorded. Table 6.7
contains the distances covered in the experiment.
Test whether there are differences in consumption due to the three types of fuel.
Solution
The null hypothesis is
H o : µ A = µ B = µC
= 64.856
m ni m
Ti.2 T..2
SSTR = ∑∑ ( xi. − x.. ) 2 = ∑ −
i =1 j =1 i =1 ni n
1
= (1269.90 2 + 1283.10 2 + 1262.40 2 ) − 970485.144
5
= 970529.076 − 970485.144
= 43.932
m ni m ni m
Ti.2
SSE = ∑∑ ( xij − xi. ) 2 = ∑∑ xij2 − ∑
i =1 j =1 i =1 j =1 i =1 ni
1
= 970550 − (1269.90 2 + 1283.10 2 + 1262.40 2 )
5
= 970550 − 970529.076
= 20.924
Total 9 64.856
f 2,12, ( 0.05) = 3.89
By H o , we mean that treatments i and k have the same effect on the mean response. By
H a we mean that i and k have different effects on the mean response. The test statistic
for the test is given as
xi − x k
T= (6.15)
1 1
MSE +
ni n k
and has the t distribution with n − m degrees of freedom. The null hypothesis is rejected
for large values of T compared to tα 2, n − m .
1 1
( xi − x k ) ± tα 2 MSE + , i ≠ k , i, k = 1, 2, , m (6.16)
ni n k
Example 6.8
Refer to Example 6.7.
If differences in treatment effects are detected, carry out a follow-up test to determine
the types of treatment effects that differ.
Solution
The test in Example 6.7 concluded that differences exist between treatment effects.
Therefore we conduct a test of pair-wise comparison.
The 95 percent confidence interval for difference between Fuel types A and B is
constructed as follows:
H o : µ B − µ A = 0 against H a : µ B − µ A ≠ 0
1 1
(256.62 − 253.98) ± 2.179 1.7437 +
5 5
= [2.64 ± 1.8198]
= [0.8202, 4.4598]
Since the interval does not contain the hypothesized value of µ B − µ A = 0 , it implies
that the fuel types A and B differ in their effect on the distance covered. The interval
further shows that in 95 percent of all samples drawn, average distance covered with
Fuel B is higher than with Fuel A.
The 95 percent confidence interval for difference between Fuel types A and C is
constructed as follows:
H o : µ C − µ A = 0 against H a : µ C − µ A ≠ 0
1 1
(252.48 − 253.98) ± 2.179 1.7437 +
5 5
= [− 1.5 ± 1.8198]
= [−3.3198, 0.3198]
Since the interval contains the hypothesized value of µ C − µ A = 0 , it implies that the
fuel types A and C do not differ in their effects on the distance covered.
The 95 percent confidence interval for difference between Fuel types B and C is
constructed as follows:
H o : µ C − µ B = 0 against H a : µ C − µ B ≠ 0
1 1
(252.48 − 256.62) ± 2.179 1.7437 +
5 5
= [− 4.14 ± 1.8198]
= [ − 5.9598, − 2.3202, ]
Since the interval does not contain the hypothesized value of µ C − µ B = 0 , it implies
that the fuel types B and C differ in their effects on distance covered. The interval
further shows that in 95 percent of all samples drawn, average distance covered with
Fuel B is higher than with Fuel C.
The plot clearly shows that the confidence interval for treatment B does not overlap
with any of the other two. This means that µ B ≠ µ A and µ B ≠ µ C . However, there
appears to be some amount of overlap between the confidence intervals for treatments A
and C. Our computations show that this overlap is significant and therefore the two are
not different.
Self-Assessment Questions
Exercise 6.5
1. Table 6.8 shows the yield for three hybrids of corn A, B, and C from an acre plot.
Perform a one-way analysis of variance and draw conclusions.
Objectives
By the end of this session, you should be able to:
1. perform a simple two-way analysis of variance on a given suitable data;
2. perform a two-way analysis of variance with interaction on a given suitable data.
Factor B
Factor A
1 2 j n
2 x 21 x 22 x2 j x2n
m x m1 xm 2 x mj x mn
Note that in Table 6.9, we have assumed for now that there is only one response
value in the ith class of factor A and jth class of factor B.
Using a partition similar to that in Session 5, the sum of squared deviations from the
general mean is given as
248
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6
ni ni
i =1 j =1 i =1 j =1
∑∑ ( x
i =1 j =1
ij − x.. ) 2 represents the total variation (SST).
The right-hand side shows a decomposition of the total variation into three components.
The first separated component
m
∑ n( x
i =1
i. − x.. ) 2 represents the variation due to factor 1 (SSA).
If the values across the levels of factor A vary widely, then the deviations of the level
means from the general mean is also supposed to be large. A large value of the between
factor A variation is thus a reflection of wide variability or heterogeneity across levels
of the factor. This is an indication of real differences between levels of factor A.
The second component
ni
∑ m( x
j =1
.j − x.. ) 2 represents the variation due to factor 2 (SSB).
If the values across the levels of factor B vary widely, then the deviations of the level
means from the general mean is also expected to be large. A large value of the between-
factor B variation is thus a reflection of wide variability or heterogeneity across levels
of the factor. This is an indication of real differences between levels of factor B.
The third component
m n
∑∑ ( x
i =1 j =1
ij − xi. − x. j + x.. ) 2 represents the residual variation after variations due to
corresponding to the ith level of factor A and jth level of factor B, there are l replicates
values.
Table 6.10: Layout of Data for General Two-way Anova
250
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6
Thus, there are in total N = lmn response values. Denote by xijk the kth replicate value
of X in the ith class of factor A and the jth class of factor B. The layout of the data for
the general Two-way Anova is given in Table 6.10.
The treatments in the layout are the combinations of factor A and factor B. Another
point for definition is that
µ ij is the mean value of the response variable obtained using level i of factor A
and level j of factor B.
Of interest are the following sums of squares and associated degrees of freedom.
m
Factor A nl ∑ ( xi.. − x... ) 2 m −1
i =1
n
Factor B ml ∑ ( x. j . − x... ) 2 n −1
j =1
Interaction m n
between l ∑∑ ( xij . − xi.. − x. j . + x... ) 2 (m − 1)(n − 1)
And B i =1 j =1
m ni l
Error ∑∑∑ ( x
i =1 j =1 k =1
ijk − xij . ) 2 mn(l − 1)
m ni l
Total ∑∑∑ ( x
i =1 j =1 k =1
ijk − x... ) 2 lmn − 1
Example 6.9
Refer to Example 6.7.
Suppose that in addition to filling each selected set of five cars of the filled with one of
the three types of fuel, each set must contain a type of the five brands of cars that the
company manufactures. Determine the effect of the brand of car and the fuel type on the
distance covered by the fifteen cars.
252
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6
Solution
Let xij be the distance covered by the ith car filled with the jth fuel type.
Thus, i = 1, 2, 3, 4, 5 and j = 1, 2, 3 .
Notice that the values involved are so large. So transform data by xij′ = xij − 250
Noting that
m n m n m n
T..2
∑∑ ( x − x.. ) = ∑∑ x − Nx.. = ∑∑ x −
2 2 2 2
ij ij ij
i =1 j =1 i =1 j =1 i =1 j =1 N
m n m m
Ti 2 T..2
∑∑ ( xi. − x.. ) 2 = ∑ ni ( xi. − x.. ) 2 = ∑
i =1 j =1 i =1 i =1 ni
−
N
m n n n T j2 T..2
∑∑ ( x. j − x.. ) 2 = ∑ n j ( x. j − x.. ) 2 = ∑
i =1 j =1 j =1 j =1 nj
−
N
We obtain the following values, squares and sums in the table using our new origin.
Fuel Type
Car Ti. xij2
Ti 2
Brand
Fuel A Fuel B Fuel C
∑∑ xi j
2
ij = 349.76 , ∑T
i
.j = 65.4 , ∑T j
2
= 1645.38 ,
m n
T..2 65.4 2
Thus, SST = ∑∑ xij2 − = 349.76 − = 349.76 − 285.144 = 64.616
i =1 j =1 N 15
m
Ti 2 T..2 897.76 65.4 2
SSA = ∑ − = − = 299.2533 − 285.144 = 14.1093
i =1 ni N 3 15
Source of
df SS MS F statistic
variation
Between
2 43.932 21.966 26.7291
Fuel Type
Between
4 14.1093 3.5273 4.2922
car brand
Residual 8 6.5747 0.8218
Total 14 64.616
f 0.05, 2, 8 = 4.46 , f 0.05, 4, 8 = 3.84
Thus, there are significant differences in distance covered due to fuel type and due to
car brand at 5 percent significance level.
Example 6.10
Refer to Example 6.9.
Suppose that the manufacturer wants to be very sure of the effects of the two factors and
selects two cars of each brand in the test process. The resulting data is as shown.
Carry out analysis of variance test and draw conclusions.
254
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6
Solution
Source of
df SS MS F statistic
variation
Between
2 101.953 50.9763 237.10
Fuel Type
Between
4 39.021 9.7553 45.37
car brand
Interaction 8 13.451 1.6813 7.82
Total 29 157.65
It can be verified that the results of the F test are significant. Thus, there are significant
differences in distance covered by the cars due to highly to type of fuel. There are also
differences due to the car brand and the interaction between the car brand and fuel type.
Self-Assessment Questions
Exercise 6.6
Rice Type
Fertilizer
A B C D
18.4 21.8 23.8 22.1
P 17.2 22.8 22.6 20.9
17.8 22.3 23.2 21.5
19.9 24.0 24.8 24.1
Q 19.4 24.6 24.2 22.3
20.4 23.4 25.4 23.2
256
CoDEUCC/Post-Diploma in Mathematics and Science Education
REFERENCES
1. Bowerman, B., & O’Connell, R.T. (1997): Applied Statistics: Improving Business
Processes. McGraw-Hill Inc, USA.
2. Devore, J. (2012). Probability and Statistics for Engineering and the Sciences.
(8thed.). International Edition. Brooks/Cole Canada.
3. DeSanto, C, and Totoro, M. (2002). Introduction to Statistics. (7th ed.). Pearson
Custom Publishing, U.S.A.
4. Freund, J.E. (1992). Mathematical Statistics. (5th ed.). Prentice-Hall, New Jersey,
U.S.A.
5. Goodman, R. (1972). Teach Yourself Statistics. (4th ed.) ELBS and the English
Universities Press Ltd, London.
6. Gordor, B.K. and Howard, N.K. (2006). Introduction to Statistical Methods.
Ghana Mathematics Group. Accra, Ghana.
7. Mason, R.D., Lind, D.A., and Marchal, W.G. (1983). Statistics. An Introduction.
Harcourt Brace Jovanovich, Inc. New York, U.S.A.
8. McClave, J.T. and Benson, P.G. (1994). Statistics for Business and Economics.
(6th ed.). Prentice-Hall, New Jersey, U.S.A.
9. Milton, J.S., Corbet, J.J., and McTeer, P.M. (1986). Introduction to Statistics. D.
C. Heath and Company, U.S.A.
Session 2
2. z = 0.6 , we conclude that there is not enough evidence to reject the null hypothesis.
3. (a) Hypotheses: H 0 : µ = 800 against H 1 : µ ≠ 800
(b) z = 1.92 , we fail to reject H 0 and conclude that the viscosity of a liquid averages
800 centistokes at 25o C .
Session 3
3. t = −0.49 . Fail to reject H 0 since t = −0.49 is greater than − t 0.05 (4) = −2.132 .
Session 4
Session 5
Session 6
1. χ 2 = 5.9168 , so fail to reject H 0 since χ 2 = 5.9168 is not less than χ 02.95 = 2.733 .
2. χ 2 = 20.845 , so fail to reject H 0 since χ 2 = 20.845 is neither less than
χ 02.995 = 11.689 nor greater than χ 02.005 = 38.076 .
3. χ 2 = 27.887 , so fail to reject H 0 since χ 2 = 27.887 is not less than χ 02.95 = 13.121 .
4. χ 2 = 55.704 . Fail to reject H 0 since χ 2 = 55.7043 is not greater than χ 02.05 = 55.758 .
ANSWERS TO SELF-ASSESSMENT QUESTIONS
UNIT 1
Session 1
1. C 8. C
2. B 9. C
3. A 10. D
4. A 11. D
5. D 12. B
6. A 13. A
7. D 14. B
Session 2
2. z = 0.6 , we conclude that there is not enough evidence to reject the null hypothesis.
3. (a) Hypotheses: H 0 : µ = 800 against H 1 : µ ≠ 800
(b) z = 1.92 , we fail to reject H 0 and conclude that the viscosity of a liquid
averages 800 centistokes at 25o C .
Session 3
3. t = −0.49 . Fail to reject H 0 since t = −0.49 is greater than − t 0.05 (4) = −2.132 .
Session 4
Session 5
kα = 0 .
2
Session 6
UNIT 2
Session 1
3. (a) H 0 : µ1 − µ 2 = 0 against H1 : µ1 − µ 2 ≠ 0
(b) z = −5.06 . Reject H 0 at α = 0.05 , µ1 and µ 2 differ.
Session 2
1. (a) t = −1.2545 . We fail to reject H 0 at the 0.10, 0.05, 0.01, and
0.001. This is because t = −1.2545 is not less than
− t 0.10 (18) = −1.330 , nor − t 0.05 (18) = −1.73 , − t 0.01 (18) = −2.55 ,
nor − t 0.001 (18) = −3.610 .
Session 3
1. (a) t = −1.6112 . Reject the null hypothesis at the 0.10 level because
t = −1.6112 is less than − t 0.10 (11) = −1.363 . However, fail to
reject the null hypothesis at the 0.05, 0.01 and 0.001 levels,
respectively. This is because t = −1.6112 is not less than
− t 0.05 (11) = −1.796 , nor less than − t 0.01 (11) = −2.718 , nor less
than − t 0.001 (11) = −4.025 .
Session 4
1. t = 1.897 . Reject H 0 and conclude that the mean weight of boxers before they
took the diet is greater than their weight after the diet. This implies that the weight
reducing diet is effective.
2. (b) d = −0.8
(c) s d2 = 3.7
(d) s d = 1.92
Session 5
1. (a) H 0 : p1 − p 2 = 0 against H1 : p1 − p 2 ≠ 0
(b) z = 2.13 . Reject H 0 at α = 0.05 , and conclude that p1 and p2
differ.
2. z = 0.8086 . Fail to reject H 0 since z = 0.8086 is not less than − z 0.025 = −1.96
nor greater than z 0.025 = 1.96 .
3. (a) H 0 : p1 − p 2 = 0 against H1 : p1 − p 2 ≠ 0
(b) z = 0.505 . Fail to reject H0 at α = 0.02 , and conclude that p1
and p 2 do not differ.
4. z = 0.126 . Fail to reject H 0 since z = 0.126 is not less than − z 0.01 = −2.33 nor
greater than z 0.01 = 2.33 .
Session 6
2. F = 2.469 . Fail to reject H 0 at the 0.10 level of significance, and conclude that
H1 : σ 12 ≠ σ 22 . That is, it is not reasonable to assume that the two population
samples have equal variances.
UNIT 3
Session 2
1. χ 2 = 2.1 . Since χ 2 = 2.1 is less than χ 02.05 (5) = 11.070 , we fail to reject H 0 and
conclude that the die is fair.
2. χ 2 = 8.998 . Since χ 2 = 8.998 is less than jf χ 20.01 (5) = 9.210 , we fail to reject
H 0 and conclude that the proportion of Muslims, Christians and Other religions
in are Ashanti region are 15%, 77% and 8% respectively.
Session 3
2. χ 2 = 6.306 . Since χ 2 = 6.306 is less than χ 02.05 (7) = 14.067 , we do not reject
H 0 and conclude that the binomial distribution is an adequate model for the data,
at the 0.05 level of significance.
Session 4
Session 5
against
against
against
(b) For union workers: χ 2 = 12.20 . Fail to reject H 0 and conclude that the level
of confidence in business and job satisfaction are independent for unionized
workers.
For non-union workers: χ 2 = 9.996 . Fail to reject H 0 and conclude that the
level of confidence in business and job satisfaction are independent for non-
unionized workers as well.
3. χ 2 = 17.4354. Reject H 0 and conclude that Teaching Evaluation and Rank are
independent random variables.