0% found this document useful (0 votes)
39 views279 pages

STA 301D Statistical Methods

The document is a course book for STA 301D Statistical Methods at the University of Cape Coast, designed for distance learners to enhance their statistical knowledge relevant to education. It includes a structured outline of sessions covering topics such as hypothesis testing, regression analysis, and analysis of variance. The book also provides a study guide, symbols and their meanings, and acknowledgments for contributors to the course material.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views279 pages

STA 301D Statistical Methods

The document is a course book for STA 301D Statistical Methods at the University of Cape Coast, designed for distance learners to enhance their statistical knowledge relevant to education. It includes a structured outline of sessions covering topics such as hypothesis testing, regression analysis, and analysis of variance. The book also provides a study guide, symbols and their meanings, and acknowledgments for contributors to the course material.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 279

UNIVERSITY OF CAPE COAST

COLLEGE OF DISTANCE EDUCATION

COURSE TITLE

STA 301D Statistical Methods

© COLLEGE OF DISTANCE EDUCATION, UNIVERSITY OF CAPE COAST


CODE PUBLICATIONS, 2019
STA 301D Statistical Methods

Prof. Nathaniel K. Howard


Dr. Bismark K. Nkansah
First printed in 2017 by Hampton Press Ltd., Cape Coast
Printed in 2019 by Wilson Books & Stationery Ent. Ltd., Cape Coast
Printed in 2021 by Paramount Paper Works, Accra
Printed in 2024 by UCC Press

© COLLEGE OF DISTANCE EDUCATION, UNIVERSITY OF CAPE COAST (CoDE UCC),


2017, 2019, 2021, 2024

CoDE PUBLICATIONS 2024

All right reserved. No part of this publication should be reproduced, stored in a retrieval system or
transmitted by any form or means, electronic, mechanical, photocopying or otherwise without the prior
permission of the copyright holder.

Cover page illustrated by William Jacobs


ABOUT THIS BOOK

This Course Book “STA 301D Statistical Methods” has been exclusively written by
experts in the discipline to up-date your general knowledge of Education in order to
equip you with the basic tool you will require for your professional training as a teacher.

This three-credit course book of thirty-six (36) sessions has been structured to reflect
the weekly three-hour lecture for this course in the University. Thus, each session is
equivalent to a one-hour lecture on campus. As a distance learner, however, you are
expected to spend a minimum of three hours and a maximum of five hours on each
session.

To help you do this effectively, a Study Guide has been particularly designed to show
you how this book can be used. In this study guide, your weekly schedules are clearly
spelt out as well as dates for quizzes, assignments and examinations.

Also included in this book is a list of all symbols and their meanings. They are meant to
draw your attention to vital issues of concern and activities you are expected to perform.

Blank sheets have been also inserted for your comments on topics that you may find
difficult. Remember to bring these to the attention of your course tutor during your
face-to-face meetings.

We wish you a happy and successful study.

Prof. Nathaniel K. Howard


Dr. Bismark K. Nkansah

CoDEUCC/Post-Diploma in Mathematics and Science Education i


ACKNOWLEDGEMENT

It has become a tradition in academic circles to acknowledge the assistance one


received from colleagues in the writing of an academic document. Those who
contributed in diverse ways toward the production of this particular course book merit
more than mere acknowledgement for two main reasons. First, they worked beyond
their normal limits in writing, editing and providing constant support and
encouragement without which the likelihood of giving up the task was very high.
Second, the time span for the writing and editing of this particular course book was so
short that their exceptional commitment and dedication were the major factors that
contributed to its accomplishment.

It is in the foregoing context that the names of Prof. Nathaniel K. Howard and Dr.
Bismark K. Nkansah, University of Cape Coast, who wrote and edited the content of
this course book for CoDEUCC, will ever remain in the annals of the College. This
special remembrance also applies to those who assisted me in the final editing of the
document.

I wish to thank the Vice-Chancellor, Prof. Johnson Nyarko Boampong, the Pro-Vice-
Chancellor, Prof. (Mrs.) Rosemond Boohene and all the staff of the University’s
Administration without whose diverse support this course book would not have been
completed.

Finally, I am greatly indebted to the entire staff of CoDEUCC, especially Mrs. Christina
Hesse for formatting the scripts.

Any limitations in this course book, however, are exclusively mine. But the good
comments must be shared among those named above.

Prof. Anokye Mohammed Adam


(Provost)

ii CoDEUCC/Post-Diploma in Mathematics & Science Education


TABLE OF CONTENTS

Content Page
About this Book ... ... ... ... ... ... ... ... i
Acknowledgement ... ... ... ... ... ... ... ... ii
Table of Contents ... ... ... ... ... ... ... ... iii
Symbols and their Meanings ... ... ... ... ... ...

UNIT 1: TESTS OF HYPOTHESES ON A SINGLE SAMPLE … 1

Session 1: Hypotheses and Test Procedures


1.1 What is a Hypothesis … … … … … 3
1.2 Types of Hypotheses … … … … 4
1.3 Classification of Hypothesis Tests … … … 4

Session 2: Tests Concerning a Single Population Mean (Large Samples) 11


2.1 Tests for Means from Single Normal Populations with
known σ … … … … … … … 11
2.2 Tests for Means from Single Populations with unknown
σ but large sample size, n … … … … 15

Session 3: Tests Concerning a Single Population Mean


(Small Samples)… … … … … … 19

Session 4: Tests Concerning a Single Population Proportion


(Large Samples) … … … … … 23

Session 5: Tests Concerning A Single Population Proportion


(Small Samples) … …. … … … … 27
Session6: Tests Concerning A Single
Population Variance… … … … … 33
UNIT 2: TESTS OF HYPOTHESES ON TWO POPULATIONS

Session 1: Tests Concerning Two Population


Means (Large And Independent Samples) … … ... 39
2.1. Tests Concerning Two Population Means (Independent
Samples with σ 12 and σ 22 known) … … 39
2.2. Tests Concerning Two Population Means (Large Independent
Samples with σ 12 and σ 22 unknown) … … 41

Session 2: Tests Concerning Two Populations Means (Small and


Independent Samples, σ 1 and σ 2 Assumed Equal) … 45

CoDEUCC/Post-Diploma in Maths & Science Education i


TABLE OF CONTENTS

Session 3: Tests Concerning Two Population Means (Small and


Independent Samples, σ 1 and σ 2 assumed unequal) … 51

Session 4: Tests Concerning Two Population Means (Paired Data) … 57

Session 5: Tests Concerning Two Population Proportions ... … 63

Session 6: Tests Concerning Two Population Variances … … … 69

UNIT 3: TESTS ON CATEGORICAL DATA … … … 75

Session 1: The Multinomial Distribution … … … ... 77

Session 2: Goodness-of-Fit Tests (when Categorical Probabilities are


Completely Specified

Session 3: Goodness-of-Fit Tests for the Poison, Binomial and Normal


Distributions … … … ... ... ... ... 87

Session 4: Goodness-Of-Fit Tests for Homogeneity … ... ... 97

Session 5: Goodness-Of-Fit Tests for Independence … ... ... 103

Session 6: Coefficients of Contingency … … … ... ... 111


6.1 Pearson’s Coefficient of Contingency … … 111
6.2 Cramer’s Coefficient of Contingency … … … 113

UNIT 4: SIMPLE LINEAR REGRESSION … … … 117


Session 1: The Line of Best Fit … … … … … … 119
1.1 Introduction to Simple Linear Regression … … 119
1.2 The Line of Best Fit … … … … ... 121
Session 2: Interpretation of the Regression Model Coefficients … 127
Session 3: The Simple Coefficient of Determination and Correlation 131
3.1 Types of Variation … … … … … 131
3.2 The Simple Coefficient of Determination … … 134
3.3 The Simple Correlation Coefficient … … 136
Session 4: Regression Functions … … … … … 141
4.1 Function of Y on X Reviewed … … … … 141
4.2 Functions of X on Y ... ... .. ... 143
4.3 Geometric Relationship Between Regression Functions … 145

ii CoDEUCC/Post-Diploma in Maths & Science Education


TABLE OF CONTENTS

Session 5: Standard Error of Estimate ... … … … 149


5.1 The Standard Error of Estimate … … … 149
5.2 The Standard Error in Terms of Correlation Coefficient 151

Session 6: Effect of Variable Transformation on Simple Linear


Regression … … … … … … … 153
6.1 Effect of Standardized Data on Simple Linear
Regression … … ... ... ... ... 153
6.2 Effect of General Linear Transformation on
Correlation Coefficient … ... ... ... ... 155

UNIT 5: INFERENCE ABOUT SIMPLE LINEAR REGRESSION … 159


Session1: Test of Significance Involving Simple Linear Regression
Model .. . … … … … … … 161
1.1 Test of Significance of Simple Linear Regression
Model … … … … … … … 161
1.2 Test of Significance of the Correlation Coefficient … 164

Session 2: Assumptions and Properties of Least Squares Regression


Line … … … … … … … … 167
2.1 Regression Assumptions … … … … …. 167
2.2 Properties of Fitted Regression Line … … … 169

Session 3: Confidence Interval … … … … ... ... 173


3.1 Confidence Interval for Mean Value of y …… … 173
3.2 Confidence Interval of the Regression slope … … 176

Session 4: The Mean and Variance of The Simple Linear Regression


Estimates … … … … … … 179
4.1 The Mean and Variance of the Slope Parameter … … 179
4.2 The Mean and Variance of the Intercept … … … 182
4.3 Estimate of the Mean Response … … … … 183

Session 5: The Use of Analysis of Variance in Simple Linear


Regression … … … … … … … 187
5.1 Presenting ANOVA in Simple Linear Regression … 187
5.2 Relationship between the F test and the t test … … 190

Session 6: Using Matrix Algebra In Simple Linear Regression … 193


6.1 The Least Squares Regression Parameter Estimates … 193
6.2 Estimation of Variations by Matrix Approach … 196
6.3 Application to Distance Measure … … … … 198

CoDEUCC/Post-Diploma in Maths & Science Education iii


TABLE OF CONTENTS

UNIT 6: MULTIPLE LINEAR REGRESSION AND


ANALYSIS OF VARIANCE ... … 201
Session 1: The General Linear Model … … … … ... 203
1.1 The Data and Regression Model … … … 203
1.2 Interpreting the Coefficients of the Linear Model … 206

Session 2: Inference about the Multiple Linear Regression … ... 209


2.1 The Standard Error of Estimate and the Multiple
Coefficient of Determination … … ... … 209
2.2 Assessing the Significance of the General Linear Model 210
2.3 Assessing the Significance of Independent Variable … 212
2.4 Confidence Interval for Regression Coefficients … 214
Session 3: Dummy Variable Regression … … … ... 217
3.1 Indicator Variables … … … … ... 217
3.2 Interpretation of Dummy Variable Regression Coefficients
and Confidence Intervals …… …. 221

Session 4: Polynomial Regression … … … … … 225


4.1 Quadratic Regression Model ... … … 225
4.2 Least Squares Estimation of the Polynomial Regression 231

Session 5: One-Way Analysis of Variance … … … … 235


5.1 Data Representation and Fundamental Equation of
One-Way ANOVA … … … … … 235
5.2 Testing the Hypothesis for One-Way ANOVA … 238
5.3 Pair-wise Comparison … … … … … 241

Session 6: Two-Way Analysis Of Variance … … … … 247


6.1 Data Layout and Fundamental Equation of Simple
Two-Way ANOVA … … … … … 247
6.2 Data Layout and Fundamental Equation of General
Two-Way ANOVA … … … … … 249
6.3 Testing the Hypothesis for General Two-Way ANOVA … 252

Statistical Tables ... ... ... ... ... ... ... ... 257

References ... ... ... ... ... ... ... ... ... 280

Answers to Self-Assessment Questions ... ... ... ... … 281

iv CoDEUCC/Post-Diploma in Maths & Science Education


SYMBOLS AND THEIR MEANINGS

INTRODUCTION

OVERVIEW

UNIT OBJECTIVES

SESSION OBJECTIVES

DO AN ACTIVITY

NOTE AN IMPORTANT POINT

TIME TO THINK AND ANSWER QUESTION(S)

REFER TO

READ OR LOOK AT

SUMMARY

SELF- ASSESSMENT TEST

ASSIGNMENT

CoDEUCC/Post-Diploma in Maths & Science Education vii


SYMBOLS AND THEIR MEANINGS

OVE R E OVE R E

SUMMA RY SUMMA RY

viii CoDEUCC/Post-Diploma in Maths & Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE
UNIT 1

UNIT 1: TESTS OF HYPOTHESES ON A SINGLE SAMPLE


UNIT OUTLINE:
Session 1: Hypotheses and Test Procedures
Session 2: Tests concerning a single Population Mean
(Large Samples)
Session 3: Tests concerning a single Population Mean
(Small Samples)
Session 4: Tests concerning a single Population Proportion
(Large Samples)
Session 5: Tests concerning a single Population Proportion
(Small Samples)
Session 6: Tests concerning a single Population Variance

In this module titled Statistical methods, you will be introduced to


some methods often used in statistical analyses. These methods
include hypotheses testing, correlation and regression analyses, and
analysis of variance. The first three units, Units 1 to 3, will be devoted to hypotheses
testing; and the last three, Units 4 to 6, to correlation and regression analyses, and
analysis of variance.

In this unit, you will learn about some common concepts used in hypothesis testing,
procedures in hypothesis testing, and how to conduct hypotheses tests that are based on
single samples. In Section 1, we shall discuss some basic concepts in hypotheses testing
and learn about test procedures. Sections 2 and 3 will focus on tests concerning single
population means for large and small samples, respectively. Sections 4 and 5 will focus
on tests concerning single population proportions for large and small samples,
respectively. The final section will be used to teach you how to conduct tests
concerning a single population variance.

UNIT OBJECTIVES
By the end of the unit, you should be able to:
1. explain the basic concepts used in hypotheses testing;
2. formulate the null and alternative hypotheses for single population parameters;
3. conduct tests for single population means (large and small samples);
4. conduct tests for single population proportions (large and small samples); and
5. conduct tests for single population variances.

CoDEUCC / Post-Diploma in Mathematics and Science Education 1


UNIT 1
TESTS OF HYPOTHESES ON A SINGLE SAMPLE

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

CoDEUCC/ Post-Diploma in Mathematics and Science Education


2
TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 1

SESSION 1: HYPOTHESES AND TEST PROCEDURES


In this unit, you will learn about some common terms and concepts
used in hypothesis testing. You will also learn about the procedures
used in hypothesis testing, and how to conduct hypotheses tests that are based on single
samples.

Objectives
By the end of the session, you should be able to:
1. explain the term hypothesis;
2. distinguish between null and alternative hypotheses;
3. distinguish between simple and composite tests;
4. distinguish between one-sided and two-sided tests;
5. explain the types of errors that may occur in hypothesis tests;
6. define a test statistic;
7. explain the critical (or rejection) region of a test;
8. explain the level of significance of a test;
9. distinguish between significant and non-significant tests; and
10. explain two approaches to making a decision in a hypothesis test.

Now read on …

1.1 What is a Hypothesis


A hypothesis is a statement or assertion about the value of a population parameter or
about a characteristic of a probability distribution. It can also be a claim about the
values of several population parameters or about the form of a probability distribution.
The following are examples.
1. The mean age, µ , of students in the College of Distance Education, UCC, in
2017, equals 36 years. That is, µ = 36 years .
This ia an assertion or claim about the mean age (parameter) of a population
(students of CoDE, UCC in 2017). Therefore, it is a hypothesis.
2. The proportion, p1 , of males in Ghana is lower than that, p2 , of females. That is,
p1 < p2 .
3. The weekly number of power failures in a certain village follows a Poisson
distribution.
This is a claim about the form of distribution (Poisson distribution) of the number of
power failures in a certain village. Therefore, it is a hypothesis.

Try to make at least two statements of hypotheses of your own.

CoDEUCC/Post-Diploma in Mathematics and Science Education 3


UNIT 1 HYPOTHESES AND TESTS PROCEDURES
SESSION 1

1.2 Types of Hypotheses


In any hypothesis testing problem, there are two contradictory hypotheses to choose
from. The two contradictory hypotheses are called the null hypothesis and alternative
hypothesis.

Definition 1
A null hypothesis is the assertion that is initially assumed to be true (or the prior belief
assertion). It is denoted by H 0 .

Definition 2
A alternative hypothesis is the assertion that contradicts the null hypothesis. It is
denoted by H 1 (or by H A in other literature).

Hypotheses testing involve the use of sample data to decide whether the null hypothesis
should be rejected or not. If evidence from sample data strongly contradicts the null
hypothesis, then the null hypothesis has to be rejected in favour of the alternative
hypothesis. If evidence from sample data does not strongly contradict the null
hypothesis, then we have no reason to reject the null hypothesis. In other words, we fail
to reject the null hypothesis if sample data does not strongly contradict it. Thus, the
possible conclussions from a hypothesis test should be “reject H 0 in favour of H 1 ” or
“fail to reject H 0 .”

1.3 Classification of Hypothesis Tests


In any hypothesis testing problem, the null and alternative hypotheses may further be
classified as simple or composite test, or as one-sided or two-sided tests depending on
how the test is set up.

1.3.1 Simple and composite tests


Suppose that the parameter of interest in a statistical testing problem is θ . Then the
hypotheses may be classified as simple or composite depending on whether θ takes on
a single value, or multiple values, or range of values.

Definition 3
If θ can only take on a single, then both the null hypothesis ( H 0 ) and alternative
hypothesis ( H 1 ) are called simple hypotheses.

4 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 1

Examples of simple hypotheses are H 0 : θ = 0 or H 1 : θ = 20 . Try to give at least two


examples of your own.

Definition 4
If θ can take on multiple or range of values, then both the null ( H 0 ) and alternative are
called composite hypotheses.

Examples of composite hypotheses are H 0 : θ ≥ 100 or H 1 : θ ∈ {10, 100} . Try to give at


least two different examples of your own.

In any particular hypothesis testing problem, the null and alternative hypotheses could
be a combination of simple and composite hypotheses. In most cases, the null
hypothesis is simple or composite; and the alternative hypothesis is composite.

1.3.2 One-sided and two-sided tests


The null ( H 0 ) and alternative ( H 1 ) hypotheses in any statistical testing problem may
further be classified as one-sided or two-sided.

Definition 5
When both the null and alternative hypotheses are composite and represent one-side of
the parameter space around some value θ 0 , then the test is said to be a one-sided test.

One-sided tests are also called one-tailed tests. Examples of one-sided tests are:
H 0 : θ ≤ θ 0 against H 1 : θ > θ 0 , or

H 0 : θ ≥ θ 0 against H 1 : θ < θ 0 .

Definition 6
When the null hypothesis is simple and the alternative hypothesis represents the rest of
the parameter space of θ , then the test is said to be a two-sided test.

Two-sided tests are also called two-tailed tests. An example of a two-sided test is:
H 0 : θ = θ 0 against H 1 : θ ≠ θ 0 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 5


UNIT 1 HYPOTHESES AND TESTS PROCEDURES
SESSION 1

1.3.3 Errors in hypothesis testing


In any hypothesis testing problem, the null hypothesis H 0 can either be true or false,
and at the end of the test there are two possible conclusions; reject H 0 or fail to reject
H 0 . These lead to four possible decisions that can be made:

1. H 0 is true and it is rejected at the end of the test. This is a wrong decision, and the
resulting error is classified as Type I error.
2. H 0 is false and it is rejected at the end of the test. This is a correct decision.

3. H 0 is true and it is not rejected at the end of the test. This is a correct decision.

4. H 0 is false and it is not rejected at the end of the test. This is a wrong decision,
and the resulting error is classified as Type II error.

These scenarios and decisions are summarized in Table 1.1.

Table 1.1: Decision table for hypothesis testing

H 0 is true H 0 is false
Reject H 0 Type I error Correct decision
Fail to reject H 0 Correct decision Type II error

As mentioned earlier, hypothesis testing involves the use of sample data to decide
whether the null hypothesis should be rejected or not. The decision to reject the null
hypothesis or not is based on the value of a test statistic. A test statistic is an estimator
whose value is calculated from sample data. Its distribution is known under the
assumption that the null hypothesis is true.

If there is a large difference between what is expected under the null hypothesis and
what is observed in a sample, then the null hypothesis is rejected; and the result is said
to be statistically significant. If, on the other hand, the difference between what is
expected and what is observed is small, then there is not enough evidence to reject the
null hypothesis; and the result is said to be not statistically significant.

There are two approaches to determining whether to reject the null hypothesis or not.
The first involves the determination of the rejection or critical region of the test. The
rejection or critical region is a set of values of the test statistic that will enable us to
reject H 0 . It is obtained by using a pre-determined level of significance (or size of the
test). The level of significance, denoted by α , is the probability of committing a Type I

6 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 1

error. The levels of significance often used in literature include α = 1% (i.e. α = 0.01 ),
or α = 5% (i.e. α = 0.05 ), or α = 10% (i.e. α = 0.10 ).

The second approach involves calculation of the p-value of the test. The p-value of the
test is the probability of observing the test statistic at least as extreme as observed under
the null hypothesis. The null hypothesis is rejected for “small” p-values (usually for
p < 0.05 ). Generally, the null hypothesis is rejected at the level of significance α if
p < α . For values of p ≥ α , there is not enough evidence to reject the null hypothesis.
We shall limit ourselves to the first approach in this module.

The probability of rejecting the null hypothesis when in fact, it is true, is always known
since it is either the pre-determined value of α or the p-value of the test. Therefore,
rejecting the null hypothesis is a strong and reliable statistical result. On the other hand,
being unable to reject the null hypothesis is a weak result and should not necessarily
lead to its acceptance because the probability of failing to reject the null hypothesis
when it is false is hardly known. Thus, in the event of failing to reject the null
hypothesis, the conclusion should be that there is no evidence to reject the null
hypothesis based on the given data.

Self- Assessment Questions


Exercise 1.1

1. Which of the following constitute a valid pair of hypotheses?


I. H 0 : x = 30 versus H 1 : x < 30 .
II. H 0 : µ = 270 versus H 1 : µ ≠ 270 .
III. H 0 : π > 0.3 versus H 1 : π = 0.3 .
IV. H 0 : µ = 70 versus H 1 : µ = 69
V. H 0 : π = 0.3 versus H 1 : π > 0.3 .
A. II only
B. V only
C. II and V only
D. All of the above
2. The null and alternative hypotheses divide all possibilities into ___.
A. two sets that overlap
B. two non-overlapping sets.
C. two sets that may or may not overlap.
D. as many sets as necessary to cover all possibilities.

CoDEUCC/Post-Diploma in Mathematics and Science Education 7


UNIT 1 HYPOTHESES AND TESTS PROCEDURES
SESSION 1

3. Which of the following is true about the null and alternative hypotheses?
A. Exactly one hypothesis must be true.
B. Both hypotheses must be true.
C. It is possible for both hypotheses to be true.
D. It is possible for neither hypothesis to be true.
4. One-sided alternative hypotheses are phrased in terms of ______.
A. ≠
B. > or <
C. ≈ or =
D. ≥ or ≤
5. A sampling distribution can be based on which of the following?
A. Sample means
B. Sample correlations
C. Sample proportions
D. All of the above
6. A Type II error occurs when _____.
A. the null hypothesis is false and we fail to reject it.
B. the null hypothesis is true and we reject it.
C. the sample mean differs from the population mean.
D. the test is biased.
7. The form of the alternative hypothesis can be _____.
A. one-sided.
B. two-sided.
C. neither one-sided nor two-sided.
D. one-side or two-sided.
8. A two-sided test is one where _____.
A. results in only one direction can lead to rejection of the null hypothesis.
B. negative sample means lead to rejection of the null hypothesis.
C. results in either of two directions can lead to rejection of the null hypothesis.
D. no results lead to the rejection of the null hypothesis.
9. The value chosen for α in a hypothesis test is known as _____.
A. the rejection level.
B. the acceptance level.
C. the significance level.
D. the error in the hypothesis test.

8 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 1

10. The null hypothesis usually represents _____.


A. the theory the researcher would like to prove.
B. the preconceived ideas of the researcher.
C. the perceptions of the sample population.
D. the status quo.
11. Which of the following values is not often used for α ?
A. 0.01
B. 0.05
C. 0.10
D. 0.25
12. Small p-values indicate more evidence in support of _____.
A. the null hypothesis
B. the alternative hypothesis
C. the quality of the researcher.
D. further testing.
13. If a teacher is trying to prove that new method of teaching statistics is more
effective than traditional one, he/she will conduct a _____.
A. one-sided test.
B. two-sided test.
C. point estimate of the population parameter.
D. confidence interval.

14. A Type I error occurs when _____.


A. the null hypothesis is false and we fail to reject it.
B. the null hypothesis is true and it is rejected.
C. the sample mean differs from the population mean.
D. the test is biased.

CoDEUCC/Post-Diploma in Mathematics and Science Education 9


UNIT 1 HYPOTHESES AND TESTS PROCEDURES
SESSION 1

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

10 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 2

SESSION 2: TESTS CONCERNING A SINGLE


POPULATION MEAN (LARGE SAMPLES)
Suppose that we wish to test the null hypothesis that the mean, µ , of a
normal population with variance, σ 2 , equals a specific value, µ 0 . Then
there will be three possible alternative hypotheses namely H 1 : µ < µ 0 , or H 1 : µ > µ 0 ,
or H 1 : µ ≠ µ 0 .

In this session, we shall learn how to test H 0 : µ = µ 0 against any of the three possible
alternative hypotheses given that
(1) the population we are sampling is normal and its standard deviation, σ , is known;
and
(2) the standard deviation, σ , is unknown; but the sample size, n, is large.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population mean;
2. conduct tests for single means with known population σ ; and
3. conduct tests for single means with unknown population σ ; but large sample size,
n.
Now read on …

2.1 Tests for Means from Single Normal Populations with


known σ
Suppose that we wish to test the null hypothesis that the mean, µ , of a normal
population with variance, σ 2 , equals a specific value, µ 0 . This is, suppose we wish to
test H 0 : µ = µ 0 against any of the three alternative hypotheses H 1 : µ < µ 0 or
H 1 : µ > µ 0 or H 1 : µ ≠ µ 0 . Then we need to perform one of the tests in Table 1.2
based on a random sample of size n from this population.
Table 1.2: Test of H 0 : µ = µ 0 against various alternatives
(a) (b) (c)
H 0 : µ = µ0 H 0 : µ = µ0 H 0 : µ = µ0

H1 : µ < µ0 H1 : µ > µ0 H1 : µ ≠ µ0

CoDEUCC/Post-Diploma in Mathematics and Science Education 11


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 2 MEAN (LARGE SAMPLES)

If the population we are sampling is normal and σ is known, then the test statistic is
given by
x − µ0
z= , (1.1)
σ n

where x is the mean of the sample data, µ the mean of the population, σ the standard
deviation of the population, and n the size of the sample. Note that the term σ n is
called the standard error of the mean.

Table 1.3 contains the critical regions for conducting any of the tests in Table 1.2.

Table 1.3: Critical regions for testing H 0 : µ = µ 0 against various


alternatives
Alternative hypothesis Rejection region for level α test
H1 : µ < µ0 z ≤ − zα (lower-tailed test)

H1 : µ > µ0 z ≥ zα (upper-tailed test)

H1 : µ ≠ µ0 Either z ≤ − z α or z ≥ z α (two-tailed test)


2 2

In Table 1.3, − zα is the value of z that leaves a value of α to its left, zα is the value of
z that leaves a value of α to its right, and α
2
leaves a value of − α2 to its left and a
value of α to its right. An example of the z table is shown in Table I under Statistical
2
Tables at the end of the module.

We now illustrate how to conduct each of the tests in Table 1.2 with the following three
examples.
Example 1.1
The scores of an examination for some students have been normally distributed for
some time now with mean 200 and standard deviation 16. Currently some lecturers
think that the performance has gone down. To support this claim, scores of 100 students
were taken for a study. It was found that the mean for the hundred students was 193.2.
(a) Set up the null and alternative hypotheses.
(b) Will you agree with the lecturers’ claim at the 5% significance level?

12 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 2

Solution
(a) The null and alternative hypotheses are H 0 : µ = 200 (in this case µ 0 = 200 )
against H 1 : µ < 200 .

(b) Substituting for x = 193.2 , µ 0 = 200 , σ = 16 , and n = 100 in Equation (1.1), we


obtain

x − µ0 193.2 − 200
z= = = −4.25
σ n 16 100

We are given α = 5% = 0.05 , and so we sketch the rejection region as follows:

From the z-tables, value of z that leaves an area of 0.05 to the left is 1.645. Thus
we would reject the null hypothesis if z ≤ −1.645 . Since z = −4.25 is less than
− 1.645 , we reject H 0 in favour of H 1 . We therefore agree with the lecturers that
the performance of students has gone down.

Example 1.2
The sales of a store had an average of GHc/ 8,000 per day. The store introduced several
advertising campaigns in order to increase sales. To determine whether or not the
advertising campaigns have been effective, a sample of 64 days of sales was selected. It
was found that the average was GHc/ 8,300 per day. From past information, it is known
that the population follows a normal distribution with a standard deviation of
GHc/ 1,200 .
(a) Set up the null and alternative hypotheses.
(b) Test whether or not the advertising campaigns were effective at the 0.05 level of
significance.

CoDEUCC/Post-Diploma in Mathematics and Science Education 13


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 2 MEAN (LARGE SAMPLES)

Solution
(a) We wish to test H 0 : µ = 8,000 against H 1 : µ > 8,000 .

(b) Substituting for x = 8,300 , µ 0 = 8,000 , σ = 1,200 , and n = 64 in Equation (1.1),


we obtain

x − µ0 8,300 − 8,000
z= = = 2.00
σ n 1,200 64

We are given α = 5% = 0.05 , and so we sketch the rejection region as follows:

From the z-tables, value of z that leaves an area of 0.05 to its right is 1.645. Thus
we would reject the null hypothesis if z ≥ 1.645 . Since z = 2 is greater than
1.645 , we reject H 0 in favour of H 1 . We therefore conclude that the advertising
campaigns were effective.

Example 1.3
The head teacher of a certain Junior High School claims that the average height of
students in his school equals 130 cm. A random sample of nine students was selected
and their average height was found to be 131.08 cm. Suppose that the distribution of
heights of students is normal with standard deviation 1.5 cm.
(c) Set up the null and alternative hypotheses.
(d) Determine whether the data contradicts the head teachers claim, at the 0.01 level
of significance.

Solution
(a) The null hypothesis is H 0 : µ = 130 and the alternative hypothesis is H 1 : µ ≠ 130
(Note: both µ < 130 and µ > 130 contradict H 0 ).

14 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 2

(b) Substituting for x = 131.08 , µ 0 = 130 , σ = 1.5 , and n = 9 in Equation (1.1), we


obtain
x − µ 0 131.08 − 130
z= = = 2.16 .
σ n 1 .5 9

Now we have z α = z 0.01 = z 0.005 .


2 2

From the z-tables, value of z that leaves an area of 0.005 to the right (and to the
left) is 2.575. Thus, we would reject the null hypothesis if z ≥ 2.575 or
z ≤ −2.575 . Now since z = 2.16 is neither greater than 2.575 nor less than
− 2.575 , we are unable to reject the null hypothesis at the 0.01 level of
significance. We therefore conclude that the data does not contradict the head
teacher’s claim.

2.2 Tests for Means from Single Populations with unknown σ but
large sample size, n
We now turn our attention to test means from single populations whose standard
deviations, σ , are not known, but the sample sizes are large.

When the population standard deviation, σ , is not known but the sample size is large
enough ( n ≥ 30 ); the z tests in Session 2.1 can easily be modified to yield valid test
procedures. A large sample size n implies that the standardized variable

x−µ
z=
s n

has approximately a standard normal distribution. If we substitute the null value µ 0 in


place of µ , we obtain the test statistic

x − µ0
z=
s n
CoDEUCC/Post-Diploma in Mathematics and Science Education 15
UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 2 MEAN (LARGE SAMPLES)

which has an approximate normal distribution when H 0 is true.


The critical regions for these tests remain the same as in Table 1.3.

Example 1.4
A random sample of size n = 100 observations taken from a population with mean µ
yielded the sample mean x = 18.9 and sample standard deviation of s = 12.6 . If the
hypotheses are H 0 : µ = 16 and H1 : µ ≠ 16 ,
(a) calculate the value of the appropriate test statistic for this test;
(b) hence determine whether H 0 should be rejected if the significance level were 1%.

Solution
(a) In this problem, we do not know σ ; but the sample size n = 100 , is large. We
therefore replace σ with s in Equation (2.1) to get
x−µ
z= .
s n

Substituting for x = 18.9 , µ 0 = 16 , s = 12.6 , and n = 100 into the equation above,
we obtain

x−µ 18.9 − 16
z= = = 2.30
s n 12.6 100

(b) Now we have z α = z 0.01 = z 0.005 .


2 2

From the z-tables, z 0.005 = 2.575 . Thus, we would reject the null hypothesis if
z ≥ 2.575 or z ≤ −2.575 . Now since z = 2.30 is neither greater than 2.575 nor
less than − 2.575 , we cannot reject the null hypothesis at the 0.01 level of
significance.

16 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 2

Self- Assessment Questions


Exercise 1.2
1. Given the data below, test the hypotheses: H 0 : µ = 30 versus H 1 : µ < 30 , at the
0.05 level of significance,

14.1 14.5 15.5 16.0 16.0 16.7 16.9 17.1 17.5 17.8
17.8 18.1 18.2 18.3 18.3 19.0 19.2 19.4 20.0 20.0
20.8 20.8 21.0 21.5 23.5 27.5 27.5 28.0 28.3 30.0
30.0 31.6 31.7 31.7 32.5 33.5 33.9 35.0 35.0 35.0
36.7 40.0 40.0 41.3 41.7 47.5 50.0 51.0 51.8 54.4
55.0 57.0

2. Given that n = 64 , x = 10.3 and s = 4 , test the null hypothesis H 0 : µ = 10


against the one-sided alternative H 1 : µ > 10 , at the α = 0.05 significance level.

3. The viscosity of a liquid detergent is supposed to average 800 centistokes at


25 C . A random sample of 16 batches of detergent is collected, and the average
viscosity is 812. Suppose we know that the standard deviation of viscosity is
σ = 25 centistokes.
(a) State the hypotheses that should be tested.
(b) Test the hypothesis in (a) at the 0.05 significance level, and draw
appropriate conclusions.

CoDEUCC/Post-Diploma in Mathematics and Science Education 17


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 2 MEAN (LARGE SAMPLES)

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

18 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 3

SESSION 3: TESTS CONCERNING A SINGLE POPULATION


MEAN (SMALL SAMPLES)
In Session 2, we learnt how to test H 0 : µ = µ 0 against any of the three
possible alternative hypotheses given that (1) the population we are
sampling is normal and its standard deviation, σ , is known; and (2) the standard
deviation, σ , is unknown; but the sample size, n, is large.

In some situations, we may not know the standard deviation, σ , of the population we
are sampling and yet the sample size may be small (i.e. less than 30). In this session, we
shall learn how to perform tests for single population means when σ is not known and
n < 30 .

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population means;
and
2. conduct tests for single means for populations with unknown σ and small sample
size, n.

Now read on …

If the population standard deviation σ is unknown and the sample size n is small, then
we cannot assume that the sample standard deviation s will be a good approximation for
σ . We must, therefore, use the t-distribution instead of the standard normal z
distribution to make inferences about a population mean, µ .

Therefore, for tests concerning single population means where σ is not known and n is
small, we use the t statistic:
x − µ0
t=
s n

which has an approximate t-distribution with n − 1 degrees of freedom.

Table 1.4 contains the critical regions for testing H 0 : µ = µ 0 against any of the
possible alternatives.

CoDEUCC/Post-Diploma in Mathematics and Science Education 19


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 3 MEAN (SMALL SAMPLES)

Table 1.4: Critical regions for testing H 0 : µ = µ 0 against various alternatives

Alternative hypothesis Rejection region for level α test


H1 : µ < µ0 t ≤ −tα (lower-tailed test)

H1 : µ > µ0 t ≥ tα (upper-tailed test)

H1 : µ ≠ µ0 Either t ≤ −t α or t ≥ t α (two-tailed test)


2 2

Note that tα and t α are based on t − 1 degrees of freedom. Their values can be read
2

from the t-table (see Table II under Statistical Tables at the end of the module). Let us
illustrate with an example.

Example 1.5
The manufacturer of a new fiberglass tire claims that its average life will be at least
40,000 miles. To verify this claim a sample of 12 tires is tested, with their lifetimes (in
1,000s of miles) being as follows:

Tire Life
1 36.1
2 40.2
3 33.8
4 38.5
5 42.0
6 35.8
7 37.0
8 41.0
9 36.8
10 37.2
11 33.0
12 36.0

Test the manufacturer’s claim at the 5 percent level of significance.


Solution
We wish to test
H 0 : µ ≥ 40 against H1 : µ < 40 .

20 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 3

Observe that σ is not known and n = 12 is small, we therefore have to use the t-
distribution with 11 (= n − 1 ) degrees of freedom. From the data, n = 12 , µ 0 = 40 ,
x = 37.2833 and s = 2.7319 .
Substituting the above into the t statistic, we obtain
37.2833 − 40
t= = −3.445
2.7319 12

From the one-tailed t-tables, the value that corresponds α = 0.05 with 11 degrees of
freedom is 1.796. That is, t 0.05 (11) = 1.796 . Since, t = −3.445 is less than
− t0.05 (11) = −1.796 , we reject H 0 and conclude that the average life of the new
fiberglass will be less than 40,000 miles.

Example 1.6
A sample of 12 radon detectors of a certain type was selected, and each was exposed to
100 pCi/L of radon. The resulting readings were as follows:
105.6 90.9 91.2 96.9 96.5 91.3
100.1 105.5 99.6 107.7 103.3 92.4
Does this data suggest that the population mean reading under these conditions differ
from 100?
(a) State the null and alternative hypotheses.
(b) test these hypotheses at α = 0.05 .

Solution
We wish to test
H 0 : µ = 100 against H1 : µ ≠ 100 .

From the data, n = 12 , µ 0 = 100 , x = 98.42 and s = 6.160 .

Substituting the above into the t statistic, we obtain


98.42 − 100
t= = −0.8885
6.160 12

From the two-tailed t-tables, the value that corresponds α = 0.025 with 11 degrees of
freedom is 2.201. That is, t0.025 (11) = 2.201 . Since t = −0.8885 is neither less than
− t0.05 (11) = −2.201 , nor greater than t0.025 (11) = 2.201 ; we cannot reject H 0 . We,

CoDEUCC/Post-Diploma in Mathematics and Science Education 21


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 3 MEAN (SMALL SAMPLES)

therefore, conclude that the data does not suggest that the population mean reading
under these conditions differs from 100.

Self- Assessment Questions


Exercise 1.3

1. A manufacturer of computer disk drives monitors the retail prices of its drives in
order to gauge the market. For one type of drive the list price is GHc/ 750 , and the
manufacturer wishes to know whether the current mean retail price differs from
the list price. Seventeen retail establishments are sampled, and the current prices
for the drive are determined. The mean and standard deviation for the 17 retail
prices are calculated:
x = GHc/ 732 s = GHc/ 38

Does this sample provide sufficient evidence to conclude that the mean retail
price differs from the list price of GHc/ 750 ?

2. A major car manufacturer wants to test a new engine to determine whether it


meets new air pollution standards. The mean emission, µ , of all engines of this
type must be less than 20 parts per million of carbon. Ten engines are
manufactured for testing purposes, and the mean and standard deviation of the
emission for this sample of engines are determined to be
x = 17.1 parts per million s = 3.0 parts per million

Do the data supply sufficient evidence to allow the manufacturer to conclude that
this type of engine meets the pollution standard? Assume that the manufacturer is
willing to risk a Type I error with probability equal to α = 0.01 .

3. The specifications for a certain kind of ribbon call for a mean breaking strength of
185 pounds. If five pieces randomly selected from different rolls have breaking
strengths of 171.6, 191.8, 187.3, 184.9 and 189.1 pounds, test the hypotheses
H 0 : µ = 185 against H1 : µ < 185
at the 0.05 level of significance.

22 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 4

SESSION 4: TESTS CONCERNING A SINGLE POPULATION


PROPORTION (LARGE SAMPLES)
There are certain experiments whose outcomes result in count data. For
example, the votes obtained by a presidential candidate in an election,
the number of members of parliament in Ghana who are females, the number of schools
on the School Feeding Programme in Ghana, etc., are all examples of count data.
Appropriate models for the analysis of count data include the binomial and Poisson
distributions which you studied in the module Introduction to Probability.

In this session, you will learn about one of the most common tests based on count data,
namely, a test concerning the parameter, p, of the binomial distribution. You will learn
to conduct tests to determine whether the proportion (or percentage) of votes obtained
by a presidential candidate is enough to win him/her the presidency, or whether the true
proportion of schools on the School Feeding Programme is 20%.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population
proportion; and
2. conduct tests concerning single population proportions for large samples.

Now read on …

Tests concerning the population proportion, p, are based on a random sample of size n
from the population. If n is large [ np ≥ 10 and n(1 − p ) ≥ 10 ], then both x (the number
of successes in the experiment) and the estimator, pˆ = nx , are approximately normally
distributed.

Suppose that we wish to test the null hypothesis H 0 : p = p 0 against any of the
alternatives H 1 : p < p 0 or H 1 : p > p 0 or H 1 : p ≠ p 0 . If n is large and H 0 is true,
then the test statistic is given by
pˆ − p0
z= , (1.2)
p0 (1 − p0 ) n

and has approximately a standard normal distribution.


Table 1.5 contains the critical regions for testing H 0 : p = p 0 against any of the three
possible alternatives.

CoDEUCC/Post Diploma in Mathematics and Science Education 23


UNIT 1
TESTS CONCERNING A SINGLE POPULATION
SESSION 4 PROPORTION (LARGE SAMPLES)

Table 1.5: Critical regions for testing H 0 : p = p 0 against various


alternatives
Alternative hypothesis Rejection region for level α test
H1 : p < p0 z ≤ − zα (lower-tailed test)

H 1 : p > p0 z ≥ zα (upper-tailed test)

H 1 : p ≠ p0 Either z ≤ − z α or z ≥ z α (two-tailed test)


2 2

Note that these test procedures are valid provided that n is large.
[i.e. np0 ≥ 10 and n(1 − p0 ) ≥ 10 ]

In the next three examples, we illustrate how to conduct each of the possible tests.

Example 1.7
An oil company claims that less than 20% of all car owners have not tried its gasoline.
Test the claim, at the 0.01 level of significance, if a random check reveals that 22 out of
200 car owners have not tried the company’s gasoline.

Solution
We wish to test

H 0 : p = 0.20 against H1 : p < 0.20 ,

Here p0 = 0.20 , the number of successes, x = 22 , and the sample size is n = 200 .
22
Thus, pˆ = = 0.11 . Substituting these into Equation (1.2); we obtain
200
0.11 − 0.20 − 0.09
z= = = −3.1802 .
0.20 (1 − 0.20) 200 0.0283

From the z tables, z 0.01 = 2.33 . Therefore, the rejection region is z < −2.33 . Since
z = −3.1802 < −2.33 , we reject the null hypothesis. We, therefore, conclude that less
than 20% of all car owners have not tried the company’s gasoline.

Example 1.8
Natural cork in wine bottles is subject to deterioration, and as a result wine in such
bottles may experience contamination. It is reported that, in a tasting of commercial
24 CoDEUCC/Post -Diploma in Mathematics and Science Education
TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 4

chardonnays, 16 of 91 bottles were considered spoiled to some extent by cork-


associated characteristics. Does this data provide strong evidence for concluding that
more than 15% of all such bottles are contaminated in this way? Conduct the test at the
0.10 level of significance.

Solution
We wish to test

H 0 : p = 0.15 against H1 : p > 0.15 ,

Here p0 = 0.15 , the number of successes, x = 16 , and the sample size is n = 91 . Thus,
16
pˆ = = 0.1758 . Substituting these into Equation (1.2); we obtain
91
0.1758 − 0.15 0.0258
z= = = 0.6898.
0.15 (1 − 0.15) 91 0.0374

From the z tables, z 0.10 = 1.28 . Therefore, the rejection region is z > 1.28 . Since
z = 0.6898 < 1.28 , we fail to reject the null hypothesis. We therefore say that there is no
strong evidence to conclude that more than 15% of all such bottles are contaminated.

Example 1.9
In a certain school, 500 students were polled and 245 of them endorsed the new system
of paying club dues. Test the null hypothesis

H 0 : p = 0.55

against the two-sided alternative hypothesis

H 1 : p ≠ 0.55 ,

at the level of significance α = 0.05 .

Solution
Here p 0 = 0.55 , the number of successes, x = 245 , and the sample size is n = 500 .
245
Thus, pˆ = = 0.49 . Substituting these into Equation (1.2); we obtain
500

CoDEUCC/Post Diploma in Mathematics and Science Education 25


UNIT 1
TESTS CONCERNING A SINGLE POPULATION
SESSION 4 PROPORTION (LARGE SAMPLES)

0.49 − 0.55 − 0.06


z= = = −2.703.
0.55 (1 − 0.55) 500 0.0222

From the z tables, z α = z 0.05 = z 0.025 = 1.96 . Therefore, the rejection region is either
2 2
z < −1.96 or z > 1.96 . Since z = −2.703 < −1.96 , we reject the null hypothesis and
conclude that the percentage of students who endorsed the new system of paying dues is
different from 55%.

Self- Assessment Questions


Exercise 1.4
1. In a certain random sample of 600 cars making a right turn at a certain
intersection, 157 pulled into the wrong lane. Use the 0.05 level of significance to
test the null hypothesis that the actual proportion of who make this mistake at the
given intersection is p = 0.30 against the alternative hypothesis p ≠ 0.30 .

2. A manufacturer of a spot remover claims that his product removes 90% of all
spots. If, in a sample, only 174 of 200 spots were removed with the
manufacturer’s product, test the null hypothesis p = 0.90 against the alternative
hypothesis p < 0.90 at the 0.05 level of significance.

26 CoDEUCC/Post -Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 5

SESSION 5: TESTS CONCERNING A SINGLE POPULATION


PROPROTION (SMALL SAMPLES)
In Session 4, we learnt that if the sample size n is large then both the
number of successes in the experiment, x, and the estimator, pˆ = nx , are
approximately normally distributed. Therefore, tests concerning p are based on z, the
standard normal distribution. In situations, where n is small, tests concerning p are
based directly on binomial probabilities.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single
population proportion; and
2. conduct tests concerning single population proportions for small samples.

Now read on …

Suppose that we wish to test H 0 : p = p 0 against H 1 : p < p 0 at α % significant level.


Then the critical region is x ≤ kα , where x is the number of observed successes; kα the
largest integer for which

∑ B(k ; n, p
k =0
0 ) ≤α ;

and B(k ; n, p 0 ) is the probability of observing k successes in n binomial trials when


p = p0 .

The critical region, if the alternative hypothesis were H 1 : p > p 0 , would be x ≥ kα∗ ,
where kα∗ is the smallest integer for which

∑ B(k ; n, p 0 ) ≤α.
k = k α∗

Similarly, the critical region, if the alternative hypothesis were H 1 : p ≠ p 0 , would be


x ≤ k α , where k α is the largest integer for which
2 2

CoDEUCC/Post-Diploma in Mathematics and Science Education 27


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 5 PROPROTION (SMALL SAMPLES)


2
α
∑ B(k ; n, p
k =0
0 )≤
2
;

and k α∗ is the smallest integer for which


2

n
α
∑ B(k ; n, p 0 )≤
2
.
k = k ∗α
2

Example 1.10
It is claimed that 40% of patients that attend a certain clinic on any day are smokers.
Suppose that on a particular day, 3 out of a sample of 13 patients attending the clinic
were found to be smokers. Test the hypothesis

H 0 : p = 0.40
against
H 1 : p ≠ 0.40

at the 5% significance level.

Solution
In this problem, x = 3 and n = 13. Since α = 0.05, k α = k 0.025 . Now, from binomial
2

tables (see Table III at the end of the module),


1
∑ B(k ;13, 0.40) = B(0;13, 0.40) + B(1;13, 0.40)
k =0
= 0.0013 + 0.0113
= 0.0126
Thus,
1
∑ B(k ;13, 0.40) = 0.0126 < 0.025 ,
k =0

implying that the largest integer k α for which


2

∑ B(k ;13, 0.40) ≤ 0.025 is 1.


k =0

28 CoDEUCC/Post Diploma Degree Programme in Mathematics/Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 5

Similarly, the smallest integer k α∗ for which


2

13

∑ B(k ;13, 0.40) ≤ 0.025 is 10.


k =10
That is,
13
∑ B(k ;13, 0.40) = B(10;13, 0.40) +  + B(13;13, 0.40)
k =10
= 0.0065 + 0.0012 + 0.0001 + 0.0000 .
= 0.0078

To be able to reject the null hypothesis, either the number of successes, x is less than or
equal to 1; or greater than or equal to 10. Since x = 3 is not less or equal to 1, nor
greater or equal to 10, we cannot reject the null hypothesis. We, therefore, conclude that
40% of patients that attend a certain clinic on any day are smokers.

Example 1.11
It has been claimed that more than 40% of all shoppers can identify a highly advertised
trademark. If, in a sample, 10 of 18 shoppers were able to identify the trademark, test at
the 0.05 level of significance whether H 0 : p = 0.40 can be rejected against
H1 : p > 0.40 .

Solution

In this problem, x = 10 and n = 18. Since α = 0.05, kα∗ = k 0∗.05 , the critical region,
x ≥ k 0∗.05 , where k 0∗.05 is the smallest integer for which

n
∑ B(k ; n, p0 ) ≤ 0.05 .
k = k ∗0.05

Now, from binomial tables,


18
∑ B(k ;13, 0.40) = B(12;18, 0.40) +  + B(18;18, 0.40)
12
= 0.0145 + 0.0058 +  + 0.000
= 0.0203

CoDEUCC/Post-Diploma in Mathematics and Science Education 29


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 5 PROPROTION (SMALL SAMPLES)

Therefore, the smallest integer k 0∗.05 for which


18
∑ B(k ;18, 0.40) ≤ 0.05 is 12.
k =12

Since x = 10 is not greater or equal to 12, we are not able to reject H 0 . We conclude,
therefore, that no more than 40% of all shoppers can identify a highly advertised
trademark.

30 CoDEUCC/Post Diploma Degree Programme in Mathematics/Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 5

Self- Assessment Questions


Exercise 1.5
1. A doctor claims that less than 30% of all persons exposed to a certain
amount of radiation will feel any ill effects. If, in a random sample, only 1 of 19
persons exposed to such radiation felt any ill effects, test H 0 : p = 0.30 against
H1 : p < 0.30 at the 0.05 level of significance.

2. In a random sample, 12 of 14 industrial accidents were due to unsafe working


conditions. Use the 0.01 level of significance to test
H 0 : p = 0.40 against H1 : p ≠ 0.40 .

3. If x = 4 of n = 20 patients suffered serious side effects from a new medication,


test H 0 : p = 0.50 against H1 : p ≠ 0.50, where p is the proportion of patients
suffering serious side effect from the new medication. Test at the 0.05 level of
significance.

CoDEUCC/Post-Diploma in Mathematics and Science Education 31


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 5 PROPROTION (SMALL SAMPLES)

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

32 CoDEUCC/Post Diploma Degree Programme in Mathematics/Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 6

SESSION 6: TESTS CONCERNING A SINGLE


POPULATION VARIANCE
It is important to conduct tests concerning population variances for a
number of reasons. For example, a pharmacist may want to know whether
the variation in the potency of a medicine is within acceptable limits; a teacher may
want to know the range of performance of his/her best student; and a surgeon may want
to know variations in the potencies in different anesthesia.

In this session, you will learn how to conduct tests concerning single population
variances. In particular, you will learn how to test the null hypothesis H 0 : σ 2 = σ 02
against any one of the alternatives H 1 : σ 2 < σ 02 , or H 1 : σ 2 > σ 02 , or H 1 : σ 2 ≠ σ 02 .

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests on single population
variance; and
2. conduct tests concerning single population variances.

Now read on …

Suppose that we wish to test the null hypothesis H 0 : σ 2 = σ 02 against any one of the
three alternatives H 1 : σ 2 < σ 02 , or H 1 : σ 2 > σ 02 , or H 1 : σ 2 ≠ σ 02 . If the population we
are sampling has a normal distribution, then the test statistic is given by

(n − 1) s 2
χ2 = (1.3)
σ 02

where χ 2 is the value of the random variable having the chi-square distribution with
n − 1 degrees of freedom. Values for the χ 2 can be read from chi-square distributions
tables (see Table IV under statistical Tables at the end of the module).

Table 1.6 contains the critical regions for testing H 0 : σ 2 = σ 02 against any of the three
possible alternatives.

CoDEUCC/Post-Diploma in Mathematics and Science Education 33


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 6 VARIANCE

Table 1.6: Critical regions for testing H 0 : σ 2 = σ 02 against the various


alternatives
Alternative hypothesis Rejection region for level α test
2
H1 : σ < σ 02 χ 2 < χ 12−α (lower-tailed test)

H1 : σ 2 > σ 02 χ 2 > χ α2 (upper-tailed test)

H1 : σ 2 ≠ σ 02 Either χ 2 < χ 12− α or χ 2 > χ α2 (two-tailed test)


2 2

Example 1.12
Given that n = 25 and s 2 = 9 , test H 0 : σ 2 = 10 against the two-side alternative
H 1 : σ 2 ≠ 10 at 1% significance level.

Solution

Substituting for n = 25 , s 2 = 9 , and σ 02 = 10 into Equation (1.3), we obtain


(n − 1) s 2 (25 − 1) (9)
χ =
2
= = 21.6 .
σ 2
0 10

Since we are dealing with a two-sided test and the significance level is 1%, the critical
region must be less than χ 02.995 , or greater than χ 02.005 . From chi-square tables, the value
of χ 02.995 (with 24 degrees of freedom) is 9.886 and that for χ 02.005 (with 24 degrees of
freedom) is 45.558. Since χ 2 = 21.6 is neither less than 9.886 nor greater than 45.558,
we cannot reject the null hypothesis.

Example 1.13
Suppose that the thickness of a part used in a semiconductor is its critical dimension and
that measurements of the thickness of a random sample of 18 such parts have the
variance s 2 = 0.68 , where the measurements are in thousands of an inch. The process is
considered to be under control if the variation of the thickness is given by a variance not
greater than 0.36. Assume that the measurements constitute a random sample from a
normal population and test the null hypothesis σ 2 = 0.36 against H1 : σ 2 > 0.36 at the
0.05 level of significance.

34 CoDEUCC/Post -Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON A SINGLE SAMPLE UNIT 1
SESSION 6

Solution

Substituting for n = 18 , s 2 = 0.68 , and σ 02 = 0.36 into Equation (1.3), we obtain


(n − 1) s 2 (18 − 1)(0.68)
χ2 = = = 32.11 .
σ2 0.36

From chi-square tables, the value of χ 02.005 (with 18 degrees of freedom) is 27.587.
Since χ 2 = 32.11 is greater than 27.587, we reject the null hypothesis.

Note tha if the population is not normal but the size of the sample is large, then we can
test the null hypothesis by using the statistic
s −σ0
z= ,
σ0 2n

which has an approximate standard normal distribution.

Self- Assessment Questions


Exercise 1.6
1. Nine determinations of the specific heat of iron had a standard deviation of
0.0086. Assuming that these determinations constitute a random sample from a
normal population, test the null hypothesis H 0 : σ = 0.0100 the alternative
H1 : σ < 0.0100 at the 0.05 level of significance.

2. In a random sample, the weights of 24 Black Angus steers of a certain age have a
standard deviation of 238 pounds. Assuming that the weights constitute a random
sample from a normal population, test the null hypothesis H 0 : σ = 250 against the
two-sided alternative H 0 : σ ≠ 250 at the 0.01 level of significance.

3. In a random sample, s = 2.53 minutes for the amount of time that 30 women took
to complete the written test for their driver’s licenses. At the 0.05 level of
significance, test the null hypothesis H 0 : σ = 2.85 against the alternative
hypothesis H1 : σ < 2.85 minutes.

4. Past data indicate that the standard deviation of measurements made on sheet
metal stampings by experienced inspectors is 0.41 square inch. If a new inspector
measures 50 stampings with a standard deviation of 0.49 square inch. Test the null
hypothesis H 0 : σ = 0.41 against the alternative hypothesis H1 : σ > 0.41 square
inch.

CoDEUCC/Post-Diploma in Mathematics and Science Education 35


UNIT 1 TESTS CONCERNING A SINGLE POPULATION
SESSION 6 VARIANCE

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

36 CoDEUCC/Post -Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS
UNIT 2

UNIT 2: TESTS OF HYPOTHESES ON TWO POPULATIONS


UNIT OUTLINE:
Session 1: Tests concerning two Population Means (Large and Independent
Samples)
Session 2: Tests concerning two Population Means (Small and Independent
Samples, σ 1 and σ 2 assumed equal)
Session 3: Tests concerning two Population Means (Small and independent
Samples, σ 1 and σ 2 assumed unequal)
Session 4: Tests concerning two Population Means (Paired Data)
Session 5: Tests concerning two Population Proportions
Session 6: Tests concerning two Population Variances

In Unit 1, you learnt about tests of hypotheses concerning a single


sample. You learnt how to conduct tests concerning single
population means, single population proportions, and single
population variances.

In this unit, you will learn about tests of hypotheses concerning two population samples.
You will learn how to conduct tests concerning two population means under various
assumptions, concerning two population proportions, and concerning two population
variances.

UNIT OBJECTIVES
By the end of the unit, you should be able to:
1. formulate null and alternative hypotheses for two population parameters;
2. conduct tests concerning two population means (large and independent samples);
3. conduct tests concerning two population means (small and independent samples,
σ 1 and σ 2 assumed equal);
4. conduct tests concerning two population means (small and independent samples,
σ 1 and σ 2 assumed unequal);
5. Tests concerning two population proportions; and
6. Tests concerning two population variances.

CoDEUCC/Post-Diploma in Mathematics and Science Education 37


TESTS OF HYPOTHESES ON TWO POPULATIONS
UNIT 2

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

38 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 1

SESSION 1: TESTS CONCERNING TWO POPULATION


MEANS (LARGE AND INDEPENDENT SAMPLES)
There are many situations in real life where we may want to compare
the means of two populations. For example, the head teacher of a
public school would want to know whether on the average, male students are older than
female students, or vice versa. Vehicle owners may want to know whether on the
average tires from one manufacturer last longer than those from another. The Chief
Executive Officer of a firm may want to ascertain from past records whether on the
average there is a difference in the performance of female and male managers.

Each one of the scenarios above calls for an investigation to find the right answers. In
this session, you will learn to conduct tests concerning the means of two populations. In
particular, you will learn to conduct tests concerning two population means based on
large and independent samples.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning tests on two population
means;
2. conduct tests concerning means of two independent samples drawn from
populations with known σ s: and
3. conduct tests concerning means of two large and independent samples drawn from
populations with unknown σ .

Now read on …

2.1 Tests Concerning Two Population Means (Independent Samples with σ 12


and σ 22 known)

All of the tests you learnt about in Unit 1 involved a single population. We now turn our
attention to the problem of comparing the means of two populations based on two
independent random samples drawn from these populations.

Suppose that we have two independent random samples with means x1 and x 2 , and
respective sample sizes n1 and n2 , from normal populations with means µ1 and µ 2 ;
and variances σ 12 and σ 22 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 39


UNIT 2 TESTS CONCERNING TWO POPULATION MEANS
SESSION 1 (LARGE AND INDEPENDENT SAMPLES)

Then we can compare µ1 and µ 2 by testing H 0 : µ1 − µ 2= δ where δ is a given


constant (the hypothesized difference between the means), against any one of the
alternatives H1 : µ1 − µ 2< δ , or H1 : µ1 − µ 2> δ , or H1 : µ1 − µ 2≠ δ under various
assumptions about the population variances.

If σ 12 and σ 22 are known, then the test statistic is given by

x1 − x2 − δ
z= (2.1)
σ 12 σ 22
+
n1 n2

where z is the usual standard normal random variable.

Can you think about the critical regions for the various tests? The critical regions
remain the same as those in Table 1.3 in Session 2 of Unit 1. This is the case because
the test statistic is the same standard normal random variable, z.

Example 2.1
A random sample of 100 observations is drawn from a normal population with variance
16 and the sample mean was found to be 10.8. Another sample of 64 observations is
drawn from a second and independent normal population with variance 25 and the
sample mean was found to be 9.6. Test the hypotheses:
H 0 : the population means are equal.
against
H1 : the population means are not equal.
at 5% significance level.

Solution
The hypotheses above is equivalent to
H 0 : µ1 − µ 2 = 0
against
H1 : µ1 − µ 2 ≠ 0

We need to evaluate the test statistic by substituting n1 = 100 , x1 = 10.8 , σ 12 = 16 ,


n 2 = 64 , x 2 = 9.6 , and σ 22 = 25 and δ = 0 into Equation (2.1) to obtain

40 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 1

10.8 − 9.6 − 0
z= = 1.6260 .
16 25
+
100 64

From z tables, z α = z 0.025 = 1.96 , giving the critical region as z < −1.96 or z > 1.96 .
2

Since z = 1.626 is neither less than − 1.96 nor greater than 1.96 , we cannot reject the
null hypothesis. We conclude, therefore, that the population means are equal.

2.2 Tests Concerning Two Population Means (Large Independent Samples with
σ 12 and σ 22 unknown)
When independent random samples are drawn from populations which may not even be
normal with unknown variances, σ 12 and σ 22 , we can still conduct the test described
under Session 2.1 with s1 and s2 substituted for σ 12 and σ 22 ; respectively into
Equation (2.1) if both n1 and n2 are large. In that case, the test statistic in Equation
(2.1) becomes
x1 − x2 − δ
z= , (2.2)
s12 s22
+
n1 n2

where s12 and s 22 are the respective sample estimates for σ 12 and σ 22 ; and z is the usual
standard normal random variable.

Example 2.2
Suppose that we have randomly selected two independent samples from populations
having means µ1 and µ 2 . If x1 = 25 , x2 = 20 , s1 = 3 , s2 = 4 , n1 = 100 , n2 = 100 .
Test the null hypothesis H 0 : µ1 − µ 2 = 0 against H1 : µ1 − µ 2 > 0 at the 0.05 level of
significance. How do you conclude about how µ1 and µ 2 compare?

Substituting n1 = 100 , x1 = 25 , s12 = 9 , n2 = 100 , x2 = 20 , and s22 = 16 and δ = 0 into


Equation (2.2) to obtain
25 − 20 − 0
z= = 10 .
9 16
+
100 100

CoDEUCC/Post-Diploma in Mathematics and Science Education 41


UNIT 2 TESTS CONCERNING TWO POPULATION MEANS
SESSION 1 (LARGE AND INDEPENDENT SAMPLES)

From z tables, z0.05 = 1.645 , giving the critical region as z > 1.645 . Since z = 10
greater than 1.645, we reject the null hypothesis and conclude that µ1 is greater than
µ2 .

Self- Assessment Questions


Exercise 2.1

1. Suppose that a random sample of 100 public owner-controlled companies in


Ghana gives a mean audit delay of x1 = 82.6 days with standard deviation of
s1 = 32.83 days, while a random sample of 100 public manager-controlled
companies in Ghana gives a mean audit delay of x2 = 92 days with standard
deviation of s2 = 37.18 days. Test the null hypothesis H 0 : µ1 − µ 2 = 0 against
H1 : µ1 − µ 2 < 0 at the 0.05 level of significance. What do you conclude about
how µ1 and µ 2 compare?

2. The University of Cape Coast wishes to demonstrate that car ownership is


detrimental to academic achievement. A random sample of 100 students who do
not own cars had a mean Grade Point Average (GPA) of 2.68 with standard
deviation 0.7, while a random sample of 100 students who own cars had a mean
GPA of 2.55 with standard deviation 0.6. Assuming that the independent
assumption holds
(a) Set up the null and alternative hypotheses that should be used to justify that
the mean GPA for non-car owners is higher than the mean GPA for car
owners.
(b) Test the hypotheses in part (a) at the 0.05 significance level. Interpret the
results of your test.
3. Perceptual ratings were measured by using a nine-point agree-disagree scale.
Suppose that the results of a telephone survey of 175 technical managers and 125
purchasing managers reveal that the mean perception score for technical managers
is 7.3 with a standard deviation of 1.4 and that the mean perception score for
purchasing managers is 8.2 with a standard deviation of 1.6.
(a) Set up the null and alternative hypotheses needed to establish whether or not
µ1 and µ 2 differ. If µ1 and µ 2 do not differ, what does µ1 − µ 2 equal to?
(b) Assuming that the samples of 175 technical managers and 125 purchasing
managers are independent, test the hypotheses in part (a) at the 0.05 level of

42 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 1

significance. What do you conclude about whether or not µ1 and µ 2


differ?

4. An experiment is performed to determine whether the average nicotine content of


one kind of cigarette exceeds that of another kind by 0.20 milligram. If n1 = 50
cigarettes of the first kind had an average nicotine content of x1 = 2.61 milligrams
with a standard deviation of s1 = 0.12 milligram, whereas n2 = 40 cigarettes of
the other kind had an average nicotine content of x2 = 2.38 milligrams with a
standard deviation of s2 = 0.14 milligram, test the null hypothesis µ1 − µ 2 = 0.20
against the alternative hypothesis µ1 − µ 2 ≠ 0.20 at the 0.05 level of significance.

5. A study of the number of business lunches that executives in the insurance and
banking industries claim as deductible expenses per month was based on random
samples and yielded the following results:
n 1 = 40 x 1 = 9.1 s 1 = 1.9
n 2 = 50 x 2 = 8.0 s 2 = 2.1

Assuming that the population variances σ 12 and σ 22 are equal, test the null
hypothesis H 0 : µ1 − µ 2 = 0 against H 0 : µ1 − µ 2 ≠ 0 ; at the 0.05 level of
significance.
6. Sample surveys conducted in a large county in a certain year and again 20 years
later showed that originally the average height of 400 ten-year old boys was 53.8
inches with a standard deviation of 2.4 inches, whereas 20 years later the average
height of 500 ten-year old boys was 54.5 inches with a standard deviation of 2.5
inches. Assuming that the population variances σ 12 and σ 22 are equal, test
H 0 : µ1 − µ 2 = −0.5 against H 0 : µ1 − µ 2 < −0.5 ; at the 0.05 level of significance.

CoDEUCC/Post-Diploma in Mathematics and Science Education 43


UNIT 2 TESTS CONCERNING TWO POPULATION MEANS
SESSION 1 (LARGE AND INDEPENDENT SAMPLES)

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

44 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 2

SESSION 2: TESTS CONCERNING TWO POPULATION


MEANS (SMALL AND INDEPENDENT SAMPLES, σ 1
AND σ 2 ASSUMED EQUAL)
In Session 1, you learnt how to conduct tests (1) concerning means of two
independent samples drawn from populations with known standard
deviations; and (2) concerning means of two large and independent samples drawn from
populations with unknown standard deviations.

In this session, we shall learn how to perform tests concerning means of two small
( n 1< 30 and n 2 < 30 ) independent samples drawn from populations with unknown
standard deviations.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests concerning two population
means; and
2. conduct tests concerning means of two small and independent samples drawn
from populations with unknown variances which are assumed equal.
Now read on …

Suppose that we have two independent random samples with means x1 and x 2 , and
respective sample sizes n 1< 30 and n 2 < 30 , from populations with means µ1 and µ 2 ;
and unknown variances σ 12 and σ 22 .

Then we can compare µ1 and µ 2 by testing H 0 : µ1 − µ 2= δ where δ is a given


constant against any one of the alternatives H1 : µ1 − µ 2< δ , or H1 : µ1 − µ 2> δ , or
H1 : µ1 − µ 2≠ δ under two different assumptions about the population variances. Under
the first assumption σ 12 and σ 22 are unknown and both are assumed to be equal to a
common variance σ 2 , whereas in the second σ 12 and σ 22 are unknown and are assumed
to be different from each other. In this session we shall concern ourselves with
problems under the first assumption, whereas in Session 3 we shall consider problems
under the second assumption.

Suppose that σ 12 and σ 22 are unknown but assumed to be equal to a common


population variance σ 2 . That is, σ 12 and σ 22 are unknown, but σ 12 = σ 22 = σ 2 . Then
the appropriate test statistic for such tests is given by

CoDEUCC/Post-Diploma in Mathematics and Science Education 45


TESTS CONCERNING TWO POPULATION MEANS
UNIT 2 (SMALL AND INDEPENDENT SAMPLES,
SESSION 2
σ 1 and σ 2 assumed equal)

x1 − x2 − δ
t= , (2.3)
1 1
sp +
n1 n2

where s p , called the pooled sample variance is given by

(n1 − 1) s12 + (n2 − 1) s22


sp = , (2.4)
n1 + n2 − 2

s12 and s22 are the respective variances of sample 1 and sample 2, and t has the t-
distribution with n1 + n2 − 2 degrees of freedom.

Can you think about the critical regions for the various tests under the first assumption?
The critical regions remain the same as those in Table 1.4 in Session 3 of Unit 1. This is
because the test statistic is the same t-distribution.

Example 2.3
Two independent random samples of sizes n 1= 16 and n2 = 10 from normal
populations with unknown standard deviations have means x1 = 23.4 and x2 = 18.2 ,
with corresponding standard deviations s1 = 3.5 and s2 = 4.8 . Test H 0 : µ1 − µ 2 = δ
against H 0 : µ1 − µ 2 > δ at the 10% significance level, assuming that the population
variances are equal.

Solution
Assuming that the population variances are equal, we first substitute n1 = 16 , s1 = 3.5 ,
n2 = 10 , s 2 = 4.8 into Equation (2.4) to evaluate s p as follows:
(n1 − 1) s12 + (n2 − 1) s22
sp =
n1 + n2 − 2
(16 − 1) (3.5) 2 + (10 − 1) (4.8) 2
=
16 + 10 − 2
= 4.04

Now substituting s p = 4.04 , n1 = 16 , x1 = 23.4 , n2 = 10 , x 2 = 18.2 , and δ = 0 into


Equation (2.3), we obtain

46 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 2

x1 − x2 − δ
t=
1 1
sp +
n1 n2
23.4 − 18.2 − 0
=
1 1
(4.04) +
16 10
= 3.193

From the t-tables, t 0.10 with 24 (= 16 + 10 − 2) degrees of freedom is 1.318. Thus, the
critical region is t > 1.318 . Since t = 3.193 > 1.318 , we reject H 0 and conclude that
µ1 > µ 2 .
Example 2.4
In the comparison of two kinds of paint, a consumer testing service finds that four 1-
gallon cans of one brand covers on the average 546 square feet with a standard
deviation of 31 square feet, whereas four 1-gallon cans of another brand covers on the
average 492 square feet with a standard deviation of 26 square feet. Assuming that the
two populations sampled are normal, and have equal variances; test the null hypothesis
H 0 : µ1 − µ 2 = 0 against H 0 : µ1 − µ 2 > 0 at the 0.05 level of significance.

Solution
Assuming that the population variances are equal, we first substitute n1 = 4 , s1 = 31 ,
n2 = 4 , s2 = 26 into Equation (2.4) to evaluate s p as follows:
(n1 − 1) s12 + (n2 − 1) s22
sp =
n1 + n2 − 2
(4 − 1) (31) 2 + (4 − 1) (26) 2
=
4+4−2
= 28.609

Now substituting s p = 28.609 , n 1= 4 , x1 = 546 , n2 = 4 , x2 = 492 , and δ = 0 into


Equation (2.3), we obtain

CoDEUCC/Post-Diploma in Mathematics and Science Education 47


TESTS CONCERNING TWO POPULATION MEANS
UNIT 2 (SMALL AND INDEPENDENT SAMPLES,
SESSION 2
σ 1 and σ 2 assumed equal)
x1 − x2 − δ
t=
1 1
sp +
n1 n 2
546 − 492 − 0
=
1 1
(28.609) +
4 4
= 2.67

From the t-tables, t 0.05 with 6 (= 4 + 4 − 2) degrees of freedom is 1.9432. Thus, the
critical region is t > 1.9432 . Since t = 2.67 > 1.9432 , we reject H 0 and conclude that
µ1 > µ 2 .

Example 2.5
A production supervisor at a major chemical company wishes to determine which of
two catalysts, catalyst XA–100 or catalyst ZB–200, maximizes the hourly yield of a
chemical process. In oder to compare the mean hourly yields obtained by using the two
catalysts, the supervisor runs the process using each catalyst for five one-hour periods.
The resulting yields (in pounds per hour) for each catalyst, along with the means and
variances of the yields, are given below:
Catalyst XA–100 Catalyst ZB–200
801 752
814 718
784 776
836 742
820 763
x 1 = 811 x 2 = 750.2
s12 = 386 s22 = 484.2

Assuming that the two populations sampled are normal, and have equal variances; test
H 0 : µ1 − µ 2 = 0 against H 0 : µ1 − µ 2 ≠ 0 by setting α equal to 0.10, 0.05, 0.01 and
0.001. How much evidence is there that the difference between µ 1 and µ 2 is equal to
0?

Solution

Assuming that the population variances are equal, we first substitute n1 = 5 , s12 = 386 ,
n2 = 5 , s22 = 484.2 into Equation (2.4) to evaluate s p as follows:

48 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 2

(n 1− 1) s12 + (n 2 − 1) s22
sp =
n 1+ n 2 − 2
(5 − 1) (386) + (5 − 1) (484.2)
=
5+5−2
= 20.859

Now substituting s p = 20.859 , n 1= 5 , x1 = 811 , n2 = 5 , x2 = 750.2 , and δ = 0 into


Equation (2.3), we obtain
x 1− x 2 − δ
t=
1 1
sp +
n1 n 2
811 − 750.2 − 0
=
1 1
(20.859) +
5 5
= 4.6066

(a) Now if α = 0.10 , then from the t-tables, t α = t 0.10 = t 0.05 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 1.8595. Thus, the critical regions are t < −1.8595 or
t > 1.8595 . Since t = 4.6066 > 1.8595 , we reject H 0 and conclude that µ1 ≠ µ 2 .

(b) Now if α = 0.05 , then from the t-tables, t α = t 0.05 = t 0.025 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 2.3060. Thus, the critical regions are t < −2.3060 and
t > 2.3060 . Since t = 4.6066 > 2.3060 , we reject H 0 and conclude that µ1 ≠ µ 2 .

(c) Now if α = 0.01 , then from the t-tables, t α = t 0.01 = t 0.005 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 3.3554. Thus, the critical regions are t < −3.3554 and
t > 3.3554 . Since t = 4.6066 > 3.3554 , we reject H 0 and conclude that µ1 ≠ µ 2 .

(d) Now if α = 0.001 , then from the t-tables, t α = t 0.001 = t 0.0005 with 8 (= 5 + 5 − 2)
2 2
degrees of freedom is 5.0413. Thus, the critical regions are t < −5.0413 and
t > 5.0413 . Since t = 4.6066 is neither less than − 5.0413 nor greater than
5.0413 , we are not able to reject H 0 and therefore conclude that µ1 = µ 2 .

There is very strong evidence that there is no difference between µ 1 and µ 2 . That is,
there is evidence at the 90%, 95% and 99% levels that µ1 = µ 2 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 49


TESTS CONCERNING TWO POPULATION MEANS
UNIT 2 (SMALL AND INDEPENDENT SAMPLES,
SESSION 2
σ 1 and σ 2 assumed equal)

Self- Assessment Questions


Exercise 2.2

1. Suppose that we have taken independent random samples n 1= 5 and n 2 = 15


from two normally distributed populations having means µ 1 and µ 2 , and
suppose that we obtain x 1= 57 and x 2 = 60 , s 1= 3 and s 2 = 5 . Assuming that
the equal variances assumption hold

(a) test the null hypothesis H 0 : µ1 − µ 2 ≥ 0 against the alternative hypothesis


H 0 : µ1 − µ 2 < 0 , by setting α equal to 0.10, 0.05, 0.01 and 0.001. How
much evidence is there that the difference between µ 1 and µ 2 is less than
0?

(b) test the null hypothesis H 0 : µ1 − µ 2 = 0 against the alternative hypothesis


H 0 : µ1 − µ 2 ≠ 0 , by setting α equal to 0.10, 0.05, 0.01 and 0.001. How
much evidence is there that the difference between µ 1 and µ 2 is not equal
to 0?

50 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 3

SESSION 3: TESTS CONCERNING TWO POPULATION


MEANS (SMALL AND INDEPENDENT
SAMPLES, σ 1 and σ 2 assumed unequal)
In Session 1, you learnt how to conduct tests (1) concerning means of
two independent samples drawn from populations with known standard
deviations; and (2) concerning means of two large and independent samples drawn from
populations with unknown standard deviations.

In this session, we shall learn how to perform tests concerning means of two small
( n 1< 30 and n 2 < 30 ) independent samples drawn from populations with unknown
standard deviations.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests concerning two population
means; and
2. conduct tests concerning means of two small and independent samples drawn
from populations with unknown variances which are assumed unequal.
Now read on …

Suppose that we have two independent random samples with means x1 and x 2 , and
respective small sample sizes n1 < 30 and n2 < 30 , from populations with means µ1
and µ 2 ; and unknown variances σ 12 and σ 22 .

Then we can compare µ1 and µ 2 by testing H 0 : µ1 − µ 2= δ where δ is a given


constant against any one of the alternatives H1 : µ1 − µ 2< δ , or H1 : µ1 − µ 2> δ , or
H1 : µ1 − µ 2≠ δ under two different assumptions about the population variances. Under
the first assumption σ 12 and σ 22 are unknown and both are assumed to be equal to a
common variance σ 2 , whereas in the second σ 12 and σ 22 are unknown and are assumed
to be different from each other. In this session we shall concern ourselves with
problems under the second assumption. Recall: problems under the first assumption
were considered in Session 2.

Suppose that σ 12 and σ 22 are unknown and are assumed to be different from each other.
Then the appropriate test statistic for such tests is given by

CoDEUCC/Post-Diploma in Mathematics and Science Education 51


TESTS CONCERNING TWO POPULATION MEANS
UNIT 2
(SMALL AND INDEPENDENT SAMPLES
SESSION 3
σ 1 and σ 2 ASSUMED UNEQUAL)
x1 − x2 − δ
t∗ = , (2.5)
s12 s22
+
n1 n2

where t ∗ is approximately t-distributed with degrees of freedom, v, given by


2
 s12 s 22 
 + 
v=  n1 n2  (2.6)
2 2
 s12   s 22 
   
 n1   n2 
+ 
n1 − 1 n2 − 1

If v is not a whole number, then we have to round down to the nearest whole number.

Note that the critical regions remain the same as in Table 1.4 in Session 3 of Unit 1.
This is because the test statistic is still the t-distribution.

Example 2.6
Rework Example 2.3 on the assumption that the population variances σ 12 and σ 22 are
different.

Solution
Assuming that the population variances are not equal and substituting n1 = 16 ,
x1 = 23.4 , s1 = 3.5 , n2 = 10 , x 2 = 18.2 , s 2 = 4.8 and δ = 0 into Equation (2.5), we
obtain
x1 − x2 − δ
t∗ =
s12 s22
+
n1 n2
23.4 − 18.2 − 0
=
(3.5) 2 (4.8) 2
+
16 10
= 2.9680

Now we need to substitute n1 = 16 , s1 = 3.5 , n2 = 10 , s 2 = 4.8 into Equation (2.6), to


obtain the degrees of freedom:

52 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 3

2 2
 s12 s 22   (3.5) 2 (4.8) 2 
 +   + 
v=  n1 n2  =  16 10 
= 15.7 ≈ 15
2 2 2 2
 s12   s 22   (3.5) 2   (4.8) 2 
       
 n1  +  n2   16  +  10 
n1 − 1 n2 − 1 16 − 1 10 − 1

From the t-tables, t 0.10 with 15 degrees of freedom is 1.341. Thus, the critical region is
t ∗ > 1.341 . Since t ∗ = 2.9680 > 1.341 , we reject H 0 and conclude that µ1 > µ 2 .

Example 2.7
Rework Example 2.4 on the assumption that the population variances σ 12 and σ 22 are
different.

Solution
Assuming that the population variances are not equal and substituting n 1= 4 , x1 = 546 ,
s1 = 31 , n2 = 4 , x2 = 492 , s2 = 26 and δ = 0 into Equation (2.5), we obtain
x1 − x2 − δ
t∗ =
s12 s22
+
n1 n2
546 − 492 − 0
=
(31) 2 (26) 2
+
4 4
= 2.669
Now we need to substitute n1 = 4 , s1 = 31 , n2 = 4 , s2 = 26 into Equation (2.6), to
obtain the degrees of freedom:
2 2
 s12 s 22   (31) 2 (26) 2 
 +   + 
 n1 n2   4 4 
f = 2 2
= 2 2
= 5.8235 ≈ 5
 s12   s 22   (31) 2   (26) 2 
       
 n1  +  n2   4  +  4 
n1 − 1 n2 − 1 4 −1 4 −1

CoDEUCC/Post-Diploma in Mathematics and Science Education 53


TESTS CONCERNING TWO POPULATION MEANS
UNIT 2
(SMALL AND INDEPENDENT SAMPLES
SESSION 3
σ 1 and σ 2 ASSUMED UNEQUAL)

From the t-tables, t 0.05 with 5 degrees of freedom is 2.015. Thus, the critical region is
t ∗ > 2.015 . Since t ∗ = 2.669 > 2.015 , we reject H 0 and conclude that µ1 > µ 2 .

Example 2.8
Rework Example 2.5 on the assumption that the population variances σ 12 and σ 22 are
different.

Solution
Assuming that the population variances are not equal and substituting n 1= 5 , x1 = 811 ,
s12 = 386 , n2 = 5 , x2 = 750.2 , s22 = 484.2 and δ = 0 into Equation (2.5), we obtain
x1 − x2 − δ
t∗ =
s12 s22
+
n1 n2
811 − 750.2 − 0
=
386 484.2
+
5 5
= 4.609

Now we need to substitute n 1= 5 , s12 = 386 , n2 = 5 , s22 = 484.2 into Equation (2.6), to
obtain the degrees of freedom:
2
 s12 s 22  2
 +   386 484.2 
 n1 n2   5 + 5 
f = 2 2
= 2 2
= 7.8994 ≈ 7
 s12   s 22   386   484.2 
     5   5 
 n1   n2  +
+ 5 −1 5 −1
n1 − 1 n2 − 1

(a) Now if α = 0.10 , then from the t-tables, t α = t 0.10 = t 0.05 with 7 degrees of
2 2
freedom is 1.8946. Thus, the critical regions are t < −1.8946 or t > 1.8946 . Since
t = 4.609 > 1.8946 , we reject H 0 and conclude that µ1 ≠ µ 2 .

54 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 3

(b) Now if α = 0.05 , then from the t-tables, t α = t 0.05 = t 0.025 with 7 degrees of
2 2
freedom is 2.3646. Thus, the critical regions are t < −2.3646 and t > 2.3646 .
Since t = 4.609 > 2.3646 , we reject H 0 and conclude that µ1 ≠ µ 2 .

(c) Now if α = 0.01 , then from the t-tables, t α = t 0.01 = t 0.005 with 7 degrees of
2 2
freedom is 3.4995. Thus, the critical regions are t < −3.4995 and t > 3.4995 .
Since t = 4.609 > 3.4995 , we reject H 0 and conclude that µ1 ≠ µ 2 .

(d) Now if α = 0.001 , then from the t-tables, t α = t 0.001 = t 0.0005 with 7 degrees of
2 2
freedom is 5.4079. Thus, the critical regions are t < −5.4079 and t > 5.4079 .
Since t = 4.609 is neither less than − 5.4079 nor greater than 5.4079 , we are not
able to reject H 0 and therefore conclude that µ1 = µ 2 .

There is very strong evidence that there is no difference between µ 1 and µ 2 . That is,
there is evidence at the 90%, 95% and 99% levels that µ1 = µ 2 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 55


TESTS CONCERNING TWO POPULATION MEANS
UNIT 2
(SMALL AND INDEPENDENT SAMPLES
SESSION 3
σ 1 and σ 2 ASSUMED UNEQUAL)
Self- Assessment Questions
Exercise 2.3

1. Suppose that we have taken independent random samples n 1= 5 and n 2 = 15


from two normally distributed populations having means µ 1 and µ 2 , and
suppose that we obtain x 1= 57 and x 2 = 60 , s 1= 3 and s 2 = 5 . Assuming that
the equal variances assumption does not hold

(a) test the null hypothesis H 0 : µ1 − µ 2 ≥ 0 against the alternative hypothesis


H1 : µ1 − µ 2 < 0 , by setting α equal to 0.10, 0.05, 0.01 and 0.001. How
much evidence is there that the difference between µ 1 and µ 2 is less than
0?

(b) test the null hypothesis H 0 : µ1 − µ 2 = 0 against the alternative hypothesis


H1 : µ1 − µ 2 ≠ 0 , by setting α equal to 0.10, 0.05, 0.01 and 0.001. How
much evidence is there that the difference between µ 1 and µ 2 is not equal
to 0?

56 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 4

SESSION 4: TESTS CONCERNING TWO POPULATION


MEANS (PAIRED DATA)
The tests discussed so far in this unit cannot be applied directly when
the samples are not independent. For example, they cannot be used for
observations that occur in pairs (as in “before and after experiments”).

In this session, we shall learn how to conduct tests concerning the means of two
samples that are not independent of each other.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for tests concerning the means of
two samples that are not independent of each other; and
2. conduct tests concerning the means of two samples that are not independent of
each other.

Now read on …

Suppose that x1 , x2 ,  , xn are the observations on n individuals before an experiment,


and y1 , y2 ,  , yn are the corresponding observations after the experiment. Then the
observations ( x1 , y1 ), ( x2 , y2 ),  , ( xn , yn ) constitute a paired data set. To compare the
means of these data, we can transform the data into a single data by finding the
differences between corresponding observations. Consequently, the problem reduces to
that of tests concerning one population mean.

Suppose that the original problem was to conduct any of the tests in
Table 2.1(a).
Table 2.1 (a): Test of H 0 : µ1 − µ 2 = δ against various alternatives
(a) (b) (c)
H 0 : µ1 − µ 2 = δ H 0 : µ1 − µ 2 = δ H 0 : µ1 − µ 2 = δ

H1 : µ1 − µ 2 < δ H1 : µ1 − µ 2 > δ H1 : µ1 − µ 2 ≠ δ

Then, by calculating the differences d i = yi − xi i = 1, 2,  , n between corresponding


observations; the tests reduces to those in Table 2.1 (b).

CoDEUCC/Post-Diploma in Mathematics and Science Education 57


UNIT 2 TESTS CONCERNING TWO POPULATION MEANS
SESSION 4 (PAIRED DATA)

Table 2.1 (b): Test of H 0 : µ d = δ against various alternatives


(a) (b) (c)
H 0 : µd = δ H 0 : µd = δ H 0 : µd = δ

H1 : µ d < δ H1 : µ d > δ H1 : µ d ≠ δ

where µ d in each test is given by µ d = µ1 − µ 2 .

Let µ d be the mean of the normally distributed population of paired differences, and let
d and sd be the mean and standard deviation of a sample of n paired differences that
have been selected randomly from the population. Then the appropriate test statistic for
conducting any one of the tests in Table 2.1 (b) is given by

d −δ
t= , (2.7 )
sd n

where t has the t-distribution with (n − 1) degrees of freedom.

As in all previous cases where the test statistic follows the t-distribution, the critical
regions are as in Table 1.4 in Session 3 of Unit 1.

Example 2.9
The data below are the weights before and after ten boxers were fed with a weight
reducing diet:
i 1 2 3 4 5 6 7 8 9 10
Before, xi 69 50 61 72 78 66 75 89 86 54
After, y i 66 49 63 70 71 65 75 88 87 51

Test the null hypothesis H 0 : µ d = 0 , against the alternative hypothesis H1 : µ d < 0 at


the 5% level of significance.

Solution
By calculating the differences yi − xi , we obtain
i 1 2 3 4 5 6 7 8 9 10
xi 69 50 61 72 78 66 75 89 86 54
yi 66 49 63 70 71 65 75 88 87 51
yi − xi − 3 − 1 2 −2 −7 −1 0 −1 1 −3

58 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 4

Considering the differences as one sample data, we find that n = 10 , d = −1.5 ,


sd = 2.5 and δ = 0 . Substituting these into Equation (2.7), we obtain
d −δ − 1.5 − 0
t= = = −1.8973
sd n 2.5 10

From t-tables, t 0.05 with 9 degrees of freedom is 1.833. Thus, the critical region is
t < −1.833 . Since t = −1.897 < −1.833 , we reject H 0 and conclude that µ 2 < µ1 .

Example 2.10
The management of the Daily Guide Newspaper knowns that there are substantial
differences in the abilities of its machine operators. Therefore it decided to compare its
machines using the paired difference approach. Suppose that eight randomly selected
machine operators produce papers for one hour using machine 1 and for one hour using
machine 2, with the following results:
Machine Operator
1 2 3 4 5 6 7 8
Machine 1 53 60 58 48 46 54 62 49
Machine 2 50 55 56 44 45 50 57 47
Assumming normality, perform a hypothesis test to determine whether or not there is a
difference between the mean hourly outputs of the two machines. Use α = 0.05 .
Solution
We wish to H 0 : µ d = 0 against H1 : µ d ≠ 0 at α = 0.05 . By calculating the differences
between Machines 1 and 2, we obtain
Machine Operator
1 2 3 4 5 6 7 8
M1 53 60 58 48 46 54 62 49
M2 50 55 56 44 45 50 57 47
M1 − M 2 3 5 2 4 1 4 5 2

Considering the differences as one sample data, we find that n = 8 , d = 3.25 ,


sd = 1.49 and δ = 0 . Substituting these into Equation (2.7), we obtain
d −δ 3.25 − 0
t= = = 6.17
sd n 1.49 8

CoDEUCC/Post-Diploma in Mathematics and Science Education 59


UNIT 2 TESTS CONCERNING TWO POPULATION MEANS
SESSION 4 (PAIRED DATA)

From t-tables, t0.025 with 7 degrees of freedom is 2.365. Thus, the critical region is
t < −2.365 , or t > 2.365 . Since t = 6.17 > 2.365 , we reject H 0 and conclude that
µ d ≠ 0 at α = 0.05 .

Example 2.11
Lactation promotes a temporary loss of bone mass to provide adequate amounts of
calcium for milk production. An experiment resulted in the following data on total body
bone mineral content for a sample both during lactation (L) and in post-weaning period
(P).
Subject
1 2 3 4 5 6 7 8 9 10
L 1928 2549 2825 1924 1628 2175 2114 2621 1843 2541
P 2126 2885 2895 1942 1750 2184 2164 2626 2006 2627
Does the data suggest that true average total body bone mineral content during post-
weaning exceeds that during lactation by more than 25? State and test the appropriate
hypotheses using the 0.05 level of significance.

Solution
We wish to test H 0 : µ d = 25 against H1 : µ d > 25 at α = 0.05 . By calculating the
differences between P and L, D = P − L , we obtain
Subject
1 2 3 4 5 6 7 8 9 10
L 1928 2549 2825 1924 1628 2175 2114 2621 1843 2541
P 2126 2885 2895 1942 1750 2184 2164 2626 2006 2627
D 198 336 70 18 122 9 50 5 163 86

Considering the differences as one sample data, we find that n = 10 , d = 105.7 ,


sd = 103.85 and δ = 25 . Substituting these into Equation (2.7), we obtain
d −δ 105.7 − 25
t= = = 2.46
sd n 103.85 10

From t-tables, t0.05 with 9 degrees of freedom is 1.833. Thus, the critical region is
t > 1.833 . Since t = 2.46 > 1.833 , we reject H 0 and conclude that µ d > 25 at
α = 0.05 . That is, the data suggests that the true average total body bone mineral
content during post weaning exceeds that during lactation by more than 25.

60 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 4

Self- Assessment Questions


Exercise 2.4

1. Rework Example 2.8 by calculating the differences ( xi − y i ) and testing


H1 : µ d = 0 against H1 : µ d > 0 .

2. A manufacturer is interested in showing that a newly developed gasoline additive


increases gasoline milage obtained by most makes of cars. To do so, the
manufacturer obtains five new cars. Each car is tested using premium unleaded
gasoline and then using premium unleaded with the additive. The variable of
interest each time is the number of miles per gallon obtained. The data from the
tests are:
x (no additive) y (with additive) d = x − y (difference)
15 16
17 15
12 15
25 27
35 35
(a) Copy and complete the above table by finding the five difference scores
generated by these paired observations.
(b) Find an unbiased estimate for the mean difference in mileage with and
without the additive.
(c) Find an unbiased estimate of the variance in the difference in mileage
obtained with and without the additive.
(d) Find sd .

(e) Assumming normality, perform a hypothesis test to determine whether or not


there is a difference between the mean mileage of the two gasolines. Use
α = 0.05 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 61


UNIT 2 TESTS CONCERNING TWO POPULATION MEANS
SESSION 4 (PAIRED DATA)

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

62 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 5

SESSION 5: TESTS CONCERNING TWO POPULATION


PROPORTIONS
In Sessions 1 – 4 of Unit 2, you learnt about methods for comparing
the means of two different populations. In this session, you will learn
about a method for comparing the proportions of two different populations.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning tests of two population
proportions;
2. conduct tests concerning proportions of two large samples that are independent of
each other.

Now read on …

Tests concerning two different population proportions, p1 and p2 , are based on two
different random samples of sizes n1 and n2 , from populations 1 and 2, respectively.

Suppose that we select a random sample of size n1 from a population, and denote the
proportion of “successes” (i.e. sample units that fall into a certain category of interest)
by p̂1 . That is,
x1
pˆ1 = ,
n1

where x1 is the number of successes. Again, suppose that we select a random sample of
size n2 from a second and different population, and denote the proportion of successes
by
x2
pˆ 2 = .
n2

If each of the sample sizes n1 and n2 is large, and if the samples are independent of
each other, then we can compare p̂1 and p̂2 by testing the null hypothesis
H 0 : pˆ1 − pˆ 2 = δ against any one of the alternatives H1 : pˆ1 − pˆ 2 < δ , or
H1 : pˆ1 − pˆ 2 > δ , or H1 : pˆ1 − pˆ 2 ≠ δ using the test statistic
pˆ − pˆ 2 − δ
z= 1 . (2.8)
σ pˆ 1 − pˆ 2

CoDEUCC/Post-Diploma in Mathematics and Science Education 63


UNIT 2 TESTS CONCERNING TWO POPULATION
SESSION 5 PROPORTIONS

If δ = 0, then σ pˆ 1 − pˆ 2 is estimated by

1 1 
s pˆ 1 − pˆ 2 = pˆ (1 − pˆ ) +  , (2.9)
 n1 n2 
where p̂ , called the combined sample proportion is given by

x1 + x2
pˆ = .
n1 + n2

If δ ≠ 0, then σ pˆ 1 − pˆ 2 is estimated by

pˆ1 (1 − pˆ1 ) pˆ 2 (1 − pˆ 2 )
s pˆ 1 − pˆ 2 = + . (2.10)
n1 − 1 n 2 −1

Note that the critical regions for the various tests will be the same as those in Table 1.3
in Unit 1 Session 2.

Example 2.12
If x1 = 18 , x 2 = 15 , n1 = 35 and n2 = 42 , test the null hypothesis

H 0 : p1 − p2 = 0
against
H1 : p1 − p2 > 0

at 5% significance level.

Solution
We note that δ = 0 and estimate σ pˆ 1 − pˆ 2 by Equation (2.9). We first find the combined
sample proportion as follows:
18 + 15
pˆ = = 0.4286 .
35 + 42
Therefore,
1 1 
s pˆ 1 − pˆ 2 = pˆ (1 − pˆ ) + 
 n1 n2 
 1 1 
= (0.4286)(0.5714)  + 
 35 42 
= 0.1133
64 CoDEUCC/Post-Diploma in Mathematics and Science Education
TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 5

18 15
Substituting pˆ1 = = 0.5143 , pˆ 2 = = 0.3571 , δ = 0 and s pˆ 1 − pˆ 2 = 0.1133 into
35 42
Equation (2.8); we obtain
pˆ1 − pˆ 2 − δ
z=
σ pˆ 1 − pˆ 2
0.5143 − 0.3571 − 0
= .
0.1133
= 1.3875

From the z tables z0.05 = 1.645 , therefore the critical region is z > 1.645 . Since
z = 1.3875 < 1.645 , we fail to reject H 0 . We therefore conclude that p1 = p2 .

Example 2.13
Refer to Example 2.12 and test the null hypothesis
H 0 : p1 − p2 = −0.15
against
H1 : p1 − p2 > −0.15

at 5% significance level. Interpret you result.

Solution
We note that δ ≠ 0 and estimate σ pˆ 1 − pˆ 2 by Equation (2.10).

We have
18 15
n1 = 35 , n2 = 42 , pˆ1 = = 0.5143 and pˆ 2 = = 0.3571 .
35 42
Therefore,
pˆ1 (1 − pˆ1 ) pˆ 2 (1 − pˆ 2 )
s pˆ 1 − pˆ 2 = +
n1 − 1 n 2 −1
(0.5143) (0.4857) (0.3571) (0.6429)
= +
35 − 1 42 − 1
= 0.1138

CoDEUCC/Post-Diploma in Mathematics and Science Education 65


UNIT 2 TESTS CONCERNING TWO POPULATION
SESSION 5 PROPORTIONS

18 15
Substituting pˆ1 = = 0.5143 , pˆ 2 = = 0.3571 , δ = −0.15 and s pˆ 1 − pˆ 2 = 0.1138 into
35 42
Equation (2.8); we obtain
pˆ1 − pˆ 2 − δ
z=
σ pˆ 1 − pˆ 2
0.5143 − 0.3571 − (−0.15)
= .
0.1138
= 2.6995

From the z tables z0.05 = 1.645 , therefore the critical region is z > 1.645 . Since
z = 2.6995 > 1.645 , we reject H 0 . We therefore conclude that p1 > p2 − 0.15 . That is,
p1 exceeds p2 by more than 15%.

66 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 5

Self- Assessment Questions


Exercise 2.5
1. The table below shows the total number of black and white males
who live in a certain district, and those who are homosexuals.
Number in sample Number of male homosexuals
White 804 575
Black 175 111
We wish to determine if there is a difference in the respective proportions of
white homosexuals to the total white population and of black homosexuals to the
total black population in this district.
(a) State the null and alternative hypotheses for the test.
(b) Conduct the test in part (a) using the 0.05 significance level.
2. Refer to Exercise 1 above and test H 0 : p1 − p2 = 0.05 against
H1 : p1 − p2 ≠ 0.05 at 5% significance level. Interpret you result.

3. The effectiveness of a newly developed allergy-relief capsule is to be compared


with that of one which has been on the market for a number of years. A sample of
250 persons using the new capsule revealed that 150 received satisfactory relief.
Out of a group of 400 using the older capsule, 232 received satisfactory relief.
Using the 0.02 significance level, test the null hypothesis that the proportion
receiving relief from the new capsule is equal to the proportion receiving relief
from the older capsule. Use a two-tailed test.
(a) State the null and alternative hypotheses, using H 0 and H1 .
(b) Conduct the test in part (a) above.
4. Refer to Exercise 3 above and test H 0 : p1 − p2 = 0.015 against
H1 : p1 − p2 ≠ 0.015 at α = 0.02 .
5. Suppose the incumbent President of Ghana comes from a large urban area. We
suspect that a large proportion of urban voters, in contrast to rural voters, will
support the President in his bid for reelection. Random samples of 500 urban and
400 rural voters are selected. The results are shown below.
Urban Rural
Plan to vote for the incumbent President x1 = 300 x2 = 200
Number in sample n1 = 500 n2 = 400
Test at the 0.05 whether this data demonstrate that a greater proportion or urban
voters plan to vote for the incumbent President.

CoDEUCC/Post-Diploma in Mathematics and Science Education 67


UNIT 2 TESTS CONCERNING TWO POPULATION
SESSION 5 PROPORTIONS

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

68 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 6

SESSION 6: TESTS CONCERNING TWO POPULATION


VARIANCES
In Unit 2 Sessions 1 – 4, you learnt about methods for comparing the
means of two different populations. You will recall that in Sessions 3
and 4, you learnt about tests concerning two population means based on small and
independent samples. The methods used in those tests were based on assumptions about
the unknown population variances σ12 and σ 22 , which were assumed equal in Session 3,
and unequal in Session 4.

Rather than make assumptions about (equality or otherwise) of the unknown population
variances σ12 and σ 22 , we can perform a test to ascertain them. In this session, you will
learn about a method for comparing the variances of two different populations.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning tests of two population
variances;
2. conduct tests concerning variances of two samples that are independent of each
other.

Now read on …

Suppose that we have two independent random samples of sizes n1 and n2 , taken from
different normal populations with variances σ12 and σ 22 . Then we can compare σ12 and
σ 22 by testing the null hypothesis H 0 : σ 12 = σ 22 against any of the alternatives
H1 : σ 12 < σ 22 , or H1 : σ 12 > σ 22 , or H1 : σ 12 ≠ σ 22 . The appropriate test statistic is given
by

s12
F= , (2.11)
s22

where s12 and s 22 are the sample variances. The F-statistic is the value of a random
variable having the F-distribution with n1 − 1 (numerator degrees of freedom) and
n2 − 1 (denominator degrees of freedom). Values of F for a given numerator and
denominator degrees of freedom can be read from F-distribution tables. (see Table V
under Statistical Tables at the end of the module).

CoDEUCC/Post-Diploma in Mathematics and Science Education 69


UNIT 2 TESTS CONCERNING TWO POPULATION
SESSION 6 VARIANCES

The critical regions for testing H 0 : σ 12 = σ 22 against an appropriate alternative, at α %


significance level are shown in Table 2.2.
Table 2.2: Critical regions for testing H 0 : σ 12 = σ 22 against the various
alternatives
Alternative hypothesis Rejection region for level α test
H1 : σ 12 < σ 22 F < F1−α (n1 − 1, n2 − 1)

H1 : σ 12 > σ 22 F > Fα (n1 − 1, n2 − 1)

H1 : σ 12 ≠ σ 02 F < F1− α (n1 − 1, n2 − 1)


2
or
F > Fα (n1 − 1, n2 − 1)
2

You may find the following identity useful later.


1
F1−α (n1 − 1, n2 − 1) = .
Fα (n2 − 1, n1 − 1)

Example 2.14
Suppose that observations from two independent random samples from two normal
populations yielded the following results: n1 = 11 , s12 = 18.4 , n2 = 16 and s 22 = 13.5 .
Test the null hypothesis H 0 : σ 12 = σ 22 against the H 1 : σ 12 ≠ σ 22 at the 10% significance
level.

Solution

Substituting s12 = 18.4 and s22 = 13.5 into Equation (2.11), we obtain

s12 18.4
F= = = 1.363 .
s22 13.5

The test is two-sided, so we need to evaluate F1− α (n1 − 1, n2 − 1) and Fα (n1 − 1, n2 − 1) .


2 2
From the F-distribution tables, we have
F0.05 (10,15) = 2.54.

Using the identity

70 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 6

1
F1−α (n1 − 1, n2 − 1) = ,
Fα (n2 − 1, n1 − 1)

we have
1 1
F0.95 (10,15) = = = 0.35.
F0.05 (15,10) 2.85

Therefore, the critical region is F < 0.35 or F > 2.54 . Since F = 1.36 is neither less
than 0.35 nor greater than 2.54, we cannot reject the null hypothesis. We therefore
conclude that σ 12 = σ 22 .

Example 2.15
The result from a certain experiment reported the following data on tensile strength
(psi) of liner specimens both when a certain fusion process was used and when this
process was not used.
No fusion 2748 2700 2655 2822 2511
3148 3257 3213 3220 2753
n1 = 10 x1 = 2902.8 s1 = 277.3

Fused 3027 3356 3359 3297


3125 2910 2889 2902
n2 = 8 x2 = 3108.1 s2 = 205.9

Does the data suggest that the standard deviation of the strength of distribution for fused
specimens is smaller than that for not fused specimens? Carry out a test at the 0.01 level
of significance.

Solution
Let σ 1 represent the standard deviation of the strength of distribution for the not fused
specimens, σ 2 represent the standard deviation of the strength of distribution for the
fused specimens.

We wish to test H 0 : σ 1 = σ 2 against H1 : σ 1 < σ 2 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 71


UNIT 2 TESTS CONCERNING TWO POPULATION
SESSION 6 VARIANCES

Substituting s1 = 277.3 and s2 = 205.9 into Equation (2.11), we obtain

s12 (277.3) 2
F= = = 1.814 .
s22 (205.9) 2

Now we need to evaluate F0.99 (10 − 1, 8 − 1) .


From the F-distribution tables, we have
1 1
F0.99 (9, 7) = = = 0.178
F0.01 (7, 9) 5.61

Therefore, the critical region is F < 0.178 . Since F = 1.814 is greater than 0.178, we
cannot reject the null hypothesis. The data does not suggest that the standard deviation
of the strength of distribution for fused specimens is smaller than that for not fused
specimens.

Example 2.16
Refer to Example 2.4, test the null hypothesis H 0 : σ 1 − σ 2 = 0 against the alternative
hypothesis H1 : σ 1 − σ 2 > 0 at the 0.05 level of significance.

Solution
Substituting for s1 = 31 and s2 = 26 into Equation (2.11), we obtain

s12 (31) 2
F= = = 1.422 .
s22 (26) 2

Now we need to evaluate F0.05 (4 − 1, 4 − 1) . From the F-distribution tables, we have

F0.05 (3, 3) = 9.28 .

Therefore, the critical region is F > 9.28 . Since F = 1.422 is less than 9.28, we cannot
reject the null hypothesis. We therefore conclude that σ 1 > σ 2 . That is, the standard
deviation of the first brand of paint is larger than that of the second.

72 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS OF HYPOTHESES ON TWO POPULATIONS UNIT 2
SESSION 6

Self- Assessment Questions


Exercise 2.6

1. In comparing the variability of the tensile strength of two kinds of structural steel,
an experiment yielded the following results: n 1= 13 , s12 = 19.2 , n 2 = 16 , and
s22 = 3.5 , where the units of measurement are 1,000 pounds per square inch.
Assuming that the measurements constitute independent random samples from
two normal populations, test the null hypothesis H 0 : σ 12 = σ 22 against the
alternative H1 : σ 12 ≠ σ 22 at the 0.02 level of significance.

2. To find out whether the inhabitants of two south pacific islands may be regarded
as having the same racial ancestry, an antropologist determines the cephalic
indicies of six adult males from each island, getting x1 = 77.4 , x2 = 72.2 , and the
corresponding standard deviations s1 = 3.3 and s2 = 2.1 . Test at the 0.10 level of
significance whether it is reasonable to assume that the two population samples
have equal variances.
3. For a sample of 28 elderly men, the sample standard deviation of serum ferritin
(mg/L) was s1 = 52.6 ; for 26 young men, the sample standard deviation was
s2 = 84.2 . Does the data suggest that the ferritin distribution in the elderly had a
smaller variance than in the younger adults? Carry out a test at the 0.01 level of
significance.

CoDEUCC/Post-Diploma in Mathematics and Science Education 73


UNIT 2 TESTS CONCERNING TWO POPULATION
SESSION 6 VARIANCES

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

74 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA
UNIT 3

UNIT 3: TESTS ON CATEGORICAL DATA

UNIT OUTLINE:

Session 1: The Multinomial Distribution


Session 2: Goodness-of-fit Tests (When Categorical
Probabilities are Completely Specified)
Session 3: Goodness-of-fit Tests for the Poisson, Binomial and
Normal Distributions
Session 4: Goodness-of-fit Tests for Homogeneity
Session 5: Goodness-of-fit Tests for Independence
Session 6: Coefficients of Contingency

So far, all the tests that you have learnt about in Units 1 and 2 are
tests for quantitative data. In this unit, we shall turn our attention to
some common tests that are considered appropriate for qualitative
data.

We shall learn about situations in which observations can be classified as falling into
exactly one of a number of mutually exclusive categories. We shall then be concerned
with the number of observations that fall in each of these categories, and investigate
whether we can reject or fail to reject hypotheses about these numbers.

UNIT OBJECTIVES
By the end of the unit, you should be able to:
1. formulate the null and alternative hypotheses for tests of qualitative data;
2. conduct tests based on whether a set of observations are drawn from specified
distributions;
3. conduct goodness-of-fit tests for homogeneity;
4. conduct goodness-of-fit tests for independence; and
5. calculate and interpret measures of strength of association between two
categorical variables.

CoDEUCC/Post-Diploma in Mathematics and Science Education 75


UNIT 3
TESTS ON CATEGORICAL DATA

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

76 CoDEUCC/Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 1

SESSION 1: THE MULTINOMIAL DISTRIBUTION


The statistical technique used for achieving each of the objectives set for
Unit 3 has the multinomial distribution as its mathematical basis. It is
therefore, prudent to begin the unit by learning about the multinomial distribution.

In this session, you will learn about the multinomial distribution, which is an extension
of the binomial.

Objectives
By the end of the session, you should be able to:
1. state at least four properties of the multinomial experiment; and
2. determine whether or not a given distribution follows the multinomial distribution.

Now read on…

You learnt about the binomial distribution during your study for the Diploma degree.
Recall that a binomial experiment possesses the following properties: (1) the
experimental consists of n identical trials, (2) each trial of the experiment can result in
two possible outcomes which may be classified as a “success” or a “failure”, (3) the
probability of a success or failure is the same for each experimental trial, and (4) the n
experimental trials are independent of each other. Multinomial experiments have similar
properties, although each trial of a multinomial experiment can result in two or more
possible outcomes.

The properties of the multinomial experiment are listed below:


1. The experiment consists of n identical trials.
2. There are k possible outcomes associated with each trial.
3. The probabilities of the k outcomes, denoted by p1 , p 2 ,  , p k , remain constant
from trial to trial; and p1 + p 2 +  + p k = 1 .

4. The n trials are independent of each other.


5. The random variables of interest are the counts n1 , n2 ,  , nk in each of the k
cells.

Note that when k = 2 in a multinomial experiment reduces to a binomial experiment.


The following are examples of multinomial distributions.

CoDEUCC/Post-Diploma in Mathematics and Science Education 77


UNIT 3 THE MULTINOMIAL DISTRIBUTION
SESSION 1

Example 3.1
Empirical studies have shown that the distribution of blood group in the population of a
certain human race is as follows:
Blood type Percentage
A 41
B 9
AB 4
O 46
Assuming that the distribution of blood group is independent from person to person,
then this can be looked on as multinomial distribution with four possible outcome blood
group A, B, AB, O; with probabilities p1 = 0.49 , p 2 = 0.09 , p3 = 0.04 , and
p 4 = 0.46 , respectively. Note that p1 + p 2 + p3 + p 4 = 1 . Therefore, the given
distribution of blood groups follows a multinomial distribution.

Example 3.2
The table shows the market shares for different brands of televisions.
Brand of TV Market share
LG 20%
Samsung 30%
Panasonic 35%
Sony 15%
It is clear that the brands of television are independent of each other.
The TV brands LG, Samsung, Panasonic and Sony have distribution probabilities
p1 = 0.20, p 2 = 0.30, p3 = 0.35, and p 4 = 0.15, respectively. So we have
0.20 + 0.30 + 0.35 + 0.15 = 1 and therefore, the distribution of brands of television sets
follows a multinomial distribution.

Example 3.3
The table shows the results of a consumer preference survey.
A B Store Brand
61 53 36
Try to justify why this is a multinomial distribution.

78 CoDEUCC/ Post Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 1

Self-Assessment Questions
Exercise 3.1
1. A study of the political affiliation of students in Mangoase SHS is given in the
table below. Justify why this is a multinomial distribution.

NPP NDC PPP OTHERS


105 80 35 30

2. Suppose that the distribution of marital status of women in a large city is given in
the table below. Explain why it is a multinomial distribution.

Married Single Others


20% 15% 65%

CoDEUCC/Post-Diploma in Mathematics and Science Education 79


UNIT 3 THE MULTINOMIAL DISTRIBUTION
SESSION 1

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

80 CoDEUCC/ Post Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 2

SESSION 2: GOODNESS-OF-FIT TESTS (WHEN CATEGORICAL


PROBABILITIES ARE COMPLETELY SPECIFIED
In session 1, you learnt about the properties of a multinomial
distribution.

In this session, you will learn how to formulate and test hypotheses concerning outcome
probabilities of multinomial distributions.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses concerning the outcome probabilities
of a multinomial distribution; and
2. conduct tests concerning outcome probabilities of a multinomial distribution.

Now read on…

Refer to Example 3.3. Suppose that we wish to test the null hypothesis that there is no
preference for any of the three brands against the alternative hypothesis that a
preference exists for one or more of the brands. Then we can let
p1 = Proportion of all customers who preferred brand A
p 2 = Proportion of all customers who preferred brand B
p3 = Proportion of all customers who preferred the store brand

Hence, we can formulate the null and alternative hypotheses as follows:

H 0 : p1 = p 2 = p3 = 1 (no preference exists)


3

H 1 : At least one of the proportions is different from 1


3
(a preference exists).

If the null hypothesis is true (i.e. if p1 = p 2 = p3 = 13 ) , then we would expect that


approximately 1 of the customers in the sample purchased each brand. That is, the
3
expected value of the number of customers who purchased brand A is given by

E (n1 ) = np1 = (150) (13 ) = 50 .


CoDEUCC/Post-Diploma in Mathematics and Science Education 81
GOODNESS-OF-FIT TESTS (WHEN CATEGORICAL
UNIT 3
SESSION 2 PROBABILITIES ARE COMPLETELY SPECIFIED

Similarly,
E (n1 ) = E (n2 ) = 50 (If no preference exists)

To measure the degree of disagreement between the data and the null hypothesis, we
use the statistic

χ2 =
[n1 − E (n1 )]2 + [n2 − E (n2 )]2 + [n3 − E (n3 )]2
E (n1 ) E (n2 ) E ( n3 )
(n − 50) (n1 − 50) (n1 − 50)
2 2 2
= 1 + +
50 50 50

This test statistic, χ 2 , has an approximate chi-square distribution with (k − 1) degrees


of freedom. We will reject the null hypothesis of no preference, at a given significance
level α , if χ 2 ≥ χ α2 . The critical values for the chi-square distribution are given in
Table IV.

We shall now give the general form of a test of hypothesis concerning multinomial
probabilities. Suppose we wish to test the null hypothesis

H0 : p1 = p1, 0 , p 2 = p 2, 0 ,  p k = p k , 0 , where p1, 0 , p 2, 0 ,  , p k , 0


represent the hypothesized values of the multinomial probabilities.
and
H 1 : At least one of the multinomial probabilities does not equal its
hypothesized value.

Then the test statistic is given by

χ =∑
2
k
[ni − E (ni )]2 , (3.1)
i =1 E ( ni )

where E (ni ) = npi 0 , is the expected number of type i outcomes assuming that H 0 is
true. The test statistic has an approximate chi-square distribution with (k − 1) degrees of
freedom. The total sample size is n and the rejection region is χ 2 ≥ χ α2 .

Note that the approximation is good on the assumption that the sample size is large
enough so that, for every cell, the expected cell frequency E (ni ) is equal to 5 or more.

82
CoDEUCC/ Post-Diploma in Mathematics and Science Education
TESTS ON CATEGORICAL DATA UNIT 3
SESSION 2

Example 3.4
The following is the criterion used by a certain firm to decide on annual pay rise of its
employees: employees who score an average above 80 in a series of evaluations will
receive a merit pay rise, those who score between 50 and 80 will receive the standard
pay rise, and those below 50 will receive no pay rise. The firm designed the plan with
the objective that, on the average, 25% of its employees would receive merit pay rise,
65% would receive standard pay rise, and 10% would receive no pay rise. The
distribution of pay rise for 600 employees after the evaluation is given below:
No pay rise Standard pay rise Merit pay rise
42 365 193

Test, at the 0.01 level of significance, whether the data indicate that the distribution of
pay rise differs from those established by the firm.

Solution
Let
p1 = Proportion of employees who receive no pay rise
p 2 = Proportion of employees who receive a standard pay rise
p3 = Proportion of employees who receive a merit pay rise

Then the null and alternative hypotheses are as follows:


H 0 : p1 = 0.10 , p 2 = 0.65 , p3 = 0.25

H 1 : At least one of the proportions differ from the firm’s plan

The expected cell frequencies are calculated as follows.

E (n1 ) = np1 0 = 600(0.10) = 60

E (n2 ) = np 2 0 = 600(0.65) = 390

E (n3 ) = np3 0 = 600(0.25) = 150

CoDEUCC/Post-Diploma in Mathematics and Science Education 83


GOODNESS-OF-FIT TESTS (WHEN CATEGORICAL
UNIT 3
SESSION 2 PROBABILITIES ARE COMPLETELY SPECIFIED

Summarizing the ni and E (ni ) into a table, we have

No pay rise Standard pay rise Merit pay rise


Observed, ni 42 365 193
Expected, E (ni ) 60 390 150

Substituting the ni s and E (ni s) into Equation (3.1), we obtain

χ2 = ∑
3
[ni − E (ni )]2
i =1 E ( ni )

=
(42 − 60)2 + (365 − 390)2 + (193 − 150)2
60 390 150
= 19.33

From the χ 2 tables, the value of χ 02.01 with degrees of freedom, k − 1 = 2 is 9.210.
Therefore, the critical region is χ 2 ≥ 9.21034 . Since 19.33 is greater than 9.210, we
reject the null hypothesis and conclude that the data contradicts the company’s plan.

Example 3.5
The headteacher of a primary school is interested in knowing whether there exist colour
preferences among the pupils of his school. A sample of 100 pupils was drawn from the
school and shown identically shaped objects, coloured red, blue, yellow, green or pink.
When each child was asked to pick the most preferred colour, 30 picked red, 18 blue, 12
yellow, 20 green and 20 pink. Test, at 5% significance level, the hypothesis:
H 0 : there does not exist colour preferences among the pupils
against
H 1 : colour preference does exist.

Solution
If there are no preferences, then the probability of choosing any colour is the same.
Since there are five colours, the probability, pi of choosing any colour is
1
pi = or 0.2 (i = 1,2,  , 5) .
5

Thus, mathematically, we are testing

84
CoDEUCC/ Post-Diploma in Mathematics and Science Education
TESTS ON CATEGORICAL DATA UNIT 3
SESSION 2

H 0 : p1 = p 2 = p3 = p 4 = p5 = 0.2
against
H 1 : at least one of the pi s ≠ 0.2

We can summarize some of our calculations into a table as follows:

Colour Red Blue Yellow Green Pink


Observed, ni 30 18 12 20 20
Expected, E (ni ) 20 20 20 20 20
ni − E ( ni ) 10 −2 −8 0 0

Thus,

χ2 = ∑
5
[ni − E (ni )]2
i =1 E ( ni )
10 2 (−2) 2 (−8) 2 0 2 0 2
= + + + +
20 20 20 20 20
= 5 + 0.2 + 3.2
= 8.4
At the 5% significance level and from the chi-square tables, χ 02.05 at df = 5 − 1 = 4 is
9.49. Therefore, the critical region is χ 2 ≥ 9.49. Since χ 2 = 8.4 < χ 02.05 (4) = 9.49 , we
cannot reject the null hypothesis. That is, we do not have sufficient evidence against the
null hypothesis.

CoDEUCC/Post-Diploma in Mathematics and Science Education 85


GOODNESS-OF-FIT TESTS (WHEN CATEGORICAL
UNIT 3
SESSION 2 PROBABILITIES ARE COMPLETELY SPECIFIED

Self-Assessment Questions
Exercise 3.2
1. Assume that a die is thrown 60 times and a record is kept of the number of times a
1, 2, 3, 4, 5 or 6 is observed.

Face 1 2 3 4 5 6
Number 13 10 8 10 12 7

Test at 5% significance level, whether or not the die is fair.

2. The Ghana Statistical Service compiles data on the population of Ghana by


religion and publishes its findings in Current Population Reports. The Service has
indicated that 15% independent simple random samples of residents in the
Ashanti region gave the following data on religion.

Muslims Christians Others


42 166 7

At the 1% significance level, test the hypothesis:

H 0 : p1 = 0.15 , p 2 = 0.77 , p3 = 0.08


against
H 1 : At least one of the probabilities differ.

86
CoDEUCC/ Post-Diploma in Mathematics and Science Education
TESTS ON CATEGORICAL DATA UNIT 3
SESSION 3

SESSION 3: GOODNESS-OF-FIT TESTS FOR THE POISSON,


BINOMIAL AND NORMAL DISTRIBUTIONS
In session 2, you learnt how to perform goodness-of-fit tests for
situations where the probabilities for elements that fall in the various
categories are completely specified.
In this session, you will learn how to formulate hypotheses and conduct goodness-of-fit
tests of a set of data to a specified discrete or continuous probability distribution. We
shall restrict our study to only the Poisson and binomial probability distributions for the
discrete case, and the normal distribution for the continuous case, although other
discrete and continuous probability distributions exist.

Objectives
By the end of the session, you should be able to:
1. formulate null and alternative hypotheses for carrying out goodness-of-fit tests of
a set of data to a Poisson, or a binomial, or a normal distribution;
2. conduct goodness-of-fit tests of a set of data to a Poisson
distribution;
3. conduct goodness-of-fit tests of a set of data to a binomial distribution; and
4. conduct goodness-of-fit tests of a set of data to a normal distribution.

Now read on…

The goodness-of-fit test can be applied to test a sample data set as coming from a
population having a Poisson, or binomial, or normal distribution. Unlike previous
statistical tests, however, the hypothesis of interest is the null hypothesis and not the
alternative. The test statistic for such tests is given by
k
χ2 = ∑ i
[n − E (ni )]2 , (3.2)
i =1 E ( n i )

where k is the number of classes; ni the number of observations that fall into class i;
and E (ni ) the expected number of observations that fall into class i. The expected
number of observations that fall into class i, E (ni ) , is given by E (ni ) = n pi , where pi
is the probability of an observation falling into class i and n is the total sample size.
The test statistic in Equation (3.2) is approximately chi-square distributed with degrees
of freedom given by (k − m − 1) , where m is the number of independent parameters that
have to be estimated from the sample. The chi-square approximation is particularly

CoDEUCC/Post-Diploma in Mathematics and Science Education 87


UNIT 3 GOODNESS-OF-FIT TESTS FOR THE POISSON,
SESSION 3 BINOMIAL AND NORMAL DISTRIBUTIONS

good if each expected frequency is greater than 5. The approximation is acceptable;


however, if no more than 20% of the expected frequencies are less than 5.

The goodness-of-fit test is constructed in such a way that we will reject the null
hypothesis, at a given significance level α , if the observed value of the test statistic is
larger than or equal to the corresponding value, χ α2 , from chi-square tables. That is, we
will reject H 0 if χ 2 ≥ χ α2 .

Examples 3.6, 3.7, and 3.8 illustrate how to perform the calculations involved and test
of a set of data to a Poisson, binomial and normal distributions; respectively.

Example 3.6
The weekly number of power failures reported in a certain district in 50 weeks is
recorded as follows:
Number of failures Number of weeks
0 6
1 8
2 13
3 11
4 7
5 4
6 1
Determine whether the weekly number of power failures in the district follows a
Poisson distribution at the 5% significance level.
Solution
We wish to test the hypotheses:
H 0 : the weekly number of power failures follows a Poisson
distribution.
against
H 1 : the weekly number of power failures does not follow a
Poisson distribution.

We therefore determine the corresponding set of expected frequencies using Poison


probabilities,
λ i e −λ
pi = , i = 0, 1, 2,  , 6 .
i!

88 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 3

where λ is the mean of the distribution. In this example, we shall have to estimate the
value of λ (note that its value is not given in the problem) by calculating the mean of
the given sample data. Calculating for the mean, we have
x f fx
0 6 0
1 8 8
2 13 26
3 11 33
4 7 28
5 4 20
6 1 6
50 121
Therefore, the mean x is given by
7
∑ f i xi 121
i =1
x= = = 2.42 ≈ 2.4
fi 50
Thus, we can calculate the various probabilities as follows.

(2.4) 0 e −2.4
p0 = = 0.091
0!
(2.4)1 e −2.4
p1 = = 0.218
1!
The rest of the calculation is summarized below.
Number of Number of Poison Expected
Failures Weeks Probabilities Frequencies
i ni pi E (ni ) = 50 pi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 7 0.125 6.25
5 4 0.060 3.00
6 1 0.024 1.20
Note that the values of pi for i = 0, 1, 2,  , 6 can also be obtained from Poisson
probability tables. For example, the value for p0 can be read off as 0.091 from where

CoDEUCC/Post-Diploma in Mathematics and Science Education 89


UNIT 3 GOODNESS-OF-FIT TESTS FOR THE POISSON,
SESSION 3 BINOMIAL AND NORMAL DISTRIBUTIONS

row with λ = 2.4 intersects with column x = 0 in Table VI under Statistical Tables at
the end of the module.

In the table above, we see that two out of seven expected frequencies
(i.e. approximately 29%) are less than 5. To satisfy the requirement that no more than
20% of the expected frequencies are less than 5, we will follow the common practice of
merging adjacent classes. In this case, it is the last three classes that we merge into one
to obtain five classes as shown.

Number of Number of Poison Expected


Failures Weeks Probabilities Frequencies
i ni pi E (ni ) = 50 pi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 12 0.209 10.45

Thus, we obtain

χ =∑
2
k
[ni − E (ni )]2
i =1 E ( ni )
(6 − 4.55) 2 (8 − 10.90) 2 (12 − 10.45) 2
= + ++
4.55 10.90 10.45
= 0.4621 + 0.7716 +  + 0.0289 + 0.2299
= 1.49

The critical region for the test is χ 2 ≥ χ α2 (k − m − 1) . In this problem, k = 5 (the


number of classes left after merging), m = 1 (only one parameter, the mean of the
distribution, has been estimated from the sample). Thus, the number of degrees of
freedom for the test is 3, since k − 1 − 1 = 3. Therefore, the critical region is
χ 2 ≥ χ 02.05 (3) = 7.815 . Since 1.49 is less than 7.815, we cannot reject the null
hypothesis. That is, we do not have sufficient evidence to believe that the weekly
number of power failures does not follow a Poison distribution.

Example 3.7
Four identical six-sided dice, each with faces marked 1 to 6, are rolled 200 times. At
each rolling, a record is made of the number of dice whose score on the uppermost face
are even. The results are as follows.

90 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 3

Number of even scores, xi 0 1 2 3 4


Frequency, f i 10 41 70 57 22

Test, at 5% level of significance, that the number of even faces follows a binomial
distribution with n = 4 and p = 0.5 .

Solution
We wish to test
H 0 : Number of even scores is ~ B (4, 0.5)
against
H 1 : Number of even scores is not ~ B (4, 0.5)
We have
p ( x) = B (n, p )= n C x p x (1 − p )n− x
Thus,
p (0) = B (4, 0.5)= 4 C 0 (0.5) 0 (0.5)4 = 0.0625

p (1) = B (4, 0.5)= 4 C1 (0.5)1 (0.5)3 = 0.2500

p (2) = B (4, 0.5)= 4 C 2 (0.5) 2 (0.5)2 = 0.3750

p (3) = B (4, 0.5)= 4 C3 (0.5) 3 (0.5)1 = 0.2500

p (4) = B (4, 0.5)= 4 C 4 (0.5) 4 (0.5)0 = 0.0625

We can now calculate the expected cell frequencies and summarize them in a table as
shown.
x ni p ( xi ) E ( ni )

0 10 0.0625 12.50
1 41 0.2500 50.00
2 70 0.3750 75.00
3 57 0.2500 50.00
4 22 0.0625 12.50
Note that it is possible to read off the value of binomial probabilities for
binomial distribution tables. See, for example, Table III under Statistical Tables
at the end of the module.

CoDEUCC/Post-Diploma in Mathematics and Science Education 91


UNIT 3 GOODNESS-OF-FIT TESTS FOR THE POISSON,
SESSION 3 BINOMIAL AND NORMAL DISTRIBUTIONS

The requirement for the use of the χ 2 distribution is satisfied since no expected
frequency is less than 5. We now evaluate the value of the test statistic as shown

χ =∑
2
k
[ni − E (ni )]2
i =1 E ( ni )
(10 − 12.50) 2 (2 − 12.50) 2
= ++
12.50 12.50
= 0.500 + 1.620 +  + 0.980 + 7.220
= 10.653
The degrees of freedom is calculated as

χ 02.05 (k − m − 1) = χ 02.05 (5 − 0 − 1) = χ 02.05 (4) = 9.488 .

Since χ 2 = 10.653 is greater than χ 02.05 (4 ) = 9.488, we reject H 0 and conclude that the
number of even scores is not approximately B (4, 0.5) .

Example 3.8
The following is the distribution of the readings obtained with a Geiger counter of the
number of particles emitted by a radioactive substance in 100 successive 40-second
intervals:
Number of particles Frequency
5– 9 1
10 – 14 10
15 – 19 37
20 – 24 36
25 – 29 13
30 – 34 2
35 – 39 1

(a) Verify that the mean of this distribution is x = 20 .


(b) Find the probabilities that a random variable having a normal distribution with
mean µ = 20 and σ = 5 will take on a value less than 9.5, between 9.5 and 14.5,
between 14.5 and 19.5, between 19.5 and 24.5, between 24.5 and 29.5, between
29.5 and 34.5, and greater than 34.5.
(c) Find the expected normal curve frequencies for the various classes.
(d) Test at the 0.05 level of significance, whether the data may be looked upon as a
random sample from a normal population.

92 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 3

Solution
(a) To answer this question, we need to calculate f x as shown.

Class Frequency ( f ) Mark (x) f x


5– 9 1 7 7
10 – 14 10 12 120
15 – 19 37 17 629
20 – 24 36 22 792
25 – 29 13 27 351
30 – 34 2 32 64
35 – 39 1 37 37
∑ f =100 ∑ fx =2000

Therefore,

x=
∑ f x = 2000 = 20,
∑ f 100
as required.
(b) Since a normally distributed variable ranges from negative infinity to positive
infinity, the area beyond the class interval must also be accounted for. Thus, the
area below 9.5 is the area below the Z value
9.5 − 20
Z= = −2.1
5
From Table I in the Appendix, the area below Z equals minus 2.1 is
approximately 0.0179.

To calculate the area between 9.5 and 14.5, the area below 14.5 is calculated as
follows
14.5 − 20
Z= = −1.1
5
From Table I in the Appendix, the area below Z equals minus 1.1 is
approximately 0.1357. Thus, the area between 9.5 and 14.5 is the difference in
the area below 9.5 and the area below 4.5, which is 0.1357 minus 0.0179 equal
to 0.1178.
To calculate the area between 14.5 and 19.5, the area below 19.5 is calculated as
follows

CoDEUCC/Post-Diploma in Mathematics and Science Education 93


UNIT 3 GOODNESS-OF-FIT TESTS FOR THE POISSON,
SESSION 3 BINOMIAL AND NORMAL DISTRIBUTIONS

19.5 − 20
Z= = −0.1
5
From Table I, the area below Z equals minus 0.1 is approximately 0.4602. Thus,
the area between 14.5 and 19.5 is the difference in the area below 19.5 and the
area below 14.5, which is 0.4602 minus 0.1357 equal to 0.3245.

The area in each of the remaining class intervals is calculated in similar manner.
The calculations are summarized in the next table.
Classes Frequency p(x)
x < 9.5 1 0.0179
9.5 ≤ x < 14.5 10 0.1178
14.5 ≤ x ≤ 19.5 37 0.3245
19.5 ≤ x ≤ 24.5 36 0.3557
24.5 ≤ x ≤ 29.5 13 0.1554
29.5 ≤ x ≤ 34.5 2 0.0268
x > 39.5 1 0.0019

The expected normal curve frequencies for the various classes are summarized in
the table below.

Classes Frequency p(x) E (ni ) = 100 p


x < 9.5 1 0.0179 1.79
9.5 ≤ x < 14.5 10 0.1178 11.78
14.5 ≤ x ≤ 19.5 37 0.3245 32.45
19.5 ≤ x ≤ 24.5 36 0.3557 35.37
24.5 ≤ x ≤ 29.5 13 0.1554 15.54
29.5 ≤ x ≤ 34.5 2 0.0268 2.68
x > 39.5 1 0.0019 0.19

To satisfy the requirement that the approximation is acceptable only if no more


than 20% of the expected frequencies are less than 5, we combine elements in the
last three classes to obtain

Classes Frequency p(x) E (ni ) = 100 p


x < 9. 5 1 0.0179 1.79
9.5 ≤ x < 14.5 10 0.1178 11.78
14.5 ≤ x ≤ 19.5 37 0.3245 32.45
19.5 ≤ x ≤ 24.5 36 0.3557 35.37
x ≥ 24.5 16 0.1841 18.41
(c) We evaluate the test statistic as follows:

94 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 3

χ2 = ∑
5
[ni − E (ni )]2
i =1 E ( ni )
(1 − 1.79) 2 (10 − 11.78) 2 (16 − 18.41) 2
= + ++
1.79 11.78 18.41
= 1.5823.

After combining classes to get them satisfy the requirement for use of the chi-
square distribution, 5 classes remain. Thus, the degrees of freedom for the test are
3, since the mean of the distribution is estimated from the sample.
From the chi-square tables and at the 0.05 level of significance, we have
χ 02.05 (3) = 7.815.

Since the calculated chi-square value of 1.5823 is less than the table value of
7.815, we fail to reject the null hypothesis. We therefore conclude that the data
may be looked upon as a random sample from a normal population.

CoDEUCC/Post-Diploma in Mathematics and Science Education 95


UNIT 3 GOODNESS-OF-FIT TESTS FOR THE POISSON,
SESSION 3 BINOMIAL AND NORMAL DISTRIBUTIONS

Self-Assessment Questions
Exercise 3.3
1. The actual arrivals per minute during lunch time for 200 people are shown below.

Number of failures Number of weeks


0 14
1 31
2 47
3 41
4 29
5 21
6 10
7 5
8 2

Determine whether the number of arrivals per minute follows a Poisson


distribution at the 0.05 level of significance.

2. A farmer kept record of the number of heifer calves born to each of his cows
during the first five years of breeding of each cow. The results are summarized
below.

No. of heifers 0 1 2 3 4 5
No. of Cows 4 19 41 52 26 8

Test at 5% level of significance, whether or not the binomial distribution with


parameter n=5 and p=0.05 is an adequate model for these data.

3. A random sample of 500 car batteries revealed the following distribution of


battery life in years.
Life (in years) Frequency
0 – under 1 12
1 – under 2 94
2 – under 3 170
3 – under 4 188
4 – under 5 28
5 – under 6 8

For these data, x = 2.80 and s = 0.97 . Test at the 0.05 significance level, whether or
not the battery lives follow a normal distribution.

96 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 4

SESSION 4: GOODNESS-OF-FIT TESTS FOR HOMOGENEITY


So far, you have learnt how to use the goodness-of-fit tests to conduct
hypothesis tests concerning probabilities of various categories of a given
population (Unit 3, Session 2). You can also conduct goodness-of-fit tests for the
Poisson, binomial and normal distributions (Unit 3, Session 3).

In this session, you will learn how to conduct goodness-of-fit test for homogeneity. This
test is used to determine whether frequency counts for a given variable are distributed
identically across different populations.

Objectives
By the end of the session, you should be able to:
1. formulate null and alternative hypotheses for goodness-of-fit tests for
homogeneity; and
2. conduct goodness-of-fit tests for homogeneity.

Now read on…

The goodness-of-fit test for homogeneity is considered appropriate when the following
conditions are `met.

1. The method for selecting a sample from each population is simple random
sampling.
2. The variable under study is categorical.
3. The expected frequency for each cell should be at least 5.

Suppose that data are sampled from p mutually exclusive populations, and that the
categorical variable has l mutually exclusive levels. If nij denotes the number of
individuals in the sample(s) that fall in row i and column j of the table, that is in (i, j ) th
cell , then the data can be arranged in a two-way table as shown in Table 3.1.

CoDEUCC/Post-Diploma in Mathematics and Science Education 97


UNIT 3
GOODNESS-OF-FIT TESTS FOR
SESSION 4 HOMOGENEITY

Table 3.1: A p × l contingency table

Variable
Population 1 2  l Totals
1 n11 n12  n1l P1
2 n21 n22  n2l P2
     
p n p1 n p2  n pl Pp
Totals L1 L2  Ll n

Thus, Table 3.1 is a p × l contingency table in which the data is sampled from p
populations and the variable of interest has l levels.

The number of observations, nij , that fall into each cell is called observed cell
frequency.
l
Pi = ∑ nij is the marginal total for row i, whilst
j =1

p
L j = ∑ nij is the marginal total for column j.
i =1

p l
Note that ∑ Pi = ∑ L j = n is the total sample size.
i =1 j =1

The null hypothesis states that, at any specific level of the variable, the p mutually
exclusive populations have the same proportions against the alternative that, at least one
of the null hypotheses is false. Thus,

H 01 : Prop. of pop. 1 = Prop. of pop. 2 =  = Prop. of pop. p

H 0 2 : Prop. of pop. 1 = Prop. of pop. 2 =  = Prop. of pop. p


   
H0 p : Prop. of pop. 1 = Prop. of pop. 2 =  = Prop. of pop. p

against
H 1 : At least one of the H 0 is false

98 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 4

The test statistic is given by

χ = ∑∑
2
p l [nij − E (nij )] 2 , (3.3)
i =1 j =1 E (nij )

where E (nij ) is the expected cell frequency for the (ij) th cell. It can be shown that

Pi × L j
E (nij ) = .
n
The tests statistic in Equation (3.3), under the null hypothesis, has an approximate chi-
square distribution with the number of degrees of freedom given by

df = ( p − 1)(l − 1) ,

where p is the number of populations, and l is the number of levels of the categorical
variable in the test.

The critical region for the test, at α % significance level, is given by

χ 2 ≥ χ α2 [( p − 1)(l − 1)].

Example 3.9
In a study of television viewing habits of children, a developmental psychologist selects
a random sample of 300 primary school pupils, 100 boys and 200 girls. Each child is
asked which of the following television programmes they liked best: The Talented
Child, or The Pulpit, or Math and Science Quiz. The results are shown below.
Viewing Preferences
The Talented Child The Pulpit Math and Science Quiz
Boys 50 30 20
Girls 50 80 70

Do boys’ preferences for the television programmes differ significantly from the girls’
preferences? Use the 0.05 level of significance.

Solution
We can calculate the population totals and television programme totals as shown in the
table.

CoDEUCC/Post-Diploma in Mathematics and Science Education 99


UNIT 3
GOODNESS-OF-FIT TESTS FOR
SESSION 4 HOMOGENEITY

Viewing Preferences
Talented Child Pulpit Math & Science Quiz Totals
Boys 50 30 20 100
Girls 50 80 70 200
Totals 100 110 90 300

The hypotheses we wish to test are as follows.

H 01 : Proportion of boys who prefer Talented Child =


Proportion. of girls who prefer Talented Child

H 0 2 : Proportion of boys who prefer The Pulpit =


Proportion. of girls who prefer The Pulpit

H 0 3 : Proportion of boys who prefer Math & Science Quiz =


Proportion. of girls who prefer Math & Science Quiz
against
H 1 : At least one of the H 0 is false

We can now calculate the expected cell frequencies as follows.

P1 × L1 100 × 100
E (n11 ) = = = 33.33
n 300

P1 × L2 100 × 110
E (n12 ) = = = 36.67
n 300

P1 × L3 100 × 90
E (n13 ) = = = 30.00
n 300

P2 × L1 200 × 100
E (n21 ) = = = 66.67
n 300

P2 × L2 200 × 110
E (n22 ) = = = 73.33
n 300

P2 × L3 200 × 90
E (n23 ) = = = 60.00
n 300

100 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 4

Substituting the observed and corresponding expected cell frequencies into Equation
(3.3), we obtain

χ = ∑∑
2
[
p l n − E (n ) 2
ij ij]
i =1 j =1 E (nij )
(50 − 33.33) 2 (70 − 60) 2
= ++
33.33 60
= 8.3375 +  + 1.6667
= 19.3255

The degrees of freedom is given by

df = ( p − 1)(l − 1)
= (2 − 1)(3 − 1)
=2
Now from chi-square tables, the value of chi-square at the 0.05 level of significance,
with 2 degrees of freedom is 5.99. Since 19.3255 is greater than 5.99, we reject the null
hypothesis and conclude that at least one of the null hypotheses is false.

CoDEUCC/Post-Diploma in Mathematics and Science Education 101


UNIT 3
GOODNESS-OF-FIT TESTS FOR
SESSION 4 HOMOGENEITY

Self-Assessment Questions
Exercise 3.4
1. The Director for Academic Affairs of the University of Cape Coast was
concerned that males and females were accepted at different rates into the four
different colleges (EDUCATION, CANS, CHLS and HAAS) in the university.
He, therefore, collected the following data on the acceptance of 1200 males and
800 females who applied to the university:

Acceptance EDUC. CANS CHLS HAAS


Males 300 240 300 360
Females 200 160 200 240

Are males and females distributed equally among the various colleges?
(a) State the appropriate null and alternative hypotheses for conducting the test
above.
(b) Conduct the test at the 0.05 level of significance.

2. The head of surgery department at the university of Cape Coast medical school
was concerned that Surgical Residents in training applied unnecessary blood
transfusions at a different rate than the more experienced Attending Physicians.
Therefore, he ordered a study of the 49 Attending Physicians and 71 Residents in
Training with privileges at the hospital. For each of the 120 surgeons, the number
of blood transfusions prescribed unnecessarily in a one-year period was recorded.
Based on the number recorded, a surgeon was identified as either prescribing
unnecessary blood transfusion Frequently, Occasionally, Rarely, or Never. The
following is a summary of the resulting data.

Physician Frequent Occasionally Rarely Never


Attending 2 3 31 13
Resident 15 28 23 5

Are attending physician and residents in training distributed equally among the
various unnecessary blood transfusion categories?
(a) State the appropriate null and alternative hypotheses for conducting the test
above.
(b) Conduct the test at the 0.05 level of significance.

102 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 5

SESSION 5: GOODNESS-OF-FIT TESTS FOR INDEPENDENCE


In Session 4, you learnt about the goodness-of-fit tests for homogeneity.
Data for such tests were contained in contingency tables whose rows
contained p mutually exclusive populations, and whose columns contained l mutually
exclusive levels of a single categorical variable.

In this session, you will learn how to conduct goodness-of-fit tests for independence,
which are similar to goodness-of-fit tests for homogeneity.

Objectives
By the end of the session, you should be able to:
1. formulate the null and alternative hypotheses for goodness-of-fit tests for
independence; and
2. conduct goodness-of-fit tests for independence.
Now read on…

Whereas goodness-of-fit tests for homogeneity are applied to a single categorical


variable from two or more populations, goodness-of-fit tests for independence are
applied to two categorical variables from a single population.

In general, a hypothesis of independence between two categorical variables in which


one is classified into r categories and the other into c categories, gives an r × c
contingency table, or r ⋅ c mutually exclusive cells, where r is the number of rows and c
the number of columns. That is, one variable is contingent (or dependent) on the other.

Table 3.2 is an r × c contingency table in which variable 1 (in the rows) is classified
into r categories and variable 2 (in the columns) is classified into c categories.
Table 3.2: An r × c contingency table
Variable 2
Variable 1 1 2  c Totals
1 n11 n12  n1c R1
2 n21 n22  n 2c R2
     
r n r1 nr 2  n rc Rr
Totals C1 C2  Cc n

The number of observation contained in each cell is called observed cell frequency.

CoDEUCC/Post Diploma in Mathematics and Science Education 103


GOODNESS-OF-FIT TESTS FOR
UNIT 3
SESSION 5
INDEPENDENCE

c
Ri = ∑ nij is the marginal total for row i, whilst
j =1

r
C j = ∑ nij is the marginal total for column j.
i =1
r c
Note that ∑ Ri = ∑ C j = n is the total sample size.
i =1 j =1

The hypotheses that we wish to test here are


H 0 : Variables are independent.

against the alternative


H 1 : Variables are not independent.

The test statistic is given by

χ = ∑∑
2
r c [nij − E (nij )] 2 ,
i =1 j =1 E (nij )

where E (nij ) is the expected cell frequency for the (ij) th cell. It can be shown that
Ri × C j
E (nij ) = .
n
The test statistic, under the null hypothesis, is approximately
chi-square distributed with the number of degrees of freedom given by
df = (r − 1)(c − 1) .

The critical region for the test, at α % significance level, is

χ 2 ≥ χ α2 [(r − 1)(c − 1)] .

Note that in tests for independence, both row and column marginal totals are free to
vary although the sample size is fixed. The test for independence is a test of association,
not a test of cause and effect. Thus, the fact that two variables are dependent does not
imply that one causes the other.

104 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 5

Example 3.10
The table below is based on the classification by size and colour of a sample of 120
shirts drawn from a large population.
Size
Small Medium Large
Red 10 13 12
Yellow 12 11 14
Colour

Green 18 20 10

Test the hypothesis


H 0 : size and colour are independent
against
H 1 : size and colour are not independent
at 5% significance level.

Solution

The row marginal total are R1 = 35 , R2 = 37 and R3 = 48 ; and the column marginal
totals are C1 = 40 , C 2 = 44 and C 3 = 36 . Therefore, the corresponding expected cell
frequencies are obtained by substituting appropriate values into
Ri × C j
E (nij ) = .
n
Thus,
35 × 40 35 × 44
E (n11 ) = = 11.67 , E (n12 ) = = 12.83 ,
120 120
35 × 36 37 × 40
E (n13 ) = = 10.50 , E (n21 ) = = 12.33 ,
120 120
37 × 44 37 × 36
E (n23 ) = = 13.57 , E (n23 ) = = 11.10 ,
120 120
48 × 40 48 × 44
E (n31 ) = = 16.00 , E (n32 ) = = 17.60 ,
120 120
48 × 36
E (n33 ) = = 14.40 .
120

CoDEUCC/Post Diploma in Mathematics and Science Education 105


GOODNESS-OF-FIT TESTS FOR
UNIT 3
SESSION 5
INDEPENDENCE

These frequencies are summarized in the table below.


Size
Small Medium Large
Red 11.67 12.83 10.50
Yellow 12.33 13.57 11.10
Colour

Green 16.00 17.60 14.40

Now, substituting both the observed and the their corresponding expected frequencies
into the test statistic, we obtain

χ = ∑∑
2
r c [nij − E (nij )]2
i =1 j =1 E (nij )
(10 − 11.67) 2 (13 − 12.83) 2 (10 − 14.40) 2
= + ++
11.67 12.83 14.40
= 3.63

The degrees of freedom is

df = (3 − 1)(3 − 1) = 4 .

Therefore, from chi-square tables, χ 02.05 (4) = 9.49 . Since χ 2 = 3.63 is less than
χ 02.05 (4) = 9.49 , we cannot reject the null hypothesis. Therefore, size and colour are
independent

Example 3.11
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents
were classified by gender (male or female) and by voting preference (NPP, NDC, or
PPP). The results are shown in the contingency table below.
Voting Preferences
NPP NDC PPP
Male 200 150 50
Female 250 300 50

Is there a gender gap? Do the men’s voting preferences differ significantly from the
women’s preferences? Use the 0.05 level of significance.

106 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 5

Solution
We wish to test the hypothesis
H 0 : Gender and voting preferences are independent.
against
H 1 : Gender and voting preferences are not independent.

The row marginal total are R1 = 400 , R2 = 600 ; and the column marginal totals are
C1 = 450 , C 2 = 450 and C3 = 100 . The sample size is 1,000. Therefore, the
corresponding expected cell frequencies are obtained by substituting appropriate values
into
Ri × C j
E (nij ) = .
n
Thus,

400 × 450 400 × 450


E (n11 ) = = 180 , E (n12 ) = = 180 ,
1000 1000
400 × 100 600 × 450
E (n13 ) = = 40 , E (n21 ) = = 270 ,
1000 1000
600 × 450 600 × 100
E (n23 ) = = 270 , E (n23 ) = = 60 ,
1000 1000

These frequencies are summarized in the table below.

Voting Preferences
NPP NDC PPP
Male 180 180 40
Female 270 270 60

Now, substituting both the observed and the their corresponding expected frequencies
into the test statistic, we obtain

χ = ∑∑
2
r c [nij − E (nij )] 2
i =1 j =1 E (nij )
(200 − 180) 2 (50 − 60) 2
= ++
180 60
= 16.2

CoDEUCC/Post Diploma in Mathematics and Science Education 107


GOODNESS-OF-FIT TESTS FOR
UNIT 3
SESSION 5
INDEPENDENCE

The degrees of freedom is

df = (2 − 1)(3 − 1) = 2 .

Therefore, from chi-square tables, χ 02.05 (2) = 5.99 . Since χ 2 = 16.2 is greater than
χ 02.05 (2) = 5.99 , we reject the null hypothesis. Therefore, Gender and voting
preferences are not independent.

Self-Assessment Questions
Exercise 3.5
1. A recent study of educational levels of 1000 voters and their political party
affiliations in Ghana showed the results given in the table. Set up the appropriate
hypotheses and test, at the 0.05 level of significance, if party affiliation is
independent of the educational level of the voters.
Party Affiliation
NPP NDC PPP
JHS 95 80 115
SHS 135 85 105
Tertiary 160 105 120

2. Pollsters have found that the public’s confidence in business is closely tied to the
economic climate. When businesses grow and employment increases, public
confidence goes high. When the opposite occurs, public confidence goes low. A
scholar hypothesized that there is a relationship between level of confidence in
business and job satisfaction, and that this is true for both union and non-union
workers. He analysed sample data collected by the National Opinion Research
Center and shown below.
Job Satisfaction
Dissatisfied

Dissatisfied
Moderately
Satisfied

Satisfied

A little
Very

Very

A Great Deal 26 15 2 1
Confidence

Only Some
Business

95 73 16 5
Level In

(Union)

Hardly Any 34 28 10 9

108 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 5

Job Satisfaction

Dissatisfied

Dissatisfied
Moderately
Satisfied

Satisfied
Very

Very
A little
(Non-union)

A Great Deal 111 52 12 4


Confidence

Business
Level In

Only Some 246 142 37 18


Hardly Any 73 51 19 9

(a) State the appropriate null and alternative hypotheses for these tests.
(b) Conduct the appropriate tests.
(c) The scholar concluded that his hypothesis was not supported by the data. Do you
agree?

3. Consider the data below. At the 0.05 level of significance, would you say that the
level of teaching evaluation is related to rank? Or are full professors more likely
to be judged above average than other ranks?
Rank
Teaching Senior Associate Full
Evaluation Lecturer Lecturer Professor Professor
Above
Average 36 62 45 50
Average 48 50 35 43
Below
Average 30 13 20 35

CoDEUCC/Post Diploma in Mathematics and Science Education 109


GOODNESS-OF-FIT TESTS FOR
UNIT 3
SESSION 5
INDEPENDENCE

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

110 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 6

SESSION 6: COEFFICIENTS OF CONTINGENCY


In Session 5, you learnt that the null hypothesis for goodness-of-fit test
for independence states that the two categorical variables in the
contingency table are independent of each other; whereas the alternative
states that they are dependent. If the null hypothesis in any such test is rejected, then our
conclusion will be that the two variables are related. But just how strong is the
relationship? Is the relationship low, moderate, or strong?

In this session, we focus our attention on calculations and interpretation of coefficients


of contingency which are measures of the strength of dependency between two
dependent variables.

Objectives
By the end of this session, you should be able to:
1. explain coefficient of contingency;
2. calculate Pearson’s coefficient of contingency;
3. interpret Pearson’s coefficient of contingency;
4. calculate Cramer’s coefficient of contingency;
5. interpret Cramer’s coefficient of contingency.

Now read on…

6.1 Pearson’s Coefficient of Contingency


One common weakness of the test statistic for the chi-square test of independence is
that it does not give a meaningful description of the degree of dependence (or strength
of association) of the variables. That is, the statistic is only useful for determining
whether there is dependence between the variables. The strength of association between
the variables depends on the degrees of freedom as well as on the value of the test
statistic. Therefore, its interpretation is not a straight forward task.

Pearson's coefficient of contingency is one method that provides a measure of strength


of association between two variables. It is given by

χ2
PCoC = , (3.4)
n+ χ2

CoDEUCC/Post-Diploma in Mathematics and Science Education 111


UNIT 3 COEFFICIENTS OF CONTINGENCY
SESSION 6

where
r
χ 2 = ∑∑
c [nij − E (nij )] 2 ,
i =1 j =1 E (nij )

and n is the sample size. If Pearson’s coefficient of contingency is 0 or close to 0, then


there is no association between the variables. If the coefficient is closer to − 1 or + 1 ,
then there is a strong negative or positive association, respectively, between the
variables.

Example 3.12
Refer to Example 3.11. Calculate and interpret Pearson’s coefficient of contingency.

Solution
We already know from Example 3.11 that the two variables, gender and voting
preferences, are dependent. Substituting χ 2 = 16.2 and n = 1,000 into Equation (3.4),
we have

χ2
PCoC =
n+ χ2
16.2
=
1,000 + 16.2
= 0.1263

We can, therefore, conclude that the association between gender and voting preferences
is low.

Example 3.13
Refer to Question 3 of Exercise 3.5. Calculate and interpret Pearson’s coefficient of
contingency.

Solution

Substituting χ 2 = 17.4354 and n = 467 into Equation (3.4), we have

112 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 6

χ2
PCoC =
n+ χ2
17.4354
=
467 + 17.4354
= 0.1897
We can, therefore, conclude that the association between Teaching Evaluation and Rank
is low.

6.2 Cramer’s Coefficient of Contingency


Another major disadvantage of Pearson’s coefficient of contingency is that it generally
does not achieve the value − 1 or + 1 , even if there is a complete association between
the two variables. It can be shown that, the largest possible value of Pearson’s
coefficient of contingency is 0.707.

For this reason, Cramer’s coefficient of contingency is often preferred as a measure of


the strength of association between two variables. It is given by

χ2
CCoC = , (3.5)
n (t − 1)

where t is the smaller of (number of rows, number of columns). The value of CCoC lies
in the interval 0 to 1.

Example 3.14
Refer to Example 3.11. Calculate and interpret Cramer’s coefficient of contingency.

Solution
We already know from Example 3.11 that the two variables, gender and voting
preferences, are dependent. We note that r = 2 is smaller than c = 3 . Substituting
t = 2 , χ 2 = 16.2 and n = 1,000 into Equation (3.5), we have

χ2
CCoC =
n (t − 1)
16.2
=
1,000 (2 − 1)
= 0.1273

CoDEUCC/Post-Diploma in Mathematics and Science Education 113


UNIT 3 COEFFICIENTS OF CONTINGENCY
SESSION 6

We can, therefore, conclude that the association between gender and voting preferences
is low.

Example 3.15
Refer to Question 3 of Exercise 3.5. Calculate and interpret Cramer’s coefficient of
contingency.

Solution

Here, we note that r = 3 is smaller than c = 4 . Substituting t = 3 , χ 2 = 17.4354 and


n = 467 into Equation (3.5), we have

χ2
CCoC =
n (t − 1)
17.4354
=
467 (3 − 1)
= 0.1366
We can, therefore, conclude that the association between Teaching Evaluation and Rank
is low.

114 CoDEUCC/ Post-Diploma in Mathematics and Science Education


TESTS ON CATEGORICAL DATA UNIT 3
SESSION 6

Self-Assessment Questions
Exercise 3.6

CoDEUCC/Post-Diploma in Mathematics and Science Education 115


UNIT 3 COEFFICIENTS OF CONTINGENCY
SESSION 6

This is a blank sheet for your short notes on:


• issues that are not clear, and
• difficult topics, if any.

116 CoDEUCC/ Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4

UNIT 4: SIMPLE LINEAR REGRESSION


UNIT OUTLINE

Session 1: The Line of Best Fit


Session 2: Interpretation of the Regression Model Coefficients
Session 3: The Simple Coefficient of Determination and Correlation
Session 4: Regression Functions
Session 5: Standard Error of Estimate
Session 6: Effect of Variable Transformation on Simple Linear Regression

We have already studied some aspects of simple linear regression


in previous courses. In this unit, we will treat into more detail the
concept of linear regression. The procedures that would be
developed in this unit would be a good preparation for multiple regression in Unit 6.

Objectives
By the end of the unit you should be able to:
1. state the general form of the simple linear regression equation and define the
terms involved;
2. determine an equation for the simple linear model using the method of least
squares and use it to make estimate for the response variable;
3. interpret the coefficients of the simple linear regression model obtained from a
given data;
4. determine the types of variation in a given dataset on two related variables;
5. compute the coefficient of determination;
6. compute the coefficient of correlation between two variables;
7. determine regression functions involving two variables and describe the
relationship that exists between them;
8. use the regression coefficients of the two models to obtain the correlation
between the two variables;
9. assess the quality of a regression model by using the standard error of estimate
10. relate results of simple linear regression using raw and transformed datasets.

CoDEUCC/Post-Diploma in Maths and Science Education 117


UNIT 4 SIMPLE LINEAR REGRESSION

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics, if any

118 CoDEUCC/Post-Diploma in Maths and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 1

SESSION 1: THE LINE OF BEST FIT


The discussion on how to determine a line of best fit for a given
statistical dataset started in previous courses. In this section, we will
review this concept and then treat it into some more details. We will first get introduced
to simple linear regression.

Objectives
By the end of this session, you should be able to
1. State the general form of the simple linear regression equation and define the
terms involved;
2. determine an equation for the simple linear model using the method of least
squares and use it to make estimate for the response variable

Now read on…

1.1 Introduction to Simple Linear Regression


Very often there is the need to relate a variable of interest Y called a response variable
to one (or more) predictor variable X in order to build an equation that can be used to
describe, predict and control the response variable on the basis of the predictor variable.
A linear prediction equation for Y in terms of a single variable X is what is referred to as
simple linear regression model. Thus, the simple linear regression assumes that there is
a straight line relationship between the two variables. Such a relationship is tentatively
decided by making a scatter plot of Y versus X.

Figure 4.1 shows a scatter plot of data on variables X and Y. The plot shows two
characteristics of the linear relationship:
1. A tendency for Y to decrease in straight line fashion as X increases.
2. A scattering of points around the straight line.

Generally, for linear relationship, there is a tendency of Y to increase or decrease in a


straight line fashion as X increases.

CoDEUCC/Post-Diploma in Mathematics and Science Education 119


UNIT 4 THE LINE OF BEST FIT
SESSION 1

14

12

10

8
Y

0
3 4 5 6 7 8 9 10
X

Figure 4.1: Scatter plot showing relationship between X and Y

Another observation from the plot is that for a value of X = 10 , for example, there are
two values of Y. For a specific value considered, there are, in theory, many Y values
with the same x, as a result of other factors that affect Y. Thus, for each x, there is a
population of Y. Denote the mean of these Y values as µ Y X . The straight line tendency
observed in the scatter plot can be represented by assuming that µ Y X is related to X by

µ Y X = β o + β1 x (4.1)

Thus, Equation (4.1) may be regarded as a line of means. The values β o and β 1 are
referred to as regression parameters.

If we take into account the other factors than X that affect Y, then the value of Y is the
sum of the average value and an error term that represents all the other factors. Thus,
Y = β o + β1 x + ε
At any value of X, there is a population of error term values that potentially occur which
describes the different potential effect on Y. We will discuss the behaviour of the error
terms into some more details in the next unit.

120 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 1

1.2 The Line of Best Fit


The values of the parameters determine the precise value of µ Y X for a given value of X.
A reliable method for estimating the parameters is by the least squares principle. The
method minimizes the sum of the squares of the vertical deviations about the line.
Suppose that bo and b1 are some estimates of β o and β 1 which results in an estimated
value of Y as ŷ i for each observation i = 1, 2,  , n . The general regression now
becomes
yˆ = bo + b1 x (4.2)
The deviations given by e = y − yˆ called the prediction error or residual, evaluate how
well the line fits the points on the scatter plot. The residuals are vertical distances
between the observed values of Y and the predicted values on the line. It is expected that
the residuals would be small if the line fits the data well.

An overall measure of the quality of the fit is given by the sum of squared deviations or
errors (SSE). Denote, for now, the SSE by Q. Then
n n n
Q = ∑ ei2 = ∑ ei2 = ∑ [ y i − (bo + b1 xi )]2 (4.3)
i =1 i =1 i =1

To obtain the values of bo and b1 for which SSE is minimum, we differentiate the
quantity Q partially with respect to each br (r = 0, 1) and equating to zero, we obtain
the following equations:
∂Q n
= 2∑ [ y i − (bo + b1 xi )] = 0
∂bo i =1 (4.4)
nbo + b1 ∑ xi = ∑ y i

∂Q n
= 2∑ [ y i − (bo + b1 xi )]xi = 0
∂b1 i =1 (4.5)
bo ∑ xi + b1 ∑ xi2 = ∑ xi y i

Equations (4.4) and (4.5) are often called the normal equations. We can solve the two
equations for bo and b1 . Now writing the system of equations as

 n
 ∑x i   bo   ∑ y i 
  =   (4.6)
∑ x ∑x 2   b1   ∑ x y 
 i i   i i 

Using Cramer’s rule, we have

CoDEUCC/Post-Diploma in Mathematics and Science Education 121


UNIT 4 THE LINE OF BEST FIT
SESSION 1

n ∑x i

b1 =
∑ yi ∑x y i i
=
n∑ xi y i − ∑ xi ∑ y i
(4.7)
∑x n∑ xi2 − (∑ xi )
2
n i

∑ xi ∑x 2
i

Substituting this value of b1 into Equation (4.4), and then dividing through by n
1 b
bo =
n
∑ y i − 1 ∑ xi
n
or
bo = y − b1 x (4.8)

Equation (4.2) obtained in terms of the least squares estimates in (4.7) and (4.8) gives
the least squares regression line. This line is what is referred to as the line of best fit.

Example 4.1
The data shown concerns the number of hours spent by ten groups of workers on similar
jobs.
Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)

(a) Obtain a plot of the data on y against x.


Comment on the relationship between the two variables.
(b) Find the least squares linear regression line for determining the number of hours
spent on the job by the number of workers involved.
(c) Use your estimated equation to predict the amount of time that would be spent on
the job by engaging four workers.
Solution
(a) The scatter plot of the data is given in Figure 4.1.
From the plot we realize that hours spent on the job (y) decreases as the more
workers are engaged on the job. This relationship also appears to be linear but not
too strong. Thus, a straight line equation would describe the relationship between X
and Y.
(b) The estimate of the regression parameters are given by
122 CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 1

n∑ xi y i − ∑ xi ∑ y i
b1 =
n∑ xi2 − (∑ xi )
2

Now, ∑ x = 67 ∑ y = 60 ∑ x y = 335 ∑ x 2
= 505 and ∑y 2
= 486

Substituting into expression with n = 10 gives


10(335) − 67(60) 670
b1 = =− = −1.1943
10(505) − (67) 2
561
Making substitutions into
1 b
bo = ∑ y i − 1 ∑ xi
n n
We have
1 1.1943
bo = (60) + (67) = 14.0
10 10

Therefore, the simple linear model for hours spent on the job by number of workers
is
y = 14.0 − 1.1943 x
(b) Substituting X = 4 into the equation, we obtain
y = 14.0 − 1.1943(4) = 9.2228
Therefore, when there are 4 workers on the job the expected time to complete the
job would be 9.2hrs.

Example 4.2
The amount (in Gh¢) of electricity consumed by a household for an average weekly
temperature (in Degrees Celsius) for eight weeks is given in the table shown.

Amount
75 71 92 88 108 120 118 126
of Elect

Temp 19.0 19.0 23.5 30.0 34.7 36.4 37.0 39.2

(a) Obtain a scatter plot of the data.


Use the plot to comment on the relationship between the two variables.
(b) Determine the least squares regression model for determining the amount of
electricity consumed in terms of temperature.
(c) Estimate the electricity consumption when temperature is 25  C .
CoDEUCC/Post-Diploma in Mathematics and Science Education 123
UNIT 4 THE LINE OF BEST FIT
SESSION 1

Solution
(a) The scatter plot of the data is as shown in Figure 4.2.

130

120

110
Electricity

100

90

80

70
20 25 30 35 40
Temperature

Figure 4.2: Scatter plot showing relationship between amount of electricity


consumption and temperature

From the plot we realize that electricity consumption (Y) increases as temperature (X)
increases, and this relationship appears to be linear and quite strong. Thus, a straight
line equation would describe the relationship between X and Y.

(b) The estimate of the regression parameters are given by

n∑ xi y i − ∑ xi ∑ y i
b1 =
n∑ xi2 − (∑ xi )
2

n
Now, ∑ x = 238.80 ∑ y = 798 ∑x y
i =1
i i = 24997.0 ∑x 2
= 7609 and

∑y 2
= 82738

Substituting into expression with n = 8 gives


8(24997) − 238.8(798) 9413
b1 = =− = 2.4471
8(7609) − (238.8) 2
3846.56

Making substitutions into


124 CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 1

1 b

bo =
n
y i − 1 ∑ xi
n
We have
1 2.4471
bo = (798) − (238.8) = 26.7034
8 8
Therefore, the simple linear model for electricity consumption (Y) in terms of
temperature (X) is
y = 26.7034 + 2.4471x

(c) Substituting X = 25 into the equation, we obtain


y = 26.7034 + 2.4471(25) = 87.8809
Therefore, when the temperature is 25  C , the expected amount of electricity
consumption would be Gh¢87.88.

Self-Assessment Questions
Exercise 4.1
1. The profit y, in GH¢, of a certain small scale business establishment in
the xth year of its operation is given in the table.

x 1 2 3 4 5
y 125 140 165 195 230
(a)Find the simple linear regression model for determining the profit for any given
year.
(b) Use your model to determine the profit in the 6th year.

2. A real estate agency collects data on the following variables:


y: sales price of a house (in thousands of GH¢)
x: home size (in tens of square feet)

x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345

(a)Find the simple linear regression model for determining the sales for any given
home size.
(b) Use your model to determine the sales for home size of 180 sq.ft.

CoDEUCC/Post-Diploma in Mathematics and Science Education 125


UNIT 4 THE LINE OF BEST FIT
SESSION 1

TPaste Industries produce various brands of tooth pastes. For effective management
of inventory, the company would like to predict more efficiently the demand for
one of its premium product, Toothgate. To develop a prediction model, the
company has gathered data concerning demand for Toothgate over the last 20 sales
periods (where a sales period is defined to be a four-week period). The data is
obtained over the following variables:
y ― the demand for Toothgate (in thousand pieces) in the sales period
x1 ― the price (in GH¢) offered by the company.
x2 ― price difference between price of the company and average industry price of
competitors’ similar product.

Period x1 x2 y Period x1 x2 y
1 4.35 -0.05 5.38 11 4.20 0.40 7.10
2 4.25 0.25 6.51 12 4.25 0.45 6.86
3 4.20 0.60 7.52 13 4.30 0.30 6.87
4 4.20 0.00 5.50 14 4.20 0.50 7.26
5 4.10 0.25 7.33 15 4.30 0.50 7.00
6 4.10 0.20 6.28 16 4.30 -0.05 5.65
7 4.30 0.05 5.87 17 4.05 0.10 6.50
8 4.30 -0.15 5.10 18 4.25 0.00 5.67
9 4.35 0.15 6.00 19 4.30 0.05 5.93
10 4.40 0.20 5.89 20 4.20 0.55 7.26

Use the data to answer Questions 3 and 4.


3. (a) Find the simple linear regression model for determining the demand in terms
of price of the commodity.
(b) Use your model to determine the demand in a period in which price is
GH¢4.15.
4. (a) Find the simple linear regression model for determining the demand in terms
of ‘‘price difference’’.
(b) Use your model to determine the demand in a period in which price difference
is GH¢0.30.

126 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 2

SESSION 2: INTERPRETATION OF THE REGRESSION MODEL


COEFFICIENTS

Having obtained the regression model, it is necessary to interpret the


coefficients of the model. It is important to note that interpretations
must be made in the context of the problem. In this session, we will learn about what
the parameters of the linear model represents. This would guide how they should be
interpreted given a particular problem.

Objective
By the end of this session, you should be able to:
• Interpret the coefficients of the simple linear regression model obtained
from a given data.

Now read on…

The regression model in Equation (4.1) was given as


µ Y X = β o + β1 x

It has been noted that µY X represents the mean of the response variable Y when the
value of the predictor variable X is x. It has also been noted that β o and β 1 are the
regression parameters.
Now, if X = 0 , then µ Y X = β o . Thus, β o is the mean response when the predictor
variable assumes the value 0. It should be noted that some care should be taken to
interpret the value of β o as its interpretation may not be relevant in the context of the
given problem.

We may denote µY X simply by y. Now let X assume a particular value c. Then at


X = c,
y c = β o + β 1c
Again, at X = c + 1 ,
y c +1 = β o + β1 (c + 1)
The difference
y c +1 − y c = β1

CoDEUCC/Post-Diploma in Mathematics and Science Education 127


UNIT 4 INTERPRETATION OF THE REGRESSION
SESSION 2 MODEL COEFFICIENTS

Thus, β 1 is the change in the mean of the response variable associated with a unit
change in X. If β1 > 0, then the mean value of y increases as x increases. If β1 < 0, then
the mean value of y decreases as x increases.

Example 4.3
Refer to Example 4.1.
Interpret the coefficients of the linear regression model for determining the number of
hours spent on the job in terms of the number of workers involved.

Solution

The linear model was obtained as


y = 14.0 − 1.1943 x
From the equation, bo = 14.0 .
The value of bo means that when there is no worker, the expected time taken to
complete the job would be 14 hours. However, this interpretation may not have
practical relevance as the job would be done only when hands are engaged.

From the equation, b1 = −1.1943 .


The value of b1 suggests that for any additional worker involved on the job, it is
expected that the time taken to complete the job would decrease by 1.19 hours (i.e about
1 hour 12 mins).

Example 4.4
Refer to Example 4.2.
Interpret the coefficients of the linear regression model for determining the amount of
electricity consumed in terms of temperature.

Solution
The linear model was obtained as

y = 26.7034 + 2.4471x
From the equation , bo = 26.7034 .
The value of bo means that when the temperature is 0  C , the expected amount spent
on electricity would be Gh¢26.70.

128
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 2

From the equation, b1 = 2.4471 .


The value of b1 suggests that for any 1 C increase in temperature, it is expected that the
amount spent on electricity would increase by Gh¢2.45.

Self-Assessment Questions
Exercise 4.2

Refer to the Questions in Exercise 4.1.


In each case, interpret the coefficients of the regression model.

CoDEUCC/Post-Diploma in Mathematics and Science Education 129


UNIT 4 INTERPRETATION OF THE REGRESSION
SESSION 2 MODEL COEFFICIENTS

This is a blank sheet for your short notes on:

• issues that are not clear; and


• difficult topics if any

130
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3

SESSION 3: THE SIMPLE COEFFICIENT OF DETERMINATION


AND CORRELATION

The simple coefficient of determination is one of the measures of


assessing the usefulness of the simple linear regression model by using
the independent variable, X. Without the use of the independent variable, we would
have to use the mean y to be a prediction of the dependent variable, Y. This would lead
n
to a prediction error of e = y − y and the sum of squared prediction error ∑(y
i =1
i − y) 2

referred to as Total Variation in y, SST. If the model is really useful, then it should
explain a high portion of this variation. In this section, we will discuss how to compute
the portion of the total variation accounted for by the model and how this is used as a
measure of the usefulness of the model. Using this measure, we will also find a measure
of the linear relationship between the two variables.

Objectives
By the end of this session, you should be able to:
1. determine the types of variation in a given dataset on two related variables.
2. compute the coefficient of determination.
3. compute the coefficient of correlation between two variables.

Now read on…

3.1 Types of Variation


Figure 4.3 shows the scatter plot in Figure 4.1 that relates two variables X and Y. It also
includes a plot of the mean of Y against each of the values of X. A horizontal line is
drawn through the mean value. By plotting the mean of y against the values of X , we
are unable to determine the effect of X in explaining the variation in Y. Thus, the values
of Y are predicted using the mean of y. This produces the maximum error in the
prediction of values of X. When we use the predictor variable X, the prediction of yi is
n
given by yˆ i = bo + b1 xi . This results in a prediction error of ∑(y
i =1
i − yˆ i ) 2 . This error

represents the amount of variation in y that is not explained by using X (i.e., the model).
We will denote this unexplained variation by SSE.

CoDEUCC/Post-Diploma Mathematics and Science Education 131


UNIT 4 THE SIMPLE COEFFICIENT OF DETERMINATION
SESSION 3 AND CORRELATION

14 Variable
y
Mean
12

10

8
y

0
3 4 5 6 7 8 9 10
x

Figure 4.3: Scatter plot of y against x showing the mean of y

The difference in the measurements


yˆ i − y
represents the amount of decrease in prediction error by using the predictor variable, X.
The sum of squares of this reduction in error as a result of the use of X is referred to as
n

the sum of square regression (SSR) and given by SSR = ∑ ( y i − y i ) . It represents the
ˆ 2

i =1
amount of variation in Y explained by the model. The three types of variation are given
by the computational formulae as

n n
SST = ∑ ( y i − y ) 2 = ∑ y i2 − ny 2 (4.9)
i =1 i =1
n n
 n n

SSE = ∑ ( y i − yˆ i ) 2 = ∑ y i2 − bo ∑ y i + b1 ∑ xi y i  (4.10)
i =1 i =1  i =1 i =1 
n n n
SSR = ∑ ( yˆ i − y ) 2 = bo ∑ y i + b1 ∑ xi y i (4.11)
i =1 i =1 i =1

132
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3

Example 4.5
The amount (in Gh¢) of electricity consumed by a household for an average weekly
temperature (in Degrees Celsius) for eight weeks is given in the table shown.

Amount
75 71 92 88 108 120 118 126
of Elect

Temp 19.0 19.0 23.5 30.0 34.7 36.4 37.0 39.2

Find:
(a) the total variation in the amount of electricity consumed
(b) variation in the amount of electricity consumed which is accounted for by the
temperature levels.

Solution
(a) The total variation in amount of electricity is
n
SST = ∑ y i2 − ny 2
i =1
= 82738 − 8(99.75) 2
= 3137.5

(b) The regression equation for determining the amount of electricity bills given
temperature was obtained as
Amt = 26.7 + 2.45 Temp

Thus, bo = 26.7 and b1 = 2.447


n

The sum of the amount ∑y


i =1
i = 798.0
n
The sum of cross-product ∑x y
i =1
i i = 24997.0

The variation explained by Temp (X) is given by

CoDEUCC/Post-Diploma Mathematics and Science Education 133


UNIT 4 THE SIMPLE COEFFICIENT OF DETERMINATION
SESSION 3 AND CORRELATION

n n
SSR = bo ∑ y i + b1 ∑ xi y i − ny 2
i =1 i =1

= 26.7 (798) + 2.447 (24997.0) − 8(99.75) 2

= 82474.259 − 79600.5

= 2873.759

3.2 The Simple Coefficient of Determination


Under sub-session 3.1, we have noted that the variation in Y explained by the
independent variable X is the sum of squares regression (SSR). It is more practical to
relate this variation to the total variation (SST) in the data. If we find the ratio of SSR to
SST, we obtain the proportion of the total variation that is accounted for by the model.
This expression, which is usually expressed as a percentage, is referred to as the simple
coefficient of determination. It is denoted by r 2 (or R 2 ). Thus,
SSR
r2 =
SST
For example, for the data in Example 4.5, the percentage of variation in electricity
consumption accounted for by the model in terms of temperature is given by
2873.759
r2 = × 100 = 91.59
3137.5
In Section 3.1, we have explained SSR in terms of the coefficients of the regression
equation. By substituting expression for SSR and SST in the expression for r 2 , and
noting that bo = y − b1 x , we obtain a more computational expression
2
 n 
 ∑ xi y i − nx y 
r 2 = n  i =1  (4.12)
  n

 ∑ y i2 − ny 2  ∑ xi2 − nx 2 
 i =1  i =1 

In terms of only sums and cross-products, we have

134
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3

2
 n n n

 n∑ xi y i − ∑ xi ∑ y i 
r2 =  i =1 i =1 i =1  (4.13)
 n 2  n   n 2  n  
2 2

n∑ y i −  ∑ y i   n∑ xi −  ∑ xi  
 i =1  i =1    i =1  i =1  

Knowledge of the various forms of r 2 will be required in subsequent sessions of this


unit.

CoDEUCC/Post-Diploma Mathematics and Science Education 135


UNIT 4 THE SIMPLE COEFFICIENT OF DETERMINATION
SESSION 3 AND CORRELATION

Example 4.6
The data on the number of hours spent by ten groups of workers on similar jobs
Example 4.1 is reproduced below.

Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)

Find the simple coefficient of determination and interpret your result.

Solution
From the table we obtain the following sums:

∑ x = 67 , ∑ y = 60, ∑ x y = 335 , ∑ x 2
= 505, and ∑y 2
= 486
Substituting into the expression in Equation (4.7) with n = 10, , we have

2
r =
(10(335) − 67(60) )2
[10(486) − (60) ][10(505) − (67) ]
2 2

448900
=
1260(561)

= 0.6351
Therefore, 63.5% of variation in number of hours of work is accounted for by the
number of people engaged on the work.

3.3 The Simple Correlation Coefficient


The simple correlation coefficient between two variables X and Y is a measure of the
strength of linear relationship between the variables. It is denoted by r and defined in
terms of the coefficient of determination r 2 by

 r2 , b1 > 0

r=
 2
− r , b1 < 0

where b1 is the slope of the least squares line relating Y to X.

136
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3

For example, for the data in Example 4.6, the correlation coefficient between hours of
work and number of people at work is simply − 0.7969 (i.e., − 0.6351) . Note that the
value should be negative because we already know that as number of people at work
increases their time spent on the work decreases. Thus, b1 < 0 ; we actually do not need
to know the value of b1 to determine the sign of the correlation coefficient.

If we have computed the b1 and r 2 , then the above approach for determining the
correlation coefficient is simple. However, we can also obtain r by using the basic
definition of the correlation as
n
∑ ( xi − x )( yi − y )
r= i =1
(4.14)
 n
2 
n
2
∑ (
 i =1 ix − x ) ∑
  i =1 ( y i − y ) 

which simplifies as
n n n
n∑ xi y i − ∑ xi ∑ y i
r= i =1 i =1 i =1
(4.15)
 n 2  n   n 2  n  
2 2

n∑ y i −  ∑ y i   n∑ xi −  ∑ xi  
 i =1  i =1    i =1  i =1  

There are a number of observations to make about the correlation coefficient.


1. Since 0 < r 2 < 1 , it implies that − 1 < r < 1

2. If r → 1 , then there is a strong tendency to move together in a straight line


fashion with b1 > 0 . Therefore, Y and X are highly related and positively
correlated.

3. If r → −1 , then there is a strong tendency to move together in a straight line


fashion with b1 < 0 . Therefore, Y and X are highly related and negatively
correlated.
4. If r = 0 , then little (or zero) linear relationship exists between Y and X, and they
are said to be uncorrelated.
5. If X and Y are independent, it implies that r = 0 . However, the converse is not
necessarily true.

CoDEUCC/Post-Diploma Mathematics and Science Education 137


UNIT 4 THE SIMPLE COEFFICIENT OF DETERMINATION
SESSION 3 AND CORRELATION

We can portray the relationship described above in charts called scatter diagrams (or
scatter plots). Figure 4.4 shows scatter diagrams for specified values of the correlation
coefficients.
Y Y

X X
(a) Perfect positive (b) Perfect negative
correlation (r = +1) correlation (r = –1)
Y
Y

X X

(c) Weak positive correlation (d) Strong positive


(0<r < +0.5) correlation (0.5<r < +1)
Y
Y

X X
(e) Weak negative correlation (f) Strong negative
(- 0.5<r < 0) correlation (- 1<r < - 0.5)

Figure 4.4: scatter plots showing various correlation coefficients

138
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 3

Y
Figure 4.4(a), (c), and (d) all have positive correlation
coefficients, indicating a direct relationship between X
and Y. On the other hand, Figure 4.4(b), (e) and (f) all
have negative correlation coefficients, indicating an
inverse relationship between X and Y. Figure 4.4(g) has
zero correlation, indicating that there is absolutely no
linear relationship between X and Y. X

(g) Zero correlation (r = 0)


The sign (whether positive or negative) only indicates the
direction of change. It has nothing to do with the strength of the relationship. To assess
the strength, we need to consider the absolute values of the coefficients. For example,
coefficients of r = −0.8 and r = 0.8 have equal strength, as their absolute value, |r|, is
equal to 0.8. A coefficient of correlation close to 0 (e.g. − 0.15 or 0.07) indicates that
the relationship is weak. A coefficient of correlation close to 1 (e.g. –0.92 or +0.88)
indicates that the relationship is strong. In the diagrams shown, Figure 4.4(c) and (e)
suggest weaker relationship between X and Y when compared with Figure 4.4(d) and
(f).

The discussion in this session is summarized in Figure 4.5.

Perfect negative Perfect positive


correlation No correlation correlation

Strong Moderate Weak Weak Moderate Strong

–1 – 0.5 0 0.5 1

Negative correlation Positive correlation

Figure 4.5

CoDEUCC/Post-Diploma Mathematics and Science Education 139


UNIT 4 THE SIMPLE COEFFICIENT OF DETERMINATION
SESSION 3 AND CORRELATION

Self-Assessment Questions
Exercise 4.3

1. The data on profit y (in GH¢), of a certain small scale business establishment in the
xth year of its operation in Exercise 4.1, Question 1, is given in the table.

x 1 2 3 4 5
y 125 140 165 195 230
(a) Find the coefficient of determination for the model obtained for profit in terms
of the year.
(b) Deduce the coefficient of correlation between profit and the year of operation.
(c) Comment on your values in (a) and (b).
2. The data on sales price, y, (in thousands of GH¢) of a house and home size, x, (in
tens of square feet) in Exercise 4.1, Question 2, is given in the table.

x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345

(a) Find the coefficient of determination for the model obtained for sales price in
terms of home size.
(b) Deduce the coefficient of correlation between sales price and home size.
(c) Comment on your values in (a) and (b).

140
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 4

SESSION 4: REGRESSION FUNCTIONS


We have so far related studied how to obtain a simple linear regression
model for determining the value of a dependent variable Y from a given
independent variable, X. It is also possible to obtain a model for X in terms of Y.
Obviously, the two models will not be the same. In this section we want to examine the
relationship between the two functions.

Objectives
By the end of this session, you should be able to:
1. determine regression functions involving two variables and describe the
relationship that exists between them.
2. use the regression coefficients of the two models to obtain the correlation
between the two variables.

Now read on…

4.1 Function of Y on X Reviewed


We have already noted that the regression function of the dependent variable Y in terms
of the independent variable X is generally given by
y = bo + b1 x.
We have dropped the cap on y since it is now understood. For distinction purposes, we
will write b1 as b yx to indicate that we are dealing specifically with regression function
of y on x. Recall that b yx is given by
n n n
n∑ xi y i − ∑ xi ∑ y i
b yx = i =1 i =1 i =1
2
(4.16)
n
 
n
n∑ xi2 −  ∑ xi 
i =1  i =1 
Notice that the numerator is an expression for covariance between X and Y, denoted as
σ xy , and the denominator is the variance of X, denoted as σ x2 , ignoring their degrees of
freedom. So for convenience, we will write
σ xy
b yx = (4.17)
σ x2

CoDEUCC/Post-Diploma in Mathematics and Science Education 141


UNIT 4 REGRESSION FUNCTIONS
SESSION 4

 σy 
Observe that by multiplying (4.17) by 1  i.e.  , we have
 σ 
 y 

σ xy σ xy σ y σy
b yx = = × =ρ
σ 2
x σ xσ y σ x σx
So we can obtain the regression coefficient in terms of the population correlation
coefficient, ρ and the standard deviations of the two variables.

Example 4.7
The data on electricity consumption and temperature in Example 4.2 is given below.

Amount of
75 71 92 88 108 120 118 126
Elect (y)

Temp (x) 19.0 19.0 23.5 30.0 34.7 36.4 37.0 39.2

(a) Find the standard deviations of the two variables.


(b) the regression coefficient of temperature in the model for amount spent on
electricity consumption.

Solution
(a) ∑ x = 238.80 , ∑ y = 798.00 , ∑ x = 7609 , ∑ y = 82738 2 2

∑ x − (∑ x ) = 7609 − (238.80) = 68.6886


2 1 2 2
1
n
Var ( X ) = 8

n −1 7
∑y − 1n (∑ y )
82738 − 18 (798)
2 2 2

Var (Y ) = = = 448.2143
n −1 7
Therefore, σ x = 68.6886 = 8.29 and σ y = 448.2143 = 21.17

(b) Recall that r = 0.9570 .


Therefore, the regression coefficient of Y on X is
21.17
b yx = 0.9570 × = 0.9570 × 2.5537 = 2.4440
8.29
Notice that this value is the same as that obtained earlier in Session 4.1. We note
also from this illustration that the regression coefficient is not the same as

142 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 4

correlation coefficient.

4.2 Functions of X on Y
Similar to regression function of Y on X, the regression of X on Y is of the form
x = a o + a1 y. (4.18)
We will derive expression for finding the coefficients in (4.18) and also determine other
properties of the model.
Now, taking sum of both sides in Equation (4.18), and dividing through by n, we have

∑ x = na o + a1 ∑ y, ⇒ x = a o + a1 y

Thus, ( x , y ) lies on the regression line of X on Y. If we find a1 we can find a o from the
last equation.
Now multiply (4.18) by y and take sums, substitute for a o , we obtain

∑ xy = a ∑ y + a ∑ y
o 1
2

= ( x − a1 y )∑ y + a1 ∑ y 2
1 1
∑ x ∑ y − a1 (∑ y ) + a1 ∑ y 2
2
=
n n

Making a1 the subject, we obtain

[ ]
a1 n∑ y 2 − (∑ y ) = n∑ xy − ∑ x ∑ y
2

n∑ xy − ∑ x ∑ y
a1 =
n∑ y 2 − (∑ y )
2

1
n
∑ xy − xy
=
1
n
∑ y2 − y2
σ xy
=
σ y2
σ xy
Therefore, a o = x − y.
σ y2
Denoting a1 by bxy , we see that

CoDEUCC/Post-Diploma in Mathematics and Science Education 143


UNIT 4 REGRESSION FUNCTIONS
SESSION 4

σ xy σ xy σ x σ
bxy = = × =ρ x
σ y σ xσ y σ y
2
σy
Now we make the following observations about the two models Y on X and X on Y.
1. The product of the slopes of the two lines is
σ σy
b yx bxy = ρ x × ρ = ρ2 (4.19)
σy σx

which is equal to the coefficient of determination.


2. Both lines pass through ( x , y ) .
3. Regression of Y on X is given by
y = bo + b1 x
σy σy
. = y−ρ x+ρ x
σx σx
σy
= y+ρ (x − x)
σx
Similarly, regression of X on Y is given as
σx
x=x+ρ ( y − y)
σy

Example 4.8
Refer to Example 4.7.
(a) Find the regression function of Temperature on Amount of Electricity consumption.
(b) Using regression coefficients, verify the correlation coefficient between the two
variables.
Solution

(a) We already know that for this data, correlation coefficient r = 0.9570 ,
x = 29.85, σ x = 8.29, y = 99.75, σ y = 21.17
Substituting into the expression
8.29
x = 29.85 + 0.9570 ( y − 99.75)
21.17
Simplifying, we obtain
x = −7.5317 + 0.3748 y

144 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 4

(b) Now, bxy = 0.3748 and b yx = 2.447 . Taking product of the two,
bxy b yx = 0.3748 × 2.447 = 0.9171
Therefore, the correlation coefficient between the two variables is 0.9577.

In the next sub-session, we will examine the geometric relationship between the two
regression lines involving two variables. For brevity of presentation henceforth, denote
the regression of Y on X as simply Yx and that of X on Y as X y .
4.3 Geometric Relationship Between Regression Functions
The general relationship between the graphs of regression lines Yx and X y is shown in
Figure 4.6. Note that this is one of two ways the two lines can relate. The two lines will
always have the same sign of slope. In the case in Figure 4.6, we assume that there is a
positive relationship between the two variables.
y

P( x , y )
θ

φy φx
x
Yx
Xy

Figure 4.6: Geometry of Relationship between Regressions of Y on X and X on Y

In the Figure 4.6,


θ is the angle between the two lines
φ y is the angle of slope of the line Yx
φ x is the angle of slope of the line X y
σy 1 σy
Then, θ = φ x − φ y and tan φ y = ρ , tan φ x =
σx ρ σx

CoDEUCC/Post-Diploma in Mathematics and Science Education 145


UNIT 4 REGRESSION FUNCTIONS
SESSION 4

Now, from elementary calculus, we have


tan φ y − tan φ x
tan θ =
1 + tan φ y tan φ x
1 σy σy
−ρ
ρ σx σx
=
1 σy σy
1+ ρ
ρ σx σx
Simplifying this gives

1 
 − ρ σ xσ y
ρ 
tan θ =  (4.20)
σ x + σ y2
2

Now, without loss generality, suppose that σ x2 = 1, σ y2 = 1 . Equation (4.20) now


simplifies as
1− ρ 2
tan θ = (4.21)

We can use Equation (4.21) to make a number of deductions.


1. If θ = 0  , then the lines coincide. This implies that

1 − ρ 2 = 0, ⇒ ρ 2 = 1 ⇒ ρ = ±1

Thus, if the two lines coincide, then it implies that the two variables are
perfectly correlated.
2. If θ = 90  , then the lines are perpendicular. This implies that

1− ρ 2
=∞ ⇒ ρ =0

Thus, if the two lines are perpendicular, then it implies that the two variables are
uncorrelated.

146 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 4

Self-Assessment Questions
Exercise 4.4

Refer to Exercise 4.1.


In each question, find the regression model for determining the independent variable in
terms of the dependent variable.
In each case, verify that b yx bxy = r 2 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 147


UNIT 4 REGRESSION FUNCTIONS
SESSION 4

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

148 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 5

SESSION 5: STANDARD ERROR OF ESTIMATE


A measure of assessing the quality of the regression model, apart from the
coefficient of determination, is the standard error of estimate. The
standard error of estimate measures the spread or scatter of the observed values around
the fitted regression line. We have already encountered this measure in Session 4.3 as
the Sum of Squares Error. This session seeks to highlight the use of this measure in
describing the performance of the regression model.

Objective

By the end of this session, you should be able to:


• Assess the quality of a regression model by using the standard error of estimate

Now read on…

5.1 The Standard Error of Estimate


It is a rare event to obtain a perfect prediction in practice. We therefore need a measure
that will indicate how precise the prediction of Y is based on X. This measure is given
by the standard error of estimate. The standard error of estimate, S , is obtained by
evaluating

∑ (Y )
n 2
i − Yˆi
S2 = i
(4.22)
n−2

Clearly, if Yi − Yˆi = 0 , then there is no error in prediction and so it makes sense that
S 2 = 0 . In that case, every observed point lies on the fitted line. As the quality of the fit
gets worse, the difference between Yi and Yˆi increases and S 2 gets larger. Therefore,
the standard error of estimate is another useful measure of the quality of the regression
equation.
As observed in Session 4.3, S 2 may also be written as
SSE
S2 =
n−2
also called the Mean square Error (MSE). Therefore, the standard error of estimate is
given by

CoDEUCC/Post-Diploma in Mathematics and Science Education 149


UNIT 4 STANDARD ERROR OF ESTIMATE
SESSION 5

SSE
S=
n−2
The Error Sum of Squares is given by
n
 n n

SSE = ∑ y i2 − bo ∑ y i + b1 ∑ xi y i 
i =1  i =1 i =1 
where bo and b1 are the regression coefficients.

Example 4.9

Refer to the problem on electricity consumption in Example 4.2.


Find
(a) the sum of squares error for predicting electricity consumption in terms of
temperature.
(b) hence, find the standard error of the estimate.

Solution

(a) The regression equation for determining the amount of electricity consumption
given temperature was obtained as
Amt = 26.7 + 2.45 Temp
Thus, bo = 26.7 and b1 = 2.447 .
n n n
Also ∑ yi2 = 82738 ,
i =1
∑ yi = 798 ,
i =1
∑x y
i =1
i i = 24997

Substituting values into the expression gives

SSE = 82738 − [26.7(798) + 2.447(24997)]

= 82738 − 82474.259

= 263.741
(b) The standard error of the estimate is given as
263.741
S= = 43.9568 = 6.63
6
In the example above, we cannot tell whether or not the value of the standard error is
low enough to justify that the model is good. If however we have two models for the
same purpose of predicting y, the model with the smaller error is usually preferred.
150
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 5

5.2 The Standard Error in Terms of Correlation Coefficient


We know that the total variation in Y, given by SST may be attributable to the variation
as a result of the use of the regression in terms of the independent variable (SSR) and a
variation that cannot be explained (SSE). Thus,
SST = SSR + SSE (4.23)

Dividing (4.23) by SST


SSR SSE
+ =1
SST SST
SSR
Using the notation r 2 = which represent the coefficient of determination, and
SST
denoting s yy = SST , we have
SSE = s yy (1 − r 2 )
Therefore, the standard error is given by
1
S2 = s yy (1 − r 2 ) (4.24)
n−2

Equation (4.24) shows that if the correlation coefficient is known and the total variation
in the dependent variable Y is also known, we can find the standard error of estimate of
y.
The result also indicates that if the coefficient of determination is high, then S will be
low.

Self-Assessment Questions
Exercise 4.5

1. Refer to Exercise 4.1, Question 1 and 2. In each case, find


(a) the coefficient of determination
(b) the standard error of estimate of y.
2. Refer to Exercise 4.1, Question 3 and 4.
(a) Find the standard error of estimate of Demand
(b) Which of the two models would give a better prediction of demand for
Toothgate? Explain.

CoDEUCC/Post-Diploma in Mathematics and Science Education 151


UNIT 4 STANDARD ERROR OF ESTIMATE
SESSION 5

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

152
CoDEUCC/Post-Diploma in Mathematics and Science Education
SIMPLE LINEAR REGRESSION UNIT 4
SESSION 6

SESSION 6: EFFECT OF VARIABLE TRANSFORMATION ON


SIMPLE LINEAR REGRESSION

It is important to know the nature of the data used for any statistical
analysis as this has implications for clarity of interpretation of results. So
far we have been dealing with the raw data. We can also work with mean-corrected data
or standardized data. By mean-corrected data, the mean of all values on each variable is
subtracted from each of the values on that variable. If the mean-corrected data is then
divided by the standard deviation of the variable, the data is then standardized.
Depending on the nature of the raw data and the objective of the analysis, we may
choose any of the three forms of the data for analysis. In this session, we will consider
how the form of the data affects the regression model.
Objective

By the end of this session, you should be able to:


• relate results of simple linear regression using raw and transformed datasets

Now read on…

6.1 Effect of Standardized Data on Simple Linear Regression


For our purpose in this subsection, let us obtain the regression function for the
standardized data.
Table 4.1 is the standardized data on the amount spent on electricity consumption and
the temperature over a period found in Example 4.2. Recall that the mean and standard
deviation of variable Y (amount of electricity consumption) in that data are
respectively, y = 99.75 and s y = 21.17. The mean and standard deviation of variable
X (Temperature) are respectively, x = 29.85 and s x = 8.29.

CoDEUCC/Post-Diploma in Mathematics and Science Education 153


UNIT 4
EFFECT OF VARIABLE TRANSFORMATION ON
SESSION 6 SIMPLE LINEAR REGRESSION

Table 4.1: Standardized Data on Amount of


Electricity Bills in Example 4.8
Number Amount Temp
1 -1.16905 -1.30923
2 -1.35799 -1.30923
3 -0.36607 -0.76623
4 -0.55500 0.01810
5 0.38968 0.58523
6 0.95649 0.79036
7 0.86203 0.86276
8 1.23990 1.12823

You can check that the data is standardized data by showing that the mean value is 0
and the variance is 1. Using the data in Table 4.1, obtain the simple linear regression of
Amount on Temperature. Your result should be the same as
Amt = 0.958Temp
Notice that in this case, the intercept bo = 0 . Notice also that b1 = 0.958 is the same as
the correlation coefficient between the original variables in Example 4.7. This result is
not peculiar to this data. The result can be generalized to any dataset.
From the general equation of the regression of Y on X,
σy
y= y+ρ (x − x)
σx
Re-arranging in terms of sample estimates, we have
y−y x−x
=r
sy sx
y− y x−x
Notice that the expression is the standardized value of Y and is the
sy sx
standardized value of X. Denote each simply by Y and X, we have
Y =rX (4.25)

154 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 6

This gives the regression Yx using standardized values, and the coefficient is the
1
correlation coefficient. The regression X y of X on Y then becomes X = Y .
r

6.2 Effect of General Linear Transformation on Correlation


Coefficient
Standardization is one way of transformation data on a given variable. In the illustration
above, you would realize that working with standardized values could be quite tedious
because of the nature of the mean and standard deviation values of the variable. To
simplify working with data that generally involve large values, we can transform the
data by selecting a working mean and new units. Suppose for data on x and y, we take a
new working mean at (a, b) . Suppose further that the mean-corrected values of x and y
could be reduced by dividing by c and d, respectively. The general transformation
considered here that gives a new set of data, X and Y, is then of the form
x = a + cX and y = b + dY
The mean values are obtained as
x = a + cX and y = b + dY
The corresponding standard deviations are
1 n c2 n
s x2 = ∑ i
n i =1
( x − x ) 2
=
n
∑ ( X i − X ) 2 . Thus, s x2 = c 2 s X2 .
i =1

1 n d2 n
s y2 = ∑
n i =1
( yi − y ) 2 =
n
∑ (Yi − Y ) 2 . Thus, s y2 = d 2 sY2
i =1

Similarly, the co-variation between x and y is given as


1 n n
s xy = ∑ ∑ ( xi − x )( y j − y )
n i =1 j =1
cd n n
= ∑ ∑ ( X i − X )(Y j − Y )
n i =1 j =1
= cd s XY

The correlation coefficient between x and y in terms of the transformed data X and Y is
therefore given as follows:
s xy cd s XY s
r= = = XY
sx s y cd s X sY s X sY

CoDEUCC/Post-Diploma in Mathematics and Science Education 155


UNIT 4
EFFECT OF VARIABLE TRANSFORMATION ON
SESSION 6 SIMPLE LINEAR REGRESSION

The result shows that the value of the correlation coefficient is not affected by the use of
the transformed data.
The slope of the regression line using the transformation is obtained similarly as
follows:
s xy cd s XY cd s XY d s XY
b yx = 2
= = 2 2 =
s x c 2 s X2 c sX c s X2

Example 4.10

The data in Exercise 4.1, Question 1 is on the profit y, (in GH¢), of a certain small scale
business establishment in the xth year of its operation. It is given in the table shown.

x 1 2 3 4 5
y 125 140 165 195 230
By using a suitable transformation,
(a) determine the regression model for profit in terms of the year of operation.
(b) find the correlation between x and y.

Solution
(a) Define the linear transformation as follows
x = 3 + u and y = 165 + 5v
y − 165
That is, u = x − 3 and v =
5
Using the transformation, we have the data as follows

u v uv
−2 −8 16
−1 −5 5
0 0 0
1 6 6
2 13 26

∑ u = 0 , ∑ v = 6, ∑ uv = 53 , ∑ u 2 = 10, and ∑ v 2 = 294

156 CoDEUCC/Post-Diploma in Mathematics and Science Education


SIMPLE LINEAR REGRESSION UNIT 4
SESSION 6

suv n∑ uv − ∑ u ∑ v 5(53) − 0(6)


bvu = = = = 5.3
su2 n∑ u 2 − (∑ u ) 5(10) − (0 )
2 2

The regression line of v on u is


v = 1.2 + 5.3(u − 0)

Changing back to the original variables, we obtain


y − 165
= 1.2 + 5.3( x − 3)
5
y − 165 = 6 + 26.5 x − 79.5
y = 91.5 + 26.5 x

(b) The correlation between x and y is the same as that between u and v. Now,
s s
bvu = r v . So r = u bvu .
su sv

r 2 = bvu2 ×
∑ u 2 − nu 2 = bvu2
10
=
28.09 × 10
= 0.9794
∑ v 2 − nv 2 294 − 5(1.2) 2
286.8

Therefore, r = 0.9897 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 157


UNIT 4
EFFECT OF VARIABLE TRANSFORMATION ON
SESSION 6 SIMPLE LINEAR REGRESSION

Self-Assessment Questions
Exercise 6.6

1. A linear transformation of the variables x and y is given by


x = a + cX and y = b + dY
Obtain an expression for the slope of the regression line of x on y, bxy , in terms of
X and Y.

158 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR
REGRESSION UNIT 5

UNIT 5: INFERENCE ABOUT SIMPLE LINEAR


REGRESSION

Unit Outline
Session 1: Test of Significance involving Simple Linear Regression Model
Session 2: Assumptions and Properties of Least Squares Regression Line
Session 3: Confidence Interval
Session4: The Mean and Variance of the Simple Linear Regression Estimates
Session 5: The Use of Analysis of Variance in Simple Linear Regression
Session 6: Using Matrix Algebra in Simple Linear Regression

This unit examines the various inferences that could be


made when we have derived a simple linear model for data
on two related variables. Inference could be made on the
significance of the parameter estimates of the model and the model itself. We
could also be interested in the confidence intervals of estimates based on the
model. We will also apply the concept of analysis of variance in regression
and introduce ourselves to regression analysis by matrix approach.

Objectives
By the end of the unit you should be able to:
1. conduct a test of significance of the simple linear regression
model;
2. conduct a test of significance of the independent variable in a simple
linear regression model;
3. state and verify whether or not the regression assumptions are satisfied
for a given model;
4. state and verify that a developed simple linear regression model has
the desired properties;
5. calculate the 95 percent confidence interval for the mean value of the
dependent variable in a simple linear regression model;
6. calculate the 95 percent confidence interval for the slope of the
regression line;

CoDEUCC/Post-Diploma in Mathematics and Science Education 159


INFERENCE ABOUT SIMPLE LINEAR
UNIT 5 REGRESSION

7. derive the mean of the simple linear regression parameters;


8. derive the variance of the simple linear regression parameters;
9. present the analysis of variance of a simple regression analysis;
10. obtain the three types of variation using matrix approach;
11. derive the least squares simple linear regression using matrix
approach;
12. obtain the three types of variation using matrix approach.

160 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR UNIT 5
REGRESSION SESSION 1

SESSION1: TEST OF SIGNIFICANCE INVOLVING SIMPLE


LINEAR REGRESSION MODEL

A high value of coefficient of determination does not necessarily mean


that the regression model is good enough. We need a statistical test of the
significance of the model to confirm whether or not a model is good. This may be
approached in two ways: we can test the significance of the variation explained by the
entire model; or we can test the signficance of the independent variable in the model. In
this session, we will discuss these two ways of assessing the relevance of an obtained
model involving two variables.
Objectives
By the end of this session, you should be able to:
1. conduct a test of significance of the simple linear regression model;
2. conduct a test of significance of the independent variable in a simple linear
regression model.

Now read on…

1.1 Test of Significance of Simple Linear Regression Model


In order to determine the significance of the relationship between y and x, we will test
the significance of the contribution of the entire model in accounting for the variation in
the dependent variable Y in terms of the independent variable, X. This means that we
have to assess the amount of the variation in Y that is accounted for by the model (SSR)
in relation to the variation unexplained (SSE). It is also important to take into
consideration the size of the data and the number of variables included in the model as
these affect the size of these variations. Thus, the statistic for the test is given by

SSR 1 MSR
F= = (5.1)
SSE (n − 2) MSE
In Equation (5.1), SSR is divided by 1 since there is only one variable in the model. In
the denominator, SSE is divided by (n − 2) since we have used the data to make two
estimates, bo and b1 . For a good model, we expect explained variation to be large and
unexplained variation to be small. Thus, a large value of F is preferred. The statistic F
has the F-distribution with degrees of freedom 1 and (n − 2) . We therefore reject the
null hypothesis for a large value of F.
Another approach is to examine the significance of X in predicting Y. Thus, the test is
based on the hypothesis

CoDEUCC/Post-Diploma in Mathematics and Science Education 161


TEST OF SIGNIFICANCE INVOLVING SIMPLE
UNIT 5
SESSION 1 LINEAR REGRESSION MODEL

H o : β1 = 0
The hypothesis says that there is no change in the mean value of y associated with an
increase in x. That is, the variable X is not important in predicting Y. The alternative
hypothesis is that
H a : β1 ≠ 0
If we reject H o , we will conclude that X is significantly related to Y and is therefore
relevant in the model. To conduct the test, we have already computed the least squares
estimate b1 of β1 from a sample of n observations of the dependent variable Y. Note
that for each value of X, there are infinite number of Y values that could be observed. So
there are potentially infinite number of samples that could be obtained and hence
infinite population of potential values of estimates of the regression coefficient. Based
on the regression assumptions (Session 2), the population of all values of b1 is normally
distributed with mean β1 and standard deviation given by
σ b = s c11
1

where
1 1
c11 = n
= n
(5.2)
∑ ( xi − x ) 2
∑x 2
i − nx 2

i =1 i =1

and s is the square root of the mean square error (MSE) associated with the model. It is
given by
SSE
s2 = .
n−2
The test statistic for the test of H o is given by
b1 − β1 H o
t=
se(b1 )
Under H o the test statistic becomes

b1
t=
se(b1 )
which has the t-distribution with n − 2 degrees of freedom. A high value of the statistic
compared to tα 2 shows a departure of b1 from the hypothesized value. We will
therefore reject H o in this case.
Example 5.1
Refer to the data in Example 4.2.

162
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR UNIT 5
REGRESSION SESSION 1

(a) Determine whether or not temperature is significant in the model for amount of
electricity consumption.
(b) Determine whether or not the model is significant. Take α = 5% .

Solution
(a) In that data the model is found to be Amt = 26.7 + 2.447 Temp
We will determine whether or not Temperature is useful in this model.
The regression sum of squares SSR = 2873.759 and the total sum of squares is
SST = 3137.5 . Therefore, SSE = 3137.5 − 2873.759 = 263.741 and
263.741
s2 = = 43.9568 .
6
The null hypothesis for the test of significance is
H o : β1 = 0 against H a : β1 ≠ 0
The test statistic is given by
b1
t=
se(b1 )
1 1
Now, c11 = = = 0.00208
n
7609 − 8(29.85) 2
∑ xi2 − nx 2
i =1

b1 2.447 2.447
Thus, t = = = = 8.0926 .
se(b1 ) 6. 0.3023
From the t-distribution table, we find t 0.025, 6 = 2.447
We observe that the value of the test statistic is greater than the table value. We
therefore reject H o : β 1 = 0 and conclude that the value of b1 is far greater than 0.
Therefore, Temperature is significant in the model.
(b) We test the significance of the contribution of the entire model in accounting for
the variation in Y. The test statistic is given as
2873.759
F= = 65.3769
43.9568
From the F table, F0.05, 1, 6 = 5.99
Since the value of the test statistic is much greater than the table value, we reject
H o and conclude that the model is significant.

CoDEUCC/Post-Diploma in Mathematics and Science Education 163


TEST OF SIGNIFICANCE INVOLVING SIMPLE
UNIT 5
SESSION 1 LINEAR REGRESSION MODEL

You will notice that the results of assessing the significance of the model by examining
the explained variation and that of examining the significance of X appear to be the
same.

1.2 Test of Significance of the Correlation Coefficient


In Session 3 of Unit 4, we have examined the simple correlation coefficient between
two observed values of X and Y that make up the sample. A similar linear correlation
coefficient can be defined for the population of all possible combinations of values of X
and Y. This coefficient is called the population correlation coefficient, denoted ρ . We
are interested in examining the hypothesis
Ho : ρ = 0
The hypothesis says that there is no linear relationship between X and Y. The alternative
is that H a : ρ ≠ 0. The test statistic is given in terms of the computed r by

r n−2
t= (5.3)
1− r2
which has the t-distribution with n − 2 degrees of freedom. A high value of t indicates
that the correlation is significant.

Example 5.2
In the sample of n = 10 pairs of values drawn on the number of hours taken to complete
a job by groups of workers of various sizes, the observed correlation coefficient
between the variables is r = 0.7969.
(a) Assess the significance of this value.
(b) Find the minimum value of r for a sample of this size that is significant at the 5%
level.
Solution
(a) Our hypothesis is H o : ρ = 0 against H a : ρ ≠ 0 . Substituting values into
r n−2
t= , we have
1− r 2

0.7969 10 − 2
t= = 3.7313
2
1 − 0.7969
From the t table, t 0.025, 8 = 2.306 . Thus, the statistic is greater. So we reject H o and
conclude that the correlation coefficient is significant.

164
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR UNIT 5
REGRESSION SESSION 1

(b) The required minimum value is given by


10
r2 × = 4.6656
1− r2
which gives r = 0.5640

Self-Assessment Questions
Exercise 5.1

Refer to Exercise 4.3. In each case,


(a) test the significance of the model obtained in Exercise 4.1 using the F test. Verify
your test by conducting the test of significance of β1 .
(b) test the significance of the correlation coefficient.
Does your test of significance of the correlation coefficient agree with the test of
significance of β1 ? Explain.

CoDEUCC/Post-Diploma in Mathematics and Science Education 165


TEST OF SIGNIFICANCE INVOLVING SIMPLE
UNIT 5
SESSION 1 LINEAR REGRESSION MODEL

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

166
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR
UNIT 5
REGRESSION SESSION 2

SESSION 2: ASSUMPTIONS AND PROPERTIES OF LEAST


SQUARES REGRESSION LINE
As with all statistical techniques, the method of least squares
regression line is governed by a set of assumptions under which results
hold valid. Under such assumptions, the fitted line would exhibit certain characteristics.
We will examine these assumptions underlying the least squares regression model and
its characteristics. It should be noted that the discussion on this topic applies not only to
simple linear regression, but also to general linear regression such as those in Unit 6.

Objectives
By the end of this session, you should be able to:
1. State and verify whether or not the regression assumptions are satisfied
for a given model;
2. State and verify that a developed simple linear regression model has the desired
properties.

Now read on…

2.1 Regression Assumptions


Consider the data shown on the number of hours spent by ten groups of workers on
similar jobs in Example 4.1.

Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)

You will notice that at any value of the independent variable, X, the dependent variable
Y, could assume several values. For example, when X = 5 , Y could take two values 10
and 8. Again, X = 10 , Y = 1, 2 . This suggests that the value of Y is not influenced by
only the value of X, but also by other factors other than the value of X. There is
therefore a population of error term values that could potentially occur that describe the
different potential effect on Y of all factors other than X. These explain the variation in
Y values observed when X = x .

The scatter plot of the data with fitted regression line is shown in Figure 5.1.

CoDEUCC/Post-Diploma in Mathematics and Science Education 167


UNIT 5 ASSUMPTIONS AND PROPERTIES OF LEAST
SESSION 2 SQUARES REGRESSION LINE

14

12

10

8
Hrs

0
3 4 5 6 7 8 9 10
GpSize

Figure 5.1: Plot of on data on number of hours spent by ten groups of workers on
similar jobs with fitted line

It is known that the equation of the regression line is


Hrs = 14.00 − 1.1943GpSize.
The regression assumptions are based on a study of the errors that are as a result of the
use of this model. The error, denoted e, is simply the difference between the observed
value, y, and the fitted value, denoted ŷ. That is, for the ith observation, ei = y i − yˆ i .
These errors are also referred to as residuals. In our illustration, for the first observation,
i = 1, X = 5 and the estimated value is yˆ1 = 14.00 − 1.194(5) = 8.0303 . Therefore, the
associated residual is e1 = 10 − 8.0303 = 1.9697 . The assumptions are stated as follows:
1. At any given value of X, the population of potential error terms (ε ) has mean
equal to zero. That is, E (ε i ) = 0 .
2. The variance of the population of error terms is a constant. That is,
Var (ε i ) = σ 2 , ∀i. It follows that the variance of response y i is
Var ( y i ) = σ 2 , ∀i.
3. Population of error terms has a normal distribution.
4. Error terms ε i , ε j are independent. That is, cov(ε i , ε j ) = 0. It follows that
cov( y i , y j ) = 0.

168 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR
UNIT 5
REGRESSION SESSION 2

The four stated assumptions can be summarized into one point; that the observations y i
are identically and independently normally distributed with mean y i = β o + β1 xi and
constant variance σ 2 . That is, y i ~ N (β o + β1 xi , σ 2 ) .

2.2 Properties of Fitted Regression Line


On the basis of the underlying assumptions, we expect the regression model to exhibit
some features. Particularly, the residuals must have some interesting features if the
assumptions should hold valid. Note that, the general fitted model obtained from the
normal equations is given as yˆ = bo + b1 x . The general expression for the error in using
the model is therefore given by y − (bo + b1 x) . We will derive these properties from the
general normal equations, and then illustrate them using the illustrative dataset in
Section 2.1. Before we outline the properties, let us determine the residuals in the use of
the model obtained for the data in Figure 5.1 as given in Table 5.1.

Table 5.1: Fitted and Residuals Values Using Estimated Model


in Example 4.1

SN
Size of
Group (X)
No of
hours (Y)
()
Fits Yˆ Residual (e)

1 5 10 8.0303 1.96970
2 8 7 4.4474 2.55258
3 4 13 9.2246 3.77540
4 6 4 6.8360 -2.83601
5 10 1 2.0588 -1.05882
6 3 7 10.4189 -3.41889
7 5 8 8.0303 -0.03030
8 9 3 3.2531 -0.25312
9 10 2 2.0588 -0.05882
10 7 5 5.6417 -0.64171

n
1. The sum of the residuals is zero. That is, ∑e
i =1
i = 0.

CoDEUCC/Post-Diploma in Mathematics and Science Education 169


UNIT 5 ASSUMPTIONS AND PROPERTIES OF LEAST
SESSION 2 SQUARES REGRESSION LINE

n n

∑ e = ∑ (y
i =1
i
i =1
i − bo − b1 xi )
n n
= ∑ y i − nbo − b1 ∑ xi
i =1 i =1
=0

The second line follows from the first of the normal equations. For example, in
10
Table 5.1, it can be verified that ∑e
i =1
i = 0.
n
2. The sum of squared residuals ∑e
i =1
2
i is a minimum. This is the condition under

which the normal equations were derived. It is therefore a basic requirement to


be satisfied for least squares estimates of the regression parameters.
3. Sum of observed values Yi equals the sum of fitted values. That is,
n n

∑ Yi = ∑ Yˆi . This is implicit in the first normal equation. In that equation we


i =1 i =1

note that

n n

∑y
i =1
i = nbo + b1 ∑ xi
i =1
n n
= ∑ bo + ∑ b1 xi
i =1 i =1
n
= ∑ (bo + b1 xi )
i =1
n
= ∑ yˆ i
i =1

The result also implies that the mean of the fitted values is the same as the mean
of the observed values. In Table 5.1, verify that Y = Yˆ and is equal to 6.00.

4. Sum of weighted residuals is zero when the residual in the ith trial is weighted
n
by the level of independent variable in the ith trial. That is, ∑x e
i =1
i i = 0. We

observe that

170 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR
UNIT 5
REGRESSION SESSION 2

n n

∑ x e = ∑ x (y
i =1
i i
i =1
i i − bo − b1 xi )
n n n
= ∑ xi y i − bo ∑ xi − b1 ∑ xi2
i =1 i =1 i =1
=0
The second line we have simply equated the second of the normal equations to
zero. Verify that this is true from Table 5.1.
5. Sum of weighted residuals is zero when the residual in the ith trial is weighted
n
by the fitted value of the response variable for the ith trial. That is, ∑ yˆ e
i =1
i i = 0.

Verify this from Table 5.1.


6. The regression line always passes through the point ( x, y ) . Notice from the
n n
first of the normal equations ∑ yi = nbo + b1 ∑ xi , By dividing through by n,
i =1 i =1

we have y = bo + b1 x . This shows that the line yˆ = bo + b1 x passes through


(x, y ) . In Table 2.1, find ( x, y ) and verify that it lies on the regression line.

Self-Assessment Questions
Exercise 5.2

1. Refer to the problem in Exercise 4.1 Question 1.


(a) Verify that the properties of a regression model are met.
(b) Check whether or not the regression assumptions are met.

CoDEUCC/Post-Diploma in Mathematics and Science Education 171


UNIT 5 ASSUMPTIONS AND PROPERTIES OF LEAST
SESSION 2 SQUARES REGRESSION LINE

This is blank sheet for your short note on:


• Issues that are not clear: and
• Difficult topics if any

172 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 3

SESSION 3: CONFIDENCE INTERVAL


We know that a point on the regression line corresponding to a particular
value xo of the independent variable X is given by yˆ = bo + b1 xo . We will
only be fortunate to have ŷ to be exactly the same as the mean value of y (or the value
of y) when X = xo . As a result, it is proper to obtain confidence interval within which
the mean value of y can lie. In this session we discuss how to calculate this interval.

Objectives
By the end of this session, you should be able to:
1. calculate the 95 percent confidence interval for the mean value of the
dependent variable in a simple linear regression model.
2. calculate the 95 percent confidence interval for the slope of the regression
line.

Now read on…

3.1 Confidence Interval for Mean Value of y


Suppose that the regression assumptions hold. A 100(1 − α ) percent confidence interval
for the mean value of y when X = xo is given by

[ yˆ ± t α 2 ( standard error of the estimate of yˆ )]


= [ yˆ ± t α 2 s ( Distance value) ] (5.4)
= [ yˆ ± t α 2 s x ′o ( X ′X) −1 x o ]

It will be shown in Section 6 that the inverse matrix


 n 2 n

 ∑ xi − ∑ xi 
1  i =1 
( X′X) −1 = i =1
n
 n
 − ∑ x
2 n

n∑ xi −  ∑ xi  
2
i n 
i =1  i =1   i =1 

The vector x o is simply the observed vector x o = (1 xo )′ . The product matrix


representing the distance of X = xo from previously observed mean value x of X is
given by

CoDEUCC/Post-Diploma in Mathematics and Science Education 173


UNIT 5 CONFIDENCE INTERVAL
SESSION 3

1 (x − x)2
x ′o ( X′X) −1 x o = + n o (5.5)
n
∑ xi2 − nx 2
i =1
The value tα 2 is based on n − 2 degrees of freedom.

Example 5.3

Refer to Example 4.1. The data on the number of hours spent by ten groups of workers
on similar jobs is reproduced below.

Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)

Find the 95% confidence interval for the estimate of the number of hours spent by 10
workers.

Solution
The model for y is already obtained as y = 14.0 − 1.19 x

∑ x = 67 , ∑ y = 60, ∑ x y = 335 , ∑ x 2
= 505, and ∑y 2
= 486

We also know that r 2 = 0.6351 .


The total variation SST is given as
1
s yy = ∑ y 2 − (∑ y )2 = 486 − 1 (60) 2 = 486 − 360 = 126
n 10
Using the relation
1
S2 = s yy (1 − r 2 )
n−2
the mean square error of the estimate of y is
1
S 2 = (126)(1 − 0.6351) = 5.7472
8
Therefore, the standard error is 2.3973
Now, at X = 10

174 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 3

1 (10 − 6.7) 2 1 10.89


x ′o ( X′X) −1 x o = + = + = 0.2941
10 505 − 10(6.7) 2
10 56.1
and yˆ = 14.0 − 1.19(10) = 2.1
Since S = 2.3973 and t 0.025, 8 = 2.306 , it follows that the desired 95 percent confidence
interval is

[ yˆ ± tα 2 S (Distance value) ]
= [2.1 ± 2.306(2.3973) 0.2941]
= [2.1 ± 2.9980]
= [−0.898, 5.098]
This means that in repeated sampling, 95 percent of all sample estimates of the mean of
y at X = 10 would lie in the interval [−0.898, 5.098] , calculated based on the given
formula.
Figure 5.2 is a MINITAB output of the 95 percent confidence interval for the problem
in Example 5.3.

14 Regression
95% CI

12

10

8
Hrs

3 4 5 6 7 8 9 10
GpSize

Figure 5.2: Geometric relationship between a 95-percent confidence interval and the
actual regression line

CoDEUCC/Post-Diploma in Mathematics and Science Education 175


UNIT 5 CONFIDENCE INTERVAL
SESSION 3

Note that the shorter the confidence interval, the better the prediction based on the
model. Geometrically, the closer the interval band to the regression line the better the
prediction. From Figure 5.2, what can you say about estimates based on the regression
line?

3.2 Confidence Interval of the Regression slope


Often, it is also important to find a 95 percent confidence interval for the slope of the
regression model. The standard error of the estimate of the slope, b1 , in the equation
yˆ = bo + b1 xo was given in Session 1 as
se(b1 ) = S c11
where c11 is defined in Equation (5.1) as
1 1
c11 = n
= n

∑ (x
i =1
i − x)2 ∑x
i =1
2
i − nx 2

The 95 percent confidence interval can therefore be constructed as


[b1 ± tα 2 S c11 ]

where tα 2 and S are as defined earlier.

Example 5.4
Refer to Example 5.3. Calculate the 95 percent confidence interval for the slope of the
regression line. Comment on your result.

Solution
The model is obtained as y = 14.0 − 1.19 x .
1 1
s xx = ∑ x 2 − (∑ x ) = 505 − (67) 2 = 505 − 448.9 = 56.1 .
2

n 10
Thus, c11 = 0.0178 . Since S = 2.3973 and t 0.025, 8 = 2.306 , it follows that the desired 95
percent confidence interval is
[b1 ± tα 2 S c11 ]
= [−1.19 ± 2.306(2.3973) 0.0178 ]
= [−1.19 ± 0.7381]
= [−1.9281, − 0.4519]

176 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 3

The interval does not contain the hypothesized value H o : β1 = 0 . This implies that the
slope value is significantly different from zero. Therefore, the independent variable X is
significant in the model.

Self-Assessment Questions
Exercise 5.3

Refer to the data in Exercise 4.1, for Questions 3 and 4. In each relevant case, calculate
a 95 percent confidence interval
(a) for the slope of the regression line for demand for Toothgate.
(b) for demand when the price is GH¢4.40.
(c) for demand when the price difference is GH¢0.50.
In each case, comment on your result.

CoDEUCC/Post-Diploma in Mathematics and Science Education 177


UNIT 5 CONFIDENCE INTERVAL
SESSION 3

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

178 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 4

SESSION 4: THE MEAN AND VARIANCE OF THE SIMPLE


LINEAR REGRESSION ESTIMATES
There are a number of results that were used in previous sessions which
were not established. These results relate to general properties of the
least squares estimators which enable us to construct confidence intervals and perform
tests of hypotheses on the regression parameters. In this session, our aim is to establish
in particular the mean and variance of the regression paramters.

Objectives

By the end of this session, you should be able to:


1. Derive the mean of the simple linear regression parameters
2. Derive the variance of the simple linear regression parameters.

Now read on…

4.1 The Mean and Variance of the Slope Parameter


Let Y1 , Y2 ,  , Yn be n observations on a dependent variable Y. In order to determine
the mean and variance of the parameter estimates, there is the need to establish that b1
is a linear combination of Y1 , Y2 ,  , Yn . It should be noted that the expression for b1 is
given as
n

∑ (x i − x )(Yi − Y )
b1 = i =1
n
(5.6)
∑ (x
i =1
i − x) 2

By expanding the numerator, we have


n n n

∑ ( xi − x )(Yi − Y ) = ∑ ( xi − x )Yi − Y ∑ ( xi − x )
i =1 i =1 i =1
n
But ∑ (x
i =1
i − x) = 0 .

Therefore,
n n

∑ ( xi − x )(Yi − Y ) = ∑ ( xi − x )Yi
i =1 i =1
So Equation (5.6) may be written simply as

CoDEUCC/Post-Diploma in Mathematics and Science Education 179


UNIT 5 THE MEAN AND VARIANCE OF THE SIMPLE
SESSION 4 LINEAR REGRESSION ESTIMATES

∑ (x i − x )Yi
b1 = i =1
n
(5.7)
∑ (x
i =1
i − x) 2

In a more handy form, let us further write Equation (6.7) as


n
b1 = ∑ k i Yi (5.8)
i =1
where
xi − x
ki = n

∑ (x
i =1
i − x)2

are fixed quantities since xi are fixed.

In Equation (5.8), it is clear that b1 is a linear combination of the observations


n n
Y1 , Y2 ,  , Yn . Notice that since ∑ ( xi − x ) = 0 , it follows that
i =1
∑k
i =1
i = 0.

Now, taking expectation of b1 in (5.8), we obtain

 n 
E (b1 ) = E  ∑ k i Yi 
 i =1 
n
= ∑ k i E (Yi )
i =1
n
= ∑ k i ( β o + β 1 xi )
i =1
n n
= β o ∑ k i + β 1 ∑ k i xi
i =1 i =1

Thus,
n
E (b1 ) = β1 ∑ k i xi
i =1
By the argument preceding Equation (5.7), we have

180 CoDDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 4

n n

n ∑ (x i − x ) xi ∑ (x i − x )( xi − x )
∑k x i i = i =1
n
= i =1
n
=1
i =1
∑ (x
i =1
i − x) 2
∑ (x
i =1
i − x) 2

Therefore, we have shown that


E (b1 ) = β1
That is, the mean of the sample estimate b1 is the same as the population value. This
result means that b1 is an unbiased estimator of β1 .
To find the variance of b1 , it should be recalled that observations Yi are pair-wise
uncorrelated. That is, cov(Yi , Y j ) = 0, ∀i ≠ j . Now,
 n 
Var (b1 ) = Var  ∑ k i Yi 
 i =1 
n
= ∑ k i2Var (Yi )
i =1
n
= σ 2 ∑ k i2
i =1

Substituting for k i ,
2
  n

n 
(x − x)  σ 2
∑ ( xi − x ) 2
σ2
Var (b1 ) = σ 2 ∑  n i  = i =1
2
= n
i =1  2   
n

 ∑ ( xi − x ) 
∑ ( x i − x )  ∑ (x − x)2
2
i
i =1   i =1  i =1

Denote Var (b1 ) by s 2 (b1 ) . Therefore,


σ2
s 2 (b1 ) = n

∑ (x
i =1
i − x)2

Using the sample standard error as estimate of σ , we obtain an estimate of the standard
error of b1 taking the square root of the variance
MSE
s 2 (b1 ) = n (5.9)
∑ ( xi − x ) 2

i =1

CoDEUCC/Post-Diploma in Mathematics and Science Education 181


UNIT 5 THE MEAN AND VARIANCE OF THE SIMPLE
SESSION 4 LINEAR REGRESSION ESTIMATES

4.2 The Mean and Variance of the Intercept


The least squares estimator of the model intercept is
bo = Y − b1 x
Since it has been shown that b1 is a linear combination of Y1 , Y2 ,  , Yn , then so is bo .
It is not difficult now to show that
E (bo ) = β o .
Let us turn attention to find Var (bo ) . Now,
Var (bo ) = Var (Y − b1 x )
n
Noting that b1 = ∑ k i Yi , we have
i =1

1 n n

Var (bo ) = Var  ∑ Yi − x ∑ k i Yi 
 n i =1 i =1 
 n 1  
= Var ∑  − x k i Yi 
 i =1  n  
2
1
n

= ∑  − x k i  Var (Yi )
i =1  n 
n
Expanding the bracket and noting that ∑k i =1
i = 0,
n
 1 2 
Var (bo ) = σ 2 ∑  2 − x k i + x 2 k i2 
i =1  n n 

1 
= σ 2  + x 2 ∑ k i2 
n 
 
 

2 1
n
( xi − x ) 2 
=σ  + x ∑ 2
2 
n i =1  2
n

 ∑ ( x i − x )  
  i =1  
 
 
2 1 x2 
=σ + n
n 


∑i =1
( xi − x ) 2 

182 CoDDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 4

1
Noting again that x =
n
∑ xi , the result is further simplified as
n
σ 2 ∑ xi2
Var (bo ) = n
i =1

n∑ ( xi − x ) 2
i =1

Thus, an estimate of the standard error of bo is the square root of the variance
n
MSE ∑ xi2
s 2 (bo ) = n
i =1
(5.10)
n∑ ( xi − x ) 2

i =1

4.3 Estimate of the Mean Response


It is possible to sue the estimated regression line to estimate the mean of the response
variable at a specific value X = xo . For this value, the estimate is

Yˆo = bo + b1 xo

Note that for the same values of x, we expect sample-to-sample variation in Yˆo . We
seek to estimate E (Yˆo ) and Var (Yˆo ) at X = x o .

Now, E (Yˆo ) = E (bo + b1 xo ) = β o + β1 xo = E (Yo )

Thus, the observed mean estimate Yˆo is an unbiased estimate of the population mean Yˆo
at a specific value X = x o .

For the variance, we have that

Var (Yˆo ) = Var (Y − b1 ( xo − x ) )

Making substitutions for Y and b1 ,

CoDEUCC/Post-Diploma in Mathematics and Science Education 183


UNIT 5 THE MEAN AND VARIANCE OF THE SIMPLE
SESSION 4 LINEAR REGRESSION ESTIMATES

1 n n

Var (Yˆo ) = Var  ∑ Yi + ( xo − x )∑ k i Yi 
 n i =1 i =1 
 n 1  
= Var ∑  + ( xo − x ) k i Yi 
 i =1  n  
2
n
1 
= ∑  + ( xo − x ) k i  Var (Yi )
i =1  n 
 1 2( xo − x ) n n

= σ 2 +
n n

i =1
k i + ( x o − x ) 2

i =1
k i2 

 
 2 
1 (x − x) 
= σ 2 + n o
n 



i =1
( xi − x ) 2 

Therefore, an estimate of the standard error of estimate of Yˆo is given by the square root
of
 
 
1 (x − x)2 
s (Yo ) = MSE  + n o
2 ˆ
(5.11)
n 



i =1
( xi − x ) 2 

In Session 6, you will realize that this result is simply the product of standard error of
the estimate of the regression model Yˆ and the distance of X = xo from previously
observed mean value x of X.

184 CoDDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 4

Self-Assessment Questions
Exercise 5.4

1. Using the least squares estimator of the model intercept


bo = Y − b1 x

Show that bo is an unbiased estimator of β o .


2. Given the estimated regression equation for Question 1, Exercise 4.1,
(a) compute estimate of the standard error of bo and b1 .
(b) determine a 95 percent confidence interval for the true slope.
(c) determine whether or not a linear relationship is statistically discernible
between year of operation and profit margin.
(d) for each year of operation, compute 95 percent confidence interval estimates
for the mean profit plot the intervals against the estimated regression line.

3. Given the linear model Yi = βxi + ε i , i = 1, 2,  , n and suppose that all


regression assumptions hold.
(a) Determine the least squares estimator b of β .

(b) Determine whether b is an unbiased estimator of β .


Show that
σ2
Var (b) = n

∑x
i =1
2
i

CoDEUCC/Post-Diploma in Mathematics and Science Education 185


UNIT 5 THE MEAN AND VARIANCE OF THE SIMPLE
SESSION 4 LINEAR REGRESSION ESTIMATES

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

186 CoDDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 5

SESSION 5: THE USE OF ANALYSIS OF VARIANCE IN SIMPLE


LINEAR REGRESSION

Most statistical softwares present the analysis of variance as part of the


output of regression analysis. Analysis of vairance, often written as
ANOVA, in this case, is just a presentation of the three types of variation in the data. In
addition it indicates the corresponding F test of significance of the regression model.
We conducted the F test in the previous session. This session will formally summarise
the procedure for presentating the ANOVA in regression. We will discuss the general
procedure of ANOVA in the latter part of Unit 6.

Objectives
By the end of this session, you should be able to:
1. Present the analysis of variance of a simple regression analysis
2. Obtain the three types of variation using matrix approach

Now read on…

5.1 Presenting ANOVA in Simple Linear Regression


We noted in Session 1 that the significance of the simple linear regression could be
assessed by examining the explained variation in relation to the unexplained variation.
Another way of assessing the model was to examine the significance of the independent
variable X. We remarked that the result of the two tests appeared to be the same. You
noticed that in that session, we did not state the hypothesis for the test of the actual
model. The hypotheses for the test based on the F-distribution is given as
H o : β1 = 0 against H o : β1 ≠ 0
which is the same as testing the significance of the variable X.
For analysis of variance, we examine the difference in the errors
( y i − y ) − ( y i − yˆ i ) = yˆ i − y
or
y i − y = ( yˆ i − y ) + ( y i − yˆ i )
Squaring both sides and summing over all observations (i = 1, 2,  , n) , then we have
n n n n

∑ ( yi − y ) 2 = ∑ ( yˆ i − y ) 2 + ∑ ( yi − yˆ i ) 2 + 2∑ ( yˆ i − y )( yi − yˆ i ) .
i =1 i =1 i =1 i =1

CoDEUCC/Post-Diploma in Mathematics and Science Education 187


UNIT 5
THE USE OF ANALYSIS OF VARIANCE IN SIMPLE
SESSION 5 LINEAR REGRESSION

We rewrite the cross-product term as


n n n

∑ ( yˆ
i =1
i − y )( y i − yˆ i ) = ∑ yˆ i ( y i − yˆ i ) − y ∑ ( y i − yˆ i )
i =1 i =1

From Property 1 of the estimated regression line in Session 2, the second sum
n n

∑ ( yi − yˆ i ) = ∑ ei = 0.
i =1 i =1

Rewriting the first sum as


n n

∑ yˆ i ( yi − yˆ i ) = ∑ yˆ i ei
i =1 i =1
n
= ∑ (bo + b1 xi )ei
i =1
n n
= bo ∑ ei + b1 ∑ xi ei
i =1 i =1
=0
n n
since ∑ ei = 0 and
i =1
∑x e
i =1
i i = 0 by Properties 1 and 3, respectively, of Session 2.

Therefore, we have
n n n

∑ ( yi − y ) 2 = ∑ ( yˆ i − y ) 2 + ∑ ( yi − yˆ i ) 2
i =1 i =1 i =1
(5.12)

Equation (5.12) is the fundamental equation of regression analysis. As pointed out,


n

∑(y
i =1
i − y ) 2 is the total sum of squares (SST)

∑ ( yˆ
i =1
i − y ) 2 is the regression sum of squares (SSR)

∑(y
i =1
i − yˆ i ) 2 is the error sum of squares (SSE)

Now if the slope of the estimated regression line is zero, i.e., under H o , then SSR = 0 .
On the other hand, if SSR is large and deviates far from 0, the linear term b1 x accounts
for a large variation in the observations.
An important point to note is that in the computation of these variations, we make use
of estimates of the sample which leads to a loss of various degrees of freedom. For
example, for SST, we use y , which comes with loss of 1 degree of freedom. Thus, for

188 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 5

the SST there are n − 1 degrees of freedom. Similarly, for the SSE, there are n − 2
degrees of freedom. Since the degrees of freedom are additive, from Equation (5.12),
we should have
df ( SSR) = df ( SST ) − df ( SSE )
Thus, SSR has 1 degree of freedom. It will be observed later that the degrees of freedom
of the SSR are always equal to the number of variables in the model.
To conduct the test of H o , have stated the statistic as
SSR 1 MSR
F= =
SSE (n − 2) MSE
which has the F distribution with 1 and n − 2 degrees of freedom.
Table 5.2 shows the general layout of the analysis of variance for the simple linear
model.

Table 5.2: ANOVA table for the simple linear model


Source of
df SS MS F statistic
variation
n

n n
∑ ( yˆ
i =1
i − y) 2 /1

Regression 1 ∑ ( yˆ i − y ) 2 ∑ ( yˆ i − y ) 2 / 1 n

i =1 i =1 ∑(y
i =1
i − yˆ i ) 2 /(n − 2)

∑(y
n
− yˆ i ) 2 /(n − 2)
Error n−2 ∑(y
i =1
i − yˆ i ) 2
i =1
i

n
Total n −1 ∑(y
i =1
i − y) 2

CoDEUCC/Post-Diploma in Mathematics and Science Education 189


UNIT 5
THE USE OF ANALYSIS OF VARIANCE IN SIMPLE
SESSION 5 LINEAR REGRESSION

Example 5.5
Refer to the data in Example 4.2.
(a) Present the analysis of variance for the model for amount of electricity consumption
in terms of temperature.
(b) Comment on your analysis.

Solution
(a) We test the null hypothesis that there is no linear relationship between amount
spent on electricity and temperature level against the alternative that there is linear
relationship at α = 0.05 .
We know that
2
1 n
n
 1
SST = ∑ y −  ∑ y i  = 82738 − (798) 2 = 3137.5
2
i
i =1 n  i =1  8
SSE = 263.741,
SSR = 3137.5 − 263.741 = 2873.759

For n = 8, the ANOVA table is as shown.


Source of
df SS MS F statistic
variation

Regression 1 2873.759 2873.759 65.377

Error 6 263.741 43.957


Total 7 3137.500

F0.05, 1, 6 = 5.99

(b) Since F = 65.3769 > F0.05, 1, 6 = 5.99 , the null hypothesis of no linear regression is
rejected, and we conclude that the amount spent on electricity is linearly influenced
by temperature.

5.2 Relationship between the F test and the t test


When we test for the significance of the linear model, we use the F test. However, the
test of significance of the independent variable X is based on the t test (see Session 1).
As already observed, the two tests are based on the same hypothesis. Thus, one would

190 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 5

expect some relationship to exist between the two tests; and there is actually a
relationship between them. To establish the relationship, we note that since the
estimated regression line is
yˆ i = y + b1 ( xi − x )
or
yˆ i − y = b1 ( xi − x )
By squaring and summing both sides over all i = 1, 2,  , n we obtain
n n
SSR = ∑ ( yˆ i − y ) 2 = b12 ∑ ( xi − x ) 2 (5.13)
i =1 i =1

From Session 4, the variance of b1 is given as


MSE
s 2 (b1 ) = n

∑ (x
i =1
i − x)2

n
Thus, MSE = s 2 (b1 )∑ ( xi − x ) 2
i =1

Then,
n
b12 ∑ ( xi − x ) 2 2
MSR b12  b 
F= = i =1
= =  1 
MSE n
s (b1 )  s (b1 ) 
2
s 2 (b1 )∑ ( xi − x ) 2
i =1

Therefore, for a random variable F distributed with 1 and n − 2 degrees of freedom,


and T student’s t random variable with n − 2 degrees of freedom,
F =T2
The quantile (or table) values are thus related as
f1−α ,1, n − 2 = t12−α 2, n − 2
The result shows that for simple linear regression, we can infer the significance of the
model from the significance of the independent variable and the vice versa.

CoDEUCC/Post-Diploma in Mathematics and Science Education 191


UNIT 5
THE USE OF ANALYSIS OF VARIANCE IN SIMPLE
SESSION 5 LINEAR REGRESSION

Self-Assessment Questions
Exercise 5.3
1. Refer to Example 4.1.
The data on the number of hours spent by ten groups of workers on similar jobs is
as shown.

Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)

Conduct analysis of variance on the linear model for hours spent on the job by the
number of workers in the various groups.

2. Refer to the data in Exercise 4.1, Questions 1.


The data on profit, in GHy, of a certain small scale business establishment in the
xth year of its operation is given in the table.

x 1 2 3 4 5
y 125 140 165 195 230
(a) Conduct analysis of variance for the linear model for profit in terms of the year
of operation.
(b) Test the significance of the year of operation in the model.
(c) Comment on your results in (a) and (b).

192 CoDEUCC/Post-Diploma in Mathematics and Science Education


INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 6

SESSION 6: USING MATRIX ALGEBRA IN SIMPLE LINEAR


REGRESSION

Computations involving regression analysis could be made a lot easier


when computations are carried out using matrix algebra. In fact, this
approach would be the option when we study multiple regression in the next unit. We
will use this session to study how matrix algebra can be used in the case of simple linear
regression. The results naturally extend to multiple regression.

Objectives
By the end of this session, you should be able to:
1. Derive the least squares simple linear regression using matrix approach
2. Obtain the three types of variation using matrix approach

Now read on…

6.1 The Least Squares Regression Parameter Estimates


Let n observations ( x1 , x 2 , x3 , , x n ) and ( y1 , y 2 , y 3 ,  , y n ) are taken on the variables X
and Y. The simple linear regression model for Y in terms of X is obtained as
y = β o + β 1 x . In order to estimate the values bo and b1 of the parameters β o and β1 ,
we have used the data on X as

y1 = βo + β1 x1
y2 = βo + β1 x2

yn = βo + β1 xn

This set of equations may be written in matrix/vector form as


Y = Xβ (5.14)

where

CoDEUCC/Post-Diploma in Mathematics and Science Education 193


UNIT 5
USING MATRIX ALGEBRA IN SIMPLE LINEAR
SESSION6 REGRESSION

 y1  1 x1 
   
 y2  1 x2 
β 
Y =  y 3 , X = 1 x3  β =  o  (5.15)
     β1 
     
y   x n 
 n 1
Notice that
Y is n × 1 vector of observed values on Y
X is n × (k + 1) data matrix, k = 1
β is (k + 1) × 1 , k = 1
In the definition of the dimension of the matrices/vectors above, k is the number of
variables in the model; in this case, k = 1 .
For least squares estimation of the parameters, the normal equations are obtained as
follows:
Pre-multiplying (5.14) by X′ , we obtain
( X′X)β = X′Y
where a 2 × 2 matrix, or generally a (k + 1) × (k + 1) matrix.
If X′X has an inverse, we then obtain the solution for the vector β as
β = ( X′X) −1 X′Y (5.16)

Now from (5.15),


1 x1 
 
1 x2 
1 1 1  1 
X′X =  1 x3 
 x1 x2 x3  x n  
  
1 x n 

By the usual procedure for matrix multiplication, we obtain
 n

 n ∑x i

X′X =  n i =1
(5.17)
 n

 ∑ xi ∑ xi2 
 i =1 i =1 
The inverse of this matrix is

194
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 6

 n 2 n

 ∑ xi − ∑ xi 
1  i =1 
( X′X) −1 = i =1
(5.18)
n
 n
 − ∑ x
2 n

n∑ xi −  ∑ xi  
2
i n 
i =1  i =1   i =1 

Let us take note of the following features of the matrix in (5.18).


1. Each of β 0 and β1 is normally distributed with mean E ( β j ) = b j , ( j = 0, 1)
and variance Var ( β j ) = c jj σ 2 , where c jj is the ( j + 1)st diagonal element of
( X′X) −1 . That is,

2. The covariance Cov( β o , β1 ) = c12σ 2 , where c12 is the off-diagonal element of


( X′X) −1 .

Example 5.6
The data on the number of hours spent by ten groups of workers on similar jobs is as
shown.

Size of
5 8 4 6 10 3 5 9 10 7
group (x)
No. of
10 7 13 4 1 7 8 3 2 5
Hrs (y)

Use matrix approach to obtain the regression equation for the number of hour spent in
terms of the number of workers in the group.

Solution

Constructing the data in the form in Equation (5.15), we have

 10 67   0.9002 − 0.1194   60 
X′X =   ( X′X) −1 =   X′Y =  
 67 505   − 0.1194 0.0178   335 

 0.9002 − 0.1194  60   14.0018 


Therefore, β =   = 
 − 0.1194 0.0178  335   − 1.1943 

CoDEUCC/Post-Diploma in Mathematics and Science Education 195


UNIT 5
USING MATRIX ALGEBRA IN SIMPLE LINEAR
SESSION6 REGRESSION

The regression equation is therefore given as


y = 14.0018 − 1.1943 x

6.2 Estimation of Variations by Matrix Approach


The Mean Residual Error
An unbiased estimator of the population variance σ 2 is the mean sum of squares error
(MSE) which has been given earlier as
n

∑(y i − yˆ i ) 2
S2 = i =1

n−2
Taking the numerator,
n n n n

∑ ( yi − yˆ i ) 2 = ∑ yi2 − 2∑ yi yˆ i + ∑ yˆ i2
i =1 i =1 i =1 i =1
(5.19)

This can further be simplified as


n n n

∑ ( yi − yˆ i ) 2 = ∑ yi2 − ∑ yˆ i2
i =1 i =1 i =1
(5.20)

Noting that
n n

∑y
i =1
2
i = Y ′Y and ∑ yˆ
i =1
2
i = ( Xβ)′( Xβ) = β ′( X′X)β

and that β = ( X′X) −1 X′Y , Equation (5.16) now simplifies as


n

∑(y
i =1
i − yˆ i ) 2 = Y ′Y − β ′X′Y

Therefore, the standard error may be written in terms of matrices as


Y ′Y − β ′X′Y
S2 = (5.21)
n−2
The variance of the parameter β j is thus given as Var( β j ) = c jj s 2 , ( j = 0, 1) .

For example, for the data above, we know that


 60   14.0018 
X′Y =   , β =  
 335   − 1.1943 
196
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 6

 60 
Now, Y ′Y = 486 , β ′X′Y = (14.0018 − 1.1943)  = 440.0178
 335 
Therefore,

Y ′Y − β ′X′Y 486 − 440.0178


S2 = = = 5.7478
n−2 8
This value is the same as that obtained in Section 3 within rounding errors.

The Sum of Squares Regression

This is denoted as SSR and given by


n
SSR = ∑ ( yˆ i − y ) 2
i =1

noting that in simple linear regression, there is only one variable. Expanding and
simplifying, we obtain
n
SSR = ∑ yˆ i2 − ny 2
i =1

In vector form
2
1 n 
SSR = β ′X′Y −  ∑ y i  (5.22)
n  i =1 
n
For example, for the data above, β ′X′Y = 440.0178 and ∑y
i =1
i = 60

1
Therefore, SSR = 440.0178 − (60) 2 = 80.0178
10
When we divide SSR by the number of variables in the model, we obtain the regression
mean square (MSR). Note that in simple linear regression, there is only one variable.

The Total Sum of Squares


The total variation in the observed values of Y is given by
n n
SST = ∑ ( y i − y ) 2 = ∑ y i2 − ny 2
i =1 i =1

This expression may be written in matrix form as

CoDEUCC/Post-Diploma in Mathematics and Science Education 197


UNIT 5
USING MATRIX ALGEBRA IN SIMPLE LINEAR
SESSION6 REGRESSION

2
1 n 
SST = Y ′Y −  ∑ y i  (5.23)
n  i =1 
You will find these matrix representations very useful in multiple regression.

6.3 Application to Distance Measure


Let the vector x o be the observed vector x o = (1 xo )′ . Of interest is the product
x′o ( X′X) −1 x o given as follows: x ′o ( X′X) −1 x o

 n 2 n

 ∑ xi − ∑ xi  1 
1  i =1  
x ′o ( X′X) −1 x o = (1 x ) i =1
2 o
 n  
 
 − ∑ xi n  xo 
n n
n∑ xi −  ∑ xi 
2

i =1  i =1   i =1 
1
1
= (∑ x 2
− x o ∑ xi
 
− ∑ xi + nxo   )
n∑ xi2 − (∑ xi )
2 i
x 
 o
1
= (∑ x 2
− xo ∑ xi − xo ∑ xi + nxo2 )
n∑ x − (∑ xi )
2 2 i
i
1
= (∑ x 2
− 2 xo ∑ xi + nxo2 )
n∑ x − (∑ xi )
2 2 i
i

=
∑ ( xi − x o ) 2
n∑ xi2 − (∑ xi )
2

Recognize that we can write numerator of the last result as

∑ (x i − x o ) 2 = ∑ ( xi − x + x − x o ) 2 = ∑ ( xi − x ) 2 + n( x o − x ) 2

Thus, ∑ (x i − xo ) 2 = ∑ xi2 −nx 2 + n( xo − x ) 2

We note also that n∑ xi2 − (∑ xi ) = n(∑ xi2 − nx 2 )


2

Making substitutions,

198
CoDEUCC/Post-Diploma in Mathematics and Science Education
INFERENCE ABOUT SIMPLE LINEAR REGRESSION UNIT 5
SESSION 6

x ′o ( X′X) −1 x o =
∑x 2
i −nx 2 + n( xo − x ) 2
(
n ∑ xi2 − nx 2 )
=
∑x 2
i −nx 2
+
n( x o − x ) 2
n(∑ x 2
i − nx 2 ) ( n ∑ xi2 − nx 2 )
1 ( xo − x ) 2
= +
n ∑ xi2 − nx 2

Therefore, the matrix product

1 ( xo − x ) 2
x ′o ( X′X) −1 x o = + (5.24)
n ∑ xi2 − nx 2

This result represents the distance of X = xo from previously observed mean value x
of X.

CoDEUCC/Post-Diploma in Mathematics and Science Education 199


UNIT 5
USING MATRIX ALGEBRA IN SIMPLE LINEAR
SESSION6 REGRESSION

Self-Assessment Questions
Exercise 5.6
1. Let the following matrices be defined as follows:
Y is n × 1 vector of observed values on Y
X is n × (k + 1) data matrix
β is (k + 1) × 1 ,
In the definition of the dimension of the matrices/vectors above, k is the number of
variables in the model; For the simple linear model
state the dimensions of the following products:
(a) β′X′Y
(b) ( X′X) −1 X′Y

2. The data on sales price, y, (in thousands of Ghana Cedis) of a house and home size,
x, (in tens of square feet) in Exercise 4.1, Question 2, is given in the table.

x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345

(a) Obtain the matrix X′X and its inverse.


(b) Hence, find the regression coefficients vector β and write down the regression
model for y.
(c) Obtain the values Y ′Y and β ′X′Y. Hence, calculate the mean square residual
for the model in (b).

200
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS
OF VARIANCE UNIT 6

UNIT 6: MULTIPLE LINEAR REGRESSION AND


ANALYSIS OF VARIANCE
UNIT OUTLINE

Session 1: The General Linear Model


Session 2: Inference about the Multiple Linear Regression
Session 3: Dummy Variable Regression
Session 4: Polynomial Regression
Session 5: One-Way Analysis of Variance
Session 6: Two-Way Analysis of Variance

The fundamentals of regression analysis have been established in


previous two units. This unit considers a generalisation to what has
been treated in those units. In addition, it examines the general concept of one and two-
way analysis of variance.

Objectives
By the end of the unit you should be able to:
1. derive the general linear regression for given data on several variables;
2. interpret the coefficients of the derived model;
3. assess a derived regression model by using various measures of quality;
4. assess the quality of a regression model by conducting relevant statistical tests;
5. determine regression model in terms of indicator variables;
6. interpret the coefficients of dummy variable regression;
7. fit a polynomial model to a given data;
8. assess the suitability of the fitted model;
9. perform one-way analysis of variance on a given suitable data;
10. determine exact pairs of classes of samples that are different after establishing
that such differences exist;
11. perform a simple two-way analysis of variance on a given suitable data;
12. perform a two-way analysis of variance with interaction on a given suitable data.

CoDEUCC/Post-Diploma in Mathematics and Science Education 201


MULTIPLE LINEAR REGRESSION AND ANALYSIS
UNIT 6 OF VARIANCE

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics, if any

CoDEUCC/Post-Diploma in Mathematics and Science Education


202
MULTIPLE LINEAR REGRESSION AND
UNIT 6
ANALYSIS OF VARIANCE SESSION 1

SESSION 1: THE GENERAL LINEAR MODEL


In Units 4 and 5, we have established the essentials of the simple linear
regression analysis. This session extends these concepts to the general
linear model which determines a dependent variable in terms of several independent
variables. Since the computations here could be quite bulky, we may rely on some
statistical computing packages such as the Minitab. It means that this statistical package
or some others should be available to you.

Objectives
By the end of this session, you should be able to:
1. Derive the general linear regression for given data on several variables.
2. Interpret the coefficients of the derived model

Now read on…

1.1 The Data and Regression Model


Let ( x1 , x 2 , x3 ,  , x p ) be p independent variables that may have some influence on a
response variable Y. The general linear model for Y is of the form
Yi = β o + β 1 xi1 + β 2 xi 2 +  + β p xip + ε i (6.1)
where
Yi is the ith observation of the response variable observed under the fixed values
( xi1 , xi 2 , xi 3 ,  , xip ) of the predictor variables.
ε i is the unobservable random error associated with Yi
β o , β 1 , β 2 ,  , β p are p + 1 unknown linear parameters.
According to the normal assumptions, Yi are independent normally distributed random
variables such that
E (Yi ) = β o + β 1 xi1 + β 2 xi 2 +  + β p xip
and Var (Yi ) = σ 2 , i = 1, 2,  , n
Let us expand Equation (6.1) for each of the n observations for Y. Given a random
sample of n observations (Y1 , Y2 , Y3 ,  , Yn ) , each observed on ( x1 , x 2 , x3 ,  , x p ) . The
following n equations result based on the Equation (6.1).

CoDEUCC/Post-Diploma in Mathematics and Science Education 203


UNIT 6 THE GENERAL LINEAR MODEL
SESSION 1

Y1 = β o + β 1 x11 + β 2 x12 +  + β j x1 j +  + β p x1 p + ε 1

Y2 = β o + β 1 x 21 + β 2 x 22 +  + β j x 2 j +  + β p x 2 p + ε 2

Yi = β o + β 1 xi1 + β 2 xi 2 +  + β j xij +  + β p x1 p + ε 1

Yn = β o + β 1 x n1 + β 2 x n 2 +  + β j x nj +  + β p x1 p + ε 1

The set of equation may be written in matrix for as


Y = Xβ + ε (6.2)
where
 y1  1 x11 x12  x1 j  x1 p   βo   ε1 
       
y  1 x x 22  x 2 j  x 2 p   β  ε 
 2  21
  1  2
              
    β=   
Y= X=  ,  β  , ε = 
y 
 i 1 xi1 xi 2  xij  xip   j εi 
       
              
       
 yn  1 x n1 x n 2  x nj  x np  βp  ε n 
It is important to note the dimensions of each of the matrix/vectors involved in Equation
(6.2).
Notice that
Y is n × 1 vector of observed values on Y
X is n × ( p + 1) data matrix
β is ( p + 1) × 1
For least squares estimation of the parameters, the normal equations are obtained as
follows:
Then to obtain Equation (5.16), we have
( X′X)b = X′Y
where X′X is ( p + 1) × ( p + 1) matrix.
If X′X has an inverse, we then obtain the solution for the vector β as
b = ( X′X) −1 X′Y (6.3)
Hence, the estimated regression equation is
Y = Xb (6.4)

204
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND
UNIT 6
ANALYSIS OF VARIANCE SESSION 1

Let us take note of the following features of the matrix in (6.4).


1. Each of b j is normally distributed with mean E (b j ) = β j , ( j = 0, 1, 2,  , p )
and variance Var (b j ) = c jj σ 2 , where c jj is the ( j + 1)st diagonal element of
( X′X) −1 .
2. The covariance Cov(b j , bk ) = c jk σ 2 , where c jk is the off-diagonal element of
( X′X) −1 .

Example 6.1
Refer to the data in Example 4.1. Suppose it is suspected that the length of experience
of the members in a group that work on the given job also influences the time spent on
carrying out the job. The data including the average years of experience of the group is
as shown.
Table 6.1: Time spent on completing a piece of
job and influential variables
Size of Av. Years of
No. of Hrs (y)
Group (x1) Experience (x2)
5 5.5 10
8 7.5 7
4 4.0 13
6 9.0 4
10 9.0 1
3 7.0 7
5 5.0 8
9 8.0 3
10 7.5 2
7 8.0 5

Obtain the regression equation for the number of hours spent in terms of the number of
workers in the group and group average years of experience.

Solution

Constructing the data in the form in Equation (6.4), we have

 10.00 67.00 70.50   60.00 


   

X′X = 67.00 505.00 496.50  , 
X′Y = 335.00 
   
 70.50 496.50 522.75   372.50 
   

CoDEUCC/Post-Diploma in Mathematics and Science Education 205


UNIT 6 THE GENERAL LINEAR MODEL
SESSION 1

 2.0323 − 0.0024 − 0.2718 


 

( X′X) = − 0.0024
−1
0.0299 − 0.0281
 
 − 0.2718 − 0.0281 0.0652 

Thus, the vector of least squares estimates of the parameters is
 2.0323 − 0.0024 − 0.2718  60.00   19.8875 
    
b =  − 0.0024 0.0299 − 0.0281 335.00  =  − 0.5861
    
 − 0.2718 − 0.0281 0.0652  372.50   − 1.4129 
  

Therefore, the regression equation is therefore given as
y = 19.8875 − 0.5861x1 − 1.4129 x 2

1.2 Interpreting the Coefficients of the Linear Model


We will use Example 6.1 to demonstrate how to interpret regression coefficients.
Meanwhile, without loss of generality, suppose that the general model in terms of two
variables is
Y = β o + β 1 x1 + β 2 x 2
Suppose also that x1 = 0 and x 2 = 0 . Then
Y = β o + β 1 (0) + β 2 (0) = β o
So β o is the average amount of time spent on the given job when there is no worker
and, of course, no years of experience. The parameter β o is the intercept in the
regression model. Notice that this interpretation does not sound practical since a job can
only be done when there are hands engaged on it. Thus, β o and indeed other parameters
may not have practical interpretations since the context of the interpretation is not likely
to occur.

To interpret β1 and β 2 suppose that in the first instance, the number of workers
involved is c with average experience d. Then the average hours taken to do the job is
Yx = c = β o + β 1c + β 2 d

Again let the next group be made up of c + 1 members with average experience d. Then
the time taken on the job is
Yx =c +1 = β o + β1 (c + 1) + β 2 d
The difference between the two is
Yx =c +1 − Yx =c = β 1

206
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND
UNIT 6
ANALYSIS OF VARIANCE SESSION 1

Thus, β1 is interpreted as the difference or change in the average time on the job if one
additional worker is engaged (or for a unit increase in number of workers), whist
holding the value of the other variable unchanged. Therefore, for our example, if one
additional person is engaged, and does not cause any change in the original average
experience of the group, then the time taken to complete the job will reduce by
0.5861hrs (or 35 mins).

Now let suppose that the number of workers involved be q with average experience r.
Then the average hours taken to do the job is
Y x2 = r = β o + β 1 q + β 2 r

Again let the next group be made up of q members with average experience r + 1 d.
Then the time taken on the job is
Yx2 = r +1 = β o + β1 q + β 2 (r + 1)

The difference between the two is


Yx2 = r +1 − Yx2 = r = β 2

Thus, β 2 is interpreted as the difference or change in the average time on the job if one
additional year of experience is gained by the workers engaged, whist holding the
number of workers unchanged. Therefore, for our example, if the group of workers
gains one additional year of experience, whilst their number remains unchanged, then
the time taken to complete the job will reduce by 1.4129hrs.

CoDEUCC/Post-Diploma in Mathematics and Science Education 207


UNIT 6 THE GENERAL LINEAR MODEL
SESSION 1

Self-Assessment Questions
Exercise 6.1
1. Refer to the data in Exercise 4.1, Question 2.
Suppose that the rating of the house is also believed to influence the price of the
house. Each house is therefore rated based on its ‘pleasant appearance’ on a scale
of 1 – 10, where 1 represents worst and 10 represents best. The data including the
new variable is presented in Table 6.2.
Table 6.2: Data on sales price of houses and
influential variables
Sales price Home Size Pleasant Rating
360 23 5
196.2 11 2
346.2 20 9
273 17 3
282 15 8
331.8 21 4
387 24 7
255.6 13 6
327 19 7
345 25 2
(a) Obtain the matrix X′X and its inverse.
(b) Hence, find the regression coefficients vector β and write down the regression
model for y.
(c) Interpret all the coefficients of your model.

208
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 2

SESSION 2: INFERENCE ABOUT THE MULTIPLE LINEAR


REGRESSION
You will recall that to examine the quality of a simple linear regression
model, we used the coefficient of determination and the standard error.
In this section we will extend the discussion on the assessment of quality of the simple
linear regression to the case of the multiple regression.

Objectives
By the end of this session, you should be able to:
1. assess a derived regression model by using various measures of quality.
2. assess the quality of a regression model by conducting relevant statistical tests.

Now read on…

2.1 The Standard Error of Estimate and the Multiple Coefficient of


Determination
The Standard Error of Estimate
An unbiased estimator of the population variance σ 2 is the mean sum of squares error
(MSE) which is given as
n

∑(y i − yˆ i ) 2
S2 = i =1

n − ( p + 1)
In matrix notation, the MSE may be written as
Y ′Y − β ′X′Y
S2 = (6.5)
n − ( p + 1)
The standard error of the estimate of the regression model is the square root of the
expression in (6.5)
For example, for the problem in Example 6.1, we obtain
Y ′Y = 486 , β ′X′Y = 470.6175
Therefore,

Y ′Y − β ′X′Y 486 − 470.6175 15.3825


S2 = = = = 2.1975
n−3 7 7
Therefore the standard error of the estimate of the regression model is 1.4824.

CoDEUCC/Post-Diploma in Mathematics and Science Education 209


UNIT 6
INFERENCE ABOUT THE MULTIPLE LINEAR
SESSION 2 REGRESSION

Multiple Coefficient of Determination


The multiple coefficient of determination is an extension of the simple coefficient of
determination to the case of the multiple regression. It is give by
n

SSR
∑ (Yˆ − Y )
i
2

r2 = = i =1
n
SST
∑ (Y
i =1
i − Y )2

which is the ratio of the explained variation to the total variation. It measures the
proportion of the total variation in the n values of Y that is explained by the overall
regression model. The value r = r 2 is the multiple correlation coefficient.
Recall that in matrix form, the SSR and SST are given, respectively, by
2
1 n 
SSR = b ′X′Y −  ∑ y i 
n  i =1 
2
1 n 
SST = Y ′Y −  ∑ y i 
n  i =1 

Example 6.2
Refer to Example 6.1.
Find the multiple coefficient of determination. Comment on your result.

Solution
n
From Example 6.1, β ′X′Y = 470.6175 and ∑y
i =1
i = 60

1
SSR = 470.6175 − (60) 2 = 110.6175 , and
10
1
SST = 486.0 − (60) 2 = 126
10
Therefore, the multiple coefficient of determination is
110.6175
r2 = = 0.8779
126

Thus, the entire model explains 87.79 percent of variation in the amount of time taken
to perform a task.

2.2 Assessing the Significance of the General Linear Model


To assess the usefulness of the model, we have to perform a test of the overall
significance of the model. This is what is called the overall F test. Let the estimated
regression equation for Y in terms of ( x1 , x 2 , x3 ,  , x p ) be given by

210 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 2

Y = bo + b1 x1 + b2 x 2 +  + b p x p
The test of significance of the model is based on the hypothesis
H o : β1 = β 2 =  = β p = 0
against the alternative
H a : β j ≠ 0, for some j = 1, 2,  , p
The hypothesis says that none of the variables ( x1 , x 2 , x3 ,  , x p ) has an effect on the
response variable Y. The alternative hypothesis, on the other hand, suggests that at least
one of the variables is influential.
The test statistic for the test is given by
n

∑ (Yˆ − Y )
i
2
/p
F= n
i =1

∑ (Y i − Yˆ ) 2 / n − ( p + 1)
i =1
That is, F is the ratio of the mean square regression to the mean square error. The
statistic has the F distribution with k and n − ( p + 1) degrees of freedom.
A large value of F indicates that the model accounts for a large variation in Y, implying
that the model is useful. Thus, we reject H o for a large value of F compared to a critical
value of f 0.05, k , n −( p +1), .

Example 6.3
Refer to Example 6.1.
Test the significance of the overall model for determining the amount of time spent on
the job by the number of workers involved and their average years of experience.

Solution
The test is based on the hypothesis that
H o : β1 = β 2 = 0 against H a : β j ≠ 0, for some j = 1, 2
We know that the means square error is
Y ′Y − β ′X′Y 15.3825
S2 = = = 2.1975
n−3 7
and the mean square regression is
2
1 n 
b ′X′Y −  ∑ y i 
n  i =1  110.6175
MSR = = = 55.3088
p 2

CoDEUCC/Post-Diploma in Mathematics and Science Education 211


UNIT 6
INFERENCE ABOUT THE MULTIPLE LINEAR
SESSION 2 REGRESSION

The results are summarized in the Anova table in Table 6.3.


Table 6.3: ANOVA table for general linear model for amount of time
spent on completing a task
Source of
df SS MS F statistic
variation
Regression 2 110.6175 55.3088 25.1690

Error 7 15.3825 2.1975

Total 9 126.0000
F2, 7 , ( 0.05) = 4.74
(b) Since F = 25.1690 > F0.05, 2, 7 = 4.74 , the null hypothesis of no linear regression is
rejected, and we conclude that the time spent on a piece of job is linearly
influenced by number of workers on the job and the group average experience.

2.3 Assessing the Significance of Independent Variable


If a multiple linear model is significant, it does not mean that all the predictor variables
are also significant. It is important therefore to assess the relevance of each of the
predictor variables in the model. To assess the significance of X j , ( j = 1, 2,  , p ) we
actually test the hypothesis
H o : β j = 0 against H a : β j ≠ 0
The test statistic for the test of H o is given by
bj − β j Ho bj
t= =
se(b j ) se(b j )
which has the t-distribution with n − ( p + 1) degrees of freedom. The standard error
se(b j ) = s c jj , j = 0,1, 2, , p , and c jj is as defined earlier. A high value of the
statistic shows a departure of the estimate b j from the hypothesized value. We will
therefore reject H o in this case.

Example 6.4
Refer to Example 6.1.
Test the significance of each of the independent variables.

Solution
The model is obtained as
y = 19.8875 − 0.5861x1 − 1.4129 x 2

212 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 2

The test of significance of X 1 is based on the hypothesis


H o : β1 = 0 against H a : β1 ≠ 0
Since b1 = −0.5861 , s = 1.4824, and from the matrix
 2.0323 − 0.0024 − 0.2718 
 
′ 
( X X) = − 0.0024
−1
0.0299 − 0.0281 
 
 − 0.2718 − 0.0281 
0.0652 

c11 = 0.0299 . So we obtain the test statistic as
0.5861
Thus, t = − = −2.2865 .
1.4824 0.0299
From the t-distribution table, we find t 0.025, 7 = 2.365
Since t = 2.2865 < t 7 , 0.025 = 2.365 , we fail to reject the null hypothesis of no linear
effect of X 1 . This implies that X 1 is not significant in the linear model. We conclude
that the time spent on a piece of job is not significantly influenced linearly by the
number of workers on the job.
The test of significance of X 2 is based on the hypothesis
H o : β 2 = 0 against H a : β 2 ≠ 0
Since b2 = −1.4129 , s = 1.4824, and from the ( X′X) −1 matrix, c 22 = 0.0652 , so we
obtain the test statistic as
1.4129
Thus, t = − = −3.7327 .
1.4824 0.0652
From the t-distribution table, we find t 0.025, 7 = 2.365
Since t = 3.7327 > t 0.025, 7 = 2.365 , we reject the null hypothesis of no linear effect of
X 2 . This implies that X 2 is significant in the linear model. We conclude that the time
spent on a piece of job is significantly influenced linearly by the average years of
experience of the group of workers on the job.

Example 6.4 buttresses the point that if the model is significant, it does not mean that all
the predictor variables are significant. Notice also that X 1 is significant if used alone.
However, the presence of X 2 does not make X 1 relevant any more.

CoDEUCC/Post-Diploma in Mathematics and Science Education 213


UNIT 6
INFERENCE ABOUT THE MULTIPLE LINEAR
SESSION 2 REGRESSION

2.4 Confidence Interval for Regression Coefficients


From the discussion above, it is noted that the quantity
bj − β j Ho
T= , j = 0, 1, 2,  , p
se(b j )
is a Student’s t random variable with n − ( p + 1) degrees of freedom. Then a
100(1 − α )% confidence interval for the parameter β j is given by
b j ± tα 2, n −( p +1) se(b j ), j = 0, 1, 2,  , p
Since se(b j ) = s c jj , j = 0,1, 2, , p, the interval is
b j ± tα 2, n −( p +1) s c jj , j = 0, 1, 2,  , p (6.6)

Example 6.5
Refer to Example 6.1.
Determine the 95 percent confidence interval for β1 and β 2 .
Comment on your result in each case.
Solution
For the problem in Example 6.1, it is known that the model is
Y = 19.8875 − 0.5861x1 − 1.4129 x 2
and s = 1.4824. From the matrix,
 2.0323 − 0.0024 − 0.2718 
 

( X′X) = − 0.0024
−1
0.0299 − 0.0281
 
 − 0.2718 − 0.0281
 0.0652 
c11 = 0.0299 and c 22 = 0.0652 based on n = 10.
The 95 percent confidence interval for β1 is given as
b1 ± t 0.025, 7 se(b1 )

= −0.5861 ± 2.365(1.4824 0.0299 )

= −0.5861 ± 0.6062

= (−1.1923, 0.0201)
The interval shows that 95 percent of all samples would yield a value of β1 within the
given interval based on our method of estimation. Since the interval contains 0, under
the hypothesis H o : β1 = 0 , it implies that the variable X 1 (the number of workers
engaged) is not statistically significant in the determination of Y, (the amount of time
spent on the job).
The 95 percent confidence interval for β 2 is given as

214 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 2

b2 ± t 0.025, 7 se(b2 )
= −1.4129 ± 2.365(1.4824 0.0652 )

= −1.4129 ± 0.8952
= (−2.3081, − 0.5177)

The interval shows that 95 percent of all samples would yield a value of β 2 within the
given interval based on our method of estimation. Since the interval does not contain 0,
under the hypothesis H o : β 2 = 0 , it implies that the variable X 2 (average years of
experience of workers engaged) is statistically significant in the determination of Y, (the
amount of time spent on the job).

Self-Assessment Questions
Exercise 6.2

1. Refer to the data in Table 6.2 of Exercise 6.1.


(a) Calculate the multiple coefficient of determination. Comment on your result.
(b) Assess the significance of the overall model.
(c) Assess the significance of each of the independent variables.

2. Refer to the data in Table 6.2 of Exercise 6.1.


Construct the 95 percent confidence interval for the regression coefficient of the
independent variables. In each case, comment on your result.

CoDEUCC/Post-Diploma in Mathematics and Science Education 215


UNIT 6
INFERENCE ABOUT THE MULTIPLE LINEAR
SESSION 2 REGRESSION

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

216 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 3

SESSION 3: DUMMY VARIABLE REGRESSION


In all the problems we have considered until now, the predictor variables
used in the regression have been quantitative in nature. There are also
frequent practical problems which involve qualitative variables for which a well-defined
scale is not apparent. Some of these variables are marital status, sex, geographical
location and urban or rural area of residence. Since such qualitative variables are
important in predicting outcomes for regression models, we now examine a way to
include the levels of a qualitative independent variable in regression analysis. In this
case, we have what is referred to as dummy variable regression.

Objectives

By the end of this session, you should be able to:


1. Determine regression model in terms of indicator variables.
2. Interpret the coefficients of dummy variable regression

Now read on…

3.1 Indicator Variables


When we quantify the levels of qualitative variable for regression, the variable is
referred to as a dummy or indicator variable. For illustration, consider data (in tens of
Ghana Cedis) on monthly income and expenditure of employed fresh graduates in a
certain region given in Table 6.3.

CoDEUCC/Post-Diploma in Mathematics and Science Education 217


UNIT 6 DUMMY VARIABLE REGRESSION
SESSION 3

Table 6.3: Monthly Income and Expenditure of


Employed Fresh Graduates
SN Income Expenditure Sex
1 220 150 M
2 277 180 M
3 303 150 F
4 326 195 M
5 332 210 F
6 333 125 F
7 339 186 F
8 344 258 M
9 346 223 M
10 352 198 F
11 356 260 M
12 413 255 M
13 418 214 M
14 419 197 M
15 431 200 F
16 431 183 F
17 434 229 M
18 555 294 M

Without using the sex of the individual, the regression model for expenditure in terms
of income is given in Table 6.4.
Table 6.4: Regression Model for Expenditure
Predictor Coef SE Coef T P
Constant 72.42 41.82 1.73 0.103
Income 0.3614 0.1114 3.24 0.005
S = 34.4081 R-Sq = 39.7% R-Sq(adj) = 35.9

Notice that this model explains only about 40 percent of variation in the data. In
addition, the standard error is high. These suggest that this model could be improved.

218 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 3

Suppose we have some reason to believe that the sex of the individual also plays a part
in determining expenditure. Our suspicion is based on the scatter plot of the data (see
Figure 6.1) using the level of sex (i.e., male and female) as the grouping variables. In
fact, we should say that we are using Male (M) and Female (F) as indicators of the
variable Sex.

300 Sex
F
M

250
Exp

200

150

100
200 300 400 500 600
Income

Figure 6.1: Scatter plot of expenditure against income of fresh graduates

The graph shows that males have the tendency to spend higher than their female
counterparts. It implies therefore that the level of one’s expenditure may be influence by
one’s sex. Based on this, we incorporate sex differences in the model for expenditure. In
order to the non-qualitative variable, Sex, on an acceptable scale, we have to create
‘indicator’ variables for it. Since there are only two levels of sex, the inclination will be
to define two indicators. However, this would pose a problem for the inversion of the
matrix X′X . To avoid this, we create only one indicator, say for female, as
1, if individual is a female
F =
0 otherwise

indicator
In general, if a qualitative variable has m levels it can be represented by m − 1
variables, each assigned the values 0 and 1.

The vectors Y, X and others for fitting the model are

CoDEUCC/Post-Diploma in Mathematics and Science Education 219


UNIT 6 DUMMY VARIABLE REGRESSION
SESSION 3

 150  1 220 0 
   
 180  1 277 0 
 150  1 303 1 
   
 195  1 326 0 
 210   332 1 
  1
 125  1 333 1 
   
 186  1 339 1 
 258  1 344 0 
   
 223  1 346 0 
Y= X=
198  1 352 1 
   
 260  1 356 0 
 255  1 413 0 
   
 214  1 418 0 
   
 197  1 419 0 
 200  1 431 1 
   
 183  1 431 1 
 229  1 434 0 
   
 294  1 555 0 

 18 6629 7  3707 
   
X ′X =  6629 2536697 2521 , X′Y = 1399384 
 7 2521 17   1252 
  

The inverse matrix is

 1.5648 − 0.0039 − 0.1434 


 
( X′X) −1
=  − 0.0039 0.0000 0.0001
 − 0.1434 0.0001 0.2356 

Therefore, the regression coefficients are obtained as

 98.2535 
 
b = ( X′X) ( X′Y) =  0.3345 
−1

 − 39.8716 
 

220 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 3

Let us test the significance of the indicator variable, F.

The hypothesis is that


H o :β2 = 0 against H o : β 2 ≠ 0

The mean square error for the model is


Y ′Y − b ′( X′Y) 794839 − 782430 12409
S2 = = = = 827.2994
n − (k + 1) 18 − 3 15
Therefore, the standard error of the estimate of y is 28.7628
b2 − 39.8716
t= = = −2.8559
S c33 28.7628 0.2356

Using MINITAB, we obtain the summary of the regression analysis result in addition to
the test as shown in Table 6.5.

Table 6.5: Regression Model for Expenditure


Predictor Coef SE Coef T P
Constant 97.36 35.11 2.77 0.014
Income 0.3369 0.0912 3.69 0.002
F − 40.98 13.62 − 2.86 0.009
S = 28.0653 R-Sq = 62.4% R-Sq(adj) = 57.4

Analysis of Variance
Source df SS MS F P
Regression 2 18994.0 9791.8 12.43 0.001
Residual Error 15 12409.0 827.299
Total 17 31403.0

Notice that sex and income are both significant in the model. Again, the inclusion of sex
has improved the quality of the model. This shows that sex differences for expenditure
exist and should not be disregarded. Notice also that the results from the use of table
values are confirmed by the p-values associated with the tests.

CoDEUCC/Post-Diploma in Mathematics and Science Education 221


UNIT 6 DUMMY VARIABLE REGRESSION
SESSION 3

3.2 Interpretation of Dummy Variable Regression Coefficients and


Confidence Intervals
The model in Table 6.5 is given as
Exp = 97.36 + 0.3369 Income − 40.98 F
which is of the form
y = bo + b1 x1 + b2 x 2
We can deduce a separate model for each level of sex from this joint model. Consider
the expenditure for males. For this indicator, x 2 = 0 , and for any amount h of income
the model for the average expenditure is of the form
µ m = bo + b1 h
which is a straight line with slope b1 and intercept bo .
For the expenditure for females, the indicator, x 2 = 1 , and for any amount h of income
the model for the average expenditure is of the form
µ f = bo + b1 h + b2
= (bo + b2 ) + b1 h
which is a straight line with slope b1 and intercept (bo + b2 ) .

Therefore, b1 represents the increase in expenditure for a unit increase in income


irrespective of the sex of the individual. Thus, in our illustration, for a Cedi increase in
income, expenditure would increase by about GHp34.
It can be observed that
µ f − µ m = b2
Thus, b2 represents the amount by which the expenditure of a female exceeds (or falls
short of) that of a male. In our illustration, for the same amount of income, expenditure
of females falls short of that of males by Gh¢40.98.

Notice that bo is the intercept of the model for males. It implies that bo (= 97.36)
represents the average expenditure for a male when no income is earned yet. For
females with no income average expenditure would be GH¢56.38.

222 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 3

Self-Assessment Questions
Exercise 4.3

1. The data on profit, in GH¢y, of a certain small scale business establishment in the
xth year of its operation in Exercise 2 of Section 1 is given in the table.

x 1 2 3 4 5
y 125 140 165 195 230
(a) Find the coefficient of determination for the model obtained for profit in terms
of the year.
(b) Deduce the coefficient of correlation between profit and the year of operation.
(c) Comment on your values in (a) and (b).
2. The data on sales price of a house (in thousands of Ghana Cedis), y, and home size
(in tens of square feet), x, in Exercise 4.1, Question 2, is given in the table.

x 23 11 20 17 15 21 24 13 19 25
y 360 196.2 346.2 273 282 331.8 387 255.6 327 345

(a) Find the coefficient of determination for the model obtained for sales price in
terms of home size.
(b) Deduce the coefficient of correlation between sales price and home size.
(c) Comment on your values in (a) and (b).

3. Refer to the problem on Table 6.3.


Find the confidence interval for the coefficient of the indicator variable. Interpret
your result.

CoDEUCC/Post-Diploma in Mathematics and Science Education 223


UNIT 6 DUMMY VARIABLE REGRESSION
SESSION 3

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

224 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 4

SESSION 4: POLYNOMIAL REGRESSION


Sometimes, we find it necessary to use another form of linear regression
model which is polynomial in nature. In this case, the relation between
two variables is non-linear. The simplest type of non-linear relationship is one in which
the dependent variable is a polynomial function of the independent variable. In this
session, we will examine in particular the quadratic model since they are frequently
encountered.

Objectives
By the end of this session, you should be able to:
1. Fit a polynomial model to a given data
2. Assess the suitability of the fitted model

Now read on…

4.1 Quadratic Regression Model


The simplest polynomial model is the quadratic model of the form
y = β o + β1 x + β 2 x 2 + ε (6.5)

where the coefficients β r (r = 0, 1, 2) are not all zero. The error term ε describes the
effects on y of all factors other than x and x 2 .
In a quadratic model, the mean of y is either
1. increasing at an increasing rate as x increases;
2. increasing at a decreasing rate as x increases;
3. decreasing at an increasing rate as x increases; or
4 decreasing at a decreasing rate as x increases.

The numerical values of β r (r = 0, 1, 2) determine exactly how the mean value of y


changes as x increases. Observed values of y would scatter around the curves shown in
Figure 6.2.

CoDEUCC/Post-Diploma in Mathematics and Science Education 225


UNIT 6 POLYNOMIAL REGRESSION
SESSION 4

y y

x x

(a) The mean of y increasing at an (b) The mean of y increasing at a


increasing rate as x increases decreasing rate as x increases

y y

x x

(c) The mean of y decreasing at an (d) The mean of y decreasing at a


increasing rate as x increases decreasing rate as x increases

Figure 6.2: The mean value of y changing in a quadratic fashion as x increases

Note that although the relationship between y and x is not linear because of the squared
term x 2 , the model is a linear model. This is because the expression
y = β o + β1 x + β 2 x 2

determines y as a linear function of the parameters, β r (r = 0, 1, 2) . Generally, as long


as the mean of y is a linear function of the regression parameters, we are dealing with a
linear regression model.
The appropriateness of a quadratic model is informed by a scatter plot of y versus x. For
example, in the problem in Exercise 4.1, Question 3 and 4, suppose that a variable on
the expenditure on advertisement (X3) is included to determine the demand for

226 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 4

Toothgate (see Exercise 6.5, Question 2). A plot of demand versus Ad. expenditure is
shown in Figure 6.3. The graph also shows the fitted quadratic curve.

7.5

7.0

6.5
Demand

6.0

5.5

5.0
3.0 3.5 4.0 4.5 5.0 5.5
AdExp

Figure 6.3: Scatter plot showing a quadratic relationship between Demand and Ad.
Expenditure

We see that a quadratic curve appears to fit the scatter plot quite well. This quite clearly
suggests that a quadratic model might be suitable for the data.

We can use the least squares method to determine the values of the coefficients,
β r (r = 0, 1, 2) .
Let the error sum of squares be given as
n
Q = ∑ (bo + b1 xi + b2 xi2 − y i ) 2
i =1

Differentiating the quantity Q partially with respect to each br (r = 0, 1, 2) and


equating to zero, we obtain the normal equations:
∂Q n
= 2∑ (bo + b1 x + b2 x 2 − y i ) = 0
∂bo i =1
nbo + b1 ∑ xi + b2 ∑ xi2 = ∑ y i

CoDEUCC/Post-Diploma in Mathematics and Science Education 227


UNIT 6 POLYNOMIAL REGRESSION
SESSION 4

∂Q n
= 2∑ (bo + b1 xi + b2 xi2 − y i ) xi = 0
∂b1 i =1
bo ∑ xi + b1 ∑ xi2 + b2 ∑ xi3 = ∑ xi y i

∂Q n
= 2∑ (bo + b1 xi + b2 xi2 − y i ) xi2 = 0
∂b2 i =1
bo ∑ xi2 + b1 ∑ xi3 + b2 ∑ xi4 = ∑ xi2 y i
The system of equations is thus obtained in matrices as

 n
 ∑x ∑x i
2
i
 o  ∑ i 
b   y 
 x 3  = x y 
∑ i ∑x ∑x   ∑ i i 
2
i i  b1 (6.7)
 4    x2 y 
 ∑ xi ∑x ∑x ∑ i i 
2 3
i i   b2 

Solving the system, we obtain the least squares regression coefficients.

Let us consider another example.

Example 6.6
Refer to the problem in Exercise 4.1, Question 1, which is on the profit, in GH¢y, of a
certain small scale business establishment in the xth year of its operation. The data is
given in the table.

x 1 2 3 4 5
y 125 140 165 195 230
(a) Fit a least squares quadratic regression model to the data
(b) Obtain a plot of the data with the fitted quadratic curve. Comment on the graph.
(c) Find the coefficient of multiple determination of the model.
(d) Find also the standard error of the estimate of the model.

Solution
(a) Since the data involve large values, let us use the transformation in Example 4.10.
That is,
y − 165
x = 3 + u and y = 165 + 5v ⇒ u = x − 3 and v =
5

228 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 4

In order to solve the normal equations, we generate the relevant values in the table
shown.

u v uv u2 u 2v u3 u4
−2 −8 16 4 − 32 −8 16
−1 −5 5 1 −5 −1 1
0 0 0 0 0 0 0
1 6 6 1 6 1 1
2 13 26 4 52 8 16
Σ 0 6 53 10 21 0 34

The matrix equation in (6.6) in terms of u is

 5 0 10  a o   6 
    
 0 10 0  a  =  53 
  1   
10 0 34  a   21
  2   
From the equation, we see that
5a o + 10a 2 = 6 , 10a1 = 53 , and 10a o + 34a 2 = 21

Solving gives a o = −0.0857 a1 = 5.3 , and a 2 = 0.6429


Therefore, the quadratic regression equation of v on u is
y = −0.0857 + 5.3u + 0.6429u 2
Changing back to the original variables in x and y we obtain

y − 165
= −0.0857 + 5.3( x − 3) + 0.6429( x − 3) 2
5

Simplifying gives
y = 114.0 + 7.213 x + 3.215 x 2

CoDEUCC/Post-Diploma in Mathematics and Science Education 229


UNIT 6 POLYNOMIAL REGRESSION
SESSION 4

(b) The quadratic curve relating profit to year of operation is given in Figure 6.4.

240

220

200
Profit

180

160

140

120
1 2 3 4 5
Year of operation

Figure 6.4: A fitted quadratic model for profit versus year of business operation

From the graph, the fitted curve passes through all the points almost exactly. This
indicates that there is almost perfect quadratic relationship between profit and year
of business operation.

(c) Using the quadratic equation, the fitted values are shown along with the original
values of y

SN y ŷ
1 125 124.429
2 140 141.286
3 165 164.571
4 195 194.286
5 230 230.429

From the table,


n n n

∑y
i =1
i = 855.00 , ∑y
i =1
2
i = 153375.00 and ∑ yˆ
i =1
2
i = 153372.00

230 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 4

The variations are:


2
n
1 n  1
s yy = ∑ y i −  ∑ y i  = 153375.00 − (855) 2 = 7170
2

i =1 n  i =1  5
2
n
1 n  1
s yˆyˆ = ∑ yˆ −  ∑ yi  = 153372.00 − (855) 2 = 7167
2
i
i =1 n  i =1  5
The coefficient of multiple determination is
s yˆyˆ 7167
r2 = = = 0.9996
s yy 7170
Thus, the model accounts for 99.96 percent of variation in profit.

(d) The unexplained variation is given by


n n
SSE = ∑ y i2 − ∑ yˆ i2 = 153375.00 − 153372.00 = 3.00
i =1 i =1

Therefore, standard error of the estimate of the quadratic model is


SSE 3.00
S= = = 1.2247
n−3 2

4.2 Least Squares Estimation of the Polynomial Regression


The general polynomial model of degree p is of the form
y = β o + β1 x + β 2 x 2 +  + β p x p + ε (6.8)

where the coefficients β r (r = 0, 1, 2,  , p) are not all zero. If the degree is decided,
the coefficients may be determined by least squares estimation. For brevity, we will
write the average value of y in Equation (6.8) as
p
y = ∑ br x r
r =0

The error sum of squares is given by


2
n
 p

Q = ∑  y i − ∑ br xir 
i =1  r =0 
Differentiating partially with respect to bk (k = 0, 1, 2,  , p ) , we have

CoDEUCC/Post-Diploma in Mathematics and Science Education 231


UNIT 6 POLYNOMIAL REGRESSION
SESSION 4

∂Q n
 p

= ∑  y i − ∑ br xir (−2 xik )
∂bk i =1  r =0 
Equating to zero,

n
 p r +k 
n

∑  ∑ r i  ∑ xi y i

i =1  r = 0
b x  = k
(6.9)
 i =1
Equation (6.9) may further be written as
n n


i =1
xik yˆ i = ∑ xik y i
i =1

It is noticeable that in Equation (6.9), the left-hand side is the kth moments of the
polynomial and the right-hand side is the kth moment of the data. Thus, to fit a
polynomial curve of degree p to a set of data by the method of least squares, it is
equivalent to equating the moments of order k = 0, 1, 2,  , p of the polynomial to
those of the data.

232 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 4

Self-Assessment Questions
Exercise 6.4
1. Refer to the problem in Exercise 4.1, Question 1, which is on the profit, in GH¢y,
of a certain small scale business establishment in the xth year of its operation. The
data is given in the table.

x 1 2 3 4 5
y 125 140 165 195 230

(a) Test the significance of the quadratic term in the quadratic model.
(b) Obtain the analysis of variance for both linear and quadratic models.
(c) Obtain a plot of the data with the fitted linear and quadratic curves in separate
panels of the graph. Comment on the graphs.
(d) Which of the two models is more appropriate for the data? Explain.

2. Refer to the problem in Exercise 4.1, Question 3 and 4. Suppose that a variable on
expenditure on advertisement (X3) is included to determine the demand for
Toothgate. The data on X3 along with the demand (Y) is given in the table shown.

Period x3 y Period x3 y
1 3.50 5.38 11 5.00 7.10
2 4.75 6.51 12 4.90 6.86
3 5.25 7.52 13 4.80 6.87
4 3.50 5.50 14 5.10 7.26
5 5.00 7.33 15 5.00 7.00
6 4.50 6.28 16 4.25 5.65
7 3.25 5.87 17 5.00 6.50
8 3.25 5.10 18 3.75 5.67
9 4.00 6.00 19 3.80 5.93
10 4.50 5.89 20 4.80 7.26

(a) Determine the least squares quadratic model for y in terms of x3.
(b) Find the multiple coefficient of determination of the model.
(c) Find also the standard error of the estimate of the model.

CoDEUCC/Post-Diploma in Mathematics and Science Education 233


UNIT 6 POLYNOMIAL REGRESSION
SESSION 4

This is a blank sheet for your short notes on:


• issues that are not clear; and
• difficult topics if any

234 CoDEUCC/Post-Diploma Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 5

SESSION 5: ONE-WAY ANALYSIS OF VARIANCE


The analysis of variance (ANOVA) that involve the test of significance of
the variance-ratio, called the F test is just one application of ANOVA. In
ANOVA one seeks to separate the variation in a variable of interest attributable to one
group of causes from the variation attributable to other groups. In one-way ANOVA,
we assume that there is only one independent variable or factor that affects the response
variable. In this session and the next, we intend to discuss the general concept of
separation of variations.

Objectives
By the end of this session, you should be able to:
1. Perform one-way analysis of variance on a given suitable data
2. Determine exact pairs of classes of samples that are different after establishing
that such differences exist.

Now read on…

5.1 Data Representation and Fundamental Equation of One-Way


ANOVA
We consider a random sample ( x1 , x 2 , x3 ,  , x n ) of n values of a given variable X.
Suppose that these n values are classified into m classes according to some criterion of
m
classification. Let the ith class contain ni elements. Thus, ∑n
i =1
i = n . Denote the jth

member of the ith class as xij . The layout of the sample is presented in Table 6.6.

CoDEUCC/Post-Diploma in Mathematics and Science Education 235


UNIT 6 ONE-WAY ANALYSIS OF VARIANCE
SESSION 5

Table 6.6: Layout of Data for ANOVA

Element
Class
1 2  j  ni

1 x11 x12  x1 j  x1n1

2 x 21 x 22  x2 j  x 2n2

  

i xi1 xi 2  xij  xini

  

m x m1 xm 2  x mj  x mnm

Let the mean of the ith class be


1 ni
xi. = ∑ xij
ni j =1
and the mean of the entire n values be
1 m 1 ni
 1 m
x.. = ∑
m i =1  ni
∑ x 
ij  =
m
∑ xi.
j =1  i =1

Thus, for each class i, the sum of deviations from the class mean is
ni

∑ (x
j =1
ij − xi. ) = 0

We consider the sum of squared deviations of all values from the general mean given as

236 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 5

ni ni

∑ ( xij − x.. ) 2 = ∑ ( xij − xi. + xi. − x.. ) 2


j =1 j =1
ni ni
= ∑ ( xij − xi. ) 2 + ni ( xi. − x.. ) 2 + 2( xi. − x.. )∑ ( xij − xi. )
j =1 j =1
ni
= ∑ ( xij − xi. ) 2 + ni ( xi. − x.. ) 2
j =1

Hence,
m ni m m ni

∑∑ ( x
i =1 j =1
ij − x.. ) = ∑ ni ( xi. − x.. ) + ∑∑ ( xij − xi. ) 2
2

i =1
2

i =1 j =1
(6.10)

Equation (6.10) is the fundamental equation of analysis of variance.


In Equation (6.10),
m ni

∑∑ ( x
i =1 j =1
ij − x.. ) 2 represents the total variation (SST).

The right-hand side shows a decomposition of the total variation into two components.
The first separated component
m

∑ n (x
i =1
i i. − x.. ) 2 represents the variation between classes or treatments (SSTR).

If the values across classes vary widely, then the deviations of the class means from the
general mean is also supposed to be large. A large value of the between-class variation
is thus a reflection of wide variability or heterogeneity across classes. Wide variability
is an indication of real differences between classes or treatments.
The second component
m ni

∑∑ ( x
i =1 j =1
ij − xi. ) 2 represents the variation within classes.

If the values within each class do not vary widely, then the deviations of the class
values from their class mean is supposed to be small. A small value of the within-class
variation is thus reflection of low variability or homogeneity within classes.

CoDEUCC/Post-Diploma in Mathematics and Science Education 237


UNIT 6 ONE-WAY ANALYSIS OF VARIANCE
SESSION 5

Define the following sum


ni m ni

∑ xij = Ti. and


j =1
∑∑ x
i =1 j =1
ij = T..

More computational formulas for the variation sum of squares components are
m ni m ni
T..2
SST = ∑∑ ( xij − x.. ) 2 = ∑∑ xij2 −
i =1 j =1 i =1 j =1 n
m ni m
Ti.2 T..2
SSTR = ∑∑ ( xi. − x.. ) 2 = ∑ −
i =1 j =1 i =1 ni n
m ni m ni m
Ti.2
SSE = ∑∑ ( xij − xi. ) 2 = ∑∑ xij2 − ∑
i =1 j =1 i =1 j =1 i =1 ni

5.2 Testing the Hypothesis for One-Way ANOVA


In Table 6.6, each class constitutes a random sample from the parent population. We
wish to test the hypothesis
H o : µ1 = µ 2 =  = µ m (6.11)
against
H a : µ i ≠ µ j for some i ≠ j
The null hypothesis says that the means of all classes are the same, meaning that there
are no real differences between classes. The alternative implies that some of the means
are not the same. If H o is false, then xij is made up of the overall mean plus the effect
of the jth class and a random error The mathematical model for xij is given as

xij = µ + τ i + ε ij , i = 1, 2, , m;
(6.12)
j = 1, 2,, ni
The model given by Equation (6.12) is known as the fixed effect model. An equivalent
hypothesis to Equation (6.11) that involves the effect of the classes is given by
H o : τ i = 0, ∀i (6.13)
This says that there is no class or treatment effect on the responses, indicating that the m
population means are the same.

238 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 5

The components of the variations in Equation (6.10) are associated with respective
degrees of freedom. The SST has n − 1 degrees of freedom as a result of the constraint
ni

∑ (x
j =1
ij − x.. ) = 0 . Similarly, SSTR has m − 1 degrees of freedom. Then, since

df ( SSE ) = df ( SST ) − df ( SSTR ) = n − m


The mean squares of the variation components are obtained as
SSTR SSE
MSTR = and MSE =
m −1 n−m
SSTR SSE
Now, noting that and
are two independent chi-square random variables
σ σ22

with the respective degrees of freedom, the ratio


SSTR / m − 1
SSE / n − m
gives an appropriate statistic for testing H o . Thus, the statistic for testing H o is given
by
m ni

∑∑ ( x
i =1 j =1
i. − x.. ) 2 / m − 1
F= m ni
(6.14)
∑∑ ( x
i =1 j =1
ij
2
− xi. ) / n − m

which has the F distribution with degrees of freedom m − 1 and n − m . We reject H o


if F > f α , m −1, n − m .

Example 6.7
A car manufacturing company wants to determine the fuel consumption rate of its new
cars. In an experiment, each of three sets of five cars of the same brand are filled with
the same brand of fuel and the distance covered before refueling is recorded. Table 6.7
contains the distances covered in the experiment.

CoDEUCC/Post-Diploma in Mathematics and Science Education 239


UNIT 6 ONE-WAY ANALYSIS OF VARIANCE
SESSION 5

Table 6.7: Distance covered by cars on three fuel types


Fuel A Fuel B Fuel C

252.5 254.6 251.4

254.1 256.5 252.5

253.0 256.4 253.6

254.9 257.3 250.9

255.4 258.3 254.0

Test whether there are differences in consumption due to the three types of fuel.

Solution
The null hypothesis is
H o : µ A = µ B = µC

against H a : At least two of µ A , µ B and µ C differ


Computing the three variations, we have
m ni m ni
T..2
SST = ∑∑ ( xij − x.. ) 2 = ∑∑ xij2 −
i =1 j =1 i =1 j =1 n
1
= 970550 − (3815.40) 2
15
= 970550 − 970485.144

= 64.856
m ni m
Ti.2 T..2
SSTR = ∑∑ ( xi. − x.. ) 2 = ∑ −
i =1 j =1 i =1 ni n
1
= (1269.90 2 + 1283.10 2 + 1262.40 2 ) − 970485.144
5
= 970529.076 − 970485.144

= 43.932

240 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 5

m ni m ni m
Ti.2
SSE = ∑∑ ( xij − xi. ) 2 = ∑∑ xij2 − ∑
i =1 j =1 i =1 j =1 i =1 ni
1
= 970550 − (1269.90 2 + 1283.10 2 + 1262.40 2 )
5
= 970550 − 970529.076

= 20.924

Thus, the test statistic is


SSTR / m − 1 43.932 / 2 21.966
F= = = = 12.5976
SSE / n − m 20.924 / 12 1.7437
From the F tables, f 2,12, ( 0.05) = 3.89
Since F = 12.5976 > f 2,12, ( 0.05) = 3.89 we reject H o . Therefore, we conclude that at
least two of the fuel types differ in performance.
The results are summarized in the ANOVA table.
Source of
df SS MS F statistic
variation

Treatment 2 43.932 21.966 12.5976

Error 12 20.924 1.7437

Total 9 64.856
f 2,12, ( 0.05) = 3.89

CoDEUCC/Post-Diploma in Mathematics and Science Education 241


UNIT 6 ONE-WAY ANALYSIS OF VARIANCE
SESSION 5

5.3 Pair-wise Comparison


If the One-way ANOVA test leads to the rejection of the H o , there is the need to
investigate which pairs of treatments differ and to determine the extent of differences.
This is done by what is called pair-wise comparison, in which two treatment means are
compared at a time. The hypothesis for the comparison is
H o : µ i = µ k against H a : µ i ≠ µ k

By H o , we mean that treatments i and k have the same effect on the mean response. By
H a we mean that i and k have different effects on the mean response. The test statistic
for the test is given as
xi − x k
T= (6.15)
1 1 
MSE  + 
 ni n k 
and has the t distribution with n − m degrees of freedom. The null hypothesis is rejected
for large values of T compared to tα 2, n − m .

Alternatively, we can construct a 95 percent confidence interval for the differences


xi − xk given by

 1 1 
( xi − x k ) ± tα 2 MSE  +  , i ≠ k , i, k = 1, 2,  , m (6.16)
  ni n k  

242 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 5

Example 6.8
Refer to Example 6.7.
If differences in treatment effects are detected, carry out a follow-up test to determine
the types of treatment effects that differ.

Solution

The test in Example 6.7 concluded that differences exist between treatment effects.
Therefore we conduct a test of pair-wise comparison.

The 95 percent confidence interval for difference between Fuel types A and B is
constructed as follows:
H o : µ B − µ A = 0 against H a : µ B − µ A ≠ 0

Since MSE = 1.7437, t 0.025, 12 = 2.179 , and x A = 253.98 , x B = 256.62 ,


we obtain

  1 1 
(256.62 − 253.98) ± 2.179 1.7437 +  
  5 5  
= [2.64 ± 1.8198]

= [0.8202, 4.4598]

Since the interval does not contain the hypothesized value of µ B − µ A = 0 , it implies
that the fuel types A and B differ in their effect on the distance covered. The interval
further shows that in 95 percent of all samples drawn, average distance covered with
Fuel B is higher than with Fuel A.

The 95 percent confidence interval for difference between Fuel types A and C is
constructed as follows:
H o : µ C − µ A = 0 against H a : µ C − µ A ≠ 0

Since MSE = 1.7437, t 0.025, 12 = 2.179 , and x A = 253.98 , xC = 252.48 ,


we obtain

CoDEUCC/Post-Diploma in Mathematics and Science Education 243


UNIT 6 ONE-WAY ANALYSIS OF VARIANCE
SESSION 5

  1 1 
(252.48 − 253.98) ± 2.179 1.7437 +  
  5 5  
= [− 1.5 ± 1.8198]
= [−3.3198, 0.3198]

Since the interval contains the hypothesized value of µ C − µ A = 0 , it implies that the
fuel types A and C do not differ in their effects on the distance covered.

The 95 percent confidence interval for difference between Fuel types B and C is
constructed as follows:
H o : µ C − µ B = 0 against H a : µ C − µ B ≠ 0

Since MSE = 1.7437, t 0.025, 12 = 2.179 , and x B = 256.62 , xC = 252.48


we obtain

  1 1 
(252.48 − 256.62) ± 2.179 1.7437 +  
  5 5  
= [− 4.14 ± 1.8198]

= [ − 5.9598, − 2.3202, ]

Since the interval does not contain the hypothesized value of µ C − µ B = 0 , it implies
that the fuel types B and C differ in their effects on distance covered. The interval
further shows that in 95 percent of all samples drawn, average distance covered with
Fuel B is higher than with Fuel C.

Graphical Display of Pair-wise Comparison


For the data in Table 6.6, a plot of the individual 95 percent confidence interval for the
mean based on the pooled standard deviation is given in the Minitab output shown.

Individual 95% CIs For Mean Based on


Pooled StDev
Level N Mean StDev ----+---------+---------+---------+-----
A 5 253.98 1.23 (-----*-----)
B 5 256.62 1.36 (-----*-----)
C 5 252.48 1.34 (-----*------)
----+---------+---------+---------+-----
252.0 254.0 256.0 258.0

244 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 5

The plot clearly shows that the confidence interval for treatment B does not overlap
with any of the other two. This means that µ B ≠ µ A and µ B ≠ µ C . However, there
appears to be some amount of overlap between the confidence intervals for treatments A
and C. Our computations show that this overlap is significant and therefore the two are
not different.

Self-Assessment Questions
Exercise 6.5

1. Table 6.8 shows the yield for three hybrids of corn A, B, and C from an acre plot.
Perform a one-way analysis of variance and draw conclusions.

Table 6.8: Corn Type


A B C

74.0 65.6 80.6

71.3 62.3 77.3

76.7 64.1 79.0

72.2 63.2 79.4

73.1 65.0 78.2

CoDEUCC/Post-Diploma in Mathematics and Science Education 245


UNIT 6 ONE-WAY ANALYSIS OF VARIANCE
SESSION 5

This is blank sheet for your short note on:


• Issues that are not clear: and
• Difficult topics if any

246 CoDEUCC/Post-Diploma in Mathematics and Science Education


MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6

SESSION 6: TWO-WAY ANALYSIS OF VARIANCE


As you know, a response variable may be affected by more than one
factor. It would therefore be necessary to determine the amount of
variation of each of these influencing variables in relation to the total variation in the
response variable. In this session, we consider the case where there are two independent
variables leading to two criteria of classification of the response variable.

Objectives
By the end of this session, you should be able to:
1. perform a simple two-way analysis of variance on a given suitable data;
2. perform a two-way analysis of variance with interaction on a given suitable data.

Now read on…

6.1 Data Layout and Fundamental Equation of Simple Two-Way


ANOVA
We consider a random sample ( x1 , x 2 , x3 ,  , x N ) of N values of a given variable X.
Suppose that these N values are classified according to some factor 1 into m classes,
and, according to another factor 2, into n classes. Denote by xij the value of X
belonging to the ith class of factor 1 and the jth class of factor 2. The layout of the data
for Two-way Anova is given in Table 6.9.

CoDEUCC/Post-Diploma in Mathematics and Science Education 247


UNIT 6 TWO-WAY ANALYSIS OF VARIANCE
SESSION 6

Table 6.9: Layout of Data for Two-way ANOVA

Factor B
Factor A
1 2  j  n

1 x11 x12  x1 j  x1n

2 x 21 x 22  x2 j  x2n

  

i xi1 xi 2  xij  xin

  

m x m1 xm 2  x mj  x mn

Note that in Table 6.9, we have assumed for now that there is only one response
value in the ith class of factor A and jth class of factor B.

Let the mean of the ith class of factor A be


1 n
xi. = ∑ xij
n j =1
the mean of the jth class of factor B be
1 m
x. j = ∑ xij
m i =1

and the mean of the entire N = mn values be


1 m n
x.. = ∑∑ xij
mn i =1 j =1

Using a partition similar to that in Session 5, the sum of squared deviations from the
general mean is given as

248
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6

ni ni

∑ ( xij − x.. ) 2 = ∑ ( xij − xi. + xi. − x.. ) 2


j =1 j =1
m n m n
(6.17)
= n∑ ( xi. − x.. ) + m∑ ( x. j − x.. ) + ∑∑ ( xij − xi. − x. j + x.. )
2 2 2

i =1 j =1 i =1 j =1

Equation (6.17) is the fundamental equation of the two-way analysis of variance. In


Equation (6.17),
m ni

∑∑ ( x
i =1 j =1
ij − x.. ) 2 represents the total variation (SST).

The right-hand side shows a decomposition of the total variation into three components.
The first separated component
m

∑ n( x
i =1
i. − x.. ) 2 represents the variation due to factor 1 (SSA).

If the values across the levels of factor A vary widely, then the deviations of the level
means from the general mean is also supposed to be large. A large value of the between
factor A variation is thus a reflection of wide variability or heterogeneity across levels
of the factor. This is an indication of real differences between levels of factor A.
The second component
ni

∑ m( x
j =1
.j − x.. ) 2 represents the variation due to factor 2 (SSB).

If the values across the levels of factor B vary widely, then the deviations of the level
means from the general mean is also expected to be large. A large value of the between-
factor B variation is thus a reflection of wide variability or heterogeneity across levels
of the factor. This is an indication of real differences between levels of factor B.
The third component
m n

∑∑ ( x
i =1 j =1
ij − xi. − x. j + x.. ) 2 represents the residual variation after variations due to

factors A and B have been separated.

6.2 Data Layout and Fundamental Equation of General Two-Way


ANOVA
In the general two-way Anova, we consider a more practical situation in which the
random sample ( x1 , x 2 , x3 ,  , x N ) on the variable X are classified according to some
factor A into m classes, and, according to another factor B, into n classes. In addition,

CoDEUCC/Post-Diploma in Mathematics and Science Education 249


UNIT 6 TWO-WAY ANALYSIS OF VARIANCE
SESSION 6

corresponding to the ith level of factor A and jth level of factor B, there are l replicates
values.
Table 6.10: Layout of Data for General Two-way Anova

Factor Factor B Row


A Mean
1 2  j  n
x111 x121 x1 j1 x1n1 x1.1
    
1 x11k x12 k  x1 jk  x1nk x1.k
    
x11l x12l x1 jl x1nl x1.l

Mean x11. x12.  x1 j .  x 2n. x1..


  
xi11 xi 21 xij1 xin1 xi.1
    
i xi1k xi 2 k  xijk  xink x i .k
    
xi1l xi 2l xijl xinl xi.l

Mean xi1. xi 2. xij . xin. xi..


  
x m11 x m 21 x mj1 x mn1 x m.1
    
m x m1k xm 2k  x mjk  x mnk x m.k
    
x m1l x m 2l x mjl x mnl x m.l

Mean x m1. x m 2. x mj . x mn. x m..


Col. x.1. x.2. x. j . x.n. x...
mean

250
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6

Thus, there are in total N = lmn response values. Denote by xijk the kth replicate value
of X in the ith class of factor A and the jth class of factor B. The layout of the data for
the general Two-way Anova is given in Table 6.10.
The treatments in the layout are the combinations of factor A and factor B. Another
point for definition is that
µ ij is the mean value of the response variable obtained using level i of factor A
and level j of factor B.
Of interest are the following sums of squares and associated degrees of freedom.

Table 6.11: Sum of Squares in two-way Anova


Source of
Sum of Squares df
Variation
m ni
Treatment l ∑∑ ( xij . − x... ) 2 mn − 1
(SSTR) i =1 j =1

m
Factor A nl ∑ ( xi.. − x... ) 2 m −1
i =1

n
Factor B ml ∑ ( x. j . − x... ) 2 n −1
j =1

Interaction m n
between l ∑∑ ( xij . − xi.. − x. j . + x... ) 2 (m − 1)(n − 1)
And B i =1 j =1

m ni l
Error ∑∑∑ ( x
i =1 j =1 k =1
ijk − xij . ) 2 mn(l − 1)

m ni l
Total ∑∑∑ ( x
i =1 j =1 k =1
ijk − x... ) 2 lmn − 1

From Table 6.10, the respective mean squares can be obtained.


The fundamental equation for the general case would however separate the total
variation into seven variation components. Can you identify the remaining two?

CoDEUCC/Post-Diploma in Mathematics and Science Education 251


UNIT 6 TWO-WAY ANALYSIS OF VARIANCE
SESSION 6

6.3 Testing the Hypothesis for General Two-Way ANOVA


The null hypothesis is
Ho : All treatment means are equal
against
H a : At least two of the treatment means differ
The test statistic is the F statistic given by
SSTR / mn − 1
F=
SSE / mn(l − 1)
which has the F distribution with degrees of freedom mn − 1 and mn(l − 1) degrees of
freedom. We reject H o if F > f α , mn −1, mn (l −1) .

If we reject H o , then we conclude that factor A or factor B or the interaction between


them has a significant effect on the mean response. If this is the case, we proceed to test
the interaction effect in a similar way. If there is no interaction effect, then it means that
the effect of changing the levels of factor A are the same for different levels of factor B
and the vice versa. When this is the case, we proceed further to test the significance of
each of the factors or test for the main effects.

Example 6.9
Refer to Example 6.7.
Suppose that in addition to filling each selected set of five cars of the filled with one of
the three types of fuel, each set must contain a type of the five brands of cars that the
company manufactures. Determine the effect of the brand of car and the fuel type on the
distance covered by the fifteen cars.

Table 6.12: Distance covered by cars on three fuel types


Fuel Type
Car Brand
Fuel A Fuel B Fuel C
1 252.5 254.6 251.4

2 254.1 256.5 252.5

3 253.0 256.4 253.6

4 254.9 257.3 250.9

5 255.4 258.3 254.0

252
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6

Solution
Let xij be the distance covered by the ith car filled with the jth fuel type.
Thus, i = 1, 2, 3, 4, 5 and j = 1, 2, 3 .
Notice that the values involved are so large. So transform data by xij′ = xij − 250

Noting that
m n m n m n
T..2
∑∑ ( x − x.. ) = ∑∑ x − Nx.. = ∑∑ x −
2 2 2 2
ij ij ij
i =1 j =1 i =1 j =1 i =1 j =1 N
m n m m
Ti 2 T..2
∑∑ ( xi. − x.. ) 2 = ∑ ni ( xi. − x.. ) 2 = ∑
i =1 j =1 i =1 i =1 ni

N
m n n n T j2 T..2
∑∑ ( x. j − x.. ) 2 = ∑ n j ( x. j − x.. ) 2 = ∑
i =1 j =1 j =1 j =1 nj

N
We obtain the following values, squares and sums in the table using our new origin.

Fuel Type
Car Ti. xij2
Ti 2
Brand
Fuel A Fuel B Fuel C

1 2.5(6.25) 4.6(21.16) 1.4(1.96) 8.5 72.25 29.37

2 4.1(16.81) 6.5(42.25) 2.5(6.25) 13.1 171.61 65.31

3 3.0(9.0) 6.4(40.96) 3.6(12.96) 13.0 169 62.92

4 4.9(24.01) 7.3(53.29) 0.9(0.81) 13.1 171.61 78.11

5 5.4(29.16) 8.3(68.89) 4.0(16) 17.7 313.29 114.05

T. j 19.9 33.1 12.4 349.76

T. 2j 396.01 1095.61 153.76

xij2 85.23 226.55 37.98 349.76

CoDEUCC/Post-Diploma in Mathematics and Science Education 253


UNIT 6 TWO-WAY ANALYSIS OF VARIANCE
SESSION 6

From the table

∑∑ xi j
2
ij = 349.76 , ∑T
i
.j = 65.4 , ∑T j
2
= 1645.38 ,

T.. = ∑ T. j = ∑ Ti. = 65.4 ,


i i
∑T 2
i. = 897.76

m n
T..2 65.4 2
Thus, SST = ∑∑ xij2 − = 349.76 − = 349.76 − 285.144 = 64.616
i =1 j =1 N 15
m
Ti 2 T..2 897.76 65.4 2
SSA = ∑ − = − = 299.2533 − 285.144 = 14.1093
i =1 ni N 3 15

n T j2 T..2 1645.38 65.4 2


SSB = ∑ − = − = 329.076 − 285.144 = 43.932
j =1 nj N 5 15

SSE = 64.616 − 14.1093 − 43.932 = 6.5747

The analysis of variance is given in the table shown.

Source of
df SS MS F statistic
variation
Between
2 43.932 21.966 26.7291
Fuel Type
Between
4 14.1093 3.5273 4.2922
car brand
Residual 8 6.5747 0.8218

Total 14 64.616
f 0.05, 2, 8 = 4.46 , f 0.05, 4, 8 = 3.84

Thus, there are significant differences in distance covered due to fuel type and due to
car brand at 5 percent significance level.

Example 6.10
Refer to Example 6.9.
Suppose that the manufacturer wants to be very sure of the effects of the two factors and
selects two cars of each brand in the test process. The resulting data is as shown.
Carry out analysis of variance test and draw conclusions.

254
CoDEUCC/Post-Diploma in Mathematics and Science Education
MULTIPLE LINEAR REGRESSION AND ANALYSIS UNIT 6
OF VARIANCE SESSION 6

Table 6.13: Distance covered by cars on three fuel types


Fuel Type
Car Brand
Fuel A Fuel B Fuel C
252.5 254.6 251.4
1
251.9 254.3 250.7
254.1 256.5 252.5
2
255.1 255.9 252.2
253.0 256.4 253.6
3
253.2 257.0 253.0
254.9 257.3 250.9
4
254.5 257.1 250.5
255.4 258.3 254.0
5
255.5 260.0 253.8

Solution

The results are summarized in the table shown.

Source of
df SS MS F statistic
variation
Between
2 101.953 50.9763 237.10
Fuel Type
Between
4 39.021 9.7553 45.37
car brand
Interaction 8 13.451 1.6813 7.82

Residual 15 3.225 0.2150

Total 29 157.65

It can be verified that the results of the F test are significant. Thus, there are significant
differences in distance covered by the cars due to highly to type of fuel. There are also
differences due to the car brand and the interaction between the car brand and fuel type.

CoDEUCC/Post-Diploma in Mathematics and Science Education 255


UNIT 6 TWO-WAY ANALYSIS OF VARIANCE
SESSION 6

Self-Assessment Questions
Exercise 6.6

1. Refer to Example 6.11.


Determine the exact pairs of levels of each factor that differ

2. Refer to Example 6.12.


Carry out the details of computations for all the four sources of variation.
3. Conduct a two-way analysis of variance on the following hypothetical data on yield
of various hybrids of rice treated with two types of fertilizer.

Rice Type
Fertilizer
A B C D
18.4 21.8 23.8 22.1
P 17.2 22.8 22.6 20.9
17.8 22.3 23.2 21.5
19.9 24.0 24.8 24.1
Q 19.4 24.6 24.2 22.3
20.4 23.4 25.4 23.2

256
CoDEUCC/Post-Diploma in Mathematics and Science Education
REFERENCES

1. Bowerman, B., & O’Connell, R.T. (1997): Applied Statistics: Improving Business
Processes. McGraw-Hill Inc, USA.
2. Devore, J. (2012). Probability and Statistics for Engineering and the Sciences.
(8thed.). International Edition. Brooks/Cole Canada.
3. DeSanto, C, and Totoro, M. (2002). Introduction to Statistics. (7th ed.). Pearson
Custom Publishing, U.S.A.
4. Freund, J.E. (1992). Mathematical Statistics. (5th ed.). Prentice-Hall, New Jersey,
U.S.A.
5. Goodman, R. (1972). Teach Yourself Statistics. (4th ed.) ELBS and the English
Universities Press Ltd, London.
6. Gordor, B.K. and Howard, N.K. (2006). Introduction to Statistical Methods.
Ghana Mathematics Group. Accra, Ghana.
7. Mason, R.D., Lind, D.A., and Marchal, W.G. (1983). Statistics. An Introduction.
Harcourt Brace Jovanovich, Inc. New York, U.S.A.
8. McClave, J.T. and Benson, P.G. (1994). Statistics for Business and Economics.
(6th ed.). Prentice-Hall, New Jersey, U.S.A.
9. Milton, J.S., Corbet, J.J., and McTeer, P.M. (1986). Introduction to Statistics. D.
C. Heath and Company, U.S.A.

280 CoDEUCC/Post-Diploma in Mathematics and Science Education


UNIT 1
Session 1
1. C 8. C
2. B 9. C
3. A 10. D
4. A 11. D
5. D 12. B
6. A 13. A
7. D 14. B

Session 2

1. n = 52 , x = 28.76 , and s = 12.2647 . Since z = −0.73 , we are not able to reject H 0 .

2. z = 0.6 , we conclude that there is not enough evidence to reject the null hypothesis.
3. (a) Hypotheses: H 0 : µ = 800 against H 1 : µ ≠ 800

(b) z = 1.92 , we fail to reject H 0 and conclude that the viscosity of a liquid averages
800 centistokes at 25o C .

Session 3

1. t = −1.95 . Fail to reject H 0 since t = −1.95 is neither less than


− t0.025 (16) = −2.120 nor greater than t0.025 (16) = 2.120 .

2. t = −3.06 . Reject H 0 since t = −3.06 is less than − t0.05 (9) = −2.821 .

3. t = −0.49 . Fail to reject H 0 since t = −0.49 is greater than − t 0.05 (4) = −2.132 .

Session 4

1. z = −2.031 . Reject H 0 since z = −2.031 is less than − z 0.025 = −1.96 .

2. z = −1.414 . Fail to reject H 0 since z = −1.414 is greater than − z 0.05 = −1.645 .

Session 5

1. We have kα = 2 . Reject H 0 since x = 1 is less than kα = 2 .

2. We have k α = 0 . Fail to reject H 0 since x = 12 is not less than or equal to k α = 0 .


2 2

3. We have k α = 5 . Reject H 0 since x = 4 is less than k α = 5 .


2 2

Session 6

1. χ 2 = 5.9168 , so fail to reject H 0 since χ 2 = 5.9168 is not less than χ 02.95 = 2.733 .
2. χ 2 = 20.845 , so fail to reject H 0 since χ 2 = 20.845 is neither less than
χ 02.995 = 11.689 nor greater than χ 02.005 = 38.076 .

3. χ 2 = 27.887 , so fail to reject H 0 since χ 2 = 27.887 is not less than χ 02.95 = 13.121 .

4. χ 2 = 55.704 . Fail to reject H 0 since χ 2 = 55.7043 is not greater than χ 02.05 = 55.758 .
ANSWERS TO SELF-ASSESSMENT QUESTIONS

UNIT 1
Session 1
1. C 8. C
2. B 9. C
3. A 10. D
4. A 11. D
5. D 12. B
6. A 13. A
7. D 14. B

Session 2

1. n = 52 , x = 28.76 , and s = 12.2647 . Since z = −0.73 , we are not able to reject H 0 .

2. z = 0.6 , we conclude that there is not enough evidence to reject the null hypothesis.
3. (a) Hypotheses: H 0 : µ = 800 against H 1 : µ ≠ 800

(b) z = 1.92 , we fail to reject H 0 and conclude that the viscosity of a liquid
averages 800 centistokes at 25o C .

Session 3

1. t = −1.95 . Fail to reject H 0 since t = −1.95 is neither less than


− t0.025 (16) = −2.120 nor greater than t0.025 (16) = 2.120 .

2. t = −3.06 . Reject H 0 since t = −3.06 is less than − t0.05 (9) = −2.821 .

3. t = −0.49 . Fail to reject H 0 since t = −0.49 is greater than − t 0.05 (4) = −2.132 .

Session 4

1. z = −2.031 . Reject H 0 since z = −2.031 is less than − z 0.025 = −1.96 .

2. z = −1.414 . Fail to reject H 0 since z = −1.414 is greater than − z 0.05 = −1.645 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 281


ANSWERS TO SELF-ASSESSMENT QUESTIONS

Session 5

1. We have kα = 2 . Reject H 0 since x = 1 is less than kα = 2 .

2. We have k α = 0 . Fail to reject H 0 since x = 12 is not less than or equal to


2

kα = 0 .
2

3. We have k α = 5 . Reject H 0 since x = 4 is less than k α = 5 .


2 2

Session 6

1. χ 2 = 5.9168 , so fail to reject H 0 since χ 2 = 5.9168 is not less than


χ 02.95 = 2.733 .

2. χ 2 = 20.845 , so fail to reject H 0 since χ 2 = 20.845 is neither less than


χ 02.995 = 11.689 nor greater than χ 02.005 = 38.076 .

3. χ 2 = 27.887 , so fail to reject H 0 since χ 2 = 27.887 is not less than


χ 02.95 = 13.121 .

4. χ 2 = 55.704 . Fail to reject H 0 since χ 2 = 55.7043 is not greater than


χ 02.05 = 55.758 .

UNIT 2
Session 1

1. z = −2.10 . Reject H 0 and conclude that µ1 < µ 2 .

2. (a) H 0 : µ1 − µ 2 ≤ 0 against H1 : µ1 − µ 2 > 0


(b) z = 1.41 . Do not reject H 0 at α = 0.05 , cannot conclude that
µ1 − µ 2 .

3. (a) H 0 : µ1 − µ 2 = 0 against H1 : µ1 − µ 2 ≠ 0
(b) z = −5.06 . Reject H 0 at α = 0.05 , µ1 and µ 2 differ.

4. z = 1.08 . Do not reject H 0 at α = 0.05 , cannot conclude that µ1 − µ 2 ≠ 0.20

282 CoDEUCC/Post-Diploma in Mathematics & Science Education


ANSWERS TO SELF-ASSESSMENT QUESTIONS

5. z = 2.604 . Reject H 0 at α = 0.05 , since z = 2.604 is greater than z 0.025 = 1.96 .

6. z = −1.2194 . Fail to reject H 0 at α = 0.05 , since z = −1.2194 is greater than


− z 0.05 = −1.645.

Session 2
1. (a) t = −1.2545 . We fail to reject H 0 at the 0.10, 0.05, 0.01, and
0.001. This is because t = −1.2545 is not less than
− t 0.10 (18) = −1.330 , nor − t 0.05 (18) = −1.73 , − t 0.01 (18) = −2.55 ,
nor − t 0.001 (18) = −3.610 .

(b) t = −1.2545 . We fail to reject H 0 at the 0.10 significance level


because t = −1.2545 is neither less than − t 0.05 (18) = −1.73 , nor
greater than t 0.05 (18) = 1.73 .

We also fail to reject H 0 at the 0.05 significance level because


t = −1.2545 is neither less than − t 0.025 (18) = −2.101 , nor greater
than t 0.025 (18) = 2.101 .

Again, we fail to reject H 0 at the 0.01 significance level


because t = −1.2545 is neither less than − t 0.005 (18) = −2.878 ,
nor greater than t 0.005 (18) = 2.878 .

Finally, we fail to reject H 0 at the 0.001 significance level


because t = −1.2545 is neither less than − t 0.0005 (18) = −9.922 ,
nor greater than t 0.0005 (18) = −3.922 .

Session 3
1. (a) t = −1.6112 . Reject the null hypothesis at the 0.10 level because
t = −1.6112 is less than − t 0.10 (11) = −1.363 . However, fail to
reject the null hypothesis at the 0.05, 0.01 and 0.001 levels,
respectively. This is because t = −1.6112 is not less than
− t 0.05 (11) = −1.796 , nor less than − t 0.01 (11) = −2.718 , nor less
than − t 0.001 (11) = −4.025 .

CoDEUCC/Post-Diploma in Mathematics and Science Education 283


ANSWERS TO SELF-ASSESSMENT QUESTIONS

(b) t = −1.6112 . We fail to reject H 0 at the 0.10 significance level


because t = −1.6112 is neither less than − t 0.05 (11) = −1.796 ,
nor greater than t0.05 (11) = 1.796 .

We also fail to reject H 0 at the 0.05 level of significance


because t = −1.6112 is neither less than − t 0.025 (11) = −2.201 ,
nor greater than t 0.025 (18) = 2.101 .

Again, we fail to reject H 0 at the 0.01 significance level because


t = −1.6112 is neither less than − t 0.005 (11) = −3.106 , nor greater
than t 0.005 (11) = 3.106 .

Finally, we fail to reject H0 at the 0.001


significance level because t = −1.6112 is neither less than
− t 0.0005 (11) = −4.437 , nor greater than t 0.0005 (11) = 4.437 .

Session 4

1. t = 1.897 . Reject H 0 and conclude that the mean weight of boxers before they
took the diet is greater than their weight after the diet. This implies that the weight
reducing diet is effective.

2. (b) d = −0.8

(c) s d2 = 3.7

(d) s d = 1.92

(e) t = −0..9317 . We fail to reject H0 and conclude that the


difference in the means is equal to 0.

Session 5

1. (a) H 0 : p1 − p 2 = 0 against H1 : p1 − p 2 ≠ 0
(b) z = 2.13 . Reject H 0 at α = 0.05 , and conclude that p1 and p2
differ.

284 CoDEUCC/Post-Diploma in Mathematics & Science Education


ANSWERS TO SELF-ASSESSMENT QUESTIONS

2. z = 0.8086 . Fail to reject H 0 since z = 0.8086 is not less than − z 0.025 = −1.96
nor greater than z 0.025 = 1.96 .

3. (a) H 0 : p1 − p 2 = 0 against H1 : p1 − p 2 ≠ 0
(b) z = 0.505 . Fail to reject H0 at α = 0.02 , and conclude that p1
and p 2 do not differ.

4. z = 0.126 . Fail to reject H 0 since z = 0.126 is not less than − z 0.01 = −2.33 nor
greater than z 0.01 = 2.33 .

5. (a) H 0 : p1 − p 2 ≤ 0 against H1 : p1 − p 2 > 0


(b) z = 3.00 . Reject H0 at α = 0.05 , and conclude that p1 > p 2 .
That is, a larger proportion of urban voters plan to vote for the
incumbent President.

Session 6

1. F = 5.49 . Reject H 0 at α = 0.02 , and conclude that H1 : σ 12 ≠ σ 22 .


That is, the variability of the tensile strength of the two kinds of steel is not the
same.

2. F = 2.469 . Fail to reject H 0 at the 0.10 level of significance, and conclude that
H1 : σ 12 ≠ σ 22 . That is, it is not reasonable to assume that the two population
samples have equal variances.

3. F = 0.39 . Reject H 0 at α = 0.01 , and conclude that H1 : σ 12 < σ 22 .


That is, variability is smaller in the elderly.

CoDEUCC/Post-Diploma in Mathematics and Science Education 285


ANSWERS TO SELF-ASSESSMENT QUESTIONS

UNIT 3

Session 2

1. χ 2 = 2.1 . Since χ 2 = 2.1 is less than χ 02.05 (5) = 11.070 , we fail to reject H 0 and
conclude that the die is fair.

2. χ 2 = 8.998 . Since χ 2 = 8.998 is less than jf χ 20.01 (5) = 9.210 , we fail to reject
H 0 and conclude that the proportion of Muslims, Christians and Other religions
in are Ashanti region are 15%, 77% and 8% respectively.

Session 3

1. χ 2 = 2.28954 . Since χ 2 = 2.28954 is less than χ 02.05 (5) = 11.070 , we do not


reject H 0 and conclude that the number of arrivals per minute follows a Poisson
distribution at the 0.05 level of significance.

2. χ 2 = 6.306 . Since χ 2 = 6.306 is less than χ 02.05 (7) = 14.067 , we do not reject
H 0 and conclude that the binomial distribution is an adequate model for the data,
at the 0.05 level of significance.

3. χ 2 = 19.530 . Since χ 2 = 19.530 is greater than χ 20.05 (5) = 11.070 , we reject H 0


and conclude that the battery lives do not follow a normal distribution.

286 CoDEUCC/Post-Diploma in Mathematics & Science Education


ANSWERS TO SELF-ASSESSMENT QUESTIONS

Session 4

1. (a) H 0 1 : Proportion of Males in Education =


Proportion of Females in Education

H 0 2 : Proportion of Males in CANS =


Proportion of Females in CANS

H 0 3 : Proportion of Males in CHLS =


Proportion of Females in CHLS

H 0 4 : Proportion of Males in HAAS =


Proportion of Females in HAAS
against

H 1 : At least one of the H 0 is false


(b) χ2 = 0. The critical region is χ 2 ≥ 7.815 since
χ 02.05 (3) = 7.815 . We fail to reject the null hypotheses since zero
is less than 7.815. We, therefore, conclude that there are equal
proportions of males and females in college of the university.

2. (a) H 0 1 : Proportion of Resident prescribing Frequently =


Proportion of Attending prescribing Frequently

H 0 2 : Proportion of Resident prescribing Occasionally =


Proportion of Attending prescribing Occasionally

H 0 3 : Proportion of Resident prescribing Rarely =


Proportion of Attending prescribing Rarely

H 0 4 : Proportion of Resident Never prescribing =


Proportion of Attending Never prescribing
against

H 1 : At least one of the H 0 is false


CoDEUCC/Post-Diploma in Mathematics and Science Education 287
ANSWERS TO SELF-ASSESSMENT QUESTIONS

(b) χ 2 = 31.88 . The critical region is χ 2 ≥ 7.815 since


χ 02.05 (3)
= 7.815 . We reject the null hypotheses in favour of the
alternative since 31.88 is greater than 7.815. We, therefore,
conclude that at least one of the null hypotheses is false.

Session 5

1. The hypotheses are

H 0 : Educational Levels and Party Affiliations are independent

against

H1 : Educational Levels and Party Affiliations are not


independent

χ 2 = 8.1227 . Since χ cal


2
= 8.1227 < χ 02.05 (4) = 9.49 , we fail to reject H 0 and
conclude that educational levels and political party affiliations in Ghana are
independent.

2. (a) For union workers:

H 0 : Level of confidence in business and job satisfaction are


independent.

against

H1 : Level of confidence in business and job satisfaction are not


independent.

For non-union workers:

H 0 : Level of confidence in business and job satisfaction are


independent.

against

288 CoDEUCC/Post-Diploma in Mathematics & Science Education


ANSWERS TO SELF-ASSESSMENT QUESTIONS

H 1 : Level of confidence in business and job satisfaction are not


independent.

(b) For union workers: χ 2 = 12.20 . Fail to reject H 0 and conclude that the level
of confidence in business and job satisfaction are independent for unionized
workers.

For non-union workers: χ 2 = 9.996 . Fail to reject H 0 and conclude that the
level of confidence in business and job satisfaction are independent for non-
unionized workers as well.

3. χ 2 = 17.4354. Reject H 0 and conclude that Teaching Evaluation and Rank are
independent random variables.

CoDEUCC/Post-Diploma in Mathematics and Science Education 289

You might also like