Stat II Module
Stat II Module
DEPARTMENT OF MANAGEMENT
MODULE OF STATISTICS FOR MANAGEMENT II
Course code: MGMT 2073
Credit hours: 3
February, 2023
Bonga, Ethiopia
1
Table of content
Contents page
Table of content .................................................................................................................. 2
CHAPTER ONE................................................................................................................. 5
SAMPLING AND SAMPLING DISTRIBUTION............................................................ 5
1. 1. Introduction ............................................................................................................ 5
1.2 Importance of sampling theory................................................................................. 7
1.3 Probability (Random) and Non-Probability (non-random) Sampling ...................... 7
1.3.1 Probability Samples ........................................................................................... 8
1.3.2 Non-Probability Samples .................................................................................... 11
1.4 Bias and Error in Sampling .................................................................................... 13
1.5. SAMPLING DISTRIBUTIONS ........................................................................... 17
1.5.1 SAMPLING DISTRIBUTION OF THE MEAN ( X ) .................................... 18
1.5.2 SAMPLING DISTRIBUTION OF THE PROPORTION ( ) ....................... 22
1.5.3 SAMPLING DISTRIBUTION OF THE DIFFERENCE BETWEEN TWO
SAMPLE MEANS ( X 1 X 2 ) ........................................................................................24
1.5.4 SAMPLING DISTRIBUTION OF THE DIFFERENCE OF TWO
PROPORTIONS ( 1 - 2 ) ................................................................................................25
SELF CHECK EXERCISE 1 ....................................................................................... 28
CHAPTER TWO .............................................................................................................. 29
STATISTICAL ESTIMATION AND STATISTICAL INFERENCE ............................ 29
2.1 INTRODUCTION TO STATISTICAL ESTIMATION ........................................ 29
2.1.1. CRITERIA FOR POINT ESTIMATOR ........................................................ 30
2.1.2 POINT ESTIMATOR OF THE MEAN .......................................................... 32
2.1.3 POINT ESTIMATE OF THE POPULATION PROPORTION ..................... 32
2.1.4 POINT ESTIMATE OF THE UNKNOWN POPULATION STANDARD
DEVIATION ............................................................................................................ 32
2.1.5 POINT ESTIMATOR OF STANDARD ERROR OF THE MEAN............... 33
2.1.6 A POINT ESTIMATE OF SAMPLE STANDARD ERROR OF THE
PROPORTION ......................................................................................................... 33
2.2 INTERVAL ESTIMATE ....................................................................................... 34
4.2.1 INTERVAL ESTIMATE OF POPULATION MEAN ................................... 35
[Link] Confidence Interval Estimate Of , Normal Population, And Standard
Deviation .................................................................................................................. 35
[Link] Precision, Confidence and Sample Size ....................................................... 36
[Link] CONFIDENCE ESTIMATE OF µ, NORMAL POPULATION, DX
UNKNOWN ............................................................................................................. 37
2
2.2.2 CONFIDENCE INTERVAL ESTIMATE FOR POPULATION
PROPORTION ......................................................................................................... 40
2.3 DETERMINATION OF SAMPLE ........................................................................ 41
2.3.1 Sample Size for Estimating Population Proportion ......................................... 42
2.3.2 Sample Size for Estimating a Population Mean .............................................. 43
UNIT SUMMARY ....................................................................................................... 45
CHAPTER THREE .......................................................................................................... 47
HYPOTHESES TESTING ............................................................................................... 47
3.1 Introduction to Hypothesis testing ......................................................................... 47
3.2 Type I and type II errors ......................................................................................... 49
3.3. Steps in Hypothesis Testing .................................................................................. 50
3.4 One tail and two tail tests ....................................................................................... 51
3.5 HYPOTHESIS TEST OF POPULATION MEAN ................................................ 53
3.6 HYPOTHESIS TEST OF PROPORTIONS ........................................................... 58
3.7 THE DIFFERENCE OF TWO MEANS ................................................................ 61
3.8 TESTING THE DIFFERENCE OF TWO POPULATION PROPORTIONS ....... 63
3.9 STUDENTS T – TEST........................................................................................... 65
3.10 A DIFFERENCE OF TWO MEANS WHEN SAMPLE SIZE IS SMALL AND
STANDARD DEVIATION UNKNOWN ................................................................... 69
UNIT SUMMARY ....................................................................................................... 73
SELF CHECK EXERCISE 3 ....................................................................................... 75
UNIT FOUR ..................................................................................................................... 76
CHI- SQUARE DISTRIBUTION .................................................................................... 76
4.1 CHAPTER INTRODUCTION............................................................................... 76
4.2 GENERAL CHARACTERISTICS OF CHI SQUARE DISTRIBUTION ............ 77
4.3 TEST FOR INDEPENDENCE AND CELL COUNTS FOR TEST OF
INDEPENDENCE ....................................................................................................... 78
4.4 TESTING THE EQUALITY OF MORE THAN TWO POPULATION
PROPORTIONS ........................................................................................................... 84
4.5 GOODNESS OF FIT TESTS ................................................................................. 86
4.5.1 GOODNESS OF-FIT-TESTS UNIFORM DISTRIBUTION ......................... 87
4.5. 2 Goodness of fit Binomial Distribution ........................................................... 89
4.5.3 Goodness of fit test for Poisson distribution ................................................... 93
4.5.4 Goodness of fit test Normal Distribution ........................................................ 94
UNIT SUMMARY ..................................................................................................... 100
SELF CHECK EXERCISE 4 ..................................................................................... 101
CHAPTER FIVE ............................................................................................................ 104
................ 104
3
ANALYSIS OF VARIANCE ........................................................................................ 104
5.1 Chapter Introduction............................................................................................. 104
5.2 One way Analysis of variance .............................................................................. 105
5.3 TWO-WAY ANALYSIS OF VARIANCE ......................................................... 110
UNIT SUMMARY ..................................................................................................... 115
SELF CHECK EXERCISE 5 ..................................................................................... 117
Chapter 6 ........................................................................................................................ 120
Simple linear Regression and Correlation ...................................................................... 120
6.1. Simple Linear Regression.................................................................................... 120
6.1.1. The Scatter Diagram ..................................................................................... 121
6.1.2. The regression Equation ............................................................................... 122
6.2. Correlation ........................................................................................................... 126
Review Exercises 6 .................................................................................................... 131
4
CHAPTER ONE
Unit content
Sampling and sampling theory
Probability (Random) and Non-Probability (non-random) Sampling
Bias and Error in Sampling
Sampling Distributions
Chapter Objectives:
1. 1. Introduction
What is sampling?
Data is collected from target population using survey. If a survey covers all population, the
survey is called census and if the survey covers part of the population, the survey is called
sampling.
Why sampling is preferable?
Cheaper than census
Takes smaller time as compared to census
5
Economy of efforts as relatively fewer staffs are needed
More detailed information can be collected using sample
Better quality of interviewing, supervision and other related activities
Limitations of sampling
It fails to provide information on individual account
Sampling gives rise to certain errors
Difficult to check for omissions of certain units
Population Sample
Parameters Statistics
Population size N Sample size n
Population mean μ Sample mean X
Population standard deviation σ Sample standard deviation s
Population proportion π Sample proportion p
6
1.2 Importance of sampling theory
When undertaking any survey, it is essential that you obtain data from people that are as
representative as possible of the group that you are studying. Even with the perfect
questionnaire (if such a thing exists), your survey data will only be regarded as useful if it
is considered that your respondents are typical of the population as a whole. For this
reason, an awareness of the principles of sampling is essential to the implementation of
most methods of research, both quantitative and qualitative.
A probability sample is one in which each member of the population has an equal chance
of being selected. A random sample is usually representative sample. There are two
methods of ensuring randomness: the lottery method and the use of random numbers.
In the lottery method, each unit of the population is numbered and shown on a chit of
paper or disc. The chits are then folded and put in a box from which a sample of
predetermined number is to be drawn.
In random number case, table of random numbers is used. The units of population are
numbered from 1 to N from which n units are selected.
In a non-probability sample, some people have a greater, but unknown, chance than others
of selection.
7
1.3.1 Probability Samples
There are five main types of probability sample. The choice of these depends on nature of
research problem, the availability of a good sampling frame, money, time, desired level of
accuracy in the sample and data collection methods. Each has its advantages, each its
disadvantages. They are:
Simple random
Systematic
Random route
Stratified
Multi-stage cluster sampling
This is perhaps an unfortunate term, because it isn't that simple and it isn't done at
random, in the sense of "haphazardly".
Characteristics:
Procedure:
The possible samples of size two from the B, C, D & E population are BC, BD, CD, CE, DE.
8
Note that, B appears in three of the six samples: so the probability of B, being selected is p
(B) = 3/6 = ½. Similarly, p(C) = p (D) = p (E) = ½: so (1.) each element of the population
has the same chance of being chosen. More over, (2) each of the possible samples of size
two has the same chance [p (BC) = p (BD) = p (BE) = p (CD) = P (CE) = p (DE) = 1/6], of
being selected. Consequently, we can say the conditions are satisfied.
2. Systematic sampling
Similar to simple random sampling, but instead of selecting random numbers from tables,
you move through list (sample frame) picking every kth name where k is N/n.
You must first work out sampling fraction by dividing population size by required sample
size. E.g. for a population of 500 and a sample of 100, the sampling fraction is 1/5 i.e. you
will select one person out of every five in the population. Random number needs to be
used only to decide on starting point. With the sampling fraction of 1/5, the starting point
must be within the first 5 people in your list
Used in market research surveys - mainly for sampling households, shops, garages and
other premises in urban areas .
Advantages:
9
Bias may be reduced because interviewer has to call at clearly defined
addresses - not able to choose
Problems:
Characteristics of particular areas (e.g. poor / rich) may mean that sample is
not representative
Open to abuse by interviewer because difficult to check that instructions
fully carried out
4. Stratified Sampling
Dividing a population into non overlapping groups is called stratification. A stratified
random sampling is one where the population you have is divide into non overlapping sub
groups or strata & then a simple random sample is selected with in each of the strata or
sub groups. Thus a population can be stratified if they have readily identifiable
characteristics that can be used to separate the population members into sub groups.
For example, we can stratify a human population as follows: first we can divide the
population into different strata on the basis of age, sex, occupations, education, religion,
region, etc… you have to notice that stratification doesn’t mean absence of randomness.
But all that it means, the population is first divided into a certain strata & then a simple
random sample is selected from each stratum of the population. The advantages of using
stratified random sapling are:
It more accurately reflects the characteristics of the population than simple
random sampling & systematic random sampling.
It is more cost effective than simple random sampling.
5. Multi-stage cluster sampling
As the name implies, this involves drawing several different samples. It does so in such a
way that cost of final interviewing is minimized.
Basic procedure: First draw sample of areas. Initially large areas selected then
progressively smaller areas within larger area are sampled. Eventually end up with sample
of households and use method of selecting individuals from these selected households
10
1.3.2 Non-Probability Samples
Cheaper
Used when sampling frame is not available
Useful when population is so widely dispersed that cluster sampling would
not be efficient
Often used in exploratory studies, e.g. for hypothesis generation
Some research not interested in working out what proportion of population
gives a particular response but rather in obtaining an idea of the range of responses on
ideas that people have.
1. Purposive Sampling
A purposive sample is one, which is selected by the researcher subjectively. The researcher
attempts to obtain sample that appears to him/her to be representative of the population
and will usually try to ensure that a range from one extreme to the other is included.
Often used in political polling - districts chosen because their pattern has in the past
provided good idea of outcomes for whole electorate.
2. Quota Sampling
Quota sampling involves the fixation of certain quotas, which are to be fulfilled by the
interviewers.
11
Quota sampling is often used in market research. Interviewers are required to find cases
with particular characteristics. They are given quota of particular types of people to
interview and the quotas are organized so that final sample should be representative of
population.
Stages:
Complex quotas can be developed so that several characteristics (e.g. age, sex, marital
status) are used simultaneously. By the end of the day, the researcher may be looking for a
widowed man in his nineties who looks as though he might buy a particular brand of
detergent.
Disadvantage of quota sampling - Interviewers choose who they like (within above
criteria) and may therefore select those who are easiest to interview, so bias can result.
Also, impossible to estimate accuracy (because not random sample)
3. Convenience sampling
A convenience sample is used when you simply stop anybody in the street who is prepared
to stop, or when you wander round a business, a shop, a restaurant, a theatre or whatever,
asking people you meet whether they will answer your questions. In other words, the
sample comprises subjects who are simply available in a convenient way to the researcher.
There is no randomness and the likelihood of bias is high. You can't draw any meaningful
conclusions from the results you obtain.
However, this method is often the only feasible one, particularly for students or others
with restricted time and resources, and can legitimately be used provided its limitations
are clearly understood and stated.
Because it is an extremely haphazard approach, students are often tempted to use the
word "random" when describing their sample where they have stopped people in the
12
street, as they see it "at random". You should avoid using the word "random" when
describing anything to do with sampling unless you are absolutely certain that you
selected respondents from a sampling frame using truly random methods.
4. Snowball sampling
With this approach, you initially contact a few potential respondents and then ask them
whether they know of anybody with the same characteristics that you are looking for in
your research. For example, if you wanted to interview a sample of vegetarians / cyclists /
people with a particular disability / people who support a particular political party etc.,
your initial contacts may well have knowledge (through e.g. support group) of others.
5. Self-selection
A sample is expected to mirror the population from which it comes; however, there is no
guarantee that any sample will be precisely representative of the population from which it
comes. Chance may dictate that a disproportionate number of untypical observations will
be made like for the case of testing fuses, the sample of fuses may consist of more or less
faulty fuses than the real population proportion of faulty cases. In practice, it is rarely
known when a sample is unrepresentative and should be discarded.
Sampling error
What can make a sample unrepresentative of its population? One of the most frequent
causes is sampling error.
Sampling error comprises the differences between the sample and the population that are
due solely to the particular units that happen to have been selected.
For example, suppose that a sample of 100 Arbaminch women are measured and are all
found to be taller than six feet. It is very clear even without any statistical prove that this
13
would be a highly unrepresentative sample leading to invalid conclusions. This is a very
unlikely occurrence because naturally such rare cases are widely distributed among the
population. But it can occur. Luckily, this is a very obvious error and can be detected very
easily.
The more dangerous error is the less obvious sampling error against which nature offers
very little protection. An example would be like a sample in which the average height is
overstated by only one inch or two rather than one foot which is more obvious. It is the
unobvious error that is of much concern.
There are two basic causes for sampling error. One is chance: That is the error that occurs
just because of bad luck. This may result in untypical choices. Unusual units in a population
do exist and there is always a possibility that an abnormally large number of them will be
chosen. The main protection against this kind of error is to use a large enough sample. The
second cause of sampling error is sampling bias.
Sampling bias is a tendency to favour the selection of units that have particular
characteristics. Sampling bias is usually the result of a poor sampling plan. The most
notable is the bias of non-response when for some reason some units have no chance of
appearing in the sample. For example, take a hypothetical case where a survey was
conducted recently by a Graduate School to find out the level of stress that graduate
students were going through. A mail questionnaire was sent to 100 randomly selected
graduate students. Only 52 responded and the results were that students were not under
stress at that time when the actual case was that it was the highest time of stress for all
students except those who were writing their thesis at their own pace. Apparently, this is
the group that had the time to respond. The researcher who was conducting the study
went back to the questionnaire to find out what the problem was and found that all those
who had responded were third and fourth PhD. students. Bias can be very costly and has to
be guarded against as much as possible. A means of selecting the units of analysis must be
designed to avoid the more obvious forms of bias. Another example would be where you
would like to know the average income of some community and you decide to use the
telephone numbers to select a sample of the total population in a locality where only the
rich and middle class households have telephone lines. You will end up with high average
income, which will lead to the wrong policy decisions.
14
Non-sampling error (measurement error)
The other main cause of unrepresentative samples is non-sampling error. This type of
error can occur whether a census or a sample is being used. Like sampling error, non-
sampling error may either be produced by participants in the statistical study or be an
innocent by product of the sampling plans and procedures.
A non-sampling error is an error that results solely from the manner in which the
observations are made.
Biased observations due to inaccurate measurement can be innocent but very devastating.
A story is told of a French astronomer who once proposed a new theory based on
spectroscopic measurements of light emitted by a particular star. When his colloquies
discovered that the measuring instrument had been contaminated by cigarette smoke,
they rejected his findings.
In surveys of personal characteristics, unintended errors may result from: -The manner in
which the response is elicited -The social desirability of the persons surveyed -The
purpose of the study -The personal biases of the interviewer or survey writer
No two interviewers are alike and the same person may provide different answers to
different interviewers. The manner in which a question is formulated can also result in
inaccurate responses. Individuals tend to provide false answers to particular questions.
15
For example, some people want to feel younger or older for some reason known to them. If
you ask such a person their age in years, it is easier for the individual just to lie to you by
over stating their age by one or more years than it is if you asked which year they were
born since it will require a bit of quick arithmetic to give a false date and a date of birth
will definitely be more accurate.
Respondents might also give incorrect answers to impress the interviewer. This type of
error is the most difficult to prevent because it results from out right deceit on the part of
the respondent. It is important to acknowledge that certain psychological factors induce
incorrect responses and great care must be taken to design a study that minimizes their
effect.
Knowing why a study is being conducted may create incorrect responses. A classic
example is the question: What is your income? If a government agency is asking, a different
figure may be provided than the respondent would give on an application for a home
mortgage. One way to guard against such bias is to camouflage the study’s goals; another
remedy is to make the questions very specific, allowing no room for personal
interpretation. For example, "Where are you employed?" could be followed by "What is
your salary?" and "Do you have any extra jobs?" A sequence of such questions may
produce more accurate information.
The preceding section has covered the most common problems associated with statistical
studies. The desirability of a sampling procedure depends on both its vulnerability to error
and its cost. However, economy and reliability are competing ends, because, to reduce
error often requires an increased expenditure of resources. Of the two types of statistical
errors, only sampling error can be controlled by exercising care in determining the method
for choosing the sample. The previous section has shown that sampling error may be due
to either bias or chance. The chance component (sometimes called random error) exists no
matter how carefully the selection procedures are implemented, and the only way to
16
minimize chance-sampling errors is to select a sufficiently large sample (sample size is
discussed towards the end of this tutorial). Sampling bias on the other hand may be
minimized by the wise choice of a sampling procedure.
so why not use the sample statistic as an estimate of the corresponding population
parameter: for instance, why not use the sample mean as an estimate of the population
mean is how confident can we be in the sample statistic.
For example: If we cast a fair die and take X to be the uppermost number, we know
that the population mean (expected value) is = 3.5, and that the population median
is also m = 3.5. But if we take a sample of, say, four throws, the mean may be far
from 3.5. Here are the results of 5 such samples of 4 throws (we used a random
number generator to obtain these samples):
X1 X2 X3 X4 X
Sample 1 6 2 5 6 4.75
Sample 2 2 3 1 6 3
Sample 3 1 1 4 6 3
Sample 4 6 2 2 1 2.75
Sample 5 1 5 1 3 2.5
17
Since each sample consists of 4 throws, we say that the sample size is n = 4. Notice that
none of the five samples gave us the correct mean, and that the mean of the first sample is
far from the actual mean. The average (mean) of these means is 3.2. Thus, although the
mean of a particular sample may not be a good predictor of the population mean, we get
better results if we take the mean of a whole bunch of sample means. Hence, sampling
distribution is a probability distribution for possible outcomes values of sample statistics,
such as sample means, sample proportion etc
It is the probability distribution for all possible values of sample means ( xi s ). The base for
this is difference of deviation between values found from different samples of the same
population.
10 20 30 40 & 50
A random sample of three is to be selected from this population & mean computed.
Develop the sampling distribution of the mean.
Thus to find how many different sample of size three can be taken from a finite population
of size five we can use combinations formula, N cn i.e. a number of possible samples of size
18
6 10,40,50 33.33
7 20,30,40 30
8 20,30,50 33.33
9 20,40,50 36.67
10 30,40,50 40
1.0
The sampling distribution of the mean is described by two parameters. Mean of sample
means & Standard deviation of sample means, which is termed as standard error of the
19
mean (s x ) .The mean of sample means ( x ) or(µ x ) is always equal to the population
mean(µ).
µ x x =µ
The standard error of the mean is equal to population standard deviation divided by the
square root of the sample size.
This works if and only if population size is large and sample size is very small (n<0.05N)
But if n is large (n>=30) population size is finite and n>=0.05N, we apply a finite
population correction factor or finite population multiplier. In this case the sampling
distribution of the mean can be approximated by normal distribution.
N n
Central Limit Theorem and Sampling Distribution of the mean
2. If the population is not normal, the distribution of sample means will be approximately
normal if the sample size n is sufficiently large. The C.L theorem shows the relationships
between the shapes of the parent population and sampling distribution (of the mean).
Normal Normal
20
The significance of the central limit theorem is that it permits us to use the sample
statistics to make inference about the population parameters without knowing anything
about the shape of the frequency distribution of that population other that what we can get
from the sample.
Example: The distribution of annual earning of all bank tailors with 5 years experience is
skewed negatively. This distribution has a mean of br.15, 000 and a standard [Link] we
draw a random sample of 30 tailors, what is the probability that their earnings will
average more than birr 15,750 annually?
Steps [Link] x and x
x br.15,000
2000
x br.365.15
n 30
2. Calculate Z
randomvar iable meanofrandom var iable
Z
s tan darddeviaitonoftherandomvar iable
xi x
Z xi =
x
15750 15000
Z 15750 +2.05
365.15
3. Calculate the area covered by the interval
p(x >15750) = p(z 2.05)
=0.5-p (0 to +2.05)
= 0.5- 0.47982
4. Interpret the results
We have a 2.02% chance that the average earning being more than 15750 annually in a
group of 30 tailors.
Activity: A production company’s 350 hourly employees average 37.6 years of age with a
standard deviation of 8.3 years. If a random sample of 45 hourly employees is taken, what
is the probability that the sample will have an average age of less than 40 years?
21
1.5.2 SAMPLING DISTRIBUTION OF THE PROPORTION ( )
It is the p probability distribution for sample proportion ( ).
x
n
x- Number of items which carry specific characteristics
n- Total number of items (sample size)
Sampling distribution of the proportion has two parameters:
Mean of sample proportion ( )
=P (population proportion)
The standard error of the proportion
pq
Where,
pq N n
n N 1
N n
Where is a finite population multiplier
22
Step 1. Check that np & nq >=5
np=120x0.6=72
nq=120x 0.4=48
2. Calculate and →
p 0.6
pq
0.0447
n
3. Calculate Z
p p
Z
0.5 0.6
Z 0.5 2.24
0.0447
4. Calculate the area covered by the interval
p( p 0.5) p(z 2.24)
= 0.5- p (0 to -2.24)
=0.5-0.48745
=0.01255
5. Interpret the results
The probability of finding 50% or less of the contractors to use this particular brand is
1.255%.if we take a random sample of 120.
Activity:
1. If 10% of a population of parts is defective what is the probability of
randomly Selecting 80 parts and finding that 12 or more defective?
2. If a population proportion is 0.28 and if the sample size is 140, 30% of the
time the sample proportion will be less than what value if you are taking random
samples?
23
1.5.3 SAMPLING DISTRIBUTION OF THE DIFFERENCE BETWEEN
TWO SAMPLE MEANS ( X 1 X 2 )
This distribution is concerned with finding the difference between sample means drawn
from two populations. That is it is interested in determining if the mean of one of
population is equal to the mean of another.
Sampling distribution of X1 X 2 has two parameters:
1. Mean of the difference between two sample means;
X 1 X 2 1 2
2. Standard error of the difference between two sample means
X 1 X 2
2
1 2
\this holds true if and only if
n1 n2
1 2 N 1 n1 2 2 N 2 n 2
X 1 X 2
n1 N1 1 n2 N 2 1
( X1 X 2 ) (1 2 )
Zn1n 2
X 1 X 2
24
1 2 is mean of random variable, and
X 1 X 2 is the standard deviation of the random variable
Activity: A soft drink factory produces two soft drinks, Apple and Sheweps. The daily
production of Apple averages 15000 bottles and is normally distributed with a standard
deviation of 2000 bottles. Sheweps daily production is also normally distributed with the
mean of 12500 and standard deviation of 2500 bottles. A sample of five randomly selected
daily production figures is taken from each of the plants. What is the probability that the
sample mean production for Apple will be less than or equal to the sample mean
production for sheweps?
Hint: 1. Calculate the expected value and the standard error
2. Calculate Z
3. Calculate the area covered by the interval
4. Interpret the result
Suppose two populations of size N1 and N 2 are given .For each sample of size n1 from first
population ,compute sample proportion 1 and standard deviation 1 .Similarly ,for each
25
For all combinations of these samples from this population, we can obtain a sampling
distribution of the difference 1 2 of samples proportions. Such a distribution is called
sampling distribution of difference of two proportions. The mean and standard deviations
of this distribution are given by
1 2 1 2 P1 P2
P1 (1 p1 ) P2 (1 P2 )
And 1 2 1
2 2
n1 n2
If sample size n1 and n2 are large, n1>=30 and n2>=30, then the sampling distribution of
difference of proportions is closely approximated by a normal distribution.
Example: 10% of machines produced by company A are defective and 5% those produced
by company B are defective A random sample of 250 machines is taken from company A
and a random sample of 300 machines from company B. What is the probability that the
difference in sample proportion is less than or equal to 0.2?
0.02 0.05
PZ
0.0228
PZ 1.32
0.5000 0.4066 0.0934
Hence the desired probability for the difference in sample proportion is 0.0934.
26
SUMMERY
X calculated from all possible samples of the same size selected from a population.
Sampling error is the difference between the value of a sample statistic
calculated from a random sample and the corresponding population parameter. This type
of error occurs due to chance. The errors that occur during the collection, recording , and
the tabulation of data are known as non-sampling error.
A method of selecting a sample in which the population is first divided in to
strata and a simple random sample is then taken from each stratum is called stratified
sampling. A method of choosing a sample by randomly selecting one of the first n elements
and then selecting every nth element thereafter is systematic sampling. Cluster sampling is
a method of sampling in which the population is first divided into clusters and then one or
more clusters is selected for sampling.
A non- probabilistic method of sampling whereby elements are selected for
the sample on the basis of convenience is called convenience sampling where as judgment
sampling is a non probabilistic method of sampling whereby elements are selected for the
sample based on the judgment of the person doing the study.
Central limit theorem is the theorem from which it is inferred that for large
sample size (n>=30), the shape of the sampling distribution of X is approximately
normal. Also, by the same theorem, thee shape of the sampling distribution of P is
approximately normal for which np>=5 and nq>=5.
27
SELF CHECK EXERCISE 1
28
CHAPTER TWO
_____________________________________________________________________
STATISTICAL ESTIMATION AND STATISTICAL INFERENCE
UNIT OUTLINE
Basic Concepts
Criteria for Estimators
Point Estimators of the Mean & Proportion
Interval Estimators of the Mean & Proportion
Student’s t Distribution.
Determination of Sample Size
Chapter objective
After completing this unit, students will be able to:
Understand estimation as an inferential process
Understand point estimate, estimator and estimation
Distinguish point and interval estimates
Identify characteristics of good estimators
Make point and interval estimation
Dear learner, we have seen the concept of sampling and sampling distribution in chapter
three of three of this module. Perhaps you may wonder the need for sampling. Do you
remember why we need to take samples? Yes, census is costly and sometimes impossible.
Therefore, we need to take part of the entire population (sample) and infer the
characteristic of the population form the sample we have drawn. Consider the following
statements. The life span of electric lamp produced by Sahara is 4,500 hours. In this
chapter we continue our discussion of inferential statistics by examining point estimation
and point estimation.
29
Brain storming question
What is estimation and when do we you estimation?
Statistical inference is the process of using limited information, a sample, for the purpose
of reaching conclusion about a large set of data, the population. Estimation refers to any
procedure where sample information is used to estimate or predict the numerical value of
some population measure (called parameter) such as the population mean μ.
30
A statistic is an unbiased estimator of a parameter of the expected value of the statistic
equals the parameter, i.e. if
E (statistic) = Parameter
Any statistic chosen as an estimator is a random variable since the value of the statistic
may differ from sample to sample. The expected value of a random variable may be
interpreted as long-run average. Therefore, the above definition indicates that a statistic is
an unbiased estimator of a parameter if the average value of the statistic is the same as the
parameter value. Thus on average the estimator will be correct.
Efficiency
Unbiasedness alone does not guarantee a good estimator. In fact, some parameters may
have more than one unbiased estimator. Selection among the unbiased estimators is made
on the basis of comparing the variances of the estimators.
If the there exist more than one unbiased estimator of population parameter, the estimator
with minimum variance is the more efficient.
Even though the average value of an unbiased estimator equals the parameter, an
estimator may yield estimates that are not particularly close to the parameter value. The
efficiency of an estimator is measured by the variance of the estimator. The minimum
variance unbiased estimator is the unbiased estimator with the smallest variance.
Consistency
Another desirable property is that an estimator should produce estimates that have a high
probability of being close to the true value as the sample size increases. An estimator that
has this property is called a consistent estimator. The variance of a consistent estimator
becomes smaller as larger sample sizes are taken.
Sufficiency
A last property of good estimator is sufficiency. A sufficient estimator is the one that
utilizes all the information a sample contains about the parameter estimated. In choosing
among possible candidates for the best estimator of a parameter, it is possible that no one
estimator has all the desirable properties. One estimator may be unbiased but have a large
variance. A biased estimator may have smaller variance. A consistent estimator may also
31
be biased. A biased estimator is not necessarily undesirable unless the amount of bias is
large. Consistence indicates that the amount of bias becomes smaller as the sample size
increases. From the discussion, we can see that the point estimators are not selected
haphazardly; rather, they are selected on the basis of some well defined criteria.
Sx ( X X )2
n 1
Where X = sample mean
n= sample size
Dear learner pay attention to devisor n-1 (sample size minus one) in the formula. Earlier,
we used the devisor N, when computing a population standard deviation σx.
32
For the random sample 1, 2, 4, 5, 7, 11 write the symbol for and compute the sample
standard deviation.
Solution
Sx
(X X)
2
n 1
(1 5)2 (2 5)2 (4 5)2 (5 5)2 (7 5)2 (11 5)2
Sx =3.633
6 1
less than 5 % of the population size. In our case, the total size of the population is
unknown; therefore it is safer to assume that the sample is less than 5% of the entire
population. Hence, we will use the estimator sx to estimate the standard error . The
X
n
symbol S Xis called the sample standard error of the mean. The formula for S Xis
Sx
X
Thus, Sx is the estimator for σx, and S Xis the estimator for X.
Dear learner, we have calculated Sx= 3.633 for the random sample of 1, 2, 4, 5, 7, 11. The
sample standard error can be obtained using the formula
Sx 3.633
SX = 1.483
33
Standard error of the proportion answers how far an unknown population proportion
might be from sample proportion. The symbol S will be used to mean standard error of
the proportion.
pq
P
n
Where p = sample proportion of success
q 1 p
n= sample size
Example
Let an even number be success, and suppose a sample of 200 numbers be selected
randomly from a population that contains 120 even numbers. Write the symbol for and
compute the value of the point estimator of the standard error of the proportion.
Pq 0.6x0.4
0.0346
n 200
The following table shows some population parameters and their estimators.
Population parameter sample statistic (estimators)
Mean X
Standard deviation σx Sx
Variance σ2x S2x
Proportion P P
Standard error of the mean X
34
conveys the fact that estimation is an uncertain process. The standard error of the point
estimator is used in creating a range of values; thus, a measure of variability is
incorporated into interval estimation. Further, a measure of confidence in the interval
estimator is provided; consequently, interval estimates are also called Confidence
Intervals. For this reasons, Interval estimators are considered more desirable than point
estimators.
x
For the sampling distribution of the mean, the standard normal variable is
X
Z
x
If we want to be 95% confident that the population mean, falls with in the estimate, we
can calculate the range as follows.
1. find the Z value for 95% confidence level
2. Use the obtained Z value to calculate the unknown population parameter.
35
For example z value for 95% confidence interval is 1.96. Therefore, if we want to be 95%
sure that the true population mean falls with in the estimate, we can rearrange the above
formula and get:
X 1.96 x X 1.96x
The proportion of correct estimates (0.95 in our illustration) is called the confidence
coefficient C. the number 100C (95% in our illustration) is called the confidence level. The
proportion of incorrect statements is symbolized by the Greek letter α (alpha). The sum of
the proportions of correct and incorrect statements 1; so
C + α =1 or α = 1- c
We can describe C as the chance that the confidence interval is correct, and α as the chance
that the interval is incorrect.
Example
A normal population has standard deviation of 10. a random sample of size 25 has a mean
of 50. Construct a 95% confidence interval estimate of the population mean.
Solution
To construct the confidence interval,
We have to first find Z value for 95% confidence level and then use the formula,
X Z x X Z x to estimate the interval. The Z value for 95% confidence level is
1.96. Therefore, the estimate can be given as, X 1.96 x X 1.96 x . That is:
10 10
50 1.96( ) 50 1.96( )
25 25
= 50 3.9 50 3.9
= 46.1 53.9
36
x
X Z /2
x
The smaller the value of Z / 2 , the more precise (narrower) is the confidence interval.
Consequently, the smaller Z / 2 and x are, and the larger n is, the more precise will be the
interval. We conclude that the larger the sample size, the more precise is an interval
estimate. It can also be concluded that the smaller the variability the more precise the
estimate. The final conclusion that can be drawn from the above relationship is, the lower
the confidence level, the more precise is the interval estimate.
Under the previous case we have seen the case where the population is uniformly
distributed and population standard deviation is known. In this case we search for Z value
x
of /2 and use the formula X Z / 2 to estimate the interval within which the
population mean lies with C Confidence coefficient. However, most of the time population
mean µ is unknown, so is population standard deviation, d. therefore, d must be estimated
from sample standard deviation.
Sx ( X X )2
n 1
After calculate the standard deviation, standard error must be computed using the
following formula.
SX
x
When population standard deviation known, the interval estimate can be calculated as
X
Z
x
37
normal distribution. The distribution rather follows a student’s t-distribution which was
identified for the first time by W. S. Gosset in 1900s. There are different t-distributions for
each sample size. T-distribution is discussed in a greater detail in hypothesis test. In this
chapter we will only illustrate how to make an interval estimate using the t-distribution;
without giving much emphasis for the distribution’s characteristic.
Tail areas for t-distribution are presented according to parameter called degrees of
freedom. We shall use the symbol for degrees of freedom. Degree of freedom for t-
distribution can be calculated as n 1.
Where
ν= degree of freedom
n= sample size
As ν increases, the tail area decreases; so is the t-value. As degrees of freedom increases,
the t-distribution approaches the standard normal distribution. When degree of freedom is
30, the t-distribution is approximately similar to normal distribution.
To construct interval estimate for µ under this situation, we need to use the value of t / 2 ,
which will be read from statistical table in association with the formula:
SX
SX
X t / 2, X t / 2,
Where
X Sample Mean
n= sample size
n -1 (degrees of freedom)
Sx= sample standard deviation
μ=unknown population mean
38
Example
The environmental protection officer of a large industrial plant sought to determine the
mean daily amount of sulphur oxide (pollutant) emitted by the plant. Because
measurements costs were high, only a random sample 10 days’ measurements were
obtained: these were, in tons per day,
8 7 10 15 11 5 8 5 13 12
Suppose emissions per day are normally distributed. Estimate μ, the mean amount of
sulphur oxides emitted per day using the confidence interval with a confidence coefficient
of 0.95.
Solution
X
X = 95 9.5
n 10
Sx ( X X )2 =
94.5
=3.24
n 1 9
The confidence level is 95%. Therefore, significance level = 1-C= 1-0.95= 0.05 and
/2=0.025.
Next, we have to calculate the degree of freedom for the observation; which is given as
ν=n-1= 10-1=9
SX SX
. t / 2, in this specific
39
normal distribution. Hence, we can use the Central limit theorem to construct interval
estimate for a mean when sample size is greater than or equal to 30.
Most of the times we need to estimate the population proportion, such the proportion that
supports a given political party. The symbol p represents the population proportion of
success and q p 1 . Those who support the political party is success and p is the
population proportion of supporters; q p 1 is the proportion who do not support the
political party.
pq
The sampling distribution of p has a mean of p and standard error of . When
conducting a confidence interval for the unknown value of p, we the estimator p in place
of p in the formula. Next, we compute the sample standard error of the proportion,
pq
S p
n
40
Then using S as estimator of , we can calculate the interval estimate as:
pq pq
P Z / 2 p p Z / 2
n n
Example
A random sample of 400 members of labour force in a five state region showed that 32
were unemployed. Construct the 95% confidence interval for the proportion unemployed
in the region.
Solution
32
P = 0.08
400
With C of 95%, 0.05and / 2 0.025
Find Z0.025 from the statistical table. To find Z value search for the probability in the main
body of the Z table and search for the corresponding Z score. In our case that will be 1.96.
Therefore, the interval estimate can be calculated as:
pq pq
P Z /2 p p Z /2
n n
(0.08)(0.92) (0.08)(0.92)
0.08 1.96 p 0.08 1.96
400 400
0.053 p 0.107
41
take small sample to hold costs dawn. On the other hand, we want to the sample to be large
enough to provide good estimator of population proportion. Consequently, the issue is
how large should the sample size be? The size of the sample depends on three factors:
How precise or narrow we want the interval estimate to be
How confident we want to be that the interval estimate is correct
How variable is the population being sampled
The higher the desired precision or level of confidence, the larger will be the sample; also
for a given precision and level of confidence, the larger the population variability is, the
larger will be the sample.
pq pq
P Z /2 p p Z /2 , Which shows that the interval extends from
n n
pq
P Z /2
n
The interval will be more precise or narrower the smaller the term that follows. The term
is called error and is indicated by e.
pq
e Z / 2
n
If we solve for n, we get the following formula:
Z 2 / 2 pq
n
e2
In fact we are trying here to determine how large our sample size should be; so we do not
have p and q because the sample has not yet been taken. Therefore, instead of p and q ,
we need to use p and q. however, p and q themselves are not known. Therefore, it is safer
to take 0.5 for p which yields the safer sample size. If the decision maker has some prior
information about the population proportion, that must be used instead of 0.5. If the
existing information leads to the belief that the population proportion is between two
values:
42
If both values are on the same side of o.5, choose p as the value closer to 0.5.
If 0.5 is between the two values, use 0.5 as for p.
Example
Suppose we want to estimate a population proportion to be with in 0.04 and we want to
a confidence coefficient of C= 0.90. How large should the sample size we take be?
Solution
We are given confidence coefficient and error. The population proportion that yields the
safest sample size is 0.5. Therefore, it is possible to calculate the sample size using the
formula:
Z 2 / 2 pq
n
e2
Next, we need to read Z value of /2, where = 1- 0.9 = 0.1. Therefore, /2 = 0.05
Z / 2 Z 0.05 1.64
Therefore,
1.642 (0.5)(0.5)
n
(0.04)2
n 420.25
x
Can be rewritten as X Z / 2 this can be expressed as X e .
Therefore,
x
e Z /2
Z 2 / 2 2 x
n
e2
43
As can be seen from the above formula, there is direct relationship between sample size
and variation in the population. Therefore, the more the variability the larger is the sample
size. Variation of the population, however, is neither known nor its estimate obtained prior
to sampling. Hence, if there is historical evidence of the variance that can be used. But most
of the time neither the population variance nor the sample variance are known. Hence we
need to estimate it using the formula:
Example
A sample is to be taken to estimate the mean salary of plumbers to be within 500 with a
confidence coefficient of 0.99. A Plumber’s union official states that birr 40,000 and birr
26,000 would be unusual large and small salaries for plumbers in the union. What should
the sample size be?
Solution
Z 2 / 2 2 x
n
e2
It is possible to use formula and find sample size but we need to firs find the σ.
officials high value officials low value
4
40,000 26,000 2.33(3500)
Therefore, 3500 Therefore n ( )2 266.02
4 500
44
UNIT SUMMARY
Estimation is the process of using sample statistic to infer about the
population parameter. There are two types of estimates point estimate and interval
estimate.
Point estimate is the process of assigning a single value that we believe the
population parameter takes.
Interval estimate is the process of constructing a range within which the
population parameter lies.
Point estimator for population mean is sample mean, for population
proportion is sample proportion, for population standard deviation is sample standard
deviation, for population standard error is sample standard error, the difference of two
population means is the difference of two sample means and the difference of two
population proportions is the difference of two sample proportions.
The confidence interval shows how certain we are that the interval is
correct. The choice of method used in constructing a confidence interval is for depends
upon whether or not the population is normal and whether the population standard
deviation X is known or unknown.
The narrower the confidence interval is, the more precise it is. And the wider
the interval, the less precise is the interval
We conclude that the larger the sample size, the more precise is an interval
estimate.
The smaller the variability, the more precise the estimate
The lower the confidence level, the more precise is the interval estimate
If population standard deviation is unknown, we need to estimate
population standard deviation with sample standard deviation and the distribution does
not follow normal distribution; it rather follows a student’s t-distribution. There are
different t-distributions for each sample size.
The size of the sample depends on three factors:
How precise or narrow we want the interval estimate to be
How confident we want to be that the interval estimate is correct
How variable is the population being sampled
45
1. Ford Motor Company introduced a new minibus which has greater fuel
economy than the regular sized minibus. A random sample of 50 minibuses averaged 30
miles per gallon, and had standard deviation of 3 miles per gallon. Construct a 95 percent
confidence interval for the mean miles per gallon for all minibuses.
2. A cattle raiser selected random sample of 10 steers, all of the same age and
fed them special mixture of grains and other ingredients. After a period of time, weight
gains were recorded. The sample mean weight gain, per steer, was 142.6 pounds and
standard deviation was 10.4 pounds. Suppose weight gains are normally distributed.
Construct a 90% confidence interval for the population mean weight gain per steer.
3. the diameter of ball bearings made by an automatic machine are normally
distributed and have standard deviation of 0.02 mm. the mean of a random sample of four
ball bearings is 6.01 mm. construct the 95% percent interval for the mean diameter of all
ball bearings being made by the machine.
4. Interviewers called a random sample of 300 homes while “Ehud Mezinanya”
is being aired. 105 respondents said they were watching the program. Construct a 95%
confidence interval for the proportion of all homes where the program was being watched.
5. The proportion of all consumers favouring a new product might be a slow as
0.20 or as high as 0.60. A random sample is to be used to estimate the proportion of the
consumers who favour the new product to within ±0.05, with a confidence coefficient of
90%. To be on the safe (larger sample) side, what sample size should be used?
46
CHAPTER THREE
HYPOTHESES TESTING
Unit contents
Chapter introduction
Type I and type II errors
Steps in Hypothesis Testing
One tail and two tail tests
Hypothesis test of Population mean
Hypothesis test of population proportion
Hypothesis test of the difference between two means
Hypothesis test of the difference between two proportion proportions
Hypothesis when population standard deviations are unknown and sample size small
Unit Objective
At the end of this chapter students will be able to:
Understand apply hypothesis testing in different managerial problems
Identify type one and type II errors
Identify one tail and two tail tests
Conduct hypothesis of population mean
Conduct hypothesis of population proportions
Conduct hypothesis test of the difference of two means
Conduct hypothesis test of the difference of two proportions
Conduct hypothesis test for normally distributed population with unknown
population standard deviation and small sample size
Dear learner, in this chapter, the concept of a statistical test of hypothesis is formally
introduced. Business managers must always be ready to make decisions and take actions
on the basis of the available information. During the process of decision making, managers
form hypotheses that they can scientifically test by using the available information. The
managers then make decisions in the light of the outcomes.
47
We make assumptions about the population parameter to be tested. The assumptions we
make about population parameters are called hypotheses. Then we take sample to
estimate the value of the population parameters. If the estimate favours the hypothesis, we
accept the hypothesis as being correct. If the value of the sample statistic thus calculated as
an estimate of the population parameter does not favour the hypothesis made about the
population, then the decision must be made as to whether the difference purely a matter of
chance which happens in nature (when in fact the sample statistic and the population
parameter are in fact similar) or whether this difference is significant enough so that it is
the real difference and our assumption about the population parameter is not correct.
Since we are testing for out hypothesis or assumption being true or not, this field of
decision making is called Hypothesis testing.
Dear students, let us assume the following example. A police found a dead body and want
to investigate the cause for the death and suspected murder. Up on further investigation, a
detective at the scene of the murder makes some assumptions or inferences about the
murder based on the initial observation and analysis of the scene of the crime.
The victim was struck from behind by left handed man
The murderer is tall
The detective makes assumption about the butler and checks the butler’s height and
whether he is right handed or left handed.
Let’s say the detective assumes the butler is innocent. After checking however, the butler is
tall and he is left handed.
Dear students, does this fact make the butler guilty? In fact yes. The mere fact that he is
tall and left handed makes the detective to accept the proposition that the butler is the
killer.
On the other hand, let’s assume that the butler is short and right handed. Would you think
this fact will make him innocent? Definitely, yes. The mere fact that he is right handed and
short can prove that the butler is innocent.
48
Caution
The only fact that the butler is tall and left handed does not mean that he killed the man
and the same is true the short and right handed person, because chances are there that he
could deliberately plan to look this way.
Detectives action
Type I error
Type I error is an error made in rejecting the null hypothesis when in fact it is true. In the
above example if the null hypothesis is the butler is innocent, what will be type one error?
Type one error is the probability of charging the butler when he is indeed innocent. Type I
error is denoted by α (alpha) and is expressed as a probability of rejecting a true
hypothesis. It is know as level of significance. 1- α represents degree of confidence.
Type II error
Type II error is an error made in accepting the null hypothesis when in fact it is false. In
the previous murder case, if the butler who has killed the victim is released the detective is
said to commit type II error- the probability of declaring a criminal butler innocent.
49
Type II error is denoted by β and it is the probability of accepting false hypothesis. β value
should be as low as possible.
Rejection of the null hypothesis that is being tested implies acceptance of alternative
hypothesis. The two hypotheses represent mutually exclusive and collectively exhaustive
theories about the value of the population parameters such as population mean,
population proportions, and population variance.
Significance level
Significance level is the chance of committing a type I error: that is probability of rejecting
a true null hypothesis. Significance level is denoted by . When the value of a test statistic
leads us to reject Ho, we say that the value is statistically significant. Significance level is
50
often used by analysts in reporting test results. For example an analyst may report that a
test result was significant at the 5 % level but not significant at the 1% level: that means
the null hypothesis that was tested would be rejected if =0.05 but would be accepted if
=0.01.
Making a decision
Based one the sample statistic obtained and the critical value, the analyst makes a decision
of whether to accept or reject the null hypothesis developed already.
51
area of the rejection lies entirely on one extreme of the curve either right or left tail are
known as one tail tests.
Acceptance region
Rejection region
(α/2) (1-α)
Rejection region (α/2)
A two tailed test is called for when we are interested in the population mean, μ being
either much larger or much smaller than the specified value, μ o. For example, winery many
need to know the average ml of wine per bottle. Too little ml causes customer complaint
and too many reduces profit.
An upper tail test is in order when we are concerned only with when the population mean
μ is larger than the specified value μo. for example, an Insurance company may need to
know the average amount of time it takes to process claim. Too long time is unacceptable
to customers.
52
Acceptance region (1-α)
A lower tail test when we are concerned only concerned with when the population mean μ
is smaller than some specified value μo. For example, an electric bulb producer may need
to know if the average life span of the bulbs it produces are less than a given pre specified
amount. If the life span is too short the customers will complain. But there is no problem if
the life span is too long.
1. Ho: µ≤a
H1:µ>a
2. H0: µ≥a
H1: µ <a
3. Ho: µ=a
H1:µ≠ a
Dear learner, the first two of the above hypothesis lead to one tail test discussed above and
the third one deals with two tail test.
53
In determining whether a one tail test is appropriate, it is helpful to express the
problem in terms of some phrases that indicate whether a single direction or both
directions of difference away from the parameter value is important. If the basic
question of interest can be expressed as has there been an increase? Is the new
better than the old, is there a decrease, or has there been a decline? Then a one tail
test is appropriate. If the question can be expressed as is there any change or is
there any difference? Then a two tailed test is appropriate.
�Illustration one
Assume that the average annual income for government employees in the nation is
reported by the census bureau to be birr 18, 750. There was some doubt whether the
average yearly income of government employees on Addis Ababa was representative of
the national average.
A random sample of 100 government employees in A.A was taken and it was found that
their average salary birr 19, 24o with standard deviation of birr 2 610. At a level of
significance α = 0.05 (95 % confidence interval) can we conclude that the average salary of
government employees in Washington is a representative of the national average?
Solution
1- State the hypothesis
H0: µ = 18750
H1: µ ≠ 18750
2- State the level of the significance
Level of significance α = 0.05
3- Determine which test statistic to use
Dear learner, here we need to determine which of the distributions we have to use to test
the null hypothesis. As we can see form the problem, population standard deviation is
known and the population is infinite. Moreover sample size is more than 30. Therefore, the
relevant test statistic, using a central limit theorem, is Z distribution. Sampling
distribution of sample means is approximately normally distributed with standard error of
the mean being σx
54
x x
z= =
x
n
490
= 19240218,750 = = 1.877
2,610 261
100
Define the critical region since α = 0.05 and it is a two tailed test, the rejection region will
be on both ends tails of the curve in such away that the rejection area will comprise 2.5%
(5%/2) at the end of the right tail and 2.5 % (5%/2) at the end of the left tail.
Z value from the table will be ± 1.96
Decision rule accept the Ho if -1.96 ≤ Z calculated ≤ 1.96
Since the calculated z is less than 1.96 we cannot reject the null hypothesis which means
the Ho is true meaning the mean is different from 18750.
Activity 5.1
1. Define type I and type II errors.
2. What is meant by the statement the significance level of a test is = 0.05?
What is meant by the symbol?
3. An automatic machine should produce parts that have a mean diameter of
25 mm. part diameters are normally distributed. The diameter of 10 parts is to be used to
check whether or not the machine is running properly. Perform a hypothesis test at 5%
level if the mean of the sample is 25.02 and sample standard deviation is 0.024 mm.
�Illustration two
The manufacture of light bulb claims that the light bulb lasts on an average for 1600 hrs.
We want the test his claim. We will not reject the claim if the average of the sample taken
lasts considerably more than 1600 hrs. But we will reject his claim if it lasts considerably
less than 1600 hrs.
A sample of 100 light bulbs was taken at random and the average bulb life of this sample
was computed to be 1570 hrs with standard deviation 120 hrs. At α = 0.01 test the validity
of the claim.
Solution
55
Step one
Ho: µ = 1600
Ha: µ < 1600
Step two
Find the significance level
As it is already stated in the problem, the test is to be done at α= 0.01.
-2.33
Decision rule
The decision rule is “Reject H0 if z is < -2.33”
3. Z = = 30 = -2.5
12
4. Decision Reject Ho
Interpretation
The average life of the bulb is considerably less than 1600 hrs.
Example 3
An Insurance company claims that it takes 2 weeks (14 days) on an average, to process an
auto-accident claim. The is 6 days. To test the validity of the claim, an investigator
randomly selected 36 people who recently filed claims. This sample revealed that it took
the company an average of 16 days to process the claims. At 99% level of confidence, check
of it takes the company more than 14 days.
Solution
1. Ho: µ = 14 days
H1: µ > 14 days
56
99% 1%
2.33
2. Decision Rule:
The decision rule is “Reject Ho if Z> 2.33”
3. Calculate the critical Value
x
z= = 16 14 = 2
6
6
4. Decision
The calculated z is the less than the critical valve. Therefore accept Ho.
Activity 5.2
1. Government officials have decided to control the practice of product of its
mean price per unit in a retail outlets rises above birr 2.5. Perform a hypothesis test at the
0.01 level if the mean price in the random sample of 40 outlets is birr 2.52 and sample
standard deviation is birr 0.10.
2. Safe fly company makes parachutes. Safe fly has been buying snap links from
a manufacturing firm which recently merged with bridge cooperation. Safe fly is concerned
that the quality of snap links they receive from Big deal might not be up to specifications.
Specifically safe fly wants to be convinced that the links will with stand a mean breaking
force of more than 5,000 pounds. Perform a hypothesis test at the 0.005 level if the mean
breaking force for a random sample of 50 links is 5,100 pounds and sample standard
deviation is 221 pounds.
3. Food Machinery Supplies manufactures automatic cola dispensing machines
that are supposed to pour 8 ounces into a cup. Before shipping a machine, Food Machinery
Supplies makes a sample check to determine if the mean amount poured by the machine is
57
at least 8 ounces. Perform a hypothesis test at the 0.05 level if a random sample of 60 filled
cups had a mean fill amount of 7.92 ounces and standard deviation of 0.16 ounces.
4. Seniors in a high school of a city in the past had a mean score of 490 on
standard mathematics test. A teacher suggests that the seniors will have a higher mean
score if they attend tutorial sessions before taking the test. Perform a hypothesis test at
0.05 level if the scores of a random sample of 35 tutored seniors who take the
standardized test have a mean of 510 and standard deviation of 85.
5. The Super Tread Tire Company requires that the tires it makes should
withstand a mean pressure of more than 150 psi (pounds per square inch) before bursting.
From each large batch of tires made, a random sample of 10 tires is selected and subject to
increasing pressure until burst. Bursting pressures have been found to be normally
distributed. Only batches of tires that meet the psi requirements are sold under the super
tread Brand name. Perform a hypothesis test at the 1% level for a batch in which the
sample mean psi is 154 and sample standard deviation is 3.7 psi.
0.5 1.96
3. Reject Ho if the calculate z is <-1.96 and > 1.96
58
Accept Ho if -1.96
p II 0.575.0.5 0.075
Z calculated = p- = = =3
0.575(0.425 0.25
400
4. Decision Reject Ho
Example 2
The mayor of the city claims that 60 % of the people of the city follows him and support his
policies. We want to test whether his claim is valid or not. A random sample of 400
persons was taken and it was found that 220 of these people supported the mayor. At level
of significance α = 0.01, what can we conclude about the mayor’s claim?
Solution
1. State the hypothesis
H0. > 0.6
H1: < 0.6
2. Decision Rule
-2.33
Therefore, the decision rule is Reject Ho if Z cal < -2.33
3. Calculate the parameter (z)
p 0.05
= = = -2.04
p .55(.45) 0.0245
.
400
4. Decision Accept Ho
59
Activity 5.3
1. Suppose that a sponsor of Television program states that the program
should be cancelled if there is convincing evidence that the program’s share of viewing
audience is less than 25 percent. The sponsor also states that the worst error would be to
cancel the program if its audience share is 25 percent or more. And the chance of making
the worst error is to be only 5 percent. A sample of 1250 TV viewers will be interviewed;
and the sample proportion p of the viewers who watch the TV program will be used to
decide whether to cancel no not to cancel the program. Suppose that there are 260
program viewers from the sample taken. Should the sponsor cancel the program?
2. A company has developed a new hair shampoo named Shanta. The
company’s marketing executive has obtained figures for the costs of plant expansion and
new product advertising. Taking the costs in to account, the executive thinks it would be a
mistake to market shanta unless there is substantial evidence that more than 20 percent of
shampoo buyers will choose Shanta rather than the competitive Shampoo. The executive
wants the chance of marketing Shanta to be 0.01 if it does not have more than 20 percent
of the market. The plan is to stock a random sample of stores with Shanta and have a
random sample of 500 customers observed as they select a shampoo to purchase. Perform
a hypothesis test if 110 of the customers in the sample purchase Shanta.
3. During a year-end audit, discrepancies were found in a company invoice
ledger. Consequently, the controller had all invoices for the year checked to determine if
they were correctly recorded in the ledger. The proportion of incorrectly recorded
invoices was found to be 0.04. The controller instituted a new procedure for processing
invoices. Subsequently, a random sample of 500 invoices was checked to determine
whether the proportion of incorrectly recorded had changed from 0.04. Perform a
hypothesis test at the 5% level if 11 of the sample invoices were recorded incorrectly.
4. Microchip Company makes chips used in electric circuits. A random sample
of 125 chips is to be selected from those produced. Production is to be halted for careful
inspection if the sample percent of defective chips is significantly higher than the normal,
where the normal is taken as 5% or less defective. It is required that the chance of halting
normal production is to be 0.02. Perform hypothesis test if the sample contains 10
defective chips.
60
5. In considering a new proposed personnel policy, executives of Equatorial
Business Group feel it would be a mistake to institute the policy if 25% or more of the
company employees oppose it. The executives want the chance of making this mistake to
be 0.10. Perform a hypothesis test if 38 of a random sample of 200 employees oppose the
policy.
3. Calculate Z
x1 x2 x1 x2 x1 x2
=
(x1 x2 ) 2 2
n1 n2 n1 n2
15001530 30
= = = -3.841
2 2 7.81
50 60
100 100
Reject Ho =
Interpretation
The two means differ significantly
Example 2
A civil group in a given city claims that a female college graduates earn less than male
college graduates. To test the claim, a survey of starting salary of 60 male graduates and 50
61
female graduate was taken and it found that the average starting salary for female
graduates was birr 29,500 with standard deviation of birr 500 and the average salary for
male graduates was birr 30,000 with standard deviation of birr 600. At 1% level of
significance test the claim of this group.
3. Calculate Z
x1 x2 500 =
500
= 4.76
=
512 50002
5002 104.88
nt 50
n1 00
4. Since Z cal >Z, we can not accept Ho
Activity 5.4
1. The president of a large automobile sales agency tells her sales manager that
she is pleased with the increase over the last year in number of cars sold. The sales
manager contends that the mean net profit per car sold this year is higher than it was last
year. In as much as detailed accounting is required to obtain a firm figure for net profit
realized on a particular sale, the sales manager has an accountant determine this profit for
random sample of 35 cars sold this year and 35 cars sold the last year. The sales manager
wants to show that the mean profit for the current year µ1 is greater than last year mean
profit, µ2. Perform a hypothesis test at the 5% level if this year’s sample has a mean x =birr
350 and standard deviation of s1= birr 25 and last year’s sample has a mean of X 2= birr
340 and standard deviation s2= 30.
2. There are two methods used to assemble a product. Fifty assemblies are
made by each method to determine if mean assembly time are different. Perform a
hypothesis test at the 1% level of the method one has a mean of 10 hrs and standard
62
deviation of 0.45hrs; compared with mean of 9.6 hrs and standard deviation of 0.4 hrs for
method 2 sample.
3. Two different methods of instruction were used in management training
program for a large group of supervisors at a steel city metals. Supervisors without any
training in the subject matter were randomly assigned to either the personalized
instruction method or the more traditional lecture Method (LM). In the personalized
Instruction method, supervisors used programmed materials and proceeded at their own
pace during schedule periods. The lecture method used training leader in a class room
setting. In order to determine whether the training methods made any difference in
learning, standardized test was given to all participants. For a random selection of 40
personalized instructions method participants, the mean score was 72 and standard
deviation was 15. For 50 lecture method participants, the mean was 81 and standard
deviation was 20. Is the observed difference in the mean scores significant at the 0.05
level?
Example
A sample of 200 students at AMU revealed that 18% of them were seniors. Similar sample
of 400 students at Debub University revealed that 15% of them were seniors. We want to
test whether the difference between these two proportions is significant that these
populations are indeed different at 5% level of significance
Solution
Steps
1. State the hypothesis
Ho. 1 2
H1 . 1 2
2. Decision Rule
Accept Ho if -1.96 1.96
3. Calculate the statistic z.
63
p1 p2 n1 p1 n2 p2 200(.18) 400(.15)
Where, =
(1 )( 1n 1n ) n1 n2 600
= 0.16
.16(.84)( 1 ) = 0.0317
200 400
p1 p2 = 0.18 0.15
= 0.95
p pn 0.0317
Example 2
An insurance company believes that smokers have high incidence of heart diseases then
non-smokers in men over 50 year of age. Accordingly, it is considering to offer discounts
on its life insurance policies to non-smokers. However, before the discount is made,
analysis is undertaken. To justify its claim that the smokes are at a higher risk of heart
disease then non-smokers, the company randomly selected 200 men which included 80
smokes and 120 non-smokes. The survey indicated that 18 smokers suffered heart disease
and 15 non-smokes suffered from heart disease. At 5% level of significance, can we justify
the claim of the insurance company that smokes have a high incidence of heart disease
than non-smokers?
Solution
Steps
1 State the hypothesis
Ho: 1 2
H1: 1 2
2. Decision Rule Reject Ho if z calculated > 1.96
3. Calculate z
p1 p2 n1 p1 n2 p2
= 0.165
(1 )( 1n n n2
64
0.225 0.125 0.1
= = 1.86
0.0536
0.165(0.835)( 1 )
80 120
4. Since z Calculated is less than 1.96, we need to Accept Ho.
Activity 5.5
1. A random sample of 1,600 workers in region 1 and 1400 workers in region 2
has been obtained to determine whether the population proportions unemployed in the
two regions are different. Perform a hypothesis test at the 5% level if the numbers
unemployed in the samples were 120 in region 1 and 84 in region 2.
2. A company is considering two different radio advertising for promotion of a
new product. Management believes that the advertisement A is more effective than
advertising B. two test market areas with virtually identical consumer characteristics are
selected; advertising A is used in one area and advertisement B in another area. In a
random sample 60 customers who heard ad A, 18 tried the product. In a random sample of
100 customers who heard ad B 22, tried the product. Does this indicate that ad A is more
effective than ad B, if a 0.05 level of significance is used?
3. A manufacturer of stereo cartridges finds that 60 of 100 diamond needles
and 40 of 100 emerald needles met technical specifications after 1000 hrs of play. Test the
hypothesis of no difference in the population meeting specification after this length of
playing time. Use the 0.01 level of significance.
4. Commercial Bank of Ethiopia wants to check two of its branches to see
whether the account had been overdrawn. Of one 100 accounts of the first branch 20 had
been overdrawn; of the second branch, 30 of 200 had been overdrawn. Test the hypothesis
of no difference in the proportion overdrawn between the two branches. Let α = 0.10.
T score distribution is useful not only when the sample size is small but also when the
population standard deviation is not known. A large sample from any population can be
65
approximated to normal distributions a small sample must come from a normal or near
normal population in order for a t-test to be used.
The t – distribution has the following characteristics
1 similar to z distribution, it is a continuous distribution
2 Similar to z distribution, it is bell shaped and symmetrical
3 Unlike z distribution, it is not just one distribution, but family of
distributions
4 The t- curve is lower at the mean than the z curve.
It is more spread out at the centre and it is higher at the tail ends. As the
sample size increase the t – distribution approaches the z –distribution.
T distribution has greater variance (spread) as compared with z – distribution
The critical t scores could be numerically larger than the z scores for a given
level of significance, the smaller the sample size the larger the t – scores critical t – scores).
T - distribution is identified by the degrees of freedom (d f) where d.f = n-1
Example 1
In order to revise the accident insurance rates for automobiles, an insurance company
wants to assess the damage caused to cars by accidents at speed of 15 miles/hr. A sample
of 16 new cars was selected at random and the company crashed each one at the speed of
15 miles/hr the cars so crashed were repaired and it is found that the average repair
amount was birr 2500 with standard deviation of birr 950. Damage in terms of dollars to
all car due to crash at 15 mile/hr. Assuming that the population distribution of costs of
repair under these conditions is normal, estimate the average
s =950
α/2 = 0.025
66
x1 x=2500 x2
X1 = x - ts
X2 = X + t sx
950 950
sx = = 237.5
16 4
t of 16-1, 25% = can be read from the table. Go to statistical table with the heading t
distribution; and search for 15 in the column degrees of freedom (the first column) and
α=0.025 in level of significance row (the first row of the table). Then find the intersection
of the two which will be read as, t15, 0.0025 = 2.131.
X1 and X2 can now be calculated using the formula mentioned earlier as follows:
X1 = x - ts
X1 = 2,500 – 2.13 (237.5)
1993.5
For X2,
X2 = X + t sx
X2 = 2,500 + 2.131(237.5)
= 3006.10
Example2
A gas station repair shop claims that it can do a lubrication job and oil change in 30
minutes, the customer protection department wants to test this claim. A sample of six cars
was sent to the station for fuel change and lubrication. The job took an average of 34
minutes with standard deviation of 4 minutes. This claim is to be tested at α = 0.0
Solution
1. State the hypotheses
Ho = 30 minutes
H1 = 30 minutes
2. Determine which test statistic to use.
67
As inferred from the problem, sample size is 6 and population standard deviation is
unknown. Therefore the relevant test statistic to be used is t-distribution.
3. set the decision rule
To set the decision rule we need to use two parameters: significance level and degree of
freedom. Degree of freedom for t-distribution can be calculated as n-1; therefore, since
there are six cars taken a sample, degree of freedom will be 6-1=5. We are conducting a
one tail test. Therefore, the significance level will be 0.05.
t 5,0.05= 2.015. Hence, the decision rule can be stated as reject Ho if t-calculated ≥ 2.015.
4. calculate the t value
t-calculated is given by the formula:
x 34 30 = 4 = 2.45
t= =
x 4 1.63
6
5. Decision
Based on the decision rule and t-calculated, we have to make a decision either to accept or
to reject the null hypothesis. In our case the t-calculated is greater than the table t-value.
Hence the decision is to reject the null hypothesis.
6. Interpretation
The job of lubrication and oil change takes more than 30 minutes to complete.
Activity 5.6
1. A nationwide survey indicates that children spend an average of 23 hours
per week watching television. A city councilwomen wishes to determine whether the time
that the children in her district spend watching television is significantly different from 23
hours. She obtains a random sample of time spent watching television in a week for 25
children in her district. A summary of results on hours per week is X = 20 and s= 8.9.
Assume that random sample variable is normally distributed. Conduct the appropriate test
at =0.05.
2. An accountant uses a sampling procedure in auditing clients’ statements of
accounts payable for possible monetary errors in the balance payable. A random sample of
16 accounts is selected, the balance payable on each is confirmed and the sample results
are used to test the null hypothesis that the average monetary error for the population of
68
accounts µ does not exceed birr 50. The accountant uses = 0.01. Assume that the
monetary errors in the account are normally distributed. For the sample of 16 accounts,
X is birr 56 and S= 8.24. Does this indicate that µ does not exceed birr 50?
3. The state personnel department believes that the average number of days
sick leave requested annually by the employees is 8. A random sample of 15 employees’
record is selected. The sample results are X =5 and S= 2.1 days. Assume that the variable
is distributed normally. Does the sample result differ significantly from 8 days belief, if =
0.05?
4. Awash winery states that the volume of wine in its standard size bottles
average 750 ml. a state alcoholic beverage control Board examines a random sample of 17
of these bottles, finding an average volume of 721 ml and standard deviation of 48 ml.
Does the ABC board have any reason to suspect that the average volume in all these bottles
is less than 750 ml? Volume is normally distributed, let = 0.01.
5. The manager of high rise condominium development expresses to his lender
that the average family income of his tenants is birr 42,000. Since the lender also holds
mortgage on large number of these unit, a sample of reported family income can be easily
obtained. A random sample of 20 files finds average family income X =36,000 and
standard deviation S= 16,000. Assume that family income is normally distributed. Has the
manager overstated average family income? Use = 0.01.
When we have two normally distributed populations whose standard deviations are
unknown but are equal, and the independent samples used in testing means are less than
or equal to 30, the statistic test to be used is student t-distribution.
Using student t-distribution, the decision rule is made from the statistical table at the end
of this module titled t-distribution of t,v or t /2,v for two tail test where;
V= degree of freedom = n1 + n2-2
Sample t-statistic is calculated as:
69
X1 X2
(n 1)s 2
(n 1)s 2
1 1
( )
n1 n2 2 n1 n2
Example
It is desired to find out if there is only significant difference in the average amount of
money carried by male and female students of Arbaminch University. A random sample of
8 male and 10 female students was selected and the amount of money they each had is
found. We are interested to know if there is any significance difference in the average
amount of money carried by male and female students at 5% level of significance.
Male Female
n1= 8 n2= 10
X1 = 20.5 X 2 =17
s1= 2 s2= 1.5
Solution
Steps
1. State the Hypothesis
H0: µ1 =µ2
H1: µ1 ≠ µ2
2. Determine which test statistic is to be used. As we can see clearly, the sample
size so small and population standard deviations are unknown. Therefore, the appropriate
test statistic will be t-distribution.
3. develop the decision rule
To develop decision rule we have to calculate the degree of freedom and read the value of t
for the specific significance level. Degree of freedom where we have two populations (P1
and P2) is given by:
(n1-1) + (n2-1)
Hence, degree of freedom is (8-1) + (10-1) = 7+9 = 16
The significance level is given to be 5%; however, since we are having a two tail test, alpha
is to distributed to the two tails (5%/2) = 2.5%
Therefore, t16, 0.025= 2.12
This implies that the decision rule is accept Ho if -2.12 ≤ t calculated ≤ 2.12.
70
4. calculate the t value
X1 X 2
t-calculated =
(X 1 X 2 )
X1 X2
t-calculated =
(n 1)s 2
(n 1)s 2
1 1
( )
n1 n2 2 n1 n2
20.5 17
= 4.25
1
( )
8 10 2 8 10
5. Decision
The calculated t-value is significantly different from the table t-value; which implies that
the Ho must be rejected.
6. Interpretation
The average amount of money carried by female students and male students differ
significantly.
Activity 5.7
1. In an effort to promote energy conservation through car pooling by
employees, a company is considering the institution of a rule at all plants requiring at least
three passengers in each car that is allowed free parking. Parking attendant at the south
side plant have provided the results of a random sample of 15 cars. A random sample of 12
cars is obtained at the west End Plant. Letting X1 and X2 represent the passengers per car
at south Side and West End, respectively, the results are:
X 1 =1.8 X 2= 2.9
S1= 1.5 S2= 1.6
Is there a difference in the average number of passengers per car for all cars parking at
these two plants? Assume that the number of passengers per car is normally distributed.
Use =0.10.
2. The research department of a historical society has developed a chemical
that they claim will lengthen the life span of papers treated with the chemical. Before
agreeing to allow some old and new valuable manuscripts to be treated with the chemical,
the society’s governing board requested statistical evidence that the paper life is
71
lengthened by the chemical treatment. Some identical papers are selected for comparison.
Twelve sheets are randomly selected and treated with the chemical. Nine sheets are left
untreated. Then all papers are aged artificially by an oven process. After aging, the papers
are tested for tear resistance by machine that precisely measures the force required to tear
the papers. The force required to tear the treated papers averaged 0.052 grams with
standard deviation of 0.015 grams, for the untreated papers average tear force was 0.036
grams with standard deviation of 0.01 grams. Does the sample evidence indicate that the
treatment with the chemical actually improves paper life as measured by the tear force?
Let = 0.01. Assume tear force is normally distributed.
3. A costly road testing of random samples of cars was carried out to determine
if mean mileage is greater for model one cars than model 2. The sample data were as
follows:
Model one n1= 8 X 1= 26 S1= 1.4
Model two n2= 10 X 2= 23.6 S2= 1.2
Perform a hypothesis test at the 5% level.
4. Data flow a large computer manufacturer, is considering whether method
one or method 2 should be adopted for training technicians who must find and correct
problems arising in its computer systems. Method one relies heavily on a check list process
of elimination, while method 2 concentrates more on teaching fundamentals of the
functions and interrelations of component parts. A random sample of 10 workers is
trained my method 1 and another random sample of 10 workers is trained by method 2.
After training, technicians in each sample are exposed to the same series of problems. The
sample mean times to correct the problems will be used to determine whether the mean
times to solve a problem by the two methods are different. Perform a hypothesis test at the
5% level if the method 1 sample has a mean of 3.2 hours and standard deviation of 0.6
hours; where as the method 2 sample has the mean of 3.8 hours and standard deviation of
0.5 hours.
72
UNIT SUMMARY
> 3 or 5.
Suppose that we want to test the hypothesis that 5. Then we can think of our
opponent suggesting that = 5. We call the opponent's hypothesis the null hypothesis and
write:
H1: 5
For the null hypothesis we always use equality, since we are comparing with a
previously determined mean.
4. Collect data.
6. Utilize the table to determine if the z score falls within the acceptance region.
7. Decide to
73
b. Fail to reject the null hypothesis and therefore state that there is not
enough evidence to suggest the truth of the alternative hypothesis.
We define a type I error as the event of rejecting the null hypothesis when the null
hypothesis was true. The probability of a type I error ( ) is called the significance level.
We define a type II error (with probability ) as the event of failing to reject the null
hypothesis when the null hypothesis was false.
74
SELF CHECK EXERCISE 3
1. East Africa bottling compilations showed that Coca cola had 42% total soft drink
market share in the past year. East African Bottling has made some modifications in
marketing policy and wants to determine if its market share has changed from 42%.
Perform a hypothesis test at the 1% level of significance if 180 of a random sample of 400
customers consume coca cola.
2. Suppose sugar refinery ships 454 gram boxes of sugar to wholesalers in railroad
carload lots. The weights of boxes are normally distributed and have standard deviation of
x 8 grams. before a carload is shipped, a random sample of n= 25 boxes is weighed and
the mean was found to be 440 grams. Perform a hypothesis test is at the 2% level to
determine whether the mean weight for the carload is less than 454 grams.
3. When properly adjusted, an automatic machine should produce parts that have a
mean diameter of 25 millimetres (mm). Part diameters are normally distributed. The mean
diameter of a sample of 10 parts is to be used to check whether or not the machine is
running properly. Perform a hypothesis test at 5% level if the mean of the sample is 25.02
mm and sample standard deviation is 0.024 mm.
4. A rancher wants to test two food mixtures, mix one and mix 2 on random sample of
steers to determine if there is a difference in mean weight gains for the mixtures. The
sample mean weight gains X 1 and X 2 and other sample data are:
75
UNIT FOUR
_____________________________________________________________________
CHI- SQUARE DISTRIBUTION
_____________________________________________________________________
Unit Content
General characteristics of chi square distribution
Test of independence and Cell counts for independence
Goodness of fit test
Goodness of fit test uniform distribution
Goodness of fit test Binomial Distribution
Goodness of fit test Poisson distribution
Goodness of fit test Normal distribution
Unit Objective
This unit enables the learners to:
Understand chi-square distribution and its general characteristics
Use chi-square distribution to test if two events are independent or not
Use chi-square distribution to test if a given distribution follows uniform distribution
Use chi-square distribution to test if a given distribution follows Binomial
Distribution
Use chi-square distribution to test if a given distribution follows Poisson
distribution
Use chi-square distribution to test if a given distribution follows Normal distribution
76
4.2 GENERAL CHARACTERISTICS OF CHI SQUARE DISTRIBUTION
Chi square is a continuous distribution ordinarily derived as a sampling distribution of a
sum of squares of independent standard normal variables
X Z
zi Were Zi is normally distributed with mean of zero and Variance 1
x
If Y= Z 2 + Z 2 + Z 2+……. Z 2=
n
Xi
( )
i 2 3 n
i1
This new variable is distributed as X 2 with n degrees of freedom
This can be rewritten as:
(𝑥 5
𝑥
)
=n
( )
𝛿
+𝑥
(
𝛿
)
i1
n
(X X ) ( X )2
Where S2X =
n 1
(x- x) 2 Therefore this new variable 2X
2 x
i1
distributions
n( X )
is normally distributed with mean zero and variance 1. It can be shown that
(n 1)S 2 x
has 2 distribution with (n-1) degrees of freedom
x2
The variable 2 can not be negative. There fore, 2 curves do not extend to the left of
zero. When is more than two, 2 curves have one mode and are skewed to the right.
The skewness is less apparent when is large and is normal when is ∞.
When is very large, 2 distribution is almost the same as normal distribution having a
2 , means the valve of 2 such that distribution with degrees of freedom has a
right tail area of .
77
Uses of chi-square distribution
Dear learner so far we have been discussing the characteristic of chi-square distribution.
Let us now see why it is used. Earlier we have been assuming the probability distributions
of population. For example, we have been saying assume the population follows normal
distribution, binomial distribution, poison distribution and the like. But a very serious
question is that we need to answer is how do we know whether the population is in fact
normally distributed, and so on.
A chi-square distribution can be used to test whether a population follows one or another
distribution. On top that Chi-square distribution can be used to test if two variables are
independent.
Dear learner, below we will see how to check if two variables are independent. We then
will see how to see if population distribution follows a particular form of distribution.
4.3 TEST FOR INDEPENDENCE AND CELL COUNTS FOR TEST OF
INDEPENDENCE
Brain storming Question
What are independent events?
.
To test for independence of two variables, we will work with count data that are arranged
in rows and columns. There will be a row category and column category as illustrated
below.
78
I II III Row total
A 200
B 300
Column 120 200 180 500
Total
What the counts in each cell would be if the categories were independent?
Dear student, do you remember what independent event means? If you answer is “two
events are independent if the occurrence of one does not affect the other and is not
affected by the other event”, you are correct. Excellent indeed!
If two events A and B are independent, then P (A/B) = P (A)
P( AnI )
In the above example, Probability of A/I is, therefore, given by = P (A)
P(I )
Row A
Where as probability of A is
GrandTotal
Similarly
BnI = 72 BnII = 120 BnIII = 108
The complete entries of the table calculated above and reproduced below is called
expected frequencies fe.
79
I II III Row total
A 48 80 72 200
B 12 120 108 300
Column Total 120 200 160 500
The actual sample counts are called observed frequencies and are denoted by f o. These two
frequencies fo, and fe are used to compute a sample statistic for testing the hypothesis that
the row and column categories are independent. The underlying idea is that for the
category to be independent, the observed categories should be close to the expected
frequencies.
If the difference (fo – fe) are large, then we reject the hypothesis of independence.
( f f )2
Sample 2 =∑ 0 e
fe
The distribution of 2 computed from contingency table is approximated by a chi-square
distribution with V degrees of freedom where:
V = (r-1) (c-1)
The chi-square approximation is satisfactory if the expected cell frequencies are not too
small, i.e f e ≥5.
If fe< 5, combine adjacent rows or columns in the contingency table to get fe values of at
least 5 before computing the sample 2 , also degree of freedom (V) will be computed
after combining rows and columns.
80
Steps followed
1. State the hypotheses
Ho: The row and column categories are independent
Ha : The row and the column categories are not independent
2. States the decision Rule reject Ho, if sample 2 > 2 ,
Where = significance level of the test
V = (r-1) (c-1)
and 2 , found from statistical table
3. Compute sample 2
( f f )2
The sample 2 = 0
fe
e
Example
�
David Gallano, a wine merchant, has collected opinion on grape wine quality from a
random sample of his customers. The customers tasted wines made from grapes grown in
three regions of the country; they rated wine qualify on scale of 1 (best) to 4. The sample
data are given in the table below. David wants to know whether quality ratings are or are
not independent of the grape growing regions. The test for independence is to be made at
5% level.
Customer quality ratings for wine Growing region
Quality Rating I II II Row total
1 15 10 6 31
2 7 13 12 32
3 11 12 8 31
4 3 8 11 26
Column Total 36 43 41 120
81
Solution
Steps
1. Ho: quality rating in independent of growing region
Ha: quality rating in independent of growing region
2. Develop the decision Rule
V = (4-1) (3-1) = 3(2) = 6
2 , = 2 0.05,6 0.05,6 = 12.592
Reject H0 if sample 2 > 12.592
3. Calculate sample 2
Calculate Expected frequency and use the formula (fo - fe)2/fe to calculate sample chi
square. To calculate expected frequencies we need to use cell counting rules. For example
for the first cell expected frequency can be calculated as (31x36)/120 = 9.3. Calculate the
rest in the same way. We will get the following results.
82
Quality Rating is not independent of growing region since chi square calculated is greater
than the critical value.
Activity 6.1
1. Sheraton Addis has rooms in high, average, and low price levels. The owner
advertises high quality service in all rooms. Opinions of the service obtained from a
random sample of guests are given in the table below. Are guest ratings independent of
price? Test at the 1 % level.
Room price
Guest service
High Average Low
rating
Excellent 14 25 11
Good 20 66 14
Poor 6 19 5
A B C D
1 12 30 10 16
2 16 22 11 8
3 14 12 3 6
83
3. The credit manager of plaza stores obtained data for random sample of credit
customers and recorded the data in the table below. Perform at the 5 % level, a
test of hypothesis that time to pay is independent of residence region.
Example
�
The following table contains counts for a random sample of 200 workers. It shows that 12
workers who had not gone to high school were rated as satisfactory by supervisor. We
want to test (at the= 0.05 level) the hypothesis that the populations of satisfactory workers
in education levels 1, 2, and 3 are equal.
84
Education level
Supervisor No high school High school Completed Row Total
rating but not high school
complete
Satisfactory 12 63 65 140
Not satisfactory 8 17 35 60
Column total 20 80 100 200
Solution
Steps to solve the question
1. Hypotheses
Ho: The cell proportions pr in any how are equal
Ha: the cell proportions in at least one are not equal.
2. Decision Rule
V= (2-1) (3-1) = 2
With α = 0.05 and 2 degree of freedom, 2 0.05,2 = 5.991
Therefore, the decision rule will be:
Reject Ho if 𝑥2> 5.991
3. Calculated sample 2
Next we must compute expected frequencies. Expected frequency can be calculated in the
same way we have been calculating expected frequencies in test of independence. For
example for the first row first column, expected frequency can be calculated as follows:
(cell rowtotal) (Cell Column Total) 140(20)
fe= = = 14
Grand Total 200
Likewise expected frequencies for all cells can be calculated. Dear learner, calculate the
expected frequency for all the remaining cells before you refer to the following table.
There are two numbers in each cell in the table below. The first number is observed
frequency and the number to the right of the observed frequency is expected frequency.
85
Supervisor No high High school Completed Row Total
rating school but not high school
complete fo fe
fo fe fo fe
Satisfactory 12 14 63 56 65 70 140
Not satisfactory 8 6 17 24 35 30 60
Column total 20 80 100 200
3. Sample 𝑥2 = 5.0595
The sample 2 does not exceed the 5.991 ( 2 0.05,2 )
4. Decision
Accept Ho since sample 2 is less the significance level.
Interpretation
� The proportion of satisfactory rated worker in the same for all three
educational level.
86
( f f )2
Sample 2 =
o e
fe
Goodness of fit test differs from independent tests
1. In the methods used to compute expected frequencies
2. The Rule for determining the number of degrees of freedom.
In a goodness test the method for calculating the expected frequencies depends on the
population assumptions that are made; the number of degrees of freedom in a goodness of
fit test is:
V = ne -1-g
Where ne = number of fe values used in computing the sample 𝑥2
g = number of population parameters estimated from the sample.
0 41
1 54
2 31
3 39
4 35
5 36
87
6 56
7 38
8 31
9 39
Total 200
Solution
1. State the hypothesis
Ho: The distribution is uniform
Ha : The distribution is uniform
2. Calculate Expected frequencies and sample 2
N 1
fe = = ( ) n = 400/10 = 40
n 𝑁
0 41 40 1 1 1/40
1 54 40 14 196 196/40
2 31 40 -9 81 81/40
3 39 40 -1 1 1/40
4 35 40 -5 25 25/40
5 36 40 -4 16 16/40
6 56 40 16 256 256/40
7 38 40 -2 4 4/40
8 31 40 -9 81 81/40
9 39 40 -1 1 1/40
(ƒ𝑜−ƒ𝑒)2
∑ = 662/40
Ƒ𝑒
V = ne -1-9
= 10 -1-0 = 9
2. Decision Rule
𝑥2 0.05, 9 = 16.919
Reject Ho if sample 𝑥2 >, 16-919
3. Sample 𝑥2 = 662/40 = 16.55
4. Accept Ho.
88
4.5. 2 Goodness of fit Binomial Distribution
A foot ball fan keeps a track of fool ball betting pool in her company. In each bet, a player
has to pick the winner for 10 games in the last season 1,000 bet were placed.
The number of correct picks is tallied in column 2.
Number of correct picks Number of bets
Fo
0 2
1 8
2 39
3 123
4 207
5 250
6 203
7 115
8 40
9 13
10 0
Solution
To test whether the above distribution is binomial or not we have to calculate the
expected frequencies assuming that the probability of winning the bet follows
binomial distribution. Then we compare the expected frequencies with the observe
frequency and see how significant the difference is.
The steps we follow to do test of fitness for binomial distribution is similar to that
goodness of fit test for uniform distribution.
89
Number of correct Number of bets Expect Expected
picks fo probability frequency
0 2 0.001 1
1 8 0.010 10
2 39 0.044 44
3 123 0.117 117
4 207 0.205 205
5 250 0.246 245
6 203 0.205 205
7 115 0.117 117
8 40 .044 44
9 13 0.010 10
10 0 0.001 1
Dear learner, do you remember what a binomial distribution is and how to calculate
probability or r occurrences if we are given probability of success? Consult probability
distribution theorem and refresh your memory once again before conducting these
probability calculations. The correct probability for the events is given in above table and
it is calculated using the following formula.
n!
P(x) = p r q nr
r!(n r)!
Where n = 10
P = 0.5
q= 0.5
r = number of correct picks
Step 1: State Hypothesis
Ho: The distribution follows binomial distribution with p=0.5
Ha: the distribution does not follow binomial distribution with p=0.05
Step 2: Calculate degree of freedom
To calculate the degrees of freedom we need to check for the values in expected frequency
column. Expected frequency must be greater than or equal five. However in above table
90
some expected frequencies (the first and the last) are less than five. Therefore we need to
merge them to make them more than five. Merging can be done by combining the first two
rows and the last two rows together.
Then, the degree of freedom can be calculated as V= K-1-g
Where: K= number of categories
g= number of population parameters estimated from sample.
0 2 0.001 1
1 8 0.010 10
2 39 0.044 44
3 123 0.117 117
4 207 0.205 205
5 250 0.246 245
6 203 0.205 205
7 115 0.117 117
8 40 .044 44
9 13 0.010 10
10 0 0.001 1
Degree of freedom can now be calculated using the formula stated above (V=K-1-g).
91
Hence in above table, we have nine categories of data; therefore the degree of
freedom will be 9-1-0, since no population parameter is estimated from sample
statistic.
Degree of freedom (V) = k-1-g = 9-1-0 =8
Step 3: Decision Rule
Next, read the value of 2 8, 0.05
Dear student can you read the value from your statistical table? To find the value:
1. Go to the statistical table and find the table with heading chi-square
distribution.
2. In the first column look for the degree of freedom (8)
3. Horizontally look for a tail area of 0.05
4. Find the intersection of the two
Dear learner, have you found that the result is 15.507.
Since the rejection region falls in only right tail, the null hypothesis developed
earlier will be rejected if 2 calculated is greater than 15.507.
Step 4: calculate 2
Step 5: Decision
92
Since 2 Calculated (1.869) is less than 15.507, the decision is to accept the Ho. this
implies that the distribution follows binomial distribution.
Solution
Step 1: state the Hypothesis
Ho: The distribution of arrival follows Poisson distribution with =2 (96/480)
H1: The distribution does not follow Poisson with =2
Step 2: Calculate expected frequencies
Expected frequency in turn can be calculated as pxn
Where p= probability and
n = number of observations
x .e
P(x)
x!
20 xe2
For example probability of zero arrival can be calculate as P(0) = 0.135. Dear
0!
student, can you calculate the probability for the remaining three events: (1 arrival, 2
arrival, and 3 arrivals)? Compare your answer with the one given below in the table. Then
the number of hours we expect zero patients to arrive can be calculated as the probability
of zero arrival multiplied by the number of hours the observation is conducted for
93
(0.135x480= 65). Dear learner, would again calculate the amount of expected hours for 1
arrival, 2 arrivals and 3 arrivals? Compare your answer with the table below.
0 1 2 3 Total
Number of 60 140 125 155 480
Hours
Probability 0.135 0.271 0.271 0.323 1.00
Number of 65 130 130 155 480
Hours
( fo fe)2 0.38 0.77 0.19 0 1.35
Fe
94
Find whether the normal distribution gives a close degree of fit at 5 % significance level?
Solution
Step 1: Hypothesis
Ho = population sample follows a normal distribution
H= A population sampled does not follow normal distribution
Step 2: calculate expected frequencies
To calculate expected frequencies, we need to first calculate the probability that the values
lie within the specified ranges.
The following diagram can help us as an aid.
3 6 9 12 15
We now need to calculate Z scores of each of the stated values and read the probability
from statistical table. Dear student, do you remember how to calculate Z scores.
x
Absolutely, Z hence Z(3) = 3 9 = 2
3
Likewise, Z (6), Z (9), Z (12) and Z (15) can be calculated. The z scores are -1, 0, 1 and 2
respectively. To calculate the expected frequencies, we need to find the probability that the
values will be within the calculated Z values; which can be read from statistical table.
95
Expected 5 27 68 68 27 5
Frequency
Step 5: Decision
Accept Ho since 2 calculated <7.81
Activity 6.2
1. The table below contains random sample data on the number of workers absent
from Commercial Bank of Ethiopia. The sample contained the same number (10) of
Monday, Tuesday, Wednesday, and so on. Does it appear that the number of workers
96
absent is uniformly distributed over days of the week? Perform a goodness-of-fit test at the
5 percent level.
1.
Monday 15
Tuesday 9
Wednesday 9
Thursday 11
Friday 16
2. When a beer bottle filling machine breaks a bottle, the machine must be
shutdown while the broken glass is removed. The production manager at Bedele
Brewery has been using Poisson distribution with λ=3 shut downs per day, on
the average, to determine the probabilities of 0, 1, 2, 3… Shutdowns in a day. The
manager has tabulated the number of shutdowns in a random sample of 120
operating days, as shown in the table 14.6. We want to test, at the 5% level, the
hypothesis number of shutdowns is a day has a Poisson distribution with λt = λ
= 3.
Number of shut downs in a day (X) Number of days (f0)
0 3
1 20
2 29
3 22
4 23
5 10
6 or more 13
97
1. Jacob, a mail order firm, sends out special item advertisements in batches of 50 at a
time. Sampson’s sale manager believes that the probability of receiving an order as a
result of any one advertisement is 0.050. The manager wants to test the hypothesis that
the distribution of number of orders in batch of 50 is a binomial distribution with
p=0.05. Data for random sample 120 mailings are given below. Perform a goodness of
fit test at the 5 % level.
0 16
1 30
2 40
3 20
4 10
5 3
6 1
7 or more 0
2. A seed grower sells early prize corn seed. A random sample of 1000 seeds was planted
to determine how many hours it takes for seeds to germinate. Data for the 1000 seeds
are given in the table below. The sample mean and the sample standard deviation were
150 hours and 12 hours respectively. Does it appear that the hours to germinate are
normally distributed? Perform a goodness of fit test at the 1% level.
98
Hours to germinate Frequency
3. Given the data below, test goodness of fit test at 1% level if distribution follows normal
distribution. 500and 100
740 or more 7
99
UNIT SUMMARY
In probability theory and statistics, the chi-square distribution (also chi-squared or χ2
distribution) is one of the most widely used theoretical probability distributions in
inferential statistics, i.e. in statistical significance tests.
It is useful because, under reasonable assumptions, easily calculated quantities can be
proven to have distributions that approximate to the chi-square distribution if the null
hypothesis is true.
Chi-square distribution is used for two main tests:
Tests of independence
Tests of goodness of fit
In test of independence, chi-square is used to test if two variables are independent or not.
In order to test for independence the chi-square test uses the basic assumption of
independence. On the basis of the assumption expected frequencies are calculated and
compared with the actual and observed frequencies. If the difference between the
expected and observed frequencies is insignificant, the two variables are said to
independent. Otherwise, the two will be dependent. In doing so we need to find the degree
of freedom for the distribution which is calculated as (r 1)(c 1).where r = number of
rows and c = number of columns.
Chi-square is also used to test if the distribution follows a give distribution type (normal,
uniform, Poisson, Binomial or the like). In testing for goodness of fit test we assume that
the distribution follows the stated distribution and find expected frequencies if the
distribution follows the assumed distribution. We then compare the expected frequency
we have calculated with the observed frequencies. If there is significant difference
between the two, the distribution deviates significantly from the stated distribution;
otherwise it follows the stated distribution. In fitness test the degree of freedom can be
calculated as ne 1 g
( fo fe)2
We then need calculate the calculated 2 . Calculated X 2
fe
100
We then compare the calculated 2 with the table 2 which is read as 2 , . If the
101
each box was recorded. The sample data are given in the table below. Test if
the distribution follows a binomial distribution at 5% level of significance.
102
X =40 thousands of gallons
Sx= 2.5 thousands of gallons
103
CHAPTER FIVE
ANALYSIS OF VARIANCE
_____________________________________________________________________
Unit outline
Characteristics of Analysis of Variance
One way analysis of Variance
Two way analysis of Variance
Unit objective
After completing this chapter students will be able to:
Distinguish F distribution from other types of distributions
Know the characteristics of analysis of variance
Use F test to test the hypothesis the mean of more than two population is
equal
Understand and use one way analysis of variance
Understand and use two way analysis of variance
Introduction
One way to compare two population variances, d21 and d22, is to use the ratio of the
sample variances, S21/S22. If S21/S22 is nearly equal to 1, you will find little evidence to
indicate that d21 and d22 are unequal. On the other hand, a very large or a very small value
of S21/S22 provides evidence of a difference in a population variance. The assumptions
required for an analysis of variance are similar to those required for student’s t-
distribution. Analysis of variance is so called because we decide whether to accept or reject
104
the hypothesis of equal population mean on the by analyzing the variations (variance) in
the sample means. The ANOVA test is performed on simple random samples drawn
randomly, one from each of the several populations. The test assumes that the populations
are normally distributed and have equal variances.
105
Years of work experience
Student 1 year of Experience 2 years of Experience 3 years of
experience
1 16 19 24
2 21 20 21
3 18 21 22
4 13 20 25
Total 68 80 92
Mean 17 20 23
Specifying Hypotheses
Dear learner you might have noticed that we have calculate mean for the three treatments:
no experience, one year of experience and 2 years of experience. We have also calculated
the overall (global mean) for the observation. Hence we want to test whether these three
sample means were drawn from populations that have identical means. In other words, we
want to test the following null hypothesis:
Ho: 1 2 3 against the alternative hypothesis
106
To test the null hypothesis that the treatment means are equal, we need to assess two
measures of variability.
1. Variability of the sample with in each treatments this is referred to as with in group
variability
2. we are also interested in the variability between the m treatments- between group
variability
The term variation refers to the sum of squared deviations which called the sum of
squares
SST = n j (x j - x)2
X = overall mean
n1( X 1- X ) 2 = 4(17-20)2 = 36
n (X 2 X )2 = 4(20-20)2 = 0
n ( X X )2 = 4(23-20)2 = 36
SST= 72
SSW = ( X nj X j )2
107
SSW = 34+2+10 = 46
The between sum of square and with in sum of square together represent the total
variation of the ANOVA model. We calculate the total variation by adding squared
deviations of the individual observations about the global mean. Total sum of square can
be calculated as:
m nj
TSS= ( Xij X )2
j 1 i1
Where:
TSS= Total sum of square
Xij= value of the observation in the ith row and jth column.
X = Overall mean
To put it more simply we obtain total sum of square by adding the between treatments
variation and the within treatments variation.
(TSS) =SSW + SST = 76+34 = 118
108
unbiased estimate of the between treatments mean of square can be obtained by dividing
SST by (m-1) degrees of freedom.
MST = SST/m-1
Where MST = between treatment mean of square
In our example the between treatment mean of square is MST= 72/2 = 36. Similarly,
nonbiased estimate of the within treatment mean square is found by dividing SSW by (n-
m) degrees of freedom.
MSW = SSW/(n-m)
Where: MSW = Mean square of with in treatments
= 46/9 = 5.11
We now test the null hypothesis that the population treatment means are equal by
comparing the between treatment means square with the within treatment mean square.
109
Accept H0 if F2, 9 < 4.26. 4.26 is a critical value that is read form statistical table at the end
of the module with the heading analysis of variance. To read the value from statistical
table:
Search for F distribution table with 5 % significance level. Search for 2 degree of freedom
in numerator (on top) and 9 degrees of freedom in the denominator (first column) and
read the intersection of the two which is 4.26 in this case.
Decision
On the basis of the calculation that we have already made, we have found that F calculate is
7.04 which is greater that the critical F value 4.26; therefore, the null hypothesis must be
rejected and the alternative hypothesis must be accepted.
110
The fourth row represents the southern part of Ethiopia.
The observation made is presented as follows
Year of Experience
Test if population mean salaries among various years of experience and among the various
geographical locations are equal at 5% level.
Solution
Specifying the hypothesis
We will have two hypotheses to be tested
1. Ho: population mean salaries among various years of work experience are
equal
2. Ho= population mean salaries among various regions are equal.
H1 = population mean values are not equal.
Between and residual sum squares.
The necessary calculation for two way analysis of variance involves computation of the
following values:
SST = between treatment seem square
SSB = between block sum square
TSS = Total sum of square
SSE = error sum of square
111
Dear learner we have already seen how to calculate between treatment sum of square, and
total sum of square. In fact we have already computed the two. Do you recall the way we
calculated the two?
We now calculate the between blocks sum of square and error sum of square. Between
blocks sum of squares can be calculated as: where I, j, and k represents the kth salary
observation in the ith row, and jth column.
I
SSB =
i1
JK ( X n X )2
Where,
SSB= between blocks sum o f square
Xi sample mean of the ith row
112
Where: I-1= degrees of freedom for the between blocks variance.
The residual (error) variance can be calculated in the same way as:
SSE
MSE = = 42-664/3(2) = 42.664/6 = 7.111
(J 1)(I 1)
Where (J-1) (I-1) = Degree of freedom for residual or error variance
To test our null hypothesis about the influence of various years of work experience we
must calculate F ratio.
F(2,6) MST/MSE = 36/7.111 = 5.065
The above ratio is calculated f ratio. To decide whether to accept or reject the null
hypothesis, we need to read the critical value from statistical table and compare it with
calculated f value.
Critical value for the above decision is F(2, 6, 0.05) = 5.14.
How to read the critical value from statistical table:
Find f distribution with the mentioned significance level
Find 2 degrees of numerator which is found in the first row of the table.
Find 6 degrees of freedom in the denominator which is found in the first column of the
table.
Find the intersection of the two and read the value which is found at the intersection
of the two.
INTERPRETATION
As the critical F value is greater than the calculated F value, we can not reject he null
hypothesis; which implies that there is no difference between the populations mean of
salaries associated with various years of work experience.
In testing the null hypothesis for the influence of geographical location on salaries, we find
that F ratio is:
1.112
F(3,6) = = 0.156
7.111
The critical value associated with this test is F(3,6,0.05) = 4.76. The critical value again
indicates that we can not reject the null hypothesis that the population means of salaries
associated with geographical locations are equal.
113
Variation squares freedom square
Between treatment SST (72) (I-1)2 36
Between Blocks SSB (3.336) (I-1)3 1.112
Residual SSE (42.664) (J-1) (I-1)6 7.111
114
UNIT SUMMARY
One-way ANOVA evaluates the effect of a single factor on a single response variable. For
example, a clinician may be interested in determining whether there are differences in the
age distribution of patients enrolled in two different study groups. Using ANOVA to make
this comparison requires that several assumptions be satisfied. Specifically, the patients
must be selected randomly from each of the population groups, a value for the response
variable is recorded for each sampled patient, the distribution of the response variable is
normally distributed in each population, and the variance of the response variable is the
same in each population. In the above example, age would represent the response variable,
while the treatment group represents the independent variable, or factor, of interest.
115
significance based on the F distribution, which tests the null hypothesis (H0) that the
means of the k groups are equal:
H0 = μ1 = μ2 = μ3 = …. μk
116
SELF CHECK EXERCISE 5
1. An investor selected random samples of stock purchases recommended by
three stock brokers a year ago. The investor calculated the percent returns
on each stock during the year, as given below. Perform an ANOVA test at α =
0.05 level to determine if the mean returns for the three advisory firms are
equal.
Percent returns
A B C
7.0 8.7 3.4
2.8 5.2 8.1
5.1 4.9 4.2
4.6 7.0 2.6
3. Three methods for assembling a product are to be tested at the 0.05 level to
determine whether mean times per assembly for the methods are equal.
Random sample assembly times in minutes are given below. Perform the
ANOVA test.
117
Method one Method two Method three
11 19 19
13 25 14
19 16 13
18 22 14
14 18 20
4. Stock analyst thinks four stock mutual funds generate about the same return.
She collected the accompanying rate of return data on four different mutual
funds during the last 5 years.
A B C D
1988 12 11 13 15
1989 12 17 19 11
1990 13 18 15 12
1991 18 20 25 11
1992 12 19 19 10
118
119
Chapter 6
Simple linear Regression and Correlation
Chapter Objective:
Dear reader, after studying this chapter, you will be able to:
Define regression analysis
Define and fit simple linear regression
Predict the population average value of the dependent variable on the basis of
known (fixed) values of the independent variable.
Understand correlation
Compute the Pearsonian and rank correlation coefficients.
The relationship between any two variables may be linear or non-linear. The former
implies a constant absolute change in the dependent variable in response to a unit
changes in the independent variable while the latter implies varying marginal change in
the dependent variable in response to changes in the independent variable.
120
Consequently, in this chapter we will confine ourselves to the type of regression
involving only tow variables and the type of relationship between our variables which is
linear. If this turns out to be the case, it is called simple linear regression.
20
15
10
1 2 3 4 5 6 7 8 9 X
121
When carefully observed, the scatter diagram at least shows the nature of relationship;
whether positive or negative and whether the curve is linear or non-linear.
When the general course of movement of the paired points is best described by a straight
line, the next task is to fit a regression line which lies as close as possible to every point
on the scatter diagram. This can be done by means of either free hand drawing or the
method of least squares. However, the latter is the most widely used method.
6.1.2. The regression Equation
Regression equation is a statement of equality that defines the relationship between two
variables. The equation of the line which is to be used in predicting the value of the
dependent variable takes the form Ye = a + bx. The most universally used and
statistically accepted method of fitting such an equation is the method of least squares.
The Method of Least Squares:-
This method requires that a straight line is to be fitted being the vertical deviations of the
observed Y values from the straight line (predicted Y values) is the minimum.
As shown in fig 6.1, if e1, e2, …… e5 are the vertical deviations of observed Y values
from the straight line (predicted Y values – Ye), fitting a straight line in keeping with the
above condition requires that (for n sample size)
1 2 𝑛 i
𝑒2 + 𝑒2 + … . +𝑒2 = e 2 is minimum. This can be done by partially
i1
i i e
e 2 = Y i a bX 2
e 2 (Yi a bx )
2
0
a a
-2 Yi a bX i 0
Y a bx 0
i i
na Y b X i
i
n n n
122
e 2 (Yi a bx )
2
0
b b
-2 Yi a bX i X i 0
∑ 𝑌iXi − 𝑎 ∑ Xi − 𝑏 ∑ Xi2 = 0
∑ 𝑌iXi − (𝑌− 𝑏X)[∑ Xi − 𝑏 ∑ X 2 ] = 0
∑ Xi Xi − 𝑌∑ X i − 𝑏 [∑ X 2 − X∑ Xi ] = 0
∑ Ki Ki −F∑ K i [∑ K 2 −K∑ Ki =0]
=
2
∑ K −K∑ Ki ∑ K 2 −K∑ Ki
∑ ∑
Therefore, ∑ ∑
Example 6.1. Suppose we want to study the relationship between input (number of
workers) and output (thousands of Birr) of five factories given in table 6.1. above. To fit
the regression line of Yi (thousands of Birr) on Xi (number of workers, we can employ
the method of least squares as follows:
Solution. Table 6.2.
Arrange the data in tabular form
40 20 230 120
Mean 8 4
123
350
=
200
= 7⁄4
𝑌i = 1 + 7⁄4 Xi
= 1 + 7⁄8(8)
𝑌i = 15
Consequently, if a factory has 8 workers, its level of output will be 15 thousand ETB.
Example 6.2. In what follows you are provided with sample observations on price
and quantity supplied of a commodity X by a competitive firm.
a) Construct the scatter diagram
b) What is the linear regression of Yi(quantity supplies) on Xi(price of the
commodity X).
c) Suppose price of the commodity X be 32, what will be the quantity supplied by
the firm?
Tab. 6.3. Data on price and quantity supplied.
124
a) *
70
*
*
60 **
*
50 *
*
40 *
* *
30
20
10
10 20 30 40 50 60 70
𝑛 ∑ KiFi−∑ Ki ∑ Fi 12(27,525)−460(675)
b) 𝑏 = 𝑛 ∑ K2−(Ki)2 = = 0.7795
12(19,750)−(460)2
𝑎 = − 𝑏X
∑ Fi
𝑌= = 675⁄12
𝑛
∑ Ki
X= = 460⁄12
𝑛
c) Xi = 32
Ye = 26.3718 + 0.7795 Xi
= 26.3718 + 0.7795 (32)
= 26.3718 + 24.944
= 51.3158
If the price of x is 32, the estimated quantity supplied will be approximately equal to 51
units.
125
1.1.3. Regression of X on Y
In the above sub-topic 6.1.2. we have explored regression of Y on X type. Sometimes, it
is possible and of interest to fit the regression of X on Y type, i.e., being Y as
independent and X dependent.
In such cases, the general form of the equation is given by:
X𝑒 = 𝑎0 + 𝑏0𝑌i
Where Xe = expected value of X
a0 – X-intercept
b0 – slope of the regression
Applying the principle of least squares as before, the constants a0 & b0 are given as
follows:
𝑎0 = X− 𝑏0 𝑌
∑ KiFi−∑ Ki ∑ Fi
𝑏0 =
𝑛 ∑ Fi−(∑ Fi)2
N.B. The regression equation of Y on X type and of X on Y type coincide at (X, 𝑌).
6.2. Correlation
The correlation coefficient measures the degree to which two variables are related
/associated – simple correlation denoted by r. For more than two variables we have
multiple correlations.
Two variables may have either positive correlation, negative correlation or may not be
correlated. Furthermore, depending on the form of relationship the correlation between
two variables may be linear or non-linear. Therefore, in this section, we shall be
concerned with quantifying the degree of association between two variables with linear
relationship.
Contrary to regression analysis explained in the previous section (6.1), the computation
of coefficient of correlation does not require one variable to be designated as dependent
and the other as independent.
The measure of the degree of relationship between any two variables known as the
pearsonian coefficient of correlation, usually denoted by r, is defined
∑(Ki −K)(Fi −F)
𝑟𝑥𝑦 = and is termed as the product – moment formula. It can be
√∑(Ki −K)2 ∑(Fi −F)2
further simplified as
𝑛 ∑ Ki Fi −∑ Ki ∑ Fi
𝑟𝑥𝑦 =
√[𝑛 ∑ K −(∑ Ki)2][𝑛 ∑ F2−(∑ Fi)2]
2
i i
126
NB. The building blocks of this formula are, therefore,
∑ Xi𝑌i, ∑ Xi , ∑ 𝑌i , ∑ X , ∑ 𝑌 and n(sample size).
2 2
i i
Yi Xi Xi2 Yi XiYi
4 2 4 16 8
7 3 9 49 21
3 1 1 9 3
9 5 25 81 45
17 9 81 289 153
Total 40 20 120 444 230
127
40 15 225 1600 600
45 20 400 2025 900
40 25 625 1600 1000
50 30 900 2500 1500
55 35 1225 3025 1925
60 40 1600 3600 2400
60 45 2025 3600 2700
65 50 2500 4225 3250
70 55 3025 4900 3850
75 60 3600 5625 4500
55 40 1600 3025 2200
60 45 2025 3600 2700
Total 675 460 19,750 39,325 27,525
Therefore,
12(27,525)− 675(460)
𝑟𝑥𝑦 =
√[12(19,750)−(460)2][12(39,325)−(675)2]
19,800
𝑟𝑥𝑦 = = 0.974
20,331.872
Yi Xi Xi2 Yi XiYi
5 3 9 25 15
8 4 16 64 32
4 2 4 16 8
10 6 36 100 60
18 10 100 324 180
Total 45 25 165 529 295
5(295)− 45(25)
𝑟=
√[5(165)−(25)2][5(529)−(45)2]
350
= = 0.99 Therefore, we have shown that property 4 is true.
352.14
128
Spearman’s Rank Correlation Coefficient
The Pearson coefficient of correlation cannot be used in cases when the direct
quantitative measurement of the phenomenon under study is not possible. In such cases,
we make use of the rank correlation coefficient.
Steps involved to calculate the spearman’s coefficient of rank correlation:
1. Rank the X values among themselves giving rank (1) to the largest (or smallest
value and (2) to the next largest (or smallest) value and so on.
2. Rank the Y-values among themselves in a similar way to that of X.
3. When there are ties in rank, i.e., when there are values sharing the same rank,
assign to each of the filed observation, the mean of the ranks they jointly occupy
and the next rank to be over looked.
4. Find the sum of the squares of the differences between ranks of two variables.
5. Apply the formula
6−∑ 𝑑2
i
𝑟𝑠 = 1 − (𝑛2−1)
129
Total 4
6 ∑ 𝑑2
(𝑛2−1)
6(4)
=1−
5(24)
= 0.75
Interpretation: Since rs= 0.75, it implies that there is similarity between the ranks of
Judge A and Judge B.
130
Review Exercises 6
1. Define and distinguish between;
a) Regression and correlation
b) Simple and multiple regression
c) Linear and non-linear relationship
2. Bring out the relevance of a scatter diagram in regression analysis.
3. Explain the meaning and status of the two constants a and b in the regression
equation Ye = a + bXi.
4. The marks obtained by 10 students in their graduation with B.A. degree in
management and the MBA entrance test were found as given below.
Graduation (Xi) 50 52 55 60 62 65 65 66 70 75
Entrance test (Yi) 52 50 57 65 65 62 65 65 71 75
Therefore, find
a) The two regression equations
b) The correlation coefficient between two sets of marks
5. Obtain the regression equation of X on Y and Y on X for the paired data given
below. Also compute the coefficient of correlation.
Market price of X 26 28 30 31 35
Market price of Y 20 27 28 30 25
6. Ten students got the following marks in Maths and Statistics
Student A B C D E F G H I J
Maths (X) 78 36 98 25 75 82 90 62 65 39
Statistics (Y) 84 51 91 60 68 62 86 58 58 47
Compute the coefficient of Rank correlation and interpret the result.
7. For a certain set of paired data on X and Y, 3Xi + 2Yi – 26 = 0 and 6Xi + Yi
– 31 = 0 are the two regression equations.
a) Find the mean values
131
[Link]. in the interview list 1 2 4 5 8 9 10 11 13 14 15 17 18 19 20
Ranking by the sales manager 2 3 1 5 4 6 8 7 9 10 12 11 13 14 15
(xi)
Ranking by the psychologist 1 3 2 4 6 5 7 9 8 11 10 12 14 13 15
(Yi)
Compute the rank correlation coefficient.
132
Z-distribution table
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
133
t- Distribution table
df α = 0.1 0.05 0.025 0.01 0.005 0.001 0.0005
∞ tα=1.282 1.645 1.96 2.326 2.576 3.091 3.291
1 3.078 6.314 12.706 31.821 63.656 318.289 636.578
2 1.886 2.92 4.303 6.965 9.925 22.328 31.6
3 1.638 2.353 3.182 4.541 5.841 10.214 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.61
5 1.476 2.015 2.571 3.365 4.032 5.894 6.869
6 1.44 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.86 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.25 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.93 4.318
13 1.35 1.771 2.16 2.65 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.14
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.12 2.583 2.921 3.686 4.015
17 1.333 1.74 2.11 2.567 2.898 3.646 3.965
18 1.33 1.734 2.101 2.552 2.878 3.61 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.85
21 1.323 1.721 2.08 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.5 2.807 3.485 3.768
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.06 2.485 2.787 3.45 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.689
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.66
30 1.31 1.697 2.042 2.457 2.75 3.385 3.646
60 1.296 1.671 2 2.39 2.66 3.232 3.46
120 1.289 1.658 1.98 2.358 2.617 3.16 3.373
∞ 1.282 1.645 1.96 2.326 2.576 3.091 3.291
134
Chi-Square Distribution Table
df\area 0.95 0.9 0.75 0.5 0.25 0.1 0.05 0.025 0.01 0.005
1 0.00393 0.01579 0.10153 0.45494 1.3233 2.70554 3.84146 5.02389 6.6349 7.87944
2 0.10259 0.21072 0.57536 1.38629 2.77259 4.60517 5.99146 7.37776 9.21034 10.59663
3 0.35185 0.58437 1.21253 2.36597 4.10834 6.25139 7.81473 9.3484 11.34487 12.83816
4 0.71072 1.06362 1.92256 3.35669 5.38527 7.77944 9.48773 11.14329 13.2767 14.86026
5 1.14548 1.61031 2.6746 4.35146 6.62568 9.23636 11.0705 12.8325 15.08627 16.7496
6 1.63538 2.20413 3.4546 5.34812 7.8408 10.64464 12.59159 14.44938 16.81189 18.54758
7 2.16735 2.83311 4.25485 6.34581 9.03715 12.01704 14.06714 16.01276 18.47531 20.27774
8 2.73264 3.48954 5.07064 7.34412 10.21885 13.36157 15.50731 17.53455 20.09024 21.95495
9 3.32511 4.16816 5.89883 8.34283 11.38875 14.68366 16.91898 19.02277 21.66599 23.58935
10 3.9403 4.86518 6.7372 9.34182 12.54886 15.98718 18.30704 20.48318 23.20925 25.18818
11 4.57481 5.57778 7.58414 10.341 13.70069 17.27501 19.67514 21.92005 24.72497 26.75685
12 5.22603 6.3038 8.43842 11.34032 14.8454 18.54935 21.02607 23.33666 26.21697 28.29952
13 5.89186 7.0415 9.29907 12.33976 15.98391 19.81193 22.36203 24.7356 27.68825 29.81947
14 6.57063 7.78953 10.16531 13.33927 17.11693 21.06414 23.68479 26.11895 29.14124 31.31935
15 7.26094 8.54676 11.03654 14.33886 18.24509 22.30713 24.99579 27.48839 30.57791 32.80132
16 7.96165 9.31224 11.91222 15.3385 19.36886 23.54183 26.29623 28.84535 31.99993 34.26719
17 8.67176 10.08519 12.79193 16.33818 20.48868 24.76904 27.58711 30.19101 33.40866 35.71847
18 9.39046 10.86494 13.67529 17.3379 21.60489 25.98942 28.8693 31.52638 34.80531 37.15645
19 10.11701 11.65091 14.562 18.33765 22.71781 27.20357 30.14353 32.85233 36.19087 38.58226
20 10.85081 12.44261 15.45177 19.33743 23.82769 28.41198 31.41043 34.16961 37.56623 39.99685
21 11.59131 13.2396 16.34438 20.33723 24.93478 29.61509 32.67057 35.47888 38.93217 41.40106
22 12.33801 14.04149 17.23962 21.33704 26.03927 30.81328 33.92444 36.78071 40.28936 42.79565
23 13.09051 14.84796 18.1373 22.33688 27.14134 32.0069 35.17246 38.07563 41.6384 44.18128
24 13.84843 15.65868 19.03725 23.33673 28.24115 33.19624 36.41503 39.36408 42.97982 45.55851
25 14.61141 16.47341 19.93934 24.33659 29.33885 34.38159 37.65248 40.64647 44.3141 46.92789
26 15.37916 17.29188 20.84343 25.33646 30.43457 35.56317 38.88514 41.92317 45.64168 48.28988
27 16.1514 18.1139 21.7494 26.33634 31.52841 36.74122 40.11327 43.19451 46.96294 49.64492
28 16.92788 18.93924 22.65716 27.33623 32.62049 37.91592 41.33714 44.46079 48.27824 50.99338
29 17.70837 19.76774 23.56659 28.33613 33.71091 39.08747 42.55697 45.72229 49.58788 52.33562
30 18.49266 20.59923 24.47761 29.33603 34.79974 40.25602 43.77297 46.97924 50.89218 53.67196
135