0% found this document useful (0 votes)
19 views135 pages

Stat II Module

Uploaded by

misganawgeto29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views135 pages

Stat II Module

Uploaded by

misganawgeto29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COLLEGE OF BUSINESS AND ECONOMICS

DEPARTMENT OF MANAGEMENT
MODULE OF STATISTICS FOR MANAGEMENT II
Course code: MGMT 2073
Credit hours: 3

Prepared by: Kibru Kefay (MBA)


Reviewed By: Mengistu Amese (MBA) and
Kifle Haile (MBA)

February, 2023
Bonga, Ethiopia

1
Table of content

Contents page
Table of content .................................................................................................................. 2
CHAPTER ONE................................................................................................................. 5
SAMPLING AND SAMPLING DISTRIBUTION............................................................ 5
1. 1. Introduction ............................................................................................................ 5
1.2 Importance of sampling theory................................................................................. 7
1.3 Probability (Random) and Non-Probability (non-random) Sampling ...................... 7
1.3.1 Probability Samples ........................................................................................... 8
1.3.2 Non-Probability Samples .................................................................................... 11
1.4 Bias and Error in Sampling .................................................................................... 13
1.5. SAMPLING DISTRIBUTIONS ........................................................................... 17
1.5.1 SAMPLING DISTRIBUTION OF THE MEAN ( X ) .................................... 18
1.5.2 SAMPLING DISTRIBUTION OF THE PROPORTION (  ) ....................... 22
1.5.3 SAMPLING DISTRIBUTION OF THE DIFFERENCE BETWEEN TWO
SAMPLE MEANS ( X 1  X 2 ) ........................................................................................24
1.5.4 SAMPLING DISTRIBUTION OF THE DIFFERENCE OF TWO
PROPORTIONS ( 1 - 2 ) ................................................................................................25
SELF CHECK EXERCISE 1 ....................................................................................... 28
CHAPTER TWO .............................................................................................................. 29
STATISTICAL ESTIMATION AND STATISTICAL INFERENCE ............................ 29
2.1 INTRODUCTION TO STATISTICAL ESTIMATION ........................................ 29
2.1.1. CRITERIA FOR POINT ESTIMATOR ........................................................ 30
2.1.2 POINT ESTIMATOR OF THE MEAN .......................................................... 32
2.1.3 POINT ESTIMATE OF THE POPULATION PROPORTION ..................... 32
2.1.4 POINT ESTIMATE OF THE UNKNOWN POPULATION STANDARD
DEVIATION ............................................................................................................ 32
2.1.5 POINT ESTIMATOR OF STANDARD ERROR OF THE MEAN............... 33
2.1.6 A POINT ESTIMATE OF SAMPLE STANDARD ERROR OF THE
PROPORTION ......................................................................................................... 33
2.2 INTERVAL ESTIMATE ....................................................................................... 34
4.2.1 INTERVAL ESTIMATE OF POPULATION MEAN ................................... 35
[Link] Confidence Interval Estimate Of  , Normal Population, And Standard
Deviation .................................................................................................................. 35
[Link] Precision, Confidence and Sample Size ....................................................... 36
[Link] CONFIDENCE ESTIMATE OF µ, NORMAL POPULATION, DX
UNKNOWN ............................................................................................................. 37

2
2.2.2 CONFIDENCE INTERVAL ESTIMATE FOR POPULATION
PROPORTION ......................................................................................................... 40
2.3 DETERMINATION OF SAMPLE ........................................................................ 41
2.3.1 Sample Size for Estimating Population Proportion ......................................... 42
2.3.2 Sample Size for Estimating a Population Mean .............................................. 43
UNIT SUMMARY ....................................................................................................... 45
CHAPTER THREE .......................................................................................................... 47
HYPOTHESES TESTING ............................................................................................... 47
3.1 Introduction to Hypothesis testing ......................................................................... 47
3.2 Type I and type II errors ......................................................................................... 49
3.3. Steps in Hypothesis Testing .................................................................................. 50
3.4 One tail and two tail tests ....................................................................................... 51
3.5 HYPOTHESIS TEST OF POPULATION MEAN ................................................ 53
3.6 HYPOTHESIS TEST OF PROPORTIONS ........................................................... 58
3.7 THE DIFFERENCE OF TWO MEANS ................................................................ 61
3.8 TESTING THE DIFFERENCE OF TWO POPULATION PROPORTIONS ....... 63
3.9 STUDENTS T – TEST........................................................................................... 65
3.10 A DIFFERENCE OF TWO MEANS WHEN SAMPLE SIZE IS SMALL AND
STANDARD DEVIATION UNKNOWN ................................................................... 69
UNIT SUMMARY ....................................................................................................... 73
SELF CHECK EXERCISE 3 ....................................................................................... 75
UNIT FOUR ..................................................................................................................... 76
CHI- SQUARE DISTRIBUTION .................................................................................... 76
4.1 CHAPTER INTRODUCTION............................................................................... 76
4.2 GENERAL CHARACTERISTICS OF CHI SQUARE DISTRIBUTION ............ 77
4.3 TEST FOR INDEPENDENCE AND CELL COUNTS FOR TEST OF
INDEPENDENCE ....................................................................................................... 78
4.4 TESTING THE EQUALITY OF MORE THAN TWO POPULATION
PROPORTIONS ........................................................................................................... 84
4.5 GOODNESS OF FIT TESTS ................................................................................. 86
4.5.1 GOODNESS OF-FIT-TESTS UNIFORM DISTRIBUTION ......................... 87
4.5. 2 Goodness of fit Binomial Distribution ........................................................... 89
4.5.3 Goodness of fit test for Poisson distribution ................................................... 93
4.5.4 Goodness of fit test Normal Distribution ........................................................ 94
UNIT SUMMARY ..................................................................................................... 100
SELF CHECK EXERCISE 4 ..................................................................................... 101
CHAPTER FIVE ............................................................................................................ 104
................ 104

3
ANALYSIS OF VARIANCE ........................................................................................ 104
5.1 Chapter Introduction............................................................................................. 104
5.2 One way Analysis of variance .............................................................................. 105
5.3 TWO-WAY ANALYSIS OF VARIANCE ......................................................... 110
UNIT SUMMARY ..................................................................................................... 115
SELF CHECK EXERCISE 5 ..................................................................................... 117
Chapter 6 ........................................................................................................................ 120
Simple linear Regression and Correlation ...................................................................... 120
6.1. Simple Linear Regression.................................................................................... 120
6.1.1. The Scatter Diagram ..................................................................................... 121
6.1.2. The regression Equation ............................................................................... 122
6.2. Correlation ........................................................................................................... 126
Review Exercises 6 .................................................................................................... 131

4
CHAPTER ONE

SAMPLING AND SAMPLING DISTRIBUTION

Unit content
 Sampling and sampling theory
 Probability (Random) and Non-Probability (non-random) Sampling
 Bias and Error in Sampling
 Sampling Distributions

 Sampling distribution of the mean ( X )


 Sampling Distribution of the Proportion (  )
 Sampling distribution of the difference between two sample means
( X1  X 2 )

 Sampling Distribution of the Difference of Two Proportions ( 1 - 2 )

Chapter Objectives:

After successful completion of the unit, you will be able to:


 Define the concept of sampling and sampling distribution
 Identify the merits and demerits of sampling
 Differentiate random and non random sampling methods
 Identify causes for sampling error
 Develop the sampling distribution of the mean and proportion
 Calculate the mean and standard deviation of the proportion

1. 1. Introduction
What is sampling?
Data is collected from target population using survey. If a survey covers all population, the
survey is called census and if the survey covers part of the population, the survey is called
sampling.
Why sampling is preferable?
 Cheaper than census
 Takes smaller time as compared to census

5
 Economy of efforts as relatively fewer staffs are needed
 More detailed information can be collected using sample
 Better quality of interviewing, supervision and other related activities

Limitations of sampling
 It fails to provide information on individual account
 Sampling gives rise to certain errors
 Difficult to check for omissions of certain units

Parameter and Statistic


When mean, median and mode and standard deviations are used to describe the
characteristics of a sample, it is called statistics and when they are used to describe the
population they are referred to as parameter.

Population Sample
Parameters Statistics
Population size N Sample size n
Population mean μ Sample mean X
Population standard deviation σ Sample standard deviation s
Population proportion π Sample proportion p

One of the objectives of sample survey is to estimate certain population parameters. A


point to know is that the true value of a population is parameter is unknown constant. It
can be determined only by complete study of the population. The concept of statistical
inference comes in to play whenever this is impossible or practically not feasible. A
statistic which is sample based quantity must serve as our source of information about the
value of parameter. In this context, there are three crucial points.
 As the sample is only part of the population, the numerical value of a statistics is
normally not expected to give us the correct value of the parameter.
 Since different samples can be drawn from particular population, the observed
value of the statistic depends on the particular sample that is chosen.
 The value of statistic will have some variability over different occasions of
sample.

6
1.2 Importance of sampling theory

When undertaking any survey, it is essential that you obtain data from people that are as
representative as possible of the group that you are studying. Even with the perfect
questionnaire (if such a thing exists), your survey data will only be regarded as useful if it
is considered that your respondents are typical of the population as a whole. For this
reason, an awareness of the principles of sampling is essential to the implementation of
most methods of research, both quantitative and qualitative.

 Population The group of people, items or units under investigation


 Census Obtained by collecting information about each member of a population
 Sample Obtained by collecting information only about some members of a "population"
 Sampling Frame The list of people from which the sample is taken. It should be
comprehensive, complete and up-to-date. Examples of sampling frame: Electoral Register;
Postcode Address File; telephone book

1.3 Probability (Random) and Non-Probability (non-random) Sampling

A probability sample is one in which each member of the population has an equal chance
of being selected. A random sample is usually representative sample. There are two
methods of ensuring randomness: the lottery method and the use of random numbers.

In the lottery method, each unit of the population is numbered and shown on a chit of
paper or disc. The chits are then folded and put in a box from which a sample of
predetermined number is to be drawn.

In random number case, table of random numbers is used. The units of population are
numbered from 1 to N from which n units are selected.

In a non-probability sample, some people have a greater, but unknown, chance than others
of selection.

7
1.3.1 Probability Samples

There are five main types of probability sample. The choice of these depends on nature of
research problem, the availability of a good sampling frame, money, time, desired level of
accuracy in the sample and data collection methods. Each has its advantages, each its
disadvantages. They are:

 Simple random
 Systematic
 Random route
 Stratified
 Multi-stage cluster sampling

1. Simple random sample

This is perhaps an unfortunate term, because it isn't that simple and it isn't done at
random, in the sense of "haphazardly".

Characteristics:

 Each person has same chance as any other of being selected


 Standard against which other methods are sometimes evaluated
 Suitable where population is relatively small and where sampling frame is
complete and up-to-date

Procedure:

1. Obtain a complete sampling frame


2. Give each case a unique number, starting at one
3. Decide on the required sample size
4. Select that many numbers from a table of random numbers or using
computer

The possible samples of size two from the B, C, D & E population are BC, BD, CD, CE, DE.

8
Note that, B appears in three of the six samples: so the probability of B, being selected is p
(B) = 3/6 = ½. Similarly, p(C) = p (D) = p (E) = ½: so (1.) each element of the population
has the same chance of being chosen. More over, (2) each of the possible samples of size
two has the same chance [p (BC) = p (BD) = p (BE) = p (CD) = P (CE) = p (DE) = 1/6], of
being selected. Consequently, we can say the conditions are satisfied.

2. Systematic sampling

Similar to simple random sampling, but instead of selecting random numbers from tables,
you move through list (sample frame) picking every kth name where k is N/n.

You must first work out sampling fraction by dividing population size by required sample
size. E.g. for a population of 500 and a sample of 100, the sampling fraction is 1/5 i.e. you
will select one person out of every five in the population. Random number needs to be
used only to decide on starting point. With the sampling fraction of 1/5, the starting point
must be within the first 5 people in your list

Disadvantage: Effect of periodicity (bias caused by particular characteristics arising in the


sampling frame at regular units). An example of this would occur if you used a sampling
frame of adult residents in an area composed of predominantly couples or young families.
If this list was arranged: Husband / Wife / Husband / Wife etc. and if every tenth person
was to be interviewed, there would be an increased chance of males being selected.

3. Random Route Sampling

Used in market research surveys - mainly for sampling households, shops, garages and
other premises in urban areas .

Address is selected at random from sampling frame (usually electoral register) as a


starting point, interviewer then given instructions to identify further addresses by taking
alternate left- and right-hand turns at road junctions and calling at every nth address
(shop, garage etc.)

Advantages:

 May be saving in time

9
 Bias may be reduced because interviewer has to call at clearly defined
addresses - not able to choose

Problems:

 Characteristics of particular areas (e.g. poor / rich) may mean that sample is
not representative
 Open to abuse by interviewer because difficult to check that instructions
fully carried out

4. Stratified Sampling
Dividing a population into non overlapping groups is called stratification. A stratified
random sampling is one where the population you have is divide into non overlapping sub
groups or strata & then a simple random sample is selected with in each of the strata or
sub groups. Thus a population can be stratified if they have readily identifiable
characteristics that can be used to separate the population members into sub groups.

For example, we can stratify a human population as follows: first we can divide the
population into different strata on the basis of age, sex, occupations, education, religion,
region, etc… you have to notice that stratification doesn’t mean absence of randomness.
But all that it means, the population is first divided into a certain strata & then a simple
random sample is selected from each stratum of the population. The advantages of using
stratified random sapling are:
 It more accurately reflects the characteristics of the population than simple
random sampling & systematic random sampling.
 It is more cost effective than simple random sampling.
5. Multi-stage cluster sampling

As the name implies, this involves drawing several different samples. It does so in such a
way that cost of final interviewing is minimized.

Basic procedure: First draw sample of areas. Initially large areas selected then
progressively smaller areas within larger area are sampled. Eventually end up with sample
of households and use method of selecting individuals from these selected households

10
1.3.2 Non-Probability Samples

It isn't always possible to undertake a probability method of sampling, such as in random


sampling. For example, there is not a complete sampling frame available for certain groups
of the population e.g. the elderly; people who are attending a football match; people who
shop in a particular part of town. Another factor to bear in mind is that many of the
probability sampling methods described above may mean that researchers would have to
undertake a postal or telephone survey delivery or might be expected to go from house to
house. We will discuss some of the problems of low response rate later on in this
workbook, but you might find that a probability sample with a poor response rate doesn't
in the end give you a particularly good representation of the population being examined.

Advantages of non-probability methods:

 Cheaper
 Used when sampling frame is not available
 Useful when population is so widely dispersed that cluster sampling would
not be efficient
 Often used in exploratory studies, e.g. for hypothesis generation
 Some research not interested in working out what proportion of population
gives a particular response but rather in obtaining an idea of the range of responses on
ideas that people have.

1. Purposive Sampling

A purposive sample is one, which is selected by the researcher subjectively. The researcher
attempts to obtain sample that appears to him/her to be representative of the population
and will usually try to ensure that a range from one extreme to the other is included.

Often used in political polling - districts chosen because their pattern has in the past
provided good idea of outcomes for whole electorate.

2. Quota Sampling

Quota sampling involves the fixation of certain quotas, which are to be fulfilled by the
interviewers.

11
Quota sampling is often used in market research. Interviewers are required to find cases
with particular characteristics. They are given quota of particular types of people to
interview and the quotas are organized so that final sample should be representative of
population.

Stages:

 Decide on characteristic of which sample is to be representative, e.g. age


 Find out distribution of this variable in population and set quota accordingly.
E.g. if 20% of population is between 20 and 30, and sample is to be 1,000 then 200 of
sample (20%) will be in this age group

Complex quotas can be developed so that several characteristics (e.g. age, sex, marital
status) are used simultaneously. By the end of the day, the researcher may be looking for a
widowed man in his nineties who looks as though he might buy a particular brand of
detergent.

Disadvantage of quota sampling - Interviewers choose who they like (within above
criteria) and may therefore select those who are easiest to interview, so bias can result.
Also, impossible to estimate accuracy (because not random sample)

3. Convenience sampling

A convenience sample is used when you simply stop anybody in the street who is prepared
to stop, or when you wander round a business, a shop, a restaurant, a theatre or whatever,
asking people you meet whether they will answer your questions. In other words, the
sample comprises subjects who are simply available in a convenient way to the researcher.
There is no randomness and the likelihood of bias is high. You can't draw any meaningful
conclusions from the results you obtain.

However, this method is often the only feasible one, particularly for students or others
with restricted time and resources, and can legitimately be used provided its limitations
are clearly understood and stated.

Because it is an extremely haphazard approach, students are often tempted to use the
word "random" when describing their sample where they have stopped people in the

12
street, as they see it "at random". You should avoid using the word "random" when
describing anything to do with sampling unless you are absolutely certain that you
selected respondents from a sampling frame using truly random methods.

4. Snowball sampling

With this approach, you initially contact a few potential respondents and then ask them
whether they know of anybody with the same characteristics that you are looking for in
your research. For example, if you wanted to interview a sample of vegetarians / cyclists /
people with a particular disability / people who support a particular political party etc.,
your initial contacts may well have knowledge (through e.g. support group) of others.

5. Self-selection

Self-selection is perhaps self-explanatory. Respondents themselves decide that they would


like to take part in your survey.

1.4 Bias and Error in Sampling

A sample is expected to mirror the population from which it comes; however, there is no
guarantee that any sample will be precisely representative of the population from which it
comes. Chance may dictate that a disproportionate number of untypical observations will
be made like for the case of testing fuses, the sample of fuses may consist of more or less
faulty fuses than the real population proportion of faulty cases. In practice, it is rarely
known when a sample is unrepresentative and should be discarded.

Sampling error

What can make a sample unrepresentative of its population? One of the most frequent
causes is sampling error.

Sampling error comprises the differences between the sample and the population that are
due solely to the particular units that happen to have been selected.

For example, suppose that a sample of 100 Arbaminch women are measured and are all
found to be taller than six feet. It is very clear even without any statistical prove that this

13
would be a highly unrepresentative sample leading to invalid conclusions. This is a very
unlikely occurrence because naturally such rare cases are widely distributed among the
population. But it can occur. Luckily, this is a very obvious error and can be detected very
easily.

The more dangerous error is the less obvious sampling error against which nature offers
very little protection. An example would be like a sample in which the average height is
overstated by only one inch or two rather than one foot which is more obvious. It is the
unobvious error that is of much concern.

There are two basic causes for sampling error. One is chance: That is the error that occurs
just because of bad luck. This may result in untypical choices. Unusual units in a population
do exist and there is always a possibility that an abnormally large number of them will be
chosen. The main protection against this kind of error is to use a large enough sample. The
second cause of sampling error is sampling bias.

Sampling bias is a tendency to favour the selection of units that have particular
characteristics. Sampling bias is usually the result of a poor sampling plan. The most
notable is the bias of non-response when for some reason some units have no chance of
appearing in the sample. For example, take a hypothetical case where a survey was
conducted recently by a Graduate School to find out the level of stress that graduate
students were going through. A mail questionnaire was sent to 100 randomly selected
graduate students. Only 52 responded and the results were that students were not under
stress at that time when the actual case was that it was the highest time of stress for all
students except those who were writing their thesis at their own pace. Apparently, this is
the group that had the time to respond. The researcher who was conducting the study
went back to the questionnaire to find out what the problem was and found that all those
who had responded were third and fourth PhD. students. Bias can be very costly and has to
be guarded against as much as possible. A means of selecting the units of analysis must be
designed to avoid the more obvious forms of bias. Another example would be where you
would like to know the average income of some community and you decide to use the
telephone numbers to select a sample of the total population in a locality where only the
rich and middle class households have telephone lines. You will end up with high average
income, which will lead to the wrong policy decisions.

14
Non-sampling error (measurement error)

The other main cause of unrepresentative samples is non-sampling error. This type of
error can occur whether a census or a sample is being used. Like sampling error, non-
sampling error may either be produced by participants in the statistical study or be an
innocent by product of the sampling plans and procedures.

A non-sampling error is an error that results solely from the manner in which the
observations are made.

The simplest example of non-sampling error is inaccurate measurements due to


malfunctioning instruments or poor procedures. For example, consider the observation of
human weights. If persons are asked to state their own weights themselves, no two
answers will be of equal reliability. The people will have weighed themselves on different
scales in various states of poor calibration. An individual’s weight fluctuates by several
pounds, so that the time of weighing will affect the answer. The scale reading will also vary
with the person’s state of understanding. Responses therefore will not be of comparable
validity unless all persons are weighed under the same circumstances.

Biased observations due to inaccurate measurement can be innocent but very devastating.
A story is told of a French astronomer who once proposed a new theory based on
spectroscopic measurements of light emitted by a particular star. When his colloquies
discovered that the measuring instrument had been contaminated by cigarette smoke,
they rejected his findings.

In surveys of personal characteristics, unintended errors may result from: -The manner in
which the response is elicited -The social desirability of the persons surveyed -The
purpose of the study -The personal biases of the interviewer or survey writer

The interviewer’s effect

No two interviewers are alike and the same person may provide different answers to
different interviewers. The manner in which a question is formulated can also result in
inaccurate responses. Individuals tend to provide false answers to particular questions.

15
For example, some people want to feel younger or older for some reason known to them. If
you ask such a person their age in years, it is easier for the individual just to lie to you by
over stating their age by one or more years than it is if you asked which year they were
born since it will require a bit of quick arithmetic to give a false date and a date of birth
will definitely be more accurate.

The respondent effect

Respondents might also give incorrect answers to impress the interviewer. This type of
error is the most difficult to prevent because it results from out right deceit on the part of
the respondent. It is important to acknowledge that certain psychological factors induce
incorrect responses and great care must be taken to design a study that minimizes their
effect.

Knowing the study purpose

Knowing why a study is being conducted may create incorrect responses. A classic
example is the question: What is your income? If a government agency is asking, a different
figure may be provided than the respondent would give on an application for a home
mortgage. One way to guard against such bias is to camouflage the study’s goals; another
remedy is to make the questions very specific, allowing no room for personal
interpretation. For example, "Where are you employed?" could be followed by "What is
your salary?" and "Do you have any extra jobs?" A sequence of such questions may
produce more accurate information.

Selecting the Sample

The preceding section has covered the most common problems associated with statistical
studies. The desirability of a sampling procedure depends on both its vulnerability to error
and its cost. However, economy and reliability are competing ends, because, to reduce
error often requires an increased expenditure of resources. Of the two types of statistical
errors, only sampling error can be controlled by exercising care in determining the method
for choosing the sample. The previous section has shown that sampling error may be due
to either bias or chance. The chance component (sometimes called random error) exists no
matter how carefully the selection procedures are implemented, and the only way to

16
minimize chance-sampling errors is to select a sufficiently large sample (sample size is
discussed towards the end of this tutorial). Sampling bias on the other hand may be
minimized by the wise choice of a sampling procedure.

1.5. SAMPLING DISTRIBUTIONS

It is often impossible to measure the mean or standard deviation of an entire population


unless the population is small, or we do a nationwide census. The population mean and
standard deviation are examples of population parameters--descriptive measurements of
the entire population. Given the impracticality of measuring population parameters, we
instead measure sample statistics--descriptive measurements of a sample. Examples of
sample statistics are the sample mean, sample median, and sample standard deviation.

so why not use the sample statistic as an estimate of the corresponding population
parameter: for instance, why not use the sample mean as an estimate of the population
mean is how confident can we be in the sample statistic.

For example: If we cast a fair die and take X to be the uppermost number, we know
that the population mean (expected value) is = 3.5, and that the population median
is also m = 3.5. But if we take a sample of, say, four throws, the mean may be far
from 3.5. Here are the results of 5 such samples of 4 throws (we used a random
number generator to obtain these samples):

X1 X2 X3 X4 X
Sample 1 6 2 5 6 4.75
Sample 2 2 3 1 6 3
Sample 3 1 1 4 6 3
Sample 4 6 2 2 1 2.75
Sample 5 1 5 1 3 2.5

17
Since each sample consists of 4 throws, we say that the sample size is n = 4. Notice that
none of the five samples gave us the correct mean, and that the mean of the first sample is
far from the actual mean. The average (mean) of these means is 3.2. Thus, although the
mean of a particular sample may not be a good predictor of the population mean, we get
better results if we take the mean of a whole bunch of sample means. Hence, sampling
distribution is a probability distribution for possible outcomes values of sample statistics,
such as sample means, sample proportion etc

1.5.1 SAMPLING DISTRIBUTION OF THE MEAN ( X )

It is the probability distribution for all possible values of sample means ( xi s ). The base for
this is difference of deviation between values found from different samples of the same
population.

Example: a population consists of the following ages:

10 20 30 40 & 50

A random sample of three is to be selected from this population & mean computed.
Develop the sampling distribution of the mean.

N = 5 (10, 20, 30, 40, 50) & n = 3

Thus to find how many different sample of size three can be taken from a finite population
of size five we can use combinations formula, N cn i.e. a number of possible samples of size

three to be drawn out of a population of five when order is an important. 5 c3  10 .

Sampling is to be done with out replacement.

No. of possible samples Sampled items Sample mean


1 10,20,30, 20
2 10,20,40 23.33
3 10,20,50 26.67
4 10,30,40 26.67
5 10,30,50 30

18
6 10,40,50 33.33
7 20,30,40 30
8 20,30,50 33.33
9 20,40,50 36.67
10 30,40,50 40

Frequency distribution of sample means

Sample mean( X ) Frequency (f)


20 1
23.33 1
26.67 2
33.33 2
30 2
36.67 1
40 1
10

Sampling distribution of sample mean ( X )

Sample mean ( X ) Probability of( X )


20 0.1
23.33 0.1
26.67 0.2
33.33 0.2
30 0.2
36.67 0.1
40 0.1

1.0

The sampling distribution of the mean is described by two parameters. Mean of sample
means & Standard deviation of sample means, which is termed as standard error of the

19
mean (s x ) .The mean of sample means ( x ) or(µ x ) is always equal to the population

mean(µ).

µ x  x =µ

The standard error of the mean is equal to population standard deviation divided by the
square root of the sample size.


 


This works if and only if population size is large and sample size is very small (n<0.05N)
But if n is large (n>=30) population size is finite and n>=0.05N, we apply a finite
population correction factor or finite population multiplier. In this case the sampling
distribution of the mean can be approximated by normal distribution.

 N n
  


Central Limit Theorem and Sampling Distribution of the mean

This theorem states that:

1. If the population is normally distributed, the distribution of the sample means is


normal regardless of the sample size.

2. If the population is not normal, the distribution of sample means will be approximately
normal if the sample size n is sufficiently large. The C.L theorem shows the relationships
between the shapes of the parent population and sampling distribution (of the mean).

Population distribution sample

Normal Normal

Not-normal…………..if n>=30 ............................................. Normal

20
The significance of the central limit theorem is that it permits us to use the sample
statistics to make inference about the population parameters without knowing anything
about the shape of the frequency distribution of that population other that what we can get
from the sample.

Example: The distribution of annual earning of all bank tailors with 5 years experience is
skewed negatively. This distribution has a mean of br.15, 000 and a standard [Link] we
draw a random sample of 30 tailors, what is the probability that their earnings will
average more than birr 15,750 annually?
Steps [Link]  x and  x

x    br.15,000

 2000
x    br.365.15
n 30
2. Calculate Z
randomvar iable  meanofrandom var iable
Z
s tan darddeviaitonoftherandomvar iable
xi   x
Z xi = 

x
 15750 15000
Z 15750   +2.05
365.15
3. Calculate the area covered by the interval
p(x >15750) = p(z  2.05)

=0.5-p (0 to +2.05)
= 0.5- 0.47982
4. Interpret the results
We have a 2.02% chance that the average earning being more than 15750 annually in a
group of 30 tailors.

Activity: A production company’s 350 hourly employees average 37.6 years of age with a
standard deviation of 8.3 years. If a random sample of 45 hourly employees is taken, what
is the probability that the sample will have an average age of less than 40 years?

21
1.5.2 SAMPLING DISTRIBUTION OF THE PROPORTION (  )
It is the p probability distribution for sample proportion (  ).
x

n
x- Number of items which carry specific characteristics
n- Total number of items (sample size)
Sampling distribution of the proportion has two parameters:
 Mean of sample proportion (  )

 =P (population proportion)
 The standard error of the proportion
pq
   Where,

  =standard error of proportion


 =population proportion
q=1-p
n= sample size
If n>=0.05N,   is calculated as:

pq N n
  
n N 1


N n
Where is a finite population multiplier

Central limit theorem and Sampling distribution of the Proportion (  )


The central limit theorem states that
1. The sampling distribution of the proportions is normally distributed if np & nq>=5
.i.e. n is large.
2. The sampling distribution of the proportion is normally distributed regardless of the
sample size if the population is normally distributed.
Example: Suppose that 60%of the electrical contractors in a region use a particular brand
of wire. What is the probability of taking a random sample of size 120 from these electrical
contractors or less use that brand of wire?

22
Step 1. Check that np & nq >=5
np=120x0.6=72
nq=120x 0.4=48

2. Calculate   and  →

  p  0.6

pq
    0.0447
n
3. Calculate Z
p p
Z   


 
0.5  0.6
Z 0.5   2.24
0.0447
4. Calculate the area covered by the interval
p( p  0.5)  p(z  2.24)

= 0.5- p (0 to -2.24)
=0.5-0.48745
=0.01255
5. Interpret the results

The probability of finding 50% or less of the contractors to use this particular brand is
1.255%.if we take a random sample of 120.
Activity:
1. If 10% of a population of parts is defective what is the probability of
randomly Selecting 80 parts and finding that 12 or more defective?
2. If a population proportion is 0.28 and if the sample size is 140, 30% of the
time the sample proportion will be less than what value if you are taking random
samples?

23
1.5.3 SAMPLING DISTRIBUTION OF THE DIFFERENCE BETWEEN
TWO SAMPLE MEANS ( X 1  X 2 )
This distribution is concerned with finding the difference between sample means drawn
from two populations. That is it is interested in determining if the mean of one of
population is equal to the mean of another.
Sampling distribution of X1  X 2 has two parameters:
1. Mean of the difference between two sample means;
 X 1 X 2  1  2
2. Standard error of the difference between two sample means

 X 1 X 2  
2
1 2
\this holds true if and only if
n1  n2

i. Sampling distribution done with replacement


ii. Population is large or finite
iii. n<0.05N
But if n>=0.05N and sampling is done without replacement

 1 2 N 1  n1   2 2 N 2  n 2 
 X 1 X 2      
 n1 N1  1   n2 N 2  1 


Central limit theorem and Sampling distribution X 1  X 2


The central limit theorem states that:
1. If n1 and n2 are greater than or equal to 30, the distribution of the

difference between two sample means ( X1  X 2 ) will be


approximately normal no matter how the original populations are distributed.
2. If the original populations are normally distributed, then the distribution of
X1  X 2 is exactly normally distributed for any values of n1 and n2 .

We standardize the ( X1  X 2 ) values using the formula:

( X1  X 2 )  (1  2 )
Zn1n 2   


X 1 X 2

Where: X1  X 2 is the normal variable?

24
1  2 is mean of random variable, and
 X 1 X 2 is the standard deviation of the random variable

Example: Two population measurements are normally distributed with 1  57 and


2  25 . The two populations standard deviations are 1  12 and 2  6 .Two
independent random samples of n1  n2  36 are taken from the population.

a. What is the expected value of the difference sample means ( X1  X 2 )?


 X 1 X 2  1  2 =57-25=32

b. What is the standard deviation of ( X1  X 2 )?


 1  122 62


 = 

X 1 X 2
n1  n2

Activity: A soft drink factory produces two soft drinks, Apple and Sheweps. The daily
production of Apple averages 15000 bottles and is normally distributed with a standard
deviation of 2000 bottles. Sheweps daily production is also normally distributed with the
mean of 12500 and standard deviation of 2500 bottles. A sample of five randomly selected
daily production figures is taken from each of the plants. What is the probability that the
sample mean production for Apple will be less than or equal to the sample mean
production for sheweps?
Hint: 1. Calculate the expected value and the standard error
2. Calculate Z
3. Calculate the area covered by the interval
4. Interpret the result

1.5.4 SAMPLING DISTRIBUTION OF THE DIFFERENCE OF TWO


PROPORTIONS ( 1 - 2 )

Suppose two populations of size N1 and N 2 are given .For each sample of size n1 from first

population ,compute sample proportion 1 and standard deviation  1 .Similarly ,for each

sample of size n2from second population ,compute sample proportion 2 and  2 .

25
For all combinations of these samples from this population, we can obtain a sampling
distribution of the difference 1  2 of samples proportions. Such a distribution is called
sampling distribution of difference of two proportions. The mean and standard deviations
of this distribution are given by
1 2  1   2  P1  P2

P1 (1  p1 ) P2 (1  P2 )
And  1 2  1
 2 2  

n1 n2

If sample size n1 and n2 are large, n1>=30 and n2>=30, then the sampling distribution of
difference of proportions is closely approximated by a normal distribution.

Example: 10% of machines produced by company A are defective and 5% those produced
by company B are defective A random sample of 250 machines is taken from company A
and a random sample of 300 machines from company B. What is the probability that the
difference in sample proportion is less than or equal to 0.2?

Solution: we are given the following information


1 2  1   2  P1  P2  0.10  0.05  0.05 ; n1  250 And n2  300
Thus standard error of the difference in a sample proportion is given by
P1 (1  p1 ) P2 (1  P2 )
 1 2  1
 2 2  
n1 n2
0.10 0.9

0.05 0.95
  0.00052  0.0228
250 300
The desired probability of difference in sample proportions is given by

P( p  p )  0.02  P Z  ( p 1  p 2 )  ( p1  p 2 



1 2
 1 22 


0.02  0.05
 PZ 
 0.0228 
 PZ  1.32
 0.5000 0.4066  0.0934
Hence the desired probability for the difference in sample proportion is 0.0934.

26
SUMMERY

 Sampling distribution of X is the probability distribution of all the value of

X calculated from all possible samples of the same size selected from a population.
 Sampling error is the difference between the value of a sample statistic
calculated from a random sample and the corresponding population parameter. This type
of error occurs due to chance. The errors that occur during the collection, recording , and
the tabulation of data are known as non-sampling error.
 A method of selecting a sample in which the population is first divided in to
strata and a simple random sample is then taken from each stratum is called stratified
sampling. A method of choosing a sample by randomly selecting one of the first n elements
and then selecting every nth element thereafter is systematic sampling. Cluster sampling is
a method of sampling in which the population is first divided into clusters and then one or
more clusters is selected for sampling.
 A non- probabilistic method of sampling whereby elements are selected for
the sample on the basis of convenience is called convenience sampling where as judgment
sampling is a non probabilistic method of sampling whereby elements are selected for the
sample based on the judgment of the person doing the study.
 Central limit theorem is the theorem from which it is inferred that for large
sample size (n>=30), the shape of the sampling distribution of X is approximately
normal. Also, by the same theorem, thee shape of the sampling distribution of P is
approximately normal for which np>=5 and nq>=5.

27
SELF CHECK EXERCISE 1

1. A diameter of a component produced on a semi-automatic machine is known to


be distributed normally with a mean of 10 mm and standard deviation [Link] a
random sample of size 5 is picked up, what is the probability that the sample mean will
between 9.95mm and 10.5?
2. The strength of the wire produce by company A has a mean of 4,500kg and a
standard deviation of 200kg .Company B has a mean of 4000kg and a standard deviation of
[Link] 50 wires of company A and 100wires of company B are selected at random and
tested for strength, what is the probability that the sample mean strength of A will be at
least 600kg more than that of B?
3. Assume that 2% of the items produced in an assembly line operation are
defective, but that the firm’s production manager is not aware of this situation. What is the
probability that in a lot of 400 such items, 3% or more will be defective?
4. A manufacturer of bottles has found that on an average 0.04of the bottles
produced are defective. A random sample of 400 bottles is examined for the proportion of
defective bottles. Find the probability that the proportion of defective bottles in the sample
is between 0.02 and 0.05.

28
CHAPTER TWO
_____________________________________________________________________
STATISTICAL ESTIMATION AND STATISTICAL INFERENCE

UNIT OUTLINE
 Basic Concepts
 Criteria for Estimators
 Point Estimators of the Mean & Proportion
 Interval Estimators of the Mean & Proportion
 Student’s t Distribution.
 Determination of Sample Size

Chapter objective
After completing this unit, students will be able to:
 Understand estimation as an inferential process
 Understand point estimate, estimator and estimation
 Distinguish point and interval estimates
 Identify characteristics of good estimators
 Make point and interval estimation

2.1 INTRODUCTION TO STATISTICAL ESTIMATION

Dear learner, we have seen the concept of sampling and sampling distribution in chapter
three of three of this module. Perhaps you may wonder the need for sampling. Do you
remember why we need to take samples? Yes, census is costly and sometimes impossible.
Therefore, we need to take part of the entire population (sample) and infer the
characteristic of the population form the sample we have drawn. Consider the following
statements. The life span of electric lamp produced by Sahara is 4,500 hours. In this
chapter we continue our discussion of inferential statistics by examining point estimation
and point estimation.

29
Brain storming question
What is estimation and when do we you estimation?

Statistical inference is the process of using limited information, a sample, for the purpose
of reaching conclusion about a large set of data, the population. Estimation refers to any
procedure where sample information is used to estimate or predict the numerical value of
some population measure (called parameter) such as the population mean μ.

An estimator is a procedure or function used in estimating a population


parameter.
An estimate is the numerical value determined from the estimator.
A parameter is a characteristic of an entire population; a statistic is a summary
measure that is computed to describe a characteristic for only a sample of the
population.

There are two types of estimators. A point estimator of a population parameter is a


procedure that produces a single value as an estimate. The sample mean is a statistic that
may be used as a point estimate of the population mean. An interval estimator of the
population parameter is the procedure that produces is a procedure that produces a range
of values. The range of values is useful as a measure of degree of error that may exist in
estimation.

2.1.1. CRITERIA FOR POINT ESTIMATOR


In point estimator we seek the sample statistic that is the best estimator of the population
parameter. Many criteria have been developed to describe what is the best for a point
estimator. The more general of these are the criteria of unbiasedness, efficiency, and
consistency.
Unbiasedeness

30
A statistic is an unbiased estimator of a parameter of the expected value of the statistic
equals the parameter, i.e. if
E (statistic) = Parameter
Any statistic chosen as an estimator is a random variable since the value of the statistic
may differ from sample to sample. The expected value of a random variable may be
interpreted as long-run average. Therefore, the above definition indicates that a statistic is
an unbiased estimator of a parameter if the average value of the statistic is the same as the
parameter value. Thus on average the estimator will be correct.

Efficiency
Unbiasedness alone does not guarantee a good estimator. In fact, some parameters may
have more than one unbiased estimator. Selection among the unbiased estimators is made
on the basis of comparing the variances of the estimators.
If the there exist more than one unbiased estimator of population parameter, the estimator
with minimum variance is the more efficient.

Even though the average value of an unbiased estimator equals the parameter, an
estimator may yield estimates that are not particularly close to the parameter value. The
efficiency of an estimator is measured by the variance of the estimator. The minimum
variance unbiased estimator is the unbiased estimator with the smallest variance.

Consistency
Another desirable property is that an estimator should produce estimates that have a high
probability of being close to the true value as the sample size increases. An estimator that
has this property is called a consistent estimator. The variance of a consistent estimator
becomes smaller as larger sample sizes are taken.

Sufficiency
A last property of good estimator is sufficiency. A sufficient estimator is the one that
utilizes all the information a sample contains about the parameter estimated. In choosing
among possible candidates for the best estimator of a parameter, it is possible that no one
estimator has all the desirable properties. One estimator may be unbiased but have a large
variance. A biased estimator may have smaller variance. A consistent estimator may also

31
be biased. A biased estimator is not necessarily undesirable unless the amount of bias is
large. Consistence indicates that the amount of bias becomes smaller as the sample size
increases. From the discussion, we can see that the point estimators are not selected
haphazardly; rather, they are selected on the basis of some well defined criteria.

2.1.2 POINT ESTIMATOR OF THE MEAN


Assume that we have the following random sample of n= 6 elements from a population
whose parameter is not known.
1 2 4 5 7 11
X 30
The sample mean is X   5
n 6
The estimator is X , and 5 is the point estimate of the unknown population mean.

2.1.3 POINT ESTIMATE OF THE POPULATION PROPORTION


The above array contains two even numbers 2 and 4. Calling the even numbers success,
the sample proportion of success is:
2 1
P 
6 3

The statistic P is an estimator of the unknown population proportion of success and is

a point estimate of the population proportion.

2.1.4 POINT ESTIMATE OF THE UNKNOWN POPULATION


STANDARD DEVIATION
We will use the symbol Sx to mean an estimate of the unknown population standard
deviation σx. The estimator, called sample standard deviation, is defined by the formula

Sx   ( X  X )2
n 1

Where X = sample mean
n= sample size

Dear learner pay attention to devisor n-1 (sample size minus one) in the formula. Earlier,
we used the devisor N, when computing a population standard deviation σx.

32
For the random sample 1, 2, 4, 5, 7, 11 write the symbol for and compute the sample
standard deviation.
Solution

Sx  
 (X  X)
2

  n 1

(1 5)2  (2  5)2  (4  5)2  (5  5)2  (7  5)2  (11 5)2
Sx  =3.633
6 1

2.1.5 POINT ESTIMATOR OF STANDARD ERROR OF THE MEAN


x
Standard error of the mean is computed by the formula   when the sample size is

less than 5 % of the population size. In our case, the total size of the population is
unknown; therefore it is safer to assume that the sample is less than 5% of the entire

population. Hence, we will use the estimator sx to estimate the standard error  . The
X
n
symbol S Xis called the sample standard error of the mean. The formula for S Xis

Sx
X

Where Sx= Sample standard deviation


n= sample size

Thus, Sx is the estimator for σx, and S Xis the estimator for  X.

Dear learner, we have calculated Sx= 3.633 for the random sample of 1, 2, 4, 5, 7, 11. The
sample standard error can be obtained using the formula
Sx 3.633
SX  =  1.483

2.1.6 A POINT ESTIMATE OF SAMPLE STANDARD ERROR OF THE


PROPORTION

33
Standard error of the proportion answers how far an unknown population proportion
might be from sample proportion. The symbol S will be used to mean standard error of

the proportion.

pq
P

n
Where p = sample proportion of success

q  1 p
n= sample size

Example
Let an even number be success, and suppose a sample of 200 numbers be selected
randomly from a population that contains 120 even numbers. Write the symbol for and
compute the value of the point estimator of the standard error of the proportion.

Pq 0.6x0.4
   0.0346
n 200

The following table shows some population parameters and their estimators.
Population parameter sample statistic (estimators)
Mean  X

Standard deviation σx Sx
Variance σ2x S2x
Proportion P P
Standard error of the mean  X

2.2 INTERVAL ESTIMATE


Point estimators of population parameters, while useful, do not convey as much
information as the interval estimators. Point estimation produces a single value as an
estimate of unknown population parameter. The estimate may or may not be close to the
parameter value; in other words, the estimate may be incorrect. Unbiasedness guarantees
only that the average value of the estimator determined from repeated samples will equal
the parameter value. An interval estimate, on the other hand, is a range of values that

34
conveys the fact that estimation is an uncertain process. The standard error of the point
estimator is used in creating a range of values; thus, a measure of variability is
incorporated into interval estimation. Further, a measure of confidence in the interval
estimator is provided; consequently, interval estimates are also called Confidence
Intervals. For this reasons, Interval estimators are considered more desirable than point
estimators.

4.2.1 INTERVAL ESTIMATE OF POPULATION MEAN


An interval estimate of  is an interval values of a and b; with in which an unknown
population mean is expected to lie. The interval is an inference based up on:
1. Value of the mean X of the simple random sample selected from the
population, and
2. Known facts about sampling distributions of the mean
The confidence interval shows how certain we are that the interval is correct. The choice of
method used in constructing a confidence interval is for  depends upon whether or not
the population is normal and whether the population standard deviation  X is known or
unknown.

[Link] Confidence Interval Estimate Of  , Normal Population, And


Standard Deviation Known
Suppose we have a normal population whose mean and standard deviation are  and x.
the sampling distribution of the mean is normal with the mean  and standard error of

x
 

For the sampling distribution of the mean, the standard normal variable is

X  
Z

x

If we want to be 95% confident that the population mean,  falls with in the estimate, we
can calculate the range as follows.
1. find the Z value for 95% confidence level
2. Use the obtained Z value to calculate the unknown population parameter.

35
For example z value for 95% confidence interval is 1.96. Therefore, if we want to be 95%
sure that the true population mean falls with in the estimate, we can rearrange the above
formula and get:

X  1.96 x    X  1.96x


The proportion of correct estimates (0.95 in our illustration) is called the confidence
coefficient C. the number 100C (95% in our illustration) is called the confidence level. The
proportion of incorrect statements is symbolized by the Greek letter α (alpha). The sum of
the proportions of correct and incorrect statements 1; so
C + α =1 or α = 1- c
We can describe C as the chance that the confidence interval is correct, and α as the chance
that the interval is incorrect.

Example
A normal population has standard deviation of 10. a random sample of size 25 has a mean
of 50. Construct a 95% confidence interval estimate of the population mean.

Solution
To construct the confidence interval,
We have to first find Z value for 95% confidence level and then use the formula,
X  Z x    X  Z x to estimate the interval. The Z value for 95% confidence level is

1.96. Therefore, the estimate can be given as, X  1.96 x    X  1.96 x . That is:

10 10
50 1.96( )    50  1.96( )
25 25
= 50  3.9    50  3.9
= 46.1    53.9

[Link] Precision, Confidence and Sample Size


The narrower the confidence interval is, the more precise it is. And the wider the interval,
the less precise is the interval. The end points of a confidence interval for µ are:

36
x
X  Z /2

x
The smaller the value of Z  / 2 , the more precise (narrower) is the confidence interval.

Consequently, the smaller Z / 2 and  x are, and the larger n is, the more precise will be the

interval. We conclude that the larger the sample size, the more precise is an interval
estimate. It can also be concluded that the smaller the variability the more precise the
estimate. The final conclusion that can be drawn from the above relationship is, the lower
the confidence level, the more precise is the interval estimate.

[Link] CONFIDENCE ESTIMATE OF µ, NORMAL POPULATION, DX


UNKNOWN

Under the previous case we have seen the case where the population is uniformly
distributed and population standard deviation is known. In this case we search for Z value
x
of /2 and use the formula X  Z  / 2 to estimate the interval within which the

population mean lies with C Confidence coefficient. However, most of the time population
mean µ is unknown, so is population standard deviation, d. therefore, d must be estimated
from sample standard deviation.

Sx   ( X  X )2
 n 1


After calculate the standard deviation, standard error must be computed using the
following formula.
SX
x

When population standard deviation known, the interval estimate can be calculated as
X  
Z

x

However, if population standard deviation is unknown, we need to estimate population


standard deviation with sample standard deviation and the distribution does not follow

37
normal distribution. The distribution rather follows a student’s t-distribution which was
identified for the first time by W. S. Gosset in 1900s. There are different t-distributions for
each sample size. T-distribution is discussed in a greater detail in hypothesis test. In this
chapter we will only illustrate how to make an interval estimate using the t-distribution;
without giving much emphasis for the distribution’s characteristic.

Tail areas for t-distribution are presented according to parameter called degrees of
freedom. We shall use the symbol  for degrees of freedom. Degree of freedom for t-
distribution can be calculated as  n 1.

Where
ν= degree of freedom
n= sample size

As ν increases, the tail area decreases; so is the t-value. As degrees of freedom increases,
the t-distribution approaches the standard normal distribution. When degree of freedom is
30, the t-distribution is approximately similar to normal distribution.

To construct interval estimate for µ under this situation, we need to use the value of t / 2 ,

which will be read from statistical table in association with the formula:

SX 

SX
X  t  / 2,    X  t  / 2,

Where

X  Sample Mean
n= sample size
  n -1 (degrees of freedom)
Sx= sample standard deviation
μ=unknown population mean

38
Example
The environmental protection officer of a large industrial plant sought to determine the
mean daily amount of sulphur oxide (pollutant) emitted by the plant. Because
measurements costs were high, only a random sample 10 days’ measurements were
obtained: these were, in tons per day,
8 7 10 15 11 5 8 5 13 12
Suppose emissions per day are normally distributed. Estimate μ, the mean amount of
sulphur oxides emitted per day using the confidence interval with a confidence coefficient
of 0.95.

Solution

X 
 X = 95  9.5
n 10

Sx   ( X  X )2 =
94.5
=3.24
n 1 9

The confidence level is 95%. Therefore, significance level  = 1-C= 1-0.95= 0.05 and
/2=0.025.
Next, we have to calculate the degree of freedom for the observation; which is given as
ν=n-1= 10-1=9

SX SX
. t / 2, in this specific

We can now calculate the interval as X  t  / 2,    X  t  / 2,

situation means t0.025, 9 = 2.26


Therefore Interval can be calculated as:
3.24 3.24
9.5  2.26( )    9.5  2.26( )
10 10
= 7.2    11.8

Dear learner, it may be difficult, sometimes, to know if the population is normally


distributed or not. Hence, we may need to use approximation. You may remember the
central limit theorem. Do you remember what a central limit theorem is?
The Central limit theorem proves that as sample sizes increases the distribution
approaches normal distribution. In fact for n greater than or equal to 30 statisticians use

39
normal distribution. Hence, we can use the Central limit theorem to construct interval
estimate for a mean when sample size is greater than or equal to 30.

2.2.2 CONFIDENCE INTERVAL ESTIMATE FOR POPULATION


PROPORTION

Most of the times we need to estimate the population proportion, such the proportion that
supports a given political party. The symbol p represents the population proportion of
success and q  p  1 . Those who support the political party is success and p is the
population proportion of supporters; q  p  1 is the proportion who do not support the
political party.

If a random sample size n is selected from a population that has p as a proportion of


success, the sample proportion of sample is denoted as P , and q  1 p . The formula is:

numberof successina sample of size n


p
n
q  1 p
To contract interval estimate for population proportion we use the normal distribution.
Dear learner you may recall form probability distribution that population proportion
follows binomial distribution. However, binomial distribution is difficult to construct
interval with. Hence, we will use normal distribution with the assumption that if sample
size is sufficiently large, the distribution of p is nearly normal. The rule of thumb is that n is
considered to be large if both np and nq are greater than 5. The rule assumes that we know
the population proportions of success and failures. However, in reality neither population
proportions nor failures are known but we need to estimate them with sample proportion
of success and failures. The rule of thumb we shall follow is that both np  15 and nq  15.

pq
The sampling distribution of p has a mean of p and standard error of   . When

conducting a confidence interval for the unknown value of p, we the estimator p in place
of p in the formula. Next, we compute the sample standard error of the proportion,

pq
S p 
n

40
Then using S as estimator of  , we can calculate the interval estimate as:

pq pq
P  Z / 2  p  p  Z / 2
n n

where P  sample proportionof success


q  1 p
  1 C
n  sample size

Example
A random sample of 400 members of labour force in a five state region showed that 32
were unemployed. Construct the 95% confidence interval for the proportion unemployed
in the region.
Solution
32
P = 0.08
400
With C of 95%,   0.05and  / 2  0.025
Find Z0.025 from the statistical table. To find Z value search for the probability in the main
body of the Z table and search for the corresponding Z score. In our case that will be 1.96.
Therefore, the interval estimate can be calculated as:

pq pq
P  Z /2  p  p  Z /2
n n

(0.08)(0.92) (0.08)(0.92)
0.08  1.96  p  0.08  1.96
400 400

0.053  p  0.107

Consequently, with 95% confidence, we state the population proportion to be between


0.053 and 0.107 that is between 5.3% and 10.7%
2.3 DETERMINATION OF SAMPLE
Dear learner you have seen sampling and reason for sampling in sampling distribution.
One reason behind sampling is to reduce the cost of data collection. If we conduct a census
study the cost we incur to collect data will be prohibitively high. Therefore, we have to

41
take small sample to hold costs dawn. On the other hand, we want to the sample to be large
enough to provide good estimator of population proportion. Consequently, the issue is
how large should the sample size be? The size of the sample depends on three factors:
 How precise or narrow we want the interval estimate to be
 How confident we want to be that the interval estimate is correct
 How variable is the population being sampled
The higher the desired precision or level of confidence, the larger will be the sample; also
for a given precision and level of confidence, the larger the population variability is, the
larger will be the sample.

2.3.1 Sample Size for Estimating Population Proportion


The confidence interval for p is

pq pq
P  Z /2  p  p  Z /2 , Which shows that the interval extends from
n n

PZ pq to P  Z pq so we can express this as:


 /2  /2
n n

pq
P  Z /2
n
The interval will be more precise or narrower the smaller the term that follows. The term
is called error and is indicated by e.

pq
e  Z / 2
n
If we solve for n, we get the following formula:
Z 2 / 2 pq
n 
e2
In fact we are trying here to determine how large our sample size should be; so we do not
have p and q because the sample has not yet been taken. Therefore, instead of p and q ,
we need to use p and q. however, p and q themselves are not known. Therefore, it is safer
to take 0.5 for p which yields the safer sample size. If the decision maker has some prior
information about the population proportion, that must be used instead of 0.5. If the
existing information leads to the belief that the population proportion is between two
values:

42
 If both values are on the same side of o.5, choose p as the value closer to 0.5.
 If 0.5 is between the two values, use 0.5 as for p.

Example
Suppose we want to estimate a population proportion to be with in  0.04 and we want to
a confidence coefficient of C= 0.90. How large should the sample size we take be?
Solution
We are given confidence coefficient and error. The population proportion that yields the
safest sample size is 0.5. Therefore, it is possible to calculate the sample size using the
formula:
Z 2 / 2 pq
n 
e2
Next, we need to read Z value of /2, where  = 1- 0.9 = 0.1. Therefore, /2 = 0.05

Z / 2  Z 0.05  1.64

Therefore,
1.642 (0.5)(0.5)
n 
(0.04)2

n  420.25

2.3.2 Sample Size for Estimating a Population Mean


x x
The confidence interval estimate of μ, X  Z  / 2    X  Z /2

x
Can be rewritten as X  Z  / 2 this can be expressed as X  e .

Therefore,
x
e  Z /2

From the above formula it is possible to derive for n, hence,

Z 2 / 2 2 x
n 
e2

43
As can be seen from the above formula, there is direct relationship between sample size
and variation in the population. Therefore, the more the variability the larger is the sample
size. Variation of the population, however, is neither known nor its estimate obtained prior
to sampling. Hence, if there is historical evidence of the variance that can be used. But most
of the time neither the population variance nor the sample variance are known. Hence we
need to estimate it using the formula:

officials high value  officials low value


 
4

Example
A sample is to be taken to estimate the mean salary of plumbers to be within  500 with a
confidence coefficient of 0.99. A Plumber’s union official states that birr 40,000 and birr
26,000 would be unusual large and small salaries for plumbers in the union. What should
the sample size be?

Solution
Z 2 / 2 2 x
n 
e2

It is possible to use formula and find sample size but we need to firs find the σ.
officials high value  officials low value
 
4
40,000  26,000 2.33(3500)
Therefore,    3500 Therefore n  ( )2  266.02
4 500

44
UNIT SUMMARY
Estimation is the process of using sample statistic to infer about the
population parameter. There are two types of estimates point estimate and interval
estimate.
Point estimate is the process of assigning a single value that we believe the
population parameter takes.
Interval estimate is the process of constructing a range within which the
population parameter lies.
Point estimator for population mean is sample mean, for population
proportion is sample proportion, for population standard deviation is sample standard
deviation, for population standard error is sample standard error, the difference of two
population means is the difference of two sample means and the difference of two
population proportions is the difference of two sample proportions.
The confidence interval shows how certain we are that the interval is
correct. The choice of method used in constructing a confidence interval is for  depends
upon whether or not the population is normal and whether the population standard
deviation  X is known or unknown.
The narrower the confidence interval is, the more precise it is. And the wider
the interval, the less precise is the interval
We conclude that the larger the sample size, the more precise is an interval
estimate.
The smaller the variability, the more precise the estimate
The lower the confidence level, the more precise is the interval estimate
If population standard deviation is unknown, we need to estimate
population standard deviation with sample standard deviation and the distribution does
not follow normal distribution; it rather follows a student’s t-distribution. There are
different t-distributions for each sample size.
The size of the sample depends on three factors:
How precise or narrow we want the interval estimate to be
How confident we want to be that the interval estimate is correct
How variable is the population being sampled

SELF CHECK EXERCISE 4

45
1. Ford Motor Company introduced a new minibus which has greater fuel
economy than the regular sized minibus. A random sample of 50 minibuses averaged 30
miles per gallon, and had standard deviation of 3 miles per gallon. Construct a 95 percent
confidence interval for the mean miles per gallon for all minibuses.
2. A cattle raiser selected random sample of 10 steers, all of the same age and
fed them special mixture of grains and other ingredients. After a period of time, weight
gains were recorded. The sample mean weight gain, per steer, was 142.6 pounds and
standard deviation was 10.4 pounds. Suppose weight gains are normally distributed.
Construct a 90% confidence interval for the population mean weight gain per steer.
3. the diameter of ball bearings made by an automatic machine are normally
distributed and have standard deviation of 0.02 mm. the mean of a random sample of four
ball bearings is 6.01 mm. construct the 95% percent interval for the mean diameter of all
ball bearings being made by the machine.
4. Interviewers called a random sample of 300 homes while “Ehud Mezinanya”
is being aired. 105 respondents said they were watching the program. Construct a 95%
confidence interval for the proportion of all homes where the program was being watched.
5. The proportion of all consumers favouring a new product might be a slow as
0.20 or as high as 0.60. A random sample is to be used to estimate the proportion of the
consumers who favour the new product to within ±0.05, with a confidence coefficient of
90%. To be on the safe (larger sample) side, what sample size should be used?

46
CHAPTER THREE

HYPOTHESES TESTING

Unit contents
 Chapter introduction
 Type I and type II errors
 Steps in Hypothesis Testing
 One tail and two tail tests
 Hypothesis test of Population mean
 Hypothesis test of population proportion
 Hypothesis test of the difference between two means
 Hypothesis test of the difference between two proportion proportions
 Hypothesis when population standard deviations are unknown and sample size small

Unit Objective
At the end of this chapter students will be able to:
 Understand apply hypothesis testing in different managerial problems
 Identify type one and type II errors
 Identify one tail and two tail tests
 Conduct hypothesis of population mean
 Conduct hypothesis of population proportions
 Conduct hypothesis test of the difference of two means
 Conduct hypothesis test of the difference of two proportions
 Conduct hypothesis test for normally distributed population with unknown
population standard deviation and small sample size

3.1 Introduction to Hypothesis testing

Dear learner, in this chapter, the concept of a statistical test of hypothesis is formally
introduced. Business managers must always be ready to make decisions and take actions
on the basis of the available information. During the process of decision making, managers
form hypotheses that they can scientifically test by using the available information. The
managers then make decisions in the light of the outcomes.

47
We make assumptions about the population parameter to be tested. The assumptions we
make about population parameters are called hypotheses. Then we take sample to
estimate the value of the population parameters. If the estimate favours the hypothesis, we
accept the hypothesis as being correct. If the value of the sample statistic thus calculated as
an estimate of the population parameter does not favour the hypothesis made about the
population, then the decision must be made as to whether the difference purely a matter of
chance which happens in nature (when in fact the sample statistic and the population
parameter are in fact similar) or whether this difference is significant enough so that it is
the real difference and our assumption about the population parameter is not correct.
Since we are testing for out hypothesis or assumption being true or not, this field of
decision making is called Hypothesis testing.

Dear students, let us assume the following example. A police found a dead body and want
to investigate the cause for the death and suspected murder. Up on further investigation, a
detective at the scene of the murder makes some assumptions or inferences about the
murder based on the initial observation and analysis of the scene of the crime.
 The victim was struck from behind by left handed man
 The murderer is tall
The detective makes assumption about the butler and checks the butler’s height and
whether he is right handed or left handed.
Let’s say the detective assumes the butler is innocent. After checking however, the butler is
tall and he is left handed.
Dear students, does this fact make the butler guilty? In fact yes. The mere fact that he is
tall and left handed makes the detective to accept the proposition that the butler is the
killer.

On the other hand, let’s assume that the butler is short and right handed. Would you think
this fact will make him innocent? Definitely, yes. The mere fact that he is right handed and
short can prove that the butler is innocent.

48
Caution
The only fact that the butler is tall and left handed does not mean that he killed the man
and the same is true the short and right handed person, because chances are there that he
could deliberately plan to look this way.

3.2 Type I and type II errors


Dear student, would identify the potential errors above detective can make? Here is the
summary of the situation. There are four possible outcomes of the investigation.
�The butler is accused by the detective where in fact he did commit the crime- correct
decision
�The butler is accused by the detective when in fact he did not commit the crime-
wrong decision
�The butler is considered innocent he, in fact, did not commit the crime- right decision
�The butler is considered innocent and in fact he did commit the crime- a wrong
decision.
The above condition can be presented in tabular format in a more convenient way:

Detectives action

Charge him Release him


Actual condition

Butler did it Correct Error

Butler did not do it Error Correct

Type I error
Type I error is an error made in rejecting the null hypothesis when in fact it is true. In the
above example if the null hypothesis is the butler is innocent, what will be type one error?
Type one error is the probability of charging the butler when he is indeed innocent. Type I
error is denoted by α (alpha) and is expressed as a probability of rejecting a true
hypothesis. It is know as level of significance. 1- α represents degree of confidence.
Type II error
Type II error is an error made in accepting the null hypothesis when in fact it is false. In
the previous murder case, if the butler who has killed the victim is released the detective is
said to commit type II error- the probability of declaring a criminal butler innocent.

49
Type II error is denoted by β and it is the probability of accepting false hypothesis. β value
should be as low as possible.

3.3. Steps in Hypothesis Testing


All hypothesis testing involves the following steps. Dear learners, we will apply the
following steps through out this chapter.
� State the null hypothesis as well as the alternative hypothesis.
� Establish the level of significance prior to sampling
� Determine the suitable test statistic
� Define the acceptance and rejection (critical) region
� Data collection and sample analysis
� Making the decision based the decision rule established.

Null and alternative Hypothesis


Null Hypothesis (H0) is an assertion about the population parameter that is being tested by
sample statistic obtained. Where as alternative hypothesis (Ha or H1) is a claim about the
population parameter that is accepted when the null hypothesis is rejected.

Rejection of the null hypothesis that is being tested implies acceptance of alternative
hypothesis. The two hypotheses represent mutually exclusive and collectively exhaustive
theories about the value of the population parameters such as population mean,
population proportions, and population variance.

Simple hypothesis and composite hypothesis


A simple hypothesis is a hypothesis wherein we specify only a single value for the
population parameter where as a composite hypothesis is the one wherein we specify
range of values for the parameter.

Significance level
Significance level is the chance of committing a type I error: that is probability of rejecting
a true null hypothesis. Significance level is denoted by . When the value of a test statistic
leads us to reject Ho, we say that the value is statistically significant. Significance level is

50
often used by analysts in reporting test results. For example an analyst may report that a
test result was significant at the 5 % level but not significant at the 1% level: that means
the null hypothesis that was tested would be rejected if =0.05 but would be accepted if
=0.01.

Determine suitable test statistic


Depending on situations, a manager needs to select which of the tests available s/he
should use for analysis. For example if the population is normally distributed and
population standard deviation is known the manager may use Z distribution. If the sample
size is small and population standard deviation is unknown the analyst need to use t-
statistic. To test for independence of two variables chi-square distribution must be used.
To test for goodness of fit, chi square distribution must be used and to test if the mean of
more than two populations are equal, analysis of variance or f distribution must be used.

Define the rejection or critical region


Once the test statistic is determined the analyst need to determine the critical value
beyond which the null hypothesis is rejected. This is obtained using the test statistic
selected and significance level needed.

Data collection and sample analysis


It is the process of collecting data to prove or disprove the null hypothesis. if for instance
the null hypothesis is the mean life of electric bulb produced by Sara P Ltd Company is 400
hrs. Then a sample 1000 electric bulbs can be tested to see if their life is actually 400 hrs.

Making a decision
Based one the sample statistic obtained and the critical value, the analyst makes a decision
of whether to accept or reject the null hypothesis developed already.

3.4 One tail and two tail tests


In some hypothesis tests the null hypothesis is rejected if the sample statistics is either too
far above or too far below the population parameter. The rejection area is to both sides of
the parameter. Tests of this type are called two tailed tests. Whereas situation in which the

51
area of the rejection lies entirely on one extreme of the curve either right or left tail are
known as one tail tests.

Acceptance region
Rejection region
(α/2) (1-α)
Rejection region (α/2)

Figure 5.1 Two tailed test

A two tailed test is called for when we are interested in the population mean, μ being
either much larger or much smaller than the specified value, μ o. For example, winery many
need to know the average ml of wine per bottle. Too little ml causes customer complaint
and too many reduces profit.

Rejection Region (α)

Acceptance region (1-α)

Figure 5.2 right tail test

An upper tail test is in order when we are concerned only with when the population mean
μ is larger than the specified value μo. for example, an Insurance company may need to
know the average amount of time it takes to process claim. Too long time is unacceptable
to customers.

Rejection region (α)

52
Acceptance region (1-α)

Figure 5.3 Left tail test

A lower tail test when we are concerned only concerned with when the population mean μ
is smaller than some specified value μo. For example, an electric bulb producer may need
to know if the average life span of the bulbs it produces are less than a given pre specified
amount. If the life span is too short the customers will complain. But there is no problem if
the life span is too long.

3.5 HYPOTHESIS TEST OF POPULATION MEAN

A researcher may need to know if population parameter is (statistically) reasonable equal


to a computed sample statistic with a known population standard deviation and
significance level. In such a case three possible null hypothesis about the population and
three corresponding alternative hypothesis materialize.

1. Ho: µ≤a
H1:µ>a
2. H0: µ≥a
H1: µ <a
3. Ho: µ=a
H1:µ≠ a
Dear learner, the first two of the above hypothesis lead to one tail test discussed above and
the third one deals with two tail test.

53
In determining whether a one tail test is appropriate, it is helpful to express the
problem in terms of some phrases that indicate whether a single direction or both
directions of difference away from the parameter value is important. If the basic
question of interest can be expressed as has there been an increase? Is the new
better than the old, is there a decrease, or has there been a decline? Then a one tail
test is appropriate. If the question can be expressed as is there any change or is
there any difference? Then a two tailed test is appropriate.

�Illustration one
Assume that the average annual income for government employees in the nation is
reported by the census bureau to be birr 18, 750. There was some doubt whether the
average yearly income of government employees on Addis Ababa was representative of
the national average.

A random sample of 100 government employees in A.A was taken and it was found that
their average salary birr 19, 24o with standard deviation of birr 2 610. At a level of
significance α = 0.05 (95 % confidence interval) can we conclude that the average salary of
government employees in Washington is a representative of the national average?

Solution
1- State the hypothesis
H0: µ = 18750
H1: µ ≠ 18750
2- State the level of the significance
Level of significance α = 0.05
3- Determine which test statistic to use
Dear learner, here we need to determine which of the distributions we have to use to test
the null hypothesis. As we can see form the problem, population standard deviation is
known and the population is infinite. Moreover sample size is more than 30. Therefore, the
relevant test statistic, using a central limit theorem, is Z distribution. Sampling
distribution of sample means is approximately normally distributed with standard error of
the mean being σx

54
x   x  
z=  =
 x
n
490
= 19240218,750 = = 1.877
2,610 261
100
Define the critical region since α = 0.05 and it is a two tailed test, the rejection region will
be on both ends tails of the curve in such away that the rejection area will comprise 2.5%
(5%/2) at the end of the right tail and 2.5 % (5%/2) at the end of the left tail.
Z value from the table will be ± 1.96
Decision rule accept the Ho if -1.96 ≤ Z calculated ≤ 1.96
Since the calculated z is less than 1.96 we cannot reject the null hypothesis which means
the Ho is true meaning the mean is different from 18750.

Activity 5.1
1. Define type I and type II errors.
2. What is meant by the statement the significance level of a test is  = 0.05?
What is meant by the symbol?
3. An automatic machine should produce parts that have a mean diameter of
25 mm. part diameters are normally distributed. The diameter of 10 parts is to be used to
check whether or not the machine is running properly. Perform a hypothesis test at 5%
level if the mean of the sample is 25.02 and sample standard deviation is 0.024 mm.

�Illustration two
The manufacture of light bulb claims that the light bulb lasts on an average for 1600 hrs.
We want the test his claim. We will not reject the claim if the average of the sample taken
lasts considerably more than 1600 hrs. But we will reject his claim if it lasts considerably
less than 1600 hrs.
A sample of 100 light bulbs was taken at random and the average bulb life of this sample
was computed to be 1570 hrs with standard deviation 120 hrs. At α = 0.01 test the validity
of the claim.

Solution

55
Step one
Ho: µ = 1600
Ha: µ < 1600
Step two
Find the significance level
As it is already stated in the problem, the test is to be done at α= 0.01.

-2.33

Decision rule
The decision rule is “Reject H0 if z is < -2.33”

3. Z = = 30 = -2.5
12

4. Decision Reject Ho
Interpretation
The average life of the bulb is considerably less than 1600 hrs.

Example 3
An Insurance company claims that it takes 2 weeks (14 days) on an average, to process an
auto-accident claim. The  is 6 days. To test the validity of the claim, an investigator
randomly selected 36 people who recently filed claims. This sample revealed that it took
the company an average of 16 days to process the claims. At 99% level of confidence, check
of it takes the company more than 14 days.
Solution
1. Ho: µ = 14 days
H1: µ > 14 days

56
99% 1%

2.33

2. Decision Rule:
The decision rule is “Reject Ho if Z> 2.33”
3. Calculate the critical Value
x
z= = 16 14 = 2
 6
 6
4. Decision
The calculated z is the less than the critical valve. Therefore accept Ho.

Activity 5.2
1. Government officials have decided to control the practice of product of its
mean price per unit in a retail outlets rises above birr 2.5. Perform a hypothesis test at the
0.01 level if the mean price in the random sample of 40 outlets is birr 2.52 and sample
standard deviation is birr 0.10.
2. Safe fly company makes parachutes. Safe fly has been buying snap links from
a manufacturing firm which recently merged with bridge cooperation. Safe fly is concerned
that the quality of snap links they receive from Big deal might not be up to specifications.
Specifically safe fly wants to be convinced that the links will with stand a mean breaking
force of more than 5,000 pounds. Perform a hypothesis test at the 0.005 level if the mean
breaking force for a random sample of 50 links is 5,100 pounds and sample standard
deviation is 221 pounds.
3. Food Machinery Supplies manufactures automatic cola dispensing machines
that are supposed to pour 8 ounces into a cup. Before shipping a machine, Food Machinery
Supplies makes a sample check to determine if the mean amount poured by the machine is

57
at least 8 ounces. Perform a hypothesis test at the 0.05 level if a random sample of 60 filled
cups had a mean fill amount of 7.92 ounces and standard deviation of 0.16 ounces.
4. Seniors in a high school of a city in the past had a mean score of 490 on
standard mathematics test. A teacher suggests that the seniors will have a higher mean
score if they attend tutorial sessions before taking the test. Perform a hypothesis test at
0.05 level if the scores of a random sample of 35 tutored seniors who take the
standardized test have a mean of 510 and standard deviation of 85.
5. The Super Tread Tire Company requires that the tires it makes should
withstand a mean pressure of more than 150 psi (pounds per square inch) before bursting.
From each large batch of tires made, a random sample of 10 tires is selected and subject to
increasing pressure until burst. Bursting pressures have been found to be normally
distributed. Only batches of tires that meet the psi requirements are sold under the super
tread Brand name. Perform a hypothesis test at the 1% level for a batch in which the
sample mean psi is 154 and sample standard deviation is 3.7 psi.

3.6 HYPOTHESIS TEST OF PROPORTIONS


Example
Sponsor of TV show believes that his studio Audience is divided equally between men &
women. Out of 400 persons attending the show one day, there were 230 men. At α = 0.05,
test if the belief of the sponsor is correct.
Solution
1. State the hypothesis
Ho:  = 0.5
H1:   0.5
2. Decision Rule

Rejection Region Rejection region

0.5 1.96
3. Reject Ho if the calculate z is <-1.96 and > 1.96

58
Accept Ho if -1.96

p  II 0.575.0.5 0.075
Z calculated = p-  = = =3
 0.575(0.425 0.25
400

4. Decision Reject Ho
Example 2
The mayor of the city claims that 60 % of the people of the city follows him and support his
policies. We want to test whether his claim is valid or not. A random sample of 400
persons was taken and it was found that 220 of these people supported the mayor. At level
of significance α = 0.01, what can we conclude about the mayor’s claim?

Solution
1. State the hypothesis
H0.  > 0.6
H1:  < 0.6
2. Decision Rule

Level of significance = 0.01

-2.33
Therefore, the decision rule is Reject Ho if Z cal < -2.33
3. Calculate the parameter (z)
p   0.05
 = = = -2.04
p .55(.45) 0.0245
.
400
4. Decision Accept Ho

59
Activity 5.3
1. Suppose that a sponsor of Television program states that the program
should be cancelled if there is convincing evidence that the program’s share of viewing
audience is less than 25 percent. The sponsor also states that the worst error would be to
cancel the program if its audience share is 25 percent or more. And the chance of making
the worst error is to be only 5 percent. A sample of 1250 TV viewers will be interviewed;
and the sample proportion p of the viewers who watch the TV program will be used to
decide whether to cancel no not to cancel the program. Suppose that there are 260
program viewers from the sample taken. Should the sponsor cancel the program?
2. A company has developed a new hair shampoo named Shanta. The
company’s marketing executive has obtained figures for the costs of plant expansion and
new product advertising. Taking the costs in to account, the executive thinks it would be a
mistake to market shanta unless there is substantial evidence that more than 20 percent of
shampoo buyers will choose Shanta rather than the competitive Shampoo. The executive
wants the chance of marketing Shanta to be 0.01 if it does not have more than 20 percent
of the market. The plan is to stock a random sample of stores with Shanta and have a
random sample of 500 customers observed as they select a shampoo to purchase. Perform
a hypothesis test if 110 of the customers in the sample purchase Shanta.
3. During a year-end audit, discrepancies were found in a company invoice
ledger. Consequently, the controller had all invoices for the year checked to determine if
they were correctly recorded in the ledger. The proportion of incorrectly recorded
invoices was found to be 0.04. The controller instituted a new procedure for processing
invoices. Subsequently, a random sample of 500 invoices was checked to determine
whether the proportion of incorrectly recorded had changed from 0.04. Perform a
hypothesis test at the 5% level if 11 of the sample invoices were recorded incorrectly.
4. Microchip Company makes chips used in electric circuits. A random sample
of 125 chips is to be selected from those produced. Production is to be halted for careful
inspection if the sample percent of defective chips is significantly higher than the normal,
where the normal is taken as 5% or less defective. It is required that the chance of halting
normal production is to be 0.02. Perform hypothesis test if the sample contains 10
defective chips.

60
5. In considering a new proposed personnel policy, executives of Equatorial
Business Group feel it would be a mistake to institute the policy if 25% or more of the
company employees oppose it. The executives want the chance of making this mistake to
be 0.10. Perform a hypothesis test if 38 of a random sample of 200 employees oppose the
policy.

3.7 THE DIFFERENCE OF TWO MEANS


A potential buyer of electric bulbs bought 100 bulbs each of two famous brand, A&B up on
testing both sample he found that brand A had a mean life of 1500 with standard deviation
of 50 hrs whereas brand B had an average life of 1530 hrs with standard deviation of 60
hrs. Can it be concluded at 5% level of significance that the two brands differ significantly
in quality?
1 State the hypothesis
H0 = µ1 =µ2
H1 = µ # µ2
2. Decision Rule
Accept Ho if -1.96 < µ1 -µ2 < 1.96

3. Calculate Z
x1  x2 x1  x2 x1  x2
 =
 (x1 x2 ) 2 2

n1 n2 n1 n2
15001530  30
= = = -3.841
2 2 7.81
50 60

100 100
Reject Ho =
Interpretation
The two means differ significantly

Example 2

A civil group in a given city claims that a female college graduates earn less than male
college graduates. To test the claim, a survey of starting salary of 60 male graduates and 50

61
female graduate was taken and it found that the average starting salary for female
graduates was birr 29,500 with standard deviation of birr 500 and the average salary for
male graduates was birr 30,000 with standard deviation of birr 600. At 1% level of
significance test the claim of this group.

1. State the hypothesis Male Female


Ho =  female Y = 30.000 29, 500
H1 =  female S1 = 600 500
N = 60 50
2. Decision Rule
Reject Ho if z calculated is greater than 2.33

3. Calculate Z

x1  x2 500 =
500
= 4.76
=
512 50002
5002 104.88

nt 50
n1 00
4. Since Z cal >Z, we can not accept Ho

Activity 5.4
1. The president of a large automobile sales agency tells her sales manager that
she is pleased with the increase over the last year in number of cars sold. The sales
manager contends that the mean net profit per car sold this year is higher than it was last
year. In as much as detailed accounting is required to obtain a firm figure for net profit
realized on a particular sale, the sales manager has an accountant determine this profit for
random sample of 35 cars sold this year and 35 cars sold the last year. The sales manager
wants to show that the mean profit for the current year µ1 is greater than last year mean
profit, µ2. Perform a hypothesis test at the 5% level if this year’s sample has a mean x =birr
350 and standard deviation of s1= birr 25 and last year’s sample has a mean of X 2= birr
340 and standard deviation s2= 30.
2. There are two methods used to assemble a product. Fifty assemblies are
made by each method to determine if mean assembly time are different. Perform a
hypothesis test at the 1% level of the method one has a mean of 10 hrs and standard

62
deviation of 0.45hrs; compared with mean of 9.6 hrs and standard deviation of 0.4 hrs for
method 2 sample.
3. Two different methods of instruction were used in management training
program for a large group of supervisors at a steel city metals. Supervisors without any
training in the subject matter were randomly assigned to either the personalized
instruction method or the more traditional lecture Method (LM). In the personalized
Instruction method, supervisors used programmed materials and proceeded at their own
pace during schedule periods. The lecture method used training leader in a class room
setting. In order to determine whether the training methods made any difference in
learning, standardized test was given to all participants. For a random selection of 40
personalized instructions method participants, the mean score was 72 and standard
deviation was 15. For 50 lecture method participants, the mean was 81 and standard
deviation was 20. Is the observed difference in the mean scores significant at the 0.05
level?

3.8 TESTING THE DIFFERENCE OF TWO POPULATION PROPORTIONS

Example

A sample of 200 students at AMU revealed that 18% of them were seniors. Similar sample
of 400 students at Debub University revealed that 15% of them were seniors. We want to
test whether the difference between these two proportions is significant that these
populations are indeed different at 5% level of significance

Solution
Steps
1. State the hypothesis
Ho. 1   2

H1 .  1   2
2. Decision Rule
Accept Ho if -1.96  1.96
3. Calculate the statistic z.

63
p1  p2 n1 p1  n2 p2 200(.18)  400(.15)
Where,  =
 (1   )( 1n  1n ) n1  n2 600

= 0.16

.16(.84)( 1  ) = 0.0317
200 400

p1  p2 = 0.18  0.15
= 0.95
 p  pn 0.0317

Since -1.96  Z  1.96, we can not reject Ho

Example 2
An insurance company believes that smokers have high incidence of heart diseases then
non-smokers in men over 50 year of age. Accordingly, it is considering to offer discounts
on its life insurance policies to non-smokers. However, before the discount is made,
analysis is undertaken. To justify its claim that the smokes are at a higher risk of heart
disease then non-smokers, the company randomly selected 200 men which included 80
smokes and 120 non-smokes. The survey indicated that 18 smokers suffered heart disease
and 15 non-smokes suffered from heart disease. At 5% level of significance, can we justify
the claim of the insurance company that smokes have a high incidence of heart disease
than non-smokers?

Solution
Steps
1 State the hypothesis
Ho: 1   2

H1: 1   2
2. Decision Rule Reject Ho if z calculated > 1.96
3. Calculate z
p1  p2 n1 p1  n2 p2
 = 0.165
 (1   )( 1n  n  n2

64
0.225 0.125 0.1
=  = 1.86
0.0536
0.165(0.835)( 1  )
80 120
4. Since z Calculated is less than 1.96, we need to Accept Ho.

Activity 5.5
1. A random sample of 1,600 workers in region 1 and 1400 workers in region 2
has been obtained to determine whether the population proportions unemployed in the
two regions are different. Perform a hypothesis test at the 5% level if the numbers
unemployed in the samples were 120 in region 1 and 84 in region 2.
2. A company is considering two different radio advertising for promotion of a
new product. Management believes that the advertisement A is more effective than
advertising B. two test market areas with virtually identical consumer characteristics are
selected; advertising A is used in one area and advertisement B in another area. In a
random sample 60 customers who heard ad A, 18 tried the product. In a random sample of
100 customers who heard ad B 22, tried the product. Does this indicate that ad A is more
effective than ad B, if a 0.05 level of significance is used?
3. A manufacturer of stereo cartridges finds that 60 of 100 diamond needles
and 40 of 100 emerald needles met technical specifications after 1000 hrs of play. Test the
hypothesis of no difference in the population meeting specification after this length of
playing time. Use the 0.01 level of significance.
4. Commercial Bank of Ethiopia wants to check two of its branches to see
whether the account had been overdrawn. Of one 100 accounts of the first branch 20 had
been overdrawn; of the second branch, 30 of 200 had been overdrawn. Test the hypothesis
of no difference in the proportion overdrawn between the two branches. Let α = 0.10.

3.9 STUDENTS T – TEST


Student t-distribution is developed by British statistician W. S. Gosset an employee of the
Guinness Brewery in Dublin, Ireland under a pen name student t-distribution that it
consequently called so.

T score distribution is useful not only when the sample size is small but also when the
population standard deviation is not known. A large sample from any population can be

65
approximated to normal distributions a small sample must come from a normal or near
normal population in order for a t-test to be used.
The t – distribution has the following characteristics
1 similar to z distribution, it is a continuous distribution
2 Similar to z distribution, it is bell shaped and symmetrical
3 Unlike z distribution, it is not just one distribution, but family of
distributions
4 The t- curve is lower at the mean than the z curve.
 It is more spread out at the centre and it is higher at the tail ends. As the
sample size increase the t – distribution approaches the z –distribution.
T distribution has greater variance (spread) as compared with z – distribution
 The critical t scores could be numerically larger than the z scores for a given
level of significance, the smaller the sample size the larger the t – scores critical t – scores).
T - distribution is identified by the degrees of freedom (d f) where d.f = n-1

Example 1

In order to revise the accident insurance rates for automobiles, an insurance company
wants to assess the damage caused to cars by accidents at speed of 15 miles/hr. A sample
of 16 new cars was selected at random and the company crashed each one at the speed of
15 miles/hr the cars so crashed were repaired and it is found that the average repair
amount was birr 2500 with standard deviation of birr 950. Damage in terms of dollars to
all car due to crash at 15 mile/hr. Assuming that the population distribution of costs of
repair under these conditions is normal, estimate the average

s =950
α/2 = 0.025

66
x1 x=2500 x2

X1 = x - ts

X2 = X + t sx

950 950
sx =  = 237.5
16 4
t of 16-1, 25% = can be read from the table. Go to statistical table with the heading t
distribution; and search for 15 in the column degrees of freedom (the first column) and
α=0.025 in level of significance row (the first row of the table). Then find the intersection
of the two which will be read as, t15, 0.0025 = 2.131.
X1 and X2 can now be calculated using the formula mentioned earlier as follows:
X1 = x - ts
X1 = 2,500 – 2.13 (237.5)
1993.5
For X2,

X2 = X + t sx
X2 = 2,500 + 2.131(237.5)
= 3006.10

Example2
A gas station repair shop claims that it can do a lubrication job and oil change in 30
minutes, the customer protection department wants to test this claim. A sample of six cars
was sent to the station for fuel change and lubrication. The job took an average of 34
minutes with standard deviation of 4 minutes. This claim is to be tested at α = 0.0

Solution
1. State the hypotheses
Ho =   30 minutes
H1 =   30 minutes
2. Determine which test statistic to use.

67
As inferred from the problem, sample size is 6 and population standard deviation is
unknown. Therefore the relevant test statistic to be used is t-distribution.
3. set the decision rule
To set the decision rule we need to use two parameters: significance level and degree of
freedom. Degree of freedom for t-distribution can be calculated as n-1; therefore, since
there are six cars taken a sample, degree of freedom will be 6-1=5. We are conducting a
one tail test. Therefore, the significance level will be 0.05.
t 5,0.05= 2.015. Hence, the decision rule can be stated as reject Ho if t-calculated ≥ 2.015.
4. calculate the t value
t-calculated is given by the formula:

x   34  30 = 4 = 2.45
t= =
x 4 1.63
6

5. Decision
Based on the decision rule and t-calculated, we have to make a decision either to accept or
to reject the null hypothesis. In our case the t-calculated is greater than the table t-value.
Hence the decision is to reject the null hypothesis.
6. Interpretation
The job of lubrication and oil change takes more than 30 minutes to complete.

Activity 5.6
1. A nationwide survey indicates that children spend an average of 23 hours
per week watching television. A city councilwomen wishes to determine whether the time
that the children in her district spend watching television is significantly different from 23
hours. She obtains a random sample of time spent watching television in a week for 25
children in her district. A summary of results on hours per week is X = 20 and s= 8.9.
Assume that random sample variable is normally distributed. Conduct the appropriate test
at =0.05.
2. An accountant uses a sampling procedure in auditing clients’ statements of
accounts payable for possible monetary errors in the balance payable. A random sample of
16 accounts is selected, the balance payable on each is confirmed and the sample results
are used to test the null hypothesis that the average monetary error for the population of

68
accounts µ does not exceed birr 50. The accountant uses  = 0.01. Assume that the
monetary errors in the account are normally distributed. For the sample of 16 accounts,
X is birr 56 and S= 8.24. Does this indicate that µ does not exceed birr 50?
3. The state personnel department believes that the average number of days
sick leave requested annually by the employees is 8. A random sample of 15 employees’
record is selected. The sample results are X =5 and S= 2.1 days. Assume that the variable
is distributed normally. Does the sample result differ significantly from 8 days belief, if =
0.05?
4. Awash winery states that the volume of wine in its standard size bottles
average 750 ml. a state alcoholic beverage control Board examines a random sample of 17
of these bottles, finding an average volume of 721 ml and standard deviation of 48 ml.
Does the ABC board have any reason to suspect that the average volume in all these bottles
is less than 750 ml? Volume is normally distributed, let = 0.01.
5. The manager of high rise condominium development expresses to his lender
that the average family income of his tenants is birr 42,000. Since the lender also holds
mortgage on large number of these unit, a sample of reported family income can be easily
obtained. A random sample of 20 files finds average family income X =36,000 and
standard deviation S= 16,000. Assume that family income is normally distributed. Has the
manager overstated average family income? Use = 0.01.

3.10 A DIFFERENCE OF TWO MEANS WHEN SAMPLE SIZE IS SMALL AND


STANDARD DEVIATION UNKNOWN

When we have two normally distributed populations whose standard deviations are
unknown but are equal, and the independent samples used in testing means are less than
or equal to 30, the statistic test to be used is student t-distribution.

Using student t-distribution, the decision rule is made from the statistical table at the end
of this module titled t-distribution of t,v or t /2,v for two tail test where;
V= degree of freedom = n1 + n2-2
Sample t-statistic is calculated as:

69
X1  X2
(n 1)s 2
 (n 1)s 2
1 1
(  )
n1  n2  2 n1 n2

Example
It is desired to find out if there is only significant difference in the average amount of
money carried by male and female students of Arbaminch University. A random sample of
8 male and 10 female students was selected and the amount of money they each had is
found. We are interested to know if there is any significance difference in the average
amount of money carried by male and female students at 5% level of significance.
Male Female
n1= 8 n2= 10

X1 = 20.5 X 2 =17
s1= 2 s2= 1.5

Solution
Steps
1. State the Hypothesis
H0: µ1 =µ2
H1: µ1 ≠ µ2
2. Determine which test statistic is to be used. As we can see clearly, the sample
size so small and population standard deviations are unknown. Therefore, the appropriate
test statistic will be t-distribution.
3. develop the decision rule
To develop decision rule we have to calculate the degree of freedom and read the value of t
for the specific significance level. Degree of freedom where we have two populations (P1
and P2) is given by:
(n1-1) + (n2-1)
Hence, degree of freedom is (8-1) + (10-1) = 7+9 = 16
The significance level is given to be 5%; however, since we are having a two tail test, alpha
is to distributed to the two tails (5%/2) = 2.5%
Therefore, t16, 0.025= 2.12
This implies that the decision rule is accept Ho if -2.12 ≤ t calculated ≤ 2.12.

70
4. calculate the t value

X1 X 2
t-calculated =
 (X 1  X 2 )

X1  X2
t-calculated =
(n 1)s 2
 (n 1)s 2
1 1
(  )
n1  n2  2 n1 n2
20.5 17
= 4.25
1
(  )
8  10  2 8 10

5. Decision
The calculated t-value is significantly different from the table t-value; which implies that
the Ho must be rejected.
6. Interpretation
The average amount of money carried by female students and male students differ
significantly.
Activity 5.7
1. In an effort to promote energy conservation through car pooling by
employees, a company is considering the institution of a rule at all plants requiring at least
three passengers in each car that is allowed free parking. Parking attendant at the south
side plant have provided the results of a random sample of 15 cars. A random sample of 12
cars is obtained at the west End Plant. Letting X1 and X2 represent the passengers per car
at south Side and West End, respectively, the results are:
X 1 =1.8 X 2= 2.9
S1= 1.5 S2= 1.6
Is there a difference in the average number of passengers per car for all cars parking at
these two plants? Assume that the number of passengers per car is normally distributed.
Use =0.10.
2. The research department of a historical society has developed a chemical
that they claim will lengthen the life span of papers treated with the chemical. Before
agreeing to allow some old and new valuable manuscripts to be treated with the chemical,
the society’s governing board requested statistical evidence that the paper life is

71
lengthened by the chemical treatment. Some identical papers are selected for comparison.
Twelve sheets are randomly selected and treated with the chemical. Nine sheets are left
untreated. Then all papers are aged artificially by an oven process. After aging, the papers
are tested for tear resistance by machine that precisely measures the force required to tear
the papers. The force required to tear the treated papers averaged 0.052 grams with
standard deviation of 0.015 grams, for the untreated papers average tear force was 0.036
grams with standard deviation of 0.01 grams. Does the sample evidence indicate that the
treatment with the chemical actually improves paper life as measured by the tear force?
Let  = 0.01. Assume tear force is normally distributed.
3. A costly road testing of random samples of cars was carried out to determine
if mean mileage is greater for model one cars than model 2. The sample data were as
follows:
Model one n1= 8 X 1= 26 S1= 1.4
Model two n2= 10 X 2= 23.6 S2= 1.2
Perform a hypothesis test at the 5% level.
4. Data flow a large computer manufacturer, is considering whether method
one or method 2 should be adopted for training technicians who must find and correct
problems arising in its computer systems. Method one relies heavily on a check list process
of elimination, while method 2 concentrates more on teaching fundamentals of the
functions and interrelations of component parts. A random sample of 10 workers is
trained my method 1 and another random sample of 10 workers is trained by method 2.
After training, technicians in each sample are exposed to the same series of problems. The
sample mean times to correct the problems will be used to determine whether the mean
times to solve a problem by the two methods are different. Perform a hypothesis test at the
5% level if the method 1 sample has a mean of 3.2 hours and standard deviation of 0.6
hours; where as the method 2 sample has the mean of 3.8 hours and standard deviation of
0.5 hours.

72
UNIT SUMMARY

Whenever we have a decision to make about a population characteristic, we make a


hypothesis. Some examples are:

> 3 or 5.

Suppose that we want to test the hypothesis that 5. Then we can think of our
opponent suggesting that = 5. We call the opponent's hypothesis the null hypothesis and
write:

H0: = 5 and our hypothesis the alternative hypothesis and write

H1: 5

For the null hypothesis we always use equality, since we are comparing with a
previously determined mean.

For the alternative hypothesis, we have the choices: < , > , or .

Procedures in Hypothesis Testing

When we test a hypothesis we proceed as follows:

1. Formulate the null and alternative hypothesis.


2. Choose a level of significance
3. Determine the sample size. (Same as confidence intervals)

4. Collect data.

5. Calculate z (or t) score.

6. Utilize the table to determine if the z score falls within the acceptance region.

7. Decide to

a. Reject the null hypothesis and therefore accept the alternative


hypothesis or

73
b. Fail to reject the null hypothesis and therefore state that there is not
enough evidence to suggest the truth of the alternative hypothesis.

Errors in Hypothesis Tests

We define a type I error as the event of rejecting the null hypothesis when the null
hypothesis was true. The probability of a type I error ( ) is called the significance level.

We define a type II error (with probability ) as the event of failing to reject the null
hypothesis when the null hypothesis was false.

74
SELF CHECK EXERCISE 3
1. East Africa bottling compilations showed that Coca cola had 42% total soft drink
market share in the past year. East African Bottling has made some modifications in
marketing policy and wants to determine if its market share has changed from 42%.
Perform a hypothesis test at the 1% level of significance if 180 of a random sample of 400
customers consume coca cola.
2. Suppose sugar refinery ships 454 gram boxes of sugar to wholesalers in railroad
carload lots. The weights of boxes are normally distributed and have standard deviation of
x  8 grams. before a carload is shipped, a random sample of n= 25 boxes is weighed and
the mean was found to be 440 grams. Perform a hypothesis test is at the 2% level to
determine whether the mean weight for the carload is less than 454 grams.
3. When properly adjusted, an automatic machine should produce parts that have a
mean diameter of 25 millimetres (mm). Part diameters are normally distributed. The mean
diameter of a sample of 10 parts is to be used to check whether or not the machine is
running properly. Perform a hypothesis test at 5% level if the mean of the sample is 25.02
mm and sample standard deviation is 0.024 mm.
4. A rancher wants to test two food mixtures, mix one and mix 2 on random sample of
steers to determine if there is a difference in mean weight gains for the mixtures. The
sample mean weight gains X 1 and X 2 and other sample data are:

Mix 1: n1= 12 steers X 1= 140 pounds S1= 6 pounds


Mix 2: n2= 15 steers X 2= 124 pounds S2= 9 pounds
Perform the hypothesis test at 2% level.
5. Random sample of 500 men and 500 women have been selected to determine whether
the proportion of women favouring a political candidate is greater the proportion of men
favouring the candidate. Carry out hypothesis test at the 1% level if, in the sample 40
women and 20 men favour the candidate.

75
UNIT FOUR
_____________________________________________________________________
CHI- SQUARE DISTRIBUTION
_____________________________________________________________________
Unit Content
 General characteristics of chi square distribution
 Test of independence and Cell counts for independence
 Goodness of fit test
 Goodness of fit test uniform distribution
 Goodness of fit test Binomial Distribution
 Goodness of fit test Poisson distribution
 Goodness of fit test Normal distribution
Unit Objective
This unit enables the learners to:
 Understand chi-square distribution and its general characteristics
 Use chi-square distribution to test if two events are independent or not
 Use chi-square distribution to test if a given distribution follows uniform distribution
 Use chi-square distribution to test if a given distribution follows Binomial
Distribution
 Use chi-square distribution to test if a given distribution follows Poisson
distribution
 Use chi-square distribution to test if a given distribution follows Normal distribution

4.1 CHAPTER INTRODUCTION


Dear learner, we have seen how to test a hypothesis in chapter five. In the previous
chapter we have seen how to test if unknown population mean is statistical equal to the
sample statistic calculated on the basis of sample selected. Bear in mind that population
parameter is not known but it is stated to be a given amount which is to be proved or
disproved.
In this chapter we are going to discuss technique used to test if two variables are
independent. Dear student, do you remember what we mean by independent events? In
this chapter, we are also going to test if a given distribution follows a stated patter. For
example we are going to test if a distribution is Normal, Uniform, Binomial, or Poisson.

76
4.2 GENERAL CHARACTERISTICS OF CHI SQUARE DISTRIBUTION
Chi square is a continuous distribution ordinarily derived as a sampling distribution of a
sum of squares of independent standard normal variables
X Z
zi  Were Zi is normally distributed with mean of zero and Variance 1
x

If Y= Z 2 + Z 2 + Z 2+……. Z 2=
n
Xi  
( )
i 2 3 n 
i1 
This new variable is distributed as X 2 with n degrees of freedom
This can be rewritten as:

 (𝑥 5
𝑥
)
=n
( )
𝛿
+𝑥
(
𝛿
)

i1

n
(X  X ) ( X   )2
Where S2X = 
n 1
(x- x) 2 Therefore this new variable   2X
2 x
i1

distributions

With n degrees of freedom X is normally distributed with mean  and variance


2
. So
n

n( X  )
is normally distributed with mean zero and variance 1. It can be shown that

n(X  )2  X has a  2 distribution with 1 degree of freeform.


2 2

(n  1)S 2 x
has  2 distribution with (n-1) degrees of freedom
 x2

The variable  2 can not be negative. There fore,  2 curves do not extend to the left of
zero. When  is more than two,  2 curves have one mode and are skewed to the right.
The skewness is less apparent when is large and is normal when  is ∞.
When  is very large,  2 distribution is almost the same as normal distribution having a

mean equal to  and standard deviation equal to 2 


The significance level, α will be right tail areas of  2 distributions. Therefore the symbol

 2 , means the valve of  2 such that distribution with  degrees of freedom has a
right tail area of .

77
Uses of chi-square distribution
Dear learner so far we have been discussing the characteristic of chi-square distribution.
Let us now see why it is used. Earlier we have been assuming the probability distributions
of population. For example, we have been saying assume the population follows normal
distribution, binomial distribution, poison distribution and the like. But a very serious
question is that we need to answer is how do we know whether the population is in fact
normally distributed, and so on.
A chi-square distribution can be used to test whether a population follows one or another
distribution. On top that Chi-square distribution can be used to test if two variables are
independent.
Dear learner, below we will see how to check if two variables are independent. We then
will see how to see if population distribution follows a particular form of distribution.
4.3 TEST FOR INDEPENDENCE AND CELL COUNTS FOR TEST OF
INDEPENDENCE
Brain storming Question
What are independent events?

.
To test for independence of two variables, we will work with count data that are arranged
in rows and columns. There will be a row category and column category as illustrated
below.

78
I II III Row total
A 200
B 300
Column 120 200 180 500
Total

What the counts in each cell would be if the categories were independent?
Dear student, do you remember what independent event means? If you answer is “two
events are independent if the occurrence of one does not affect the other and is not
affected by the other event”, you are correct. Excellent indeed!
If two events A and B are independent, then P (A/B) = P (A)
P( AnI )
In the above example, Probability of A/I is, therefore, given by = P (A)
P(I )
Row A
Where as probability of A is
GrandTotal

AnI Row A Total


=
I Grand Total

Cell Row Total x cell ColumnTotal


All count =
Grand Total

Row A Total x Row I total


AnI =
Grand Total
200x120
= = 48
500

AnII = AxII 200x200


= = 80
GrandTotal 500

An II = Row A total x Column III total = 200x180 = 72


Grand Total 500

Similarly
BnI = 72 BnII = 120 BnIII = 108
The complete entries of the table calculated above and reproduced below is called
expected frequencies fe.

79
I II III Row total
A 48 80 72 200
B 12 120 108 300
Column Total 120 200 160 500

Determination of Degrees as freedom for chi-square


The number of counts that must be obtained by the rule is the number of rows minus one
times the number of columns minus one.
(2-1) (3-1) = 2

The actual sample counts are called observed frequencies and are denoted by f o. These two
frequencies fo, and fe are used to compute a sample statistic for testing the hypothesis that
the row and column categories are independent. The underlying idea is that for the
category to be independent, the observed categories should be close to the expected
frequencies.
If the difference (fo – fe) are large, then we reject the hypothesis of independence.
( f  f )2
Sample  2 =∑ 0 e

fe
The distribution of  2 computed from contingency table is approximated by a chi-square
distribution with V degrees of freedom where:
V = (r-1) (c-1)
The chi-square approximation is satisfactory if the expected cell frequencies are not too
small, i.e f e ≥5.

If fe< 5, combine adjacent rows or columns in the contingency table to get fe values of at
least 5 before computing the sample  2 , also degree of freedom (V) will be computed
after combining rows and columns.

80
Steps followed
1. State the hypotheses
Ho: The row and column categories are independent
Ha : The row and the column categories are not independent
2. States the decision Rule reject Ho, if sample  2 >  2 ,
Where  = significance level of the test
V = (r-1) (c-1)
and  2 , found from statistical table
3. Compute sample  2
( f  f )2
The sample  2 =  0

fe
e

4. Accept or reject the Ho Based on the decision rule

Example

David Gallano, a wine merchant, has collected opinion on grape wine quality from a
random sample of his customers. The customers tasted wines made from grapes grown in
three regions of the country; they rated wine qualify on scale of 1 (best) to 4. The sample
data are given in the table below. David wants to know whether quality ratings are or are
not independent of the grape growing regions. The test for independence is to be made at
5% level.
Customer quality ratings for wine Growing region
Quality Rating I II II Row total
1 15 10 6 31
2 7 13 12 32
3 11 12 8 31
4 3 8 11 26
Column Total 36 43 41 120

81
Solution
Steps
1. Ho: quality rating in independent of growing region
Ha: quality rating in independent of growing region
2. Develop the decision Rule
V = (4-1) (3-1) = 3(2) = 6
 2 , =  2 0.05,6 0.05,6 = 12.592
Reject H0 if sample  2 > 12.592
3. Calculate sample  2
Calculate Expected frequency and use the formula (fo - fe)2/fe to calculate sample chi
square. To calculate expected frequencies we need to use cell counting rules. For example
for the first cell expected frequency can be calculated as (31x36)/120 = 9.3. Calculate the
rest in the same way. We will get the following results.

Fo Fe Fo-fe (fo-fe) (fo-fe)2/fe


15 9.3 5.7 32.49 3.4935
10 11.1 -1.1 1.21 0.1090
6 10.6 -4.6 21.16 1.9962
7 9.6 -2.6 6.76 0.7042
13 11.5 1.5 2.25 0.1957
12 10.9 1.1 1.21 0.1110
11 9.3 1.7 2.89 0.3108
12 11.1 0.9 0.81 0.0730
8 10.6 -2.6 6.76 0.6377
3 7.8 -4.8 23.04 2.9538
8 9.3 -1.3 1.69 0.1817
15 8.9 6.1 37.21 4.1809
Sample  2 = 14.9475

4. Since sample 𝑥2> 12.592, reject Ho.


Interpretation

82
Quality Rating is not independent of growing region since chi square calculated is greater
than the critical value.

Activity 6.1
1. Sheraton Addis has rooms in high, average, and low price levels. The owner
advertises high quality service in all rooms. Opinions of the service obtained from a
random sample of guests are given in the table below. Are guest ratings independent of
price? Test at the 1 % level.

Room price
Guest service
High Average Low
rating
Excellent 14 25 11

Good 20 66 14

Poor 6 19 5

2. Vota Company makes equipments for controlling the operations of machines.


Vota equipment defects are of four types A, B, C, and D. Vota operates three shifts each day,
shifts 1, 2, and 3. Data for random samples of defects by shift and type of defect are given
in the following table. Perform at the 2.5% level, a test of a hypothesis that defect type and
shift are independent.

Shift Defect type

A B C D

1 12 30 10 16

2 16 22 11 8

3 14 12 3 6

83
3. The credit manager of plaza stores obtained data for random sample of credit
customers and recorded the data in the table below. Perform at the 5 % level, a
test of hypothesis that time to pay is independent of residence region.

Days until payment Customer residence regions


of last bill Urban Suburban Rural
Under 15 31 61 28
15-30 72 87 41
Over 30 37 32 11

4.4 TESTING THE EQUALITY OF MORE THAN TWO POPULATION


PROPORTIONS
Chi-square distribution can be used to test for the equality of two proportions or more
than two proportions. The test for equal proportions is done in exactly the same way as
the test for independence. In fact the test for independence is the same at a test that in any
row, the cell proportions to column totals, Pr are equal and in any column the cell
proportions to row totals are equal.
Cell Pr oportions
must be equal
ColumnTotal
Cell Pr oportion
must be equal
Row Total
The hypothesis testing for equal proportions can be stated either in terms of equal row
proportions or in terms of equal column proportions.
We will use row proportion (pr) in our hypotheses

H0: the proportions pr in any row are equal


Ha: the pr in at least one row are not equal

Example

The following table contains counts for a random sample of 200 workers. It shows that 12
workers who had not gone to high school were rated as satisfactory by supervisor. We
want to test (at the= 0.05 level) the hypothesis that the populations of satisfactory workers
in education levels 1, 2, and 3 are equal.

84
Education level
Supervisor No high school High school Completed Row Total
rating but not high school
complete
Satisfactory 12 63 65 140
Not satisfactory 8 17 35 60
Column total 20 80 100 200

Solution
Steps to solve the question
1. Hypotheses
Ho: The cell proportions pr in any how are equal
Ha: the cell proportions in at least one are not equal.
2. Decision Rule
V= (2-1) (3-1) = 2
With α = 0.05 and 2 degree of freedom,  2 0.05,2 = 5.991
Therefore, the decision rule will be:
Reject Ho if 𝑥2> 5.991
3. Calculated sample  2
Next we must compute expected frequencies. Expected frequency can be calculated in the
same way we have been calculating expected frequencies in test of independence. For
example for the first row first column, expected frequency can be calculated as follows:
(cell rowtotal) (Cell Column Total) 140(20)
fe= = = 14
Grand Total 200

Likewise expected frequencies for all cells can be calculated. Dear learner, calculate the
expected frequency for all the remaining cells before you refer to the following table.
There are two numbers in each cell in the table below. The first number is observed
frequency and the number to the right of the observed frequency is expected frequency.

85
Supervisor No high High school Completed Row Total
rating school but not high school
complete fo fe
fo fe fo fe
Satisfactory 12 14 63 56 65 70 140
Not satisfactory 8 6 17 24 35 30 60
Column total 20 80 100 200

Sample  2 can now be calculated as follows:

Fo fe Fo-fe (fo-fe) (fo-fe)2/fe


12 14 2 4 0.2857
63 56 7 49 0.875
65 70 5 25 0.3571
8 6 2 4 0.6667
17 24 7 49 2.0417
35 30 5 25 0.8333
Calculated χ2 ∑(ƒ𝑜 − ƒ𝑒) 2⁄ƒ𝑒 = 5.0595

3. Sample 𝑥2 = 5.0595
The sample  2 does not exceed the 5.991 (  2 0.05,2 )

4. Decision
Accept Ho since sample  2 is less the significance level.
Interpretation
� The proportion of satisfactory rated worker in the same for all three
educational level.

4.5 GOODNESS OF FIT TESTS


Goodness of fit tests use sample data as a basis for accepting or rejecting assumptions
about population distribution. A goodness of fit test is performed by first computing the
frequency that would be expected it Ho is true: then the differences between the
frequencies observed in the sample fo and the expected frequencies fe are used to
calculate:

86
( f  f )2
Sample  2 = 
o e

fe
Goodness of fit test differs from independent tests
1. In the methods used to compute expected frequencies
2. The Rule for determining the number of degrees of freedom.

In a goodness test the method for calculating the expected frequencies depends on the
population assumptions that are made; the number of degrees of freedom in a goodness of
fit test is:
V = ne -1-g
Where ne = number of fe values used in computing the sample 𝑥2
g = number of population parameters estimated from the sample.

4.5.1 GOODNESS OF-FIT-TESTS UNIFORM DISTRIBUTION


A goodness of fit tests of uniform distribution tests whether the population distribution
follows uniform distribution or not. A uniform distribution is the case where all expected
frequencies are equal.
Example

A winning number in Massachusetts lottery is a four digit number such as 1084 or 1416.
Digits in winning number are assumed to be drawn at random. It is assumed that the
winning digit population has a uniform distribution; that is each of the N = 10 integers has
1/10 probability (1/N) of being selected for each place in a winning number. Does the
winning digit population have a uniform distribution?
The following table gives sample data of 400 observations
Digit Observed

0 41
1 54
2 31
3 39
4 35
5 36

87
6 56
7 38
8 31
9 39
Total 200

Solution
1. State the hypothesis
Ho: The distribution is uniform
Ha : The distribution is uniform
2. Calculate Expected frequencies and sample  2
N 1
fe = = ( ) n = 400/10 = 40
n 𝑁

Digit observed expected Fo-fe (fo-fe)2 (fo-fe) 2/fe

0 41 40 1 1 1/40
1 54 40 14 196 196/40
2 31 40 -9 81 81/40
3 39 40 -1 1 1/40
4 35 40 -5 25 25/40
5 36 40 -4 16 16/40
6 56 40 16 256 256/40
7 38 40 -2 4 4/40
8 31 40 -9 81 81/40
9 39 40 -1 1 1/40
(ƒ𝑜−ƒ𝑒)2
∑ = 662/40
Ƒ𝑒

V = ne -1-9
= 10 -1-0 = 9
2. Decision Rule
𝑥2 0.05, 9 = 16.919
Reject Ho if sample 𝑥2 >, 16-919
3. Sample 𝑥2 = 662/40 = 16.55
4. Accept Ho.

88
4.5. 2 Goodness of fit Binomial Distribution

A foot ball fan keeps a track of fool ball betting pool in her company. In each bet, a player
has to pick the winner for 10 games in the last season 1,000 bet were placed.
The number of correct picks is tallied in column 2.
Number of correct picks Number of bets
Fo
0 2
1 8
2 39
3 123
4 207
5 250
6 203
7 115
8 40
9 13
10 0

Solution
To test whether the above distribution is binomial or not we have to calculate the
expected frequencies assuming that the probability of winning the bet follows
binomial distribution. Then we compare the expected frequencies with the observe
frequency and see how significant the difference is.
The steps we follow to do test of fitness for binomial distribution is similar to that
goodness of fit test for uniform distribution.

89
Number of correct Number of bets Expect Expected
picks fo probability frequency

0 2 0.001 1
1 8 0.010 10
2 39 0.044 44
3 123 0.117 117
4 207 0.205 205
5 250 0.246 245
6 203 0.205 205
7 115 0.117 117
8 40 .044 44
9 13 0.010 10
10 0 0.001 1

Dear learner, do you remember what a binomial distribution is and how to calculate
probability or r occurrences if we are given probability of success? Consult probability
distribution theorem and refresh your memory once again before conducting these
probability calculations. The correct probability for the events is given in above table and
it is calculated using the following formula.
n!
P(x) = p r q nr
r!(n  r)!
Where n = 10
P = 0.5
q= 0.5
r = number of correct picks
Step 1: State Hypothesis
Ho: The distribution follows binomial distribution with p=0.5
Ha: the distribution does not follow binomial distribution with p=0.05
Step 2: Calculate degree of freedom
To calculate the degrees of freedom we need to check for the values in expected frequency
column. Expected frequency must be greater than or equal five. However in above table

90
some expected frequencies (the first and the last) are less than five. Therefore we need to
merge them to make them more than five. Merging can be done by combining the first two
rows and the last two rows together.
Then, the degree of freedom can be calculated as V= K-1-g
Where: K= number of categories
g= number of population parameters estimated from sample.

Number of correct Number of bets Expect Expected


picks Fo probability frequency

0 2 0.001 1
1 8 0.010 10
2 39 0.044 44
3 123 0.117 117
4 207 0.205 205
5 250 0.246 245
6 203 0.205 205
7 115 0.117 117
8 40 .044 44
9 13 0.010 10
10 0 0.001 1

The above operation now yields the following summarized table.

Number of correct Number of bets Expect Expected


picks fo probability frequency
Fe
0&1 10 0.011 11
2 39 0.044 44
3 123 0.117 117
4 207 0.205 205
5 250 0.246 245
6 203 0.205 205
7 115 0.117 117
8 40 .044 44
9 & 10 13 0.011 11

Degree of freedom can now be calculated using the formula stated above (V=K-1-g).

91
Hence in above table, we have nine categories of data; therefore the degree of
freedom will be 9-1-0, since no population parameter is estimated from sample
statistic.
Degree of freedom (V) = k-1-g = 9-1-0 =8
Step 3: Decision Rule
Next, read the value of  2 8, 0.05
Dear student can you read the value from your statistical table? To find the value:
1. Go to the statistical table and find the table with heading chi-square
distribution.
2. In the first column look for the degree of freedom (8)
3. Horizontally look for a tail area of 0.05
4. Find the intersection of the two
Dear learner, have you found that the result is 15.507.
Since the rejection region falls in only right tail, the null hypothesis developed
earlier will be rejected if  2 calculated is greater than 15.507.
Step 4: calculate  2

To calculate chi-square value we use ƒ𝑜 − ƒ𝑒)

Fo fe fo-fe (fo-fe)2 (fo-fe)2/fe


10 11 -1 1 0.090909
39 44 -5 25 0.568182
123 117 6 36 0.307692
207 205 2 4 0.019512
250 245 5 25 0.102041
203 205 -2 4 0.019512
115 117 -2 4 0.034188
40 44 -4 16 0.363636
13 11 2 4 0.363636
 2 Calculated 1.869309

Step 5: Decision

92
Since  2 Calculated (1.869) is less than 15.507, the decision is to accept the Ho. this
implies that the distribution follows binomial distribution.

4.5.3 Goodness of fit test for Poisson distribution


Goodness of fit test for Poisson distribution is used to test whether a given distribution
follows Poisson distribution or not.
Example

Suppose a hospital has kept track of the number of patients arriving at the emergency
room with in a given hour in the last 480 hours /20 days. It was found that 960 patients
came to emergency room during that period. The observed distribution of arrival of
patients is given below.
Number of patients arriving
0 1 2 3 Total
Number of 60 140 125 155 480
Hours

Solution
Step 1: state the Hypothesis
Ho: The distribution of arrival follows Poisson distribution with  =2 (96/480)
H1: The distribution does not follow Poisson with  =2
Step 2: Calculate expected frequencies
Expected frequency in turn can be calculated as pxn
Where p= probability and
n = number of observations
x .e
P(x)  
x!

20 xe2
For example probability of zero arrival can be calculate as P(0)  = 0.135. Dear
0!
student, can you calculate the probability for the remaining three events: (1 arrival, 2
arrival, and 3 arrivals)? Compare your answer with the one given below in the table. Then
the number of hours we expect zero patients to arrive can be calculated as the probability
of zero arrival multiplied by the number of hours the observation is conducted for

93
(0.135x480= 65). Dear learner, would again calculate the amount of expected hours for 1
arrival, 2 arrivals and 3 arrivals? Compare your answer with the table below.

0 1 2 3 Total
Number of 60 140 125 155 480
Hours
Probability 0.135 0.271 0.271 0.323 1.00
Number of 65 130 130 155 480
Hours
( fo  fe)2 0.38 0.77 0.19 0 1.35
Fe

Step 3: Develop decision Rule


To develop decision rule we rely on degree of freedom and confidence level. Degree of
freedom in turn can be calculated as V= k-1-g
g= number of parameters estimated. In this case we have taken a
sample of 480 hours to calculate the average rate of arrival ( ) . Hence g takes a value of 1.
V= 4-1-1=2
  0.05

Hence  2 2,0.05 =5.991


Therefore, decision rule will be reject Ho if  2 calculated > 5.991
However, the  2 calculated (1.35) is less than 5.991. Hence, the Ho should be accepted;
which means distribution follows Poisson distribution.

4.5.4 Goodness of fit test Normal Distribution


Suppose we are given the following data An Observed distribution and a fitted normal
distribution with n=200 mean = 9 and standard deviation (s) = 3.

Values <3 3-6 6-9 9-12 12-15 -15


Observed 7 30 61 73 25 4
Frequency

94
Find whether the normal distribution gives a close degree of fit at 5 % significance level?

Solution
Step 1: Hypothesis
Ho = population sample follows a normal distribution
H= A population sampled does not follow normal distribution
Step 2: calculate expected frequencies
To calculate expected frequencies, we need to first calculate the probability that the values
lie within the specified ranges.
The following diagram can help us as an aid.

3 6 9 12 15

We now need to calculate Z scores of each of the stated values and read the probability
from statistical table. Dear student, do you remember how to calculate Z scores.
x
Absolutely, Z   hence Z(3) = 3  9 =  2
 3

Likewise, Z (6), Z (9), Z (12) and Z (15) can be calculated. The z scores are -1, 0, 1 and 2
respectively. To calculate the expected frequencies, we need to find the probability that the
values will be within the calculated Z values; which can be read from statistical table.

Values <3 3-6 6-9 9-12 12-15 -15


Observed 7 30 61 73 25 4
Frequency
Probability 0.0228 0.1358 0.3413 0.3413 0.1358 0.0228

95
Expected 5 27 68 68 27 5
Frequency

Step 3: calculate degrees of freedom and develop decision rule


All the expected frequencies are greater than or equal to five. Therefore, no merging of
cells is needed. The degree of freedom can be calculated as:
V=k-1-g;
Dear learner, there are two parameters estimated form sample statistic: mean and
standard deviation. Hence K=2.
=6-1-2=3
 2 3,0.05 = 7.81
The decision rule, therefore, is accept Ho if  2 calculated > 7.81.
Step 4: calculate test statistic

Fo fe fo-fe (fo-fe)2 (fo-fe)2/fe


7 5 2 4 .8
30 27 3 9 .33
61 68 -7 49 .72
73 68 5 25 .37
25 27 -2 4 .15
4 5 -1 1 .2
 2 calculated 2.57

Step 5: Decision
Accept Ho since  2 calculated <7.81

Activity 6.2
1. The table below contains random sample data on the number of workers absent
from Commercial Bank of Ethiopia. The sample contained the same number (10) of
Monday, Tuesday, Wednesday, and so on. Does it appear that the number of workers

96
absent is uniformly distributed over days of the week? Perform a goodness-of-fit test at the
5 percent level.
1.

Day Number of Workers absent

Monday 15

Tuesday 9

Wednesday 9

Thursday 11

Friday 16

2. When a beer bottle filling machine breaks a bottle, the machine must be
shutdown while the broken glass is removed. The production manager at Bedele
Brewery has been using Poisson distribution with λ=3 shut downs per day, on
the average, to determine the probabilities of 0, 1, 2, 3… Shutdowns in a day. The
manager has tabulated the number of shutdowns in a random sample of 120
operating days, as shown in the table 14.6. We want to test, at the 5% level, the
hypothesis number of shutdowns is a day has a Poisson distribution with λt = λ
= 3.
Number of shut downs in a day (X) Number of days (f0)

0 3

1 20

2 29

3 22

4 23

5 10

6 or more 13

97
1. Jacob, a mail order firm, sends out special item advertisements in batches of 50 at a
time. Sampson’s sale manager believes that the probability of receiving an order as a
result of any one advertisement is 0.050. The manager wants to test the hypothesis that
the distribution of number of orders in batch of 50 is a binomial distribution with
p=0.05. Data for random sample 120 mailings are given below. Perform a goodness of
fit test at the 5 % level.

Number of orders received from a mailing Observed frequency


list of 50 advertisements

0 16

1 30

2 40

3 20

4 10

5 3

6 1

7 or more 0

2. A seed grower sells early prize corn seed. A random sample of 1000 seeds was planted
to determine how many hours it takes for seeds to germinate. Data for the 1000 seeds
are given in the table below. The sample mean and the sample standard deviation were
150 hours and 12 hours respectively. Does it appear that the hours to germinate are
normally distributed? Perform a goodness of fit test at the 1% level.

98
Hours to germinate Frequency

Less than 120 10

120 and under 126 21

126 and under 132 50

132 and under 138 110

138 and under 144 160

144 and under 150 200

150 and under 156 210

156 and under 162 120

162 and under 168 60

168 and under 174 36

174 and under 180 19

180 and more 4

3. Given the data below, test goodness of fit test at 1% level if distribution follows normal
distribution.   500and   100

Test score interval Observed frequency

Less than 260 3

260 and under 340 5

340 and under 420 35

420 and under 500 63

500 and under 580 51

580 and under 660 28

660 and under 740 8

740 or more 7

99
UNIT SUMMARY
In probability theory and statistics, the chi-square distribution (also chi-squared or χ2
distribution) is one of the most widely used theoretical probability distributions in
inferential statistics, i.e. in statistical significance tests.
It is useful because, under reasonable assumptions, easily calculated quantities can be
proven to have distributions that approximate to the chi-square distribution if the null
hypothesis is true.
Chi-square distribution is used for two main tests:
 Tests of independence
 Tests of goodness of fit
In test of independence, chi-square is used to test if two variables are independent or not.
In order to test for independence the chi-square test uses the basic assumption of
independence. On the basis of the assumption expected frequencies are calculated and
compared with the actual and observed frequencies. If the difference between the
expected and observed frequencies is insignificant, the two variables are said to
independent. Otherwise, the two will be dependent. In doing so we need to find the degree
of freedom for the distribution which is calculated as (r  1)(c  1).where r = number of
rows and c = number of columns.
Chi-square is also used to test if the distribution follows a give distribution type (normal,
uniform, Poisson, Binomial or the like). In testing for goodness of fit test we assume that
the distribution follows the stated distribution and find expected frequencies if the
distribution follows the assumed distribution. We then compare the expected frequency
we have calculated with the observed frequencies. If there is significant difference
between the two, the distribution deviates significantly from the stated distribution;
otherwise it follows the stated distribution. In fitness test the degree of freedom can be
calculated as ne  1  g

where, ne = number of category,

g = number of parameters estimated from sample

( fo  fe)2
We then need calculate the calculated  2 . Calculated X 2  
fe

100
We then compare the calculated  2 with the table  2 which is read as  2 , . If the

computed  2 is greater than  2 


, , we reject the null hypothesis that the distribution
follows a stated distribution, otherwise we accept the null hypothesis (H0).

SELF CHECK EXERCISE 4


1. A security analyst classified a random sample of stocks in four industries as
being high, average, or low in terms of safety to buyers of the stock. The data
is presented in the table below. Perform a hypothesis test of independence of
safety ratings and industry classification. Use α = 0.01.

Stock safety Industry category


rating I II III IV
High 25 18 29 12
Average 32 30 42 20
Low 13 32 14 33

2. A company is considering four areas as possible locations for manufacturing


plant. Each area has about the same number of workers. The company needs
skilled workers and wants to determine whether the proportions of skilled
workers in the areas are the same. Random sample of data are given the
table below. At the 5 % level, perform a test of hypothesis that the four areas
have the same proportion of skilled manpower.
Area
Number of A B C D
Skilled workers 89 99 103 82
Unskilled 161 151 148 168
workers

3. A manufacturer packages drinking glasses in boxes of 50. All glasses from a


sample of 100 boxes were examined, and the number of defective glasses in

101
each box was recorded. The sample data are given in the table below. Test if
the distribution follows a binomial distribution at 5% level of significance.

Number of defectives in a box Number of boxes


0 69
1 22
2 4
3 1
4 3
5 1

4. Planes landing at Bole International AirPort are classified as scheduled or


unscheduled. A random sample of hourly observations provided in the table
below which shows number of unscheduled plane landing. Does the number
of landing per hour appear to have a Poisson distribution?
Number of planes landing in 1 hours Frequency
0 24
1 35
2 43
3 25
4 16
5 4
6 2
7 0
8 1
9 or more 0

5. For inventory planning and control purposes, Awash chemical Company


wants to know if its sales of a liquid chemical are normally distributed. Sales
for a random sample of 200 days are given in table below. the sample mean
and sample standard deviation calculated from the 200 sample daily sales
numbers are:

102
X =40 thousands of gallons
Sx= 2.5 thousands of gallons

Sales in thousands of gallons Number of days, fo


Less than 34.0 0
34.0 and under 35.5 13
35.5 and under 37 20
37.0 and under 38.5 35
38.5 and under 40.0 43
40.0 and under 41.5 51
41.5 and under 43.0 27
43.0 and under 44.5 10
44.5 and under 46 1
46.0 and more 0
Total 200

103
CHAPTER FIVE

ANALYSIS OF VARIANCE
_____________________________________________________________________
Unit outline
 Characteristics of Analysis of Variance
 One way analysis of Variance
 Two way analysis of Variance
Unit objective
After completing this chapter students will be able to:
 Distinguish F distribution from other types of distributions
 Know the characteristics of analysis of variance
 Use F test to test the hypothesis the mean of more than two population is
equal
 Understand and use one way analysis of variance
 Understand and use two way analysis of variance

5.1 Chapter Introduction


Procedures for determining whether or not two populations have equal means were
discussed in hypothesis testing. However, management problems involve more than two
populations and decision makers want to know whether the means of these populations
are or are not equal.
The responses that are generated in an experimental situation always exhibit a certain
amount of variability. In an analysis of variance, we divide the total variation in response
measurements in to portions that may be attributed to various factors of interest to the
experimenter.

Introduction
One way to compare two population variances, d21 and d22, is to use the ratio of the
sample variances, S21/S22. If S21/S22 is nearly equal to 1, you will find little evidence to
indicate that d21 and d22 are unequal. On the other hand, a very large or a very small value
of S21/S22 provides evidence of a difference in a population variance. The assumptions
required for an analysis of variance are similar to those required for student’s t-
distribution. Analysis of variance is so called because we decide whether to accept or reject

104
the hypothesis of equal population mean on the by analyzing the variations (variance) in
the sample means. The ANOVA test is performed on simple random samples drawn
randomly, one from each of the several populations. The test assumes that the populations
are normally distributed and have equal variances.

5.2 One way Analysis of variance


In the analysis of variance, the F statistic, is used to test whether the mean of two or more
groups are significantly different. It operates by breaking dawn the variance of the two or
more groups in to components. These components ore then used to construct the sample
statistic.
An ANOVA based on group data that are defined by single classification is called one-way
ANOVA and an ANOVA based on group data that are defined by a dual Classification is
called tow way ANOVA.
Suppose we went to test whether number of years of work experience since graduation
has an effect on beginning salary for management graduates. The following table shows
salary for the graduates with different years of experience. Test if the average salaries of
the different working experience category are different at 5% level.
The three treatments or groups are:
Treatment 1: Bachelor’s degree with no work experience
Treatment 2: Bachelor’s Degree with one year of experience
Treatment 3: Bachelor’s degree with two years of experience
We also assume that all students in the sample graduated from one University and
specialized in one field of study. In order to simplify the necessary computation, a random
sample of only 12 observations- 3 samples of (of 4 graduates) from each of the
combinations.

105
Years of work experience
Student 1 year of Experience 2 years of Experience 3 years of
experience
1 16 19 24
2 21 20 21
3 18 21 22
4 13 20 25
Total 68 80 92
Mean 17 20 23

(Global) Overall mean = (17+20+23)/3 =20

Specifying Hypotheses
Dear learner you might have noticed that we have calculate mean for the three treatments:
no experience, one year of experience and 2 years of experience. We have also calculated
the overall (global mean) for the observation. Hence we want to test whether these three
sample means were drawn from populations that have identical means. In other words, we
want to test the following null hypothesis:
Ho: 1  2  3 against the alternative hypothesis

H1: At least two population means are not equal.


Thus we are testing whether the difference between the sample means are too large to be
attributed solely to chance. If the test results indicate that the sample means are
significantly different, then we can conclude that the different years of work experience
have an impact on beginning salaries. Note that we make inference about means of more
than two populations.
As we can see from the above table there are n observations and m populations. Each of
the m populations is a treatment. The top row indicates that we will be testing the equality
of m different means. With in each column there are n individual samples taken from each
of the m treatments. In developing the one way analysis of variance model, our purpose is
to specify the underlying relationships among the various treatments. Hence the first step
is to calculate the sample means from the random observations taken from each of the m
treatments.

106
 To test the null hypothesis that the treatment means are equal, we need to assess two
measures of variability.
1. Variability of the sample with in each treatments this is referred to as with in group
variability
2. we are also interested in the variability between the m treatments- between group
variability

 The term variation refers to the sum of squared deviations which called the sum of
squares

SST = n j (x j - x)2

Where SST = between treatment sum of squares


nj = sample size of treatment
X j = sample mean of the jth treatment

X = overall mean
n1( X 1- X ) 2 = 4(17-20)2 = 36
n (X 2  X )2 = 4(20-20)2 = 0

n ( X  X )2 = 4(23-20)2 = 36

SST= 72

With In Treatment Sum of Square


With in treatment sum of square specifies the treatment effect. It indicates the unexplained
variability that is due to the random sampling process. Calculation of with in treatment
sum of square can be done as follows.

SSW =  ( X nj  X j )2

Treatment 1 Treatment 2 Treatment 3


(16-17)2 =1 (19-20)2 =1 (24-13)2 = 1
(21-17)2 = 16 (20 – 20)2 = 0 (21-23)2=4
(18-17)2 = 1 (21 – 20)2=1 (22-23)2 =1
(13-17)2 = 16 (20-20)2 = 0 (25-23)2 = 4
Total 34 2 10

107
SSW = 34+2+10 = 46
The between sum of square and with in sum of square together represent the total
variation of the ANOVA model. We calculate the total variation by adding squared
deviations of the individual observations about the global mean. Total sum of square can
be calculated as:
m nj

TSS=  ( Xij  X )2
j 1 i1

Where:
TSS= Total sum of square
Xij= value of the observation in the ith row and jth column.

X = Overall mean
To put it more simply we obtain total sum of square by adding the between treatments
variation and the within treatments variation.
(TSS) =SSW + SST = 76+34 = 118

Between treatment and with in treatment mean squares


The number of degrees freedom associated with the between treatments variation is (m-1)
whereas, the number of degrees of freedom associated with the with in treatment
variation is (n-m)
Where m= number of treatments and M= number of observations
Therefore, if return back to the example and try to calculate the degrees of freedom for the
between treatment and with in treatment variation, there are 3 treatments and 12
observations. Which shows that m=3 and n=12. Thus, the degree of freedom can be
calculated as:
Degree of freedom for between treatment variation is = 3-1 =2
Degree of freedom for within treatment variation is = 12-3 = 9

Mean square of the between treatments


The test of null hypothesis is based on the assumption that all the m treatments have
common variance. If the null hypothesis is in fact true, then the SST and SSW can be used
as a basis for estimate of a common variance. To calculate these estimates, we can now
divide each of the variability measures by its number of degrees of freedom. Hence the

108
unbiased estimate of the between treatments mean of square can be obtained by dividing
SST by (m-1) degrees of freedom.
MST = SST/m-1
Where MST = between treatment mean of square
In our example the between treatment mean of square is MST= 72/2 = 36. Similarly,
nonbiased estimate of the within treatment mean square is found by dividing SSW by (n-
m) degrees of freedom.
MSW = SSW/(n-m)
Where: MSW = Mean square of with in treatments
= 46/9 = 5.11
We now test the null hypothesis that the population treatment means are equal by
comparing the between treatment means square with the within treatment mean square.

The Test statistic


Comparison of the between treatments mean square and the within treatment mean
square is performed by computing a ratio:
The test statistic (F) = MST/MSW
If the null hypothesis that the population treatment means are equal were true, the ratio
(F) would tend to be equal 1. Alternatively, if the null hypothesis were not true, the ratio
would be greater than 1 (MST generally can not be smaller than MSW), which implies that
the treatment means do differ because the between treatment variances exceed the within
treatment variance. The ratio for the above example can be calculated as:
F calculated = MST/MSW
F calculated = 36/5.11 = 7.04
Summary table for one way ANOVA
Sources of sum of squares degree of mean
Variation freedom squares
Between treatments 72 SST 2 (m-1) 36
With in treatment 46 SSW 9 (n-m) 5.11
F, 2, 9 36/5.11 = 7.04
F 2, 9, 0.05 = 4 .26
Decision Rule

109
Accept H0 if F2, 9 < 4.26. 4.26 is a critical value that is read form statistical table at the end
of the module with the heading analysis of variance. To read the value from statistical
table:
Search for F distribution table with 5 % significance level. Search for 2 degree of freedom
in numerator (on top) and 9 degrees of freedom in the denominator (first column) and
read the intersection of the two which is 4.26 in this case.
Decision
On the basis of the calculation that we have already made, we have found that F calculate is
7.04 which is greater that the critical F value 4.26; therefore, the null hypothesis must be
rejected and the alternative hypothesis must be accepted.

5.3 TWO-WAY ANALYSIS OF VARIANCE


In this section we extend one way ANOVA to two ways. A two way analysis of variance
deals with a more in-depth interpretation of Analysis of Variance analysis. In the previous
example we have been using our primary interest focused on single aspect of the one way
analysis of variance (years of experience), but it is possible that another factor also affects
the outcome. In one way analysis of variance, we conclude that the number of years of
experience had a significant impact on starting salary. However, we may suspect that some
of the variability of the model is due to the geographic location of the job. Hence, now we
need not only to look at the treatment effects of number of years work experience but also
to isolate the impact of geographic location on the starting salaries of all the graduates. By
setting up a two way ANOVA problem, we want to design a more accurate test to explain
the differences in mean population of the treatments.
Our new model must be constructed in such a way as to test for the influence that the
second a second factor may have on starting salary. Using the data from the previous table
which is repeated below, we have 4 rows represent 4 geographic locations in Ethiopia.
Hence, we will be able to acquire information about the various years of work experience
as well as information about the geographic locations of the job. This new factor in our
analysis (region) is called the blocking factor. The blocks contain only a single observation
per cell. Let’s assume that
 the first row represents Western part of Ethiopia
 the second row represents the Eastern part of Ethiopia
 the third row represents the northern part of Ethiopia

110
 The fourth row represents the southern part of Ethiopia.
The observation made is presented as follows

Year of Experience

Region 1 2 3 Row Row mean


sums
1 16 19 24 59 19.667
2 21 20 21 62 20.667
3 18 21 22 61 20.333
4 13 20 25 58 19.333
Column sum 68 80 92 240
Column means 17 20 23

Test if population mean salaries among various years of experience and among the various
geographical locations are equal at 5% level.
Solution
Specifying the hypothesis
We will have two hypotheses to be tested
1. Ho: population mean salaries among various years of work experience are
equal
2. Ho= population mean salaries among various regions are equal.
H1 = population mean values are not equal.
Between and residual sum squares.
The necessary calculation for two way analysis of variance involves computation of the
following values:
SST = between treatment seem square
SSB = between block sum square
TSS = Total sum of square
SSE = error sum of square

111
Dear learner we have already seen how to calculate between treatment sum of square, and
total sum of square. In fact we have already computed the two. Do you recall the way we
calculated the two?
We now calculate the between blocks sum of square and error sum of square. Between
blocks sum of squares can be calculated as: where I, j, and k represents the kth salary
observation in the ith row, and jth column.
I
SSB =
i1
JK ( X n  X )2

Where,
SSB= between blocks sum o f square
Xi  sample mean of the ith row

X  Over all mean

The between blocks sum of square can be calculated as follows.

j(X 1  X )2 = 3(19.667 – 20)2 = 0.333


j(X 2  X )2 = 3 (20.667- 20)2 = 1.335
j(X 3  X )2 =3(20.337-20)2 = 0.333
j(X 4  X )2 = 3(19 .333-20)2 = 1.335
Total sum of square between blocks (SSB) is given by summation of the values which will
be 3.336.
Error sum of square in turn can be calculated by subtracting between treatments sum of
square, and between blocks sum of square from the total sum of square.
SSE = TSS – SST – SSB
= 118-72 -3.336 = 42.664
The next logical step to be done is determination of degrees of freedom for between blocks
sum of square and residual (error) variation. The number of degrees of freedom associated
with the between blocks variation is (I-1) where as the degrees of freedom for associated
with the residual variation is (J-1)(I-1).
Between variance and error variance
It is now possible to obtain the unbiased estimates of the between blocks variance and
residual variance. The between blocks variance is calculated as:
SSB 3.336
MSB = = = 3.336/3 = 1.112
I 1 4 1

112
Where: I-1= degrees of freedom for the between blocks variance.
The residual (error) variance can be calculated in the same way as:
SSE
MSE = = 42-664/3(2) = 42.664/6 = 7.111
(J 1)(I 1)
Where (J-1) (I-1) = Degree of freedom for residual or error variance
To test our null hypothesis about the influence of various years of work experience we
must calculate F ratio.
F(2,6) MST/MSE = 36/7.111 = 5.065
The above ratio is calculated f ratio. To decide whether to accept or reject the null
hypothesis, we need to read the critical value from statistical table and compare it with
calculated f value.
Critical value for the above decision is F(2, 6, 0.05) = 5.14.
How to read the critical value from statistical table:
 Find f distribution with the mentioned significance level
 Find 2 degrees of numerator which is found in the first row of the table.
 Find 6 degrees of freedom in the denominator which is found in the first column of the
table.
 Find the intersection of the two and read the value which is found at the intersection
of the two.
INTERPRETATION
As the critical F value is greater than the calculated F value, we can not reject he null
hypothesis; which implies that there is no difference between the populations mean of
salaries associated with various years of work experience.
In testing the null hypothesis for the influence of geographical location on salaries, we find
that F ratio is:
1.112
F(3,6) = = 0.156
7.111
The critical value associated with this test is F(3,6,0.05) = 4.76. The critical value again
indicates that we can not reject the null hypothesis that the population means of salaries
associated with geographical locations are equal.

Two way ANOVA Summary table


Source of sum of degree of mean

113
Variation squares freedom square
Between treatment SST (72) (I-1)2 36
Between Blocks SSB (3.336) (I-1)3 1.112
Residual SSE (42.664) (J-1) (I-1)6 7.111

F2, 6 = 36/7.111 = 5.065


F (2, 6, 0.05) = 5.14
F (3, 6) = 1.112/7.111 = 0.156
F (3, 6 0.05) = 4.76
The years of experience and region do not affect the salary.

114
UNIT SUMMARY

Analysis of Variance (ANOVA) is statistical technique used to determine whether samples


from two or more groups come from populations with equal means. Analysis of variance
employs one dependent measure, whereas multivariate analysis of variance compares
samples on two or more dependent measures.

Analysis of variance (ANOVA) is a statistical technique that can be used to evaluate


whether there are differences between the average value, or mean, across several
population groups. With this model, the response variable is continuous in nature,
whereas the predictor variables are categorical. For example, in a clinical trial of
hypertensive patients, ANOVA methods could be used to compare the effectiveness of
three different drugs in lowering blood pressure. Alternatively, ANOVA could be used to
determine whether infant birth weight is significantly different among mothers who
smoked during pregnancy relative to those who did not

One-way ANOVA evaluates the effect of a single factor on a single response variable. For
example, a clinician may be interested in determining whether there are differences in the
age distribution of patients enrolled in two different study groups. Using ANOVA to make
this comparison requires that several assumptions be satisfied. Specifically, the patients
must be selected randomly from each of the population groups, a value for the response
variable is recorded for each sampled patient, the distribution of the response variable is
normally distributed in each population, and the variance of the response variable is the
same in each population. In the above example, age would represent the response variable,
while the treatment group represents the independent variable, or factor, of interest.

As indicated through its designation, ANOVA compares means by using estimates of


variance. Specifically, the sampled observations can be described in terms of the variation
of the individual values around their group means, and of the variation of the group means
around the overall mean. These measures are frequently referred to as sources of "within-
groups" and "between-groups" variability, respectively. If the variability within the k
different populations is small relative to the variability between the group means, this
suggests that the population means are different. This is formally tested using a test of

115
significance based on the F distribution, which tests the null hypothesis (H0) that the
means of the k groups are equal:

H0 = μ1 = μ2 = μ3 = …. μk

An F-test is constructed by taking the ratio of the "between-groups" variation to the


"within-groups" variation. If n represents the total number of sampled observations, this
ratio has an F distribution with k-1 and n-k degrees in the numerator and denominator,
respectively. Under the null hypothesis, the "within-groups" and "between-groups"
variance both estimate the same underlying population variance and the F ratio is close to
one. If the between-groups variance is much larger than the within-groups, the F ratio
becomes large and the associated p-value becomes small. This leads to rejection of the null
hypothesis, thereby concluding that the means of the groups are not all equal. When
interpreting the results from the ANOVA procedures it is helpful to comment on the
strength of the observed association, as significant differences may result simply from
having a very large number of samples.

Multi-way analysis of variance (MANOVA) is an extension of the one-way model that


allows for the inclusion of additional independent nominal variables. In some analyses,
researchers may wish to adjust for group differences for a variable that is continuous in
nature. For example, in the example cited above, when evaluating the effectiveness of
hypertensive agents administered to three groups, we may wish to control for group
differences in the age of the patients. The addition of a continuous variable to an existing
ANOVA model is referred to as analysis of covariance (ANCOVA).

116
SELF CHECK EXERCISE 5
1. An investor selected random samples of stock purchases recommended by
three stock brokers a year ago. The investor calculated the percent returns
on each stock during the year, as given below. Perform an ANOVA test at α =
0.05 level to determine if the mean returns for the three advisory firms are
equal.
Percent returns
A B C
7.0 8.7 3.4
2.8 5.2 8.1
5.1 4.9 4.2
4.6 7.0 2.6

2. Instruments for correcting a power plant malfunction are mounted on


control panel. Three panels were designed, with the instruments arranged
differently on different panels. Then three random samples of four control
engineers per were selected. Each sample was assigned to one panel. The
time in seconds taken by engineers to correct stimulated malfunction are
given below. Perform ANOVA test at the 0.05 level to determine if the mean
times to correct the malfunction are the same for the three panels.
Percent returns
Panel A Panel B Panel C
17 9 13
12 16 8
15 11 14
20 12 9

3. Three methods for assembling a product are to be tested at the 0.05 level to
determine whether mean times per assembly for the methods are equal.
Random sample assembly times in minutes are given below. Perform the
ANOVA test.

117
Method one Method two Method three
11 19 19
13 25 14
19 16 13
18 22 14
14 18 20

4. Stock analyst thinks four stock mutual funds generate about the same return.
She collected the accompanying rate of return data on four different mutual
funds during the last 5 years.

A B C D
1988 12 11 13 15
1989 12 17 19 11
1990 13 18 15 12
1991 18 20 25 11
1992 12 19 19 10

a) Conduct a one-way ANOVA to decide whether the funds give different


performance. Use 5%
b) Conduct a two way ANOVA to decide whether the funds give different
performances. Use 5%
5. The following table gives the data regarding the sales in four zones in
Ethiopia and the sales made by four sales men. At 5% level of significance
conduct a two way ANOVA (Analysis of Variance), to test the mean sales
among the sales men is the same.
North East West South
Sales Man A 8 6 5 4
Sales man B 6 6 7 6
Sales Man C 5 6 8 9
Sales Man 4 8 7 9

118
119
Chapter 6
Simple linear Regression and Correlation
Chapter Objective:
Dear reader, after studying this chapter, you will be able to:
 Define regression analysis
 Define and fit simple linear regression
 Predict the population average value of the dependent variable on the basis of
known (fixed) values of the independent variable.
 Understand correlation
 Compute the Pearsonian and rank correlation coefficients.

6.1. Simple Linear Regression


In the preceding chapters we have been dealing with data on a single variable. Here we
shall focus on methods of dealing with paired data, which may be related in some way.
Regression Analysis:- is concerned with describing and evaluating the relationship
between a dependent variable and one or more independent variables. Therefore,
regression is used for bringing out the nature of relationship and using it to know the best
approximate value of the other variable. In what follows, therefore, we will deal with the
problem of estimating and/or predicting the population mean/average values of the
dependent variable on the basis of known values of the independent variable (s).
The variable whose value is to be estimated/predicted is known as dependent variable
while the variables which help us in determining the value of the dependent variable are
known as independent variables.
A regression equation which involves only two variables, a dependent and an in
dependent referred to us simple regression. This model assumes that the dependent
variable is influenced by only one systematic variable and the error term. However, when
several variables (necessarily more than two) are included in the model, it is called
multiple/multivariate regression.

The relationship between any two variables may be linear or non-linear. The former
implies a constant absolute change in the dependent variable in response to a unit
changes in the independent variable while the latter implies varying marginal change in
the dependent variable in response to changes in the independent variable.

120
Consequently, in this chapter we will confine ourselves to the type of regression
involving only tow variables and the type of relationship between our variables which is
linear. If this turns out to be the case, it is called simple linear regression.

6.1.1. The Scatter Diagram


Consider the following data collected by taking a sample of five industries in a given
industrial sector on their input (number of workers) and output (thousands of birr).
Table 6.1.

(Yi) (Xi) Paired date


Industry output (thousands of Inputs (no of (Xi, Yi)
Birr) workers)
1 4 2 (2,4)
2 7 3 (3,7)
3 3 1 (1,3)
4 9 5 (5,9)
5 17 9 (9,17)

Output level (Yi) is believed to depend on number of workers (Xi). Accordingly, Yi is a


dependent variable and Xi is independent variable.
In order to visualize the form of regression we plot these points on a graph as shown in
fig. 6.1. What we get is a scatter diagram.

20

15

10

1 2 3 4 5 6 7 8 9 X

121
When carefully observed, the scatter diagram at least shows the nature of relationship;
whether positive or negative and whether the curve is linear or non-linear.
When the general course of movement of the paired points is best described by a straight
line, the next task is to fit a regression line which lies as close as possible to every point
on the scatter diagram. This can be done by means of either free hand drawing or the
method of least squares. However, the latter is the most widely used method.
6.1.2. The regression Equation
Regression equation is a statement of equality that defines the relationship between two
variables. The equation of the line which is to be used in predicting the value of the
dependent variable takes the form Ye = a + bx. The most universally used and
statistically accepted method of fitting such an equation is the method of least squares.
The Method of Least Squares:-
This method requires that a straight line is to be fitted being the vertical deviations of the
observed Y values from the straight line (predicted Y values) is the minimum.
As shown in fig 6.1, if e1, e2, …… e5 are the vertical deviations of observed Y values
from the straight line (predicted Y values – Ye), fitting a straight line in keeping with the
above condition requires that (for n sample size)
1 2 𝑛  i
𝑒2 + 𝑒2 + … . +𝑒2 = e 2 is minimum. This can be done by partially
i1

differentiating  e2 with respect to a and b and equating them to zero.


 ei is the2 error made when taking Ye instead of Y. Therefore, ei = Yi – Ye.
 e = Y  Y 
2

i  i e

 e 2 =  Y i  a  bX 2
 e 2  (Yi  a  bx ) 
2

  0
a a
 -2  Yi  a  bX i   0

  Y   a   bx  0
i i

 na  Y  b X  i

i

n n n

122
 e 2  (Yi  a  bx ) 
2

  0
b b
 -2  Yi  a  bX i X i  0

 ∑ 𝑌iXi − 𝑎 ∑ Xi − 𝑏 ∑ Xi2 = 0
 ∑ 𝑌iXi − (𝑌− 𝑏X)[∑ Xi − 𝑏 ∑ X 2 ] = 0
 ∑ Xi Xi − 𝑌∑ X i − 𝑏 [∑ X 2 − X∑ Xi ] = 0
 ∑ Ki Ki −F∑ K i [∑ K 2 −K∑ Ki =0]
=
2
∑ K −K∑ Ki ∑ K 2 −K∑ Ki

∑ ∑
Therefore, ∑ ∑

Or equivalently, multiplying both the numerator and denominator by n, we get:


𝑛 ∑ FiKi−∑ Ki ∑ Fi
𝑏=
𝑛 ∑ K2−∑(Ki)2

Example 6.1. Suppose we want to study the relationship between input (number of
workers) and output (thousands of Birr) of five factories given in table 6.1. above. To fit
the regression line of Yi (thousands of Birr) on Xi (number of workers, we can employ
the method of least squares as follows:
Solution. Table 6.2.
Arrange the data in tabular form

Yi Xi YiXi Xi2 Where  = summation /total


∑ Ki
Tab. 4 2 8 4 Mean of Xi = 𝑛
6.2 7 3 21 9 ∑ Fi
Mean of 𝑌i =
𝑛
3 1 3 1
n = number of sample size
9 5 45 25
n=5
17 9 153 81

 40 20 230 120

Mean 8 4

Substituting these values in the above equations, we get


𝑛 ∑ FiKi−∑ Ki ∑ Fi
𝑏=
𝑛 ∑ K2−(∑ Ki)2
5(230)−40(20)
=
5(120)−(20)2
1150−800
=
600−400

123
350
=
200

= 7⁄4

𝑎 = − 𝑏X= 8 − 7⁄4 (4)= 1


Therefore, the least square regression equation equals:
𝑌𝑒 = 1 + 7⁄4 Xi
 Estimate the amount of Birr that a factory will have if it has 8 workers.
Xi = 8

𝑌i = 1 + 7⁄4 Xi

= 1 + 7⁄8(8)
𝑌i = 15
Consequently, if a factory has 8 workers, its level of output will be 15 thousand ETB.
Example 6.2. In what follows you are provided with sample observations on price
and quantity supplied of a commodity X by a competitive firm.
a) Construct the scatter diagram
b) What is the linear regression of Yi(quantity supplies) on Xi(price of the
commodity X).
c) Suppose price of the commodity X be 32, what will be the quantity supplied by
the firm?
Tab. 6.3. Data on price and quantity supplied.

(Yi) (Xi) XiYi Xi2


40 15 600 225
45 20 900 400
40 25 1000 625
50 30 1500 900
55 35 1925 1225
60 40 2400 1600
60 45 2700 2025
65 50 3250 2500
70 55 3850 3025
75 60 4500 3600
55 40 2200 1600
60 45 2700 2025
Total 675 460 27,525 19,750

124
a) *
70
*
*
60 **
*
50 *
*
40 *
* *

30

20

10

10 20 30 40 50 60 70
𝑛 ∑ KiFi−∑ Ki ∑ Fi 12(27,525)−460(675)
b) 𝑏 = 𝑛 ∑ K2−(Ki)2 = = 0.7795
12(19,750)−(460)2
𝑎 = − 𝑏X
∑ Fi
𝑌= = 675⁄12
𝑛
∑ Ki
X= = 460⁄12
𝑛

𝑎 = 675⁄12 − 460⁄12 (0.7795) = 26.3718


Therefore, the estimated supply function is
Ye = 26.3718 + 0.7795 Xi

c) Xi = 32
Ye = 26.3718 + 0.7795 Xi
= 26.3718 + 0.7795 (32)
= 26.3718 + 24.944
= 51.3158
If the price of x is 32, the estimated quantity supplied will be approximately equal to 51
units.

125
1.1.3. Regression of X on Y
In the above sub-topic 6.1.2. we have explored regression of Y on X type. Sometimes, it
is possible and of interest to fit the regression of X on Y type, i.e., being Y as
independent and X dependent.
In such cases, the general form of the equation is given by:
X𝑒 = 𝑎0 + 𝑏0𝑌i
Where Xe = expected value of X
a0 – X-intercept
b0 – slope of the regression
Applying the principle of least squares as before, the constants a0 & b0 are given as
follows:
𝑎0 = X− 𝑏0 𝑌
∑ KiFi−∑ Ki ∑ Fi
𝑏0 =
𝑛 ∑ Fi−(∑ Fi)2

N.B. The regression equation of Y on X type and of X on Y type coincide at (X, 𝑌).
6.2. Correlation
The correlation coefficient measures the degree to which two variables are related
/associated – simple correlation denoted by r. For more than two variables we have
multiple correlations.
Two variables may have either positive correlation, negative correlation or may not be
correlated. Furthermore, depending on the form of relationship the correlation between
two variables may be linear or non-linear. Therefore, in this section, we shall be
concerned with quantifying the degree of association between two variables with linear
relationship.
Contrary to regression analysis explained in the previous section (6.1), the computation
of coefficient of correlation does not require one variable to be designated as dependent
and the other as independent.
The measure of the degree of relationship between any two variables known as the
pearsonian coefficient of correlation, usually denoted by r, is defined
∑(Ki −K)(Fi −F)
𝑟𝑥𝑦 = and is termed as the product – moment formula. It can be
√∑(Ki −K)2 ∑(Fi −F)2

further simplified as
𝑛 ∑ Ki Fi −∑ Ki ∑ Fi
𝑟𝑥𝑦 =
√[𝑛 ∑ K −(∑ Ki)2][𝑛 ∑ F2−(∑ Fi)2]
2
i i

126
NB. The building blocks of this formula are, therefore,
∑ Xi𝑌i, ∑ Xi , ∑ 𝑌i , ∑ X , ∑ 𝑌 and n(sample size).
2 2
i i

Properties of Pearson coefficient of correlation


1. −1 ≤ 𝑟 ≤ 1
2. 𝑤ℎ𝑒𝑛 𝑟 = 0, 𝑡ℎ𝑒𝑟𝑒 i𝑠 𝑛𝑜 𝑙i𝑛𝑒𝑎𝑟 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡i𝑜𝑛
3. When r = 1/-1 perfect positive/negative correlation.
4. Adding a constant number to each value of X and Y, as well as multiplying each
value by a constant does not affect the value of r.
5. The closeness of the relationship is not proportional to the value of r.
6. When r is positive and close to 1 then there is high positive correlation while
when it is close to zero it shows low positive correlation. Similarly, when r is
negative and close to -1 then there is high negative correlation while when it is
close to zero it shows low negative correlation
7. It is free of any units used.
Example 6.3. Find the Pearson coefficient of correlation for the two variables in the
data of table 6.1.
Solution
Table 6.4.

Yi Xi Xi2 Yi XiYi
4 2 4 16 8
7 3 9 49 21
3 1 1 9 3
9 5 25 81 45
17 9 81 289 153
Total 40 20 120 444 230

5(230)− 40(20) 1150−800 350


𝑟𝑥𝑦 = = = = 0.99
√[5(120)−(20)2][5(444)−(40)2] √(600−400)(2220−1600) 352.14

Interpretation: it implies strong positive relation:


Example 6.4. Find the pearsonian coefficient of correlation for the two variables in the
data of table 6.3.
Solution: Table 6.5.
2
Yi Xi Xi2 Yi XiYi

127
40 15 225 1600 600
45 20 400 2025 900
40 25 625 1600 1000
50 30 900 2500 1500
55 35 1225 3025 1925
60 40 1600 3600 2400
60 45 2025 3600 2700
65 50 2500 4225 3250
70 55 3025 4900 3850
75 60 3600 5625 4500
55 40 1600 3025 2200
60 45 2025 3600 2700
Total 675 460 19,750 39,325 27,525

Therefore,
12(27,525)− 675(460)
𝑟𝑥𝑦 =
√[12(19,750)−(460)2][12(39,325)−(675)2]
19,800
𝑟𝑥𝑦 = = 0.974
20,331.872

Interpretation: It implies strong positive relation between X & Y.


Example 6.5. Adding to each value of X and Y given in table 6.1 a constant number, say
1, show that property 4 holds true.
Solution
Table 6.6.

Yi Xi Xi2 Yi XiYi
5 3 9 25 15
8 4 16 64 32
4 2 4 16 8
10 6 36 100 60
18 10 100 324 180
Total 45 25 165 529 295

5(295)− 45(25)
𝑟=
√[5(165)−(25)2][5(529)−(45)2]
350
= = 0.99 Therefore, we have shown that property 4 is true.
352.14

128
Spearman’s Rank Correlation Coefficient
The Pearson coefficient of correlation cannot be used in cases when the direct
quantitative measurement of the phenomenon under study is not possible. In such cases,
we make use of the rank correlation coefficient.
Steps involved to calculate the spearman’s coefficient of rank correlation:
1. Rank the X values among themselves giving rank (1) to the largest (or smallest
value and (2) to the next largest (or smallest) value and so on.
2. Rank the Y-values among themselves in a similar way to that of X.
3. When there are ties in rank, i.e., when there are values sharing the same rank,
assign to each of the filed observation, the mean of the ranks they jointly occupy
and the next rank to be over looked.
4. Find the sum of the squares of the differences between ranks of two variables.
5. Apply the formula
6−∑ 𝑑2
i
𝑟𝑠 = 1 − (𝑛2−1)

𝑤ℎ𝑒𝑟𝑒 n = number of pairs of observations


di =ith difference between ranks of X and Y
As the steps above indicate, rs may be calculated for numerical data after ranking the
values according to numerical size.
Example 6.2. Consider the ranks given by two Judges for five ladies in a beauty contest:
Table 6.7
Judges
Ladies RA RB
AZEB 1 2
TIZITA 3 4
FATUMA 4 3
LEMLEM 2 1
CHALTU 5 5
Solution:
di di2
1 1
1 1
-1 1
-1 1
0 0

129
Total 4

6 ∑ 𝑑2
(𝑛2−1)
6(4)
=1−
5(24)

= 0.75
Interpretation: Since rs= 0.75, it implies that there is similarity between the ranks of
Judge A and Judge B.

130
Review Exercises 6
1. Define and distinguish between;
a) Regression and correlation
b) Simple and multiple regression
c) Linear and non-linear relationship
2. Bring out the relevance of a scatter diagram in regression analysis.
3. Explain the meaning and status of the two constants a and b in the regression
equation Ye = a + bXi.
4. The marks obtained by 10 students in their graduation with B.A. degree in
management and the MBA entrance test were found as given below.
Graduation (Xi) 50 52 55 60 62 65 65 66 70 75
Entrance test (Yi) 52 50 57 65 65 62 65 65 71 75
Therefore, find
a) The two regression equations
b) The correlation coefficient between two sets of marks
5. Obtain the regression equation of X on Y and Y on X for the paired data given
below. Also compute the coefficient of correlation.
Market price of X 26 28 30 31 35
Market price of Y 20 27 28 30 25
6. Ten students got the following marks in Maths and Statistics
Student A B C D E F G H I J
Maths (X) 78 36 98 25 75 82 90 62 65 39
Statistics (Y) 84 51 91 60 68 62 86 58 58 47
Compute the coefficient of Rank correlation and interpret the result.
7. For a certain set of paired data on X and Y, 3Xi + 2Yi – 26 = 0 and 6Xi + Yi
– 31 = 0 are the two regression equations.
a) Find the mean values

b) Find the coefficient of correlation [𝐻i𝑛𝑡: 𝑟 = √𝑏. 𝑏0]


8. A leading company engaged in the production of detergents has 10 vacancies of
salesman for which 15 (n) persons were called for personal interviews. The
interview board consisted of the sales manager and a psychologist. The ranks
given by the two to all 15 candidates who attend the interview is given below.

131
[Link]. in the interview list 1 2 4 5 8 9 10 11 13 14 15 17 18 19 20
Ranking by the sales manager 2 3 1 5 4 6 8 7 9 10 12 11 13 14 15
(xi)
Ranking by the psychologist 1 3 2 4 6 5 7 9 8 11 10 12 14 13 15
(Yi)
Compute the rank correlation coefficient.

132
Z-distribution table
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

133
t- Distribution table
df α = 0.1 0.05 0.025 0.01 0.005 0.001 0.0005
∞ tα=1.282 1.645 1.96 2.326 2.576 3.091 3.291
1 3.078 6.314 12.706 31.821 63.656 318.289 636.578
2 1.886 2.92 4.303 6.965 9.925 22.328 31.6
3 1.638 2.353 3.182 4.541 5.841 10.214 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.61
5 1.476 2.015 2.571 3.365 4.032 5.894 6.869
6 1.44 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.86 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.25 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.93 4.318
13 1.35 1.771 2.16 2.65 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.14
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.12 2.583 2.921 3.686 4.015
17 1.333 1.74 2.11 2.567 2.898 3.646 3.965
18 1.33 1.734 2.101 2.552 2.878 3.61 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.85
21 1.323 1.721 2.08 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.5 2.807 3.485 3.768
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.06 2.485 2.787 3.45 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.689
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.66
30 1.31 1.697 2.042 2.457 2.75 3.385 3.646
60 1.296 1.671 2 2.39 2.66 3.232 3.46
120 1.289 1.658 1.98 2.358 2.617 3.16 3.373
∞ 1.282 1.645 1.96 2.326 2.576 3.091 3.291

134
Chi-Square Distribution Table
df\area 0.95 0.9 0.75 0.5 0.25 0.1 0.05 0.025 0.01 0.005
1 0.00393 0.01579 0.10153 0.45494 1.3233 2.70554 3.84146 5.02389 6.6349 7.87944
2 0.10259 0.21072 0.57536 1.38629 2.77259 4.60517 5.99146 7.37776 9.21034 10.59663
3 0.35185 0.58437 1.21253 2.36597 4.10834 6.25139 7.81473 9.3484 11.34487 12.83816
4 0.71072 1.06362 1.92256 3.35669 5.38527 7.77944 9.48773 11.14329 13.2767 14.86026
5 1.14548 1.61031 2.6746 4.35146 6.62568 9.23636 11.0705 12.8325 15.08627 16.7496
6 1.63538 2.20413 3.4546 5.34812 7.8408 10.64464 12.59159 14.44938 16.81189 18.54758
7 2.16735 2.83311 4.25485 6.34581 9.03715 12.01704 14.06714 16.01276 18.47531 20.27774
8 2.73264 3.48954 5.07064 7.34412 10.21885 13.36157 15.50731 17.53455 20.09024 21.95495
9 3.32511 4.16816 5.89883 8.34283 11.38875 14.68366 16.91898 19.02277 21.66599 23.58935
10 3.9403 4.86518 6.7372 9.34182 12.54886 15.98718 18.30704 20.48318 23.20925 25.18818
11 4.57481 5.57778 7.58414 10.341 13.70069 17.27501 19.67514 21.92005 24.72497 26.75685
12 5.22603 6.3038 8.43842 11.34032 14.8454 18.54935 21.02607 23.33666 26.21697 28.29952
13 5.89186 7.0415 9.29907 12.33976 15.98391 19.81193 22.36203 24.7356 27.68825 29.81947
14 6.57063 7.78953 10.16531 13.33927 17.11693 21.06414 23.68479 26.11895 29.14124 31.31935
15 7.26094 8.54676 11.03654 14.33886 18.24509 22.30713 24.99579 27.48839 30.57791 32.80132
16 7.96165 9.31224 11.91222 15.3385 19.36886 23.54183 26.29623 28.84535 31.99993 34.26719
17 8.67176 10.08519 12.79193 16.33818 20.48868 24.76904 27.58711 30.19101 33.40866 35.71847
18 9.39046 10.86494 13.67529 17.3379 21.60489 25.98942 28.8693 31.52638 34.80531 37.15645
19 10.11701 11.65091 14.562 18.33765 22.71781 27.20357 30.14353 32.85233 36.19087 38.58226
20 10.85081 12.44261 15.45177 19.33743 23.82769 28.41198 31.41043 34.16961 37.56623 39.99685
21 11.59131 13.2396 16.34438 20.33723 24.93478 29.61509 32.67057 35.47888 38.93217 41.40106
22 12.33801 14.04149 17.23962 21.33704 26.03927 30.81328 33.92444 36.78071 40.28936 42.79565
23 13.09051 14.84796 18.1373 22.33688 27.14134 32.0069 35.17246 38.07563 41.6384 44.18128
24 13.84843 15.65868 19.03725 23.33673 28.24115 33.19624 36.41503 39.36408 42.97982 45.55851
25 14.61141 16.47341 19.93934 24.33659 29.33885 34.38159 37.65248 40.64647 44.3141 46.92789
26 15.37916 17.29188 20.84343 25.33646 30.43457 35.56317 38.88514 41.92317 45.64168 48.28988
27 16.1514 18.1139 21.7494 26.33634 31.52841 36.74122 40.11327 43.19451 46.96294 49.64492
28 16.92788 18.93924 22.65716 27.33623 32.62049 37.91592 41.33714 44.46079 48.27824 50.99338
29 17.70837 19.76774 23.56659 28.33613 33.71091 39.08747 42.55697 45.72229 49.58788 52.33562
30 18.49266 20.59923 24.47761 29.33603 34.79974 40.25602 43.77297 46.97924 50.89218 53.67196

135

You might also like