INTRODUCTION TO
DATA ANALYTICS
Class #11
Statistical Inference - II
Dr. Sreeja S R
Assistant Professor
Indian Institute of Information Technology
IIIT Sri City
IIITS: IDA - M2021 1
Q U O T E O F T H E D AY. .
IIITS: IDA - M2021 2
IN THIS PRESENTATION…
• Errors in hypothesis testing
• Case Study 1: Coffee Sale
• Case Study 2: Machine Testing
• Summary of Sampling Distributions in Hypothesis Testing
IIITS: IDA - M2021 3
Calculating
•Assuming
that we have the results of random sample. Hence, we use the
characteristics of sampling distribution to calculate the probabilities of making
either Type I or Type II error.
Example 6.6:
Suppose, two hypotheses in a statistical testing are:
Also, assume that for a given sample, population obeys normal distribution. A
threshold limit say is used to say that they are significantly different from a.
IIITS: IDA - M2021 4
Calculating
•
Here, shaded region implies the probability that,
a-δ a a+δ
Thus the null hypothesis is to be rejected if the mean value is less than or
greater than .
If denotes the sample mean, then the Type I error is
IIITS: IDA - M2021 5
THE REJECTION REGION
•
The rejection region comprises of value of the test statistics for which
1. The probability when the null hypothesis is true is less than or equal to the specified .
2. Probability when is true are greater than they are under .
a’ a a”
Rejection region for H0 for a
given value of α
Reject H0 Do not reject H0 Reject H0
≠a =a ≠a
IIITS: IDA - M2021 6
Two-Tailed Test
• For two-tailed hypothesis test, hypotheses take the form
In other words, to reject a null hypothesis, sample mean or under a given .
Thus, in a two-tailed test, there are two rejection regions (also known as critical
region), one on each tail of the sampling distribution curve.
IIITS: IDA - M2021 7
Two-Tailed Test
Acceptance region
Accept H0 ,if the sample
mean falls in this region
95 % of area
0.025 of area 0.025 of area
µH 0
Rejection region
Reject H0 ,if the sample mean falls
in either of these regions
Acceptance and rejection regions in case of a two-tailed test with 5% significance level.
IIITS: IDA - M2021 8
One-Tailed Test
•A one-tailed
test would be used when we are to test, say, whether the population mean is
either lower or higher than the hypothesis test value.
Symbolically,
Wherein there is one rejection region only on the left-tail (or right-tail).
Acceptance region
Acceptance region
.05 of area
.05 of area
Rejection region
Rejection region
¿ − tailed test
tailed test
¿
IIITS: IDA - M2021 9
EXAMPLE 6.7: CALCULATING
•
Consider the two hypotheses are
The null hypothesis is
The alternative hypothesis is
Assume that given a sample of size 16 and standard deviation is 0.2 and sample
follows normal distribution.
IIITS: IDA - M2021 10
EXAMPLE 6.7: CALCULATING
•We can decide the rejection region as follows.
Suppose, the null hypothesis is to be rejected if the mean value is less than 7.9 or greater than 8.1.
If is the sample mean, then the probability of Type I error is
Given the standard deviation of the sample is 0.2 and that the distribution follows normal
distribution.
Thus,
and
Hence,
IIITS: IDA - M2021 11
Example 6.8: Calculating and
There are two identically appearing boxes of chocolates. Box A contains 60 red and
40 black chocolates whereas box B contains 40 red and 60 black chocolates. There
is no label on the either box. One box is placed on the table. We are to test the
hypothesis that “Box B is on the table”.
To test the hypothesis an experiment is planned, which is as follows:
• Draw at random five chocolates from the box.
• We replace each chocolates before selecting a new one.
• The number of red chocolates in an experiment is considered as the sample
statistics.
Note: Since each draw is independent to each other, we can assume the sample distribution
follows binomial probability distribution. IIITS: IDA - M2021 12
Example 6.8: Calculating
•Let us express the population parameter as
The hypotheses of the problem can be stated as:
// Box B is on the table
// Box A is on the table
Calculating
In this example, the null hypothesis specifies that the probability of drawing a red chocolate is .
This means that, lower proportion of red chocolates in observations favors the null hypothesis.
In other words, drawing all red chocolates provides sufficient evidence to reject the null
hypothesis. Then, the probability of making a error is the probability of getting five red
chocolates in a sample of five from Box B. That is,
Using the binomial distribution
Thus, the probability of rejecting a true null hypothesis is That is, there is approximately
chance that the box B will be mislabeled as box A. IIITS: IDA - M2021 13
Example 6.8: Calculating
• error occurs if we fail to reject the null hypothesis when it is not true. For the current
The
illustration, such a situation occurs, if Box A is on the table but we did not get the five red
chocolates required to reject the hypothesis that Box B is on the table.
The probability of error is then the probability of getting four or fewer red chocolates in a
sample of five from Box A.
That is,
Using the probability rule:
That is,
Now,
Hence,
That is, the probability of making error is over . This means that, if Box IIITS:
A isIDAon- M2021
the table,
14
the
probability that we will be unable to detect it is .
CASE STUDY 1: COFFEE SALE
A coffee vendor nearby Kharagpur railway station has been having average
sales of 500 cups per day. Because of the development of a bus stand nearby, it
expects to increase its sales. During the first 12 days, after the inauguration of
the bus stand, the daily sales were as under:
550 570 490 615 505 580 570 460 600 580 530 526
On the basis of this sample information, can we conclude that the sales of coffee
have increased?
Consider 5% level of confidence.
IIITS: IDA - M2021 15
HYPOTHESIS TESTING : 5 STEPS
•The
following five steps are followed when testing hypothesis
1. Specify and , the null and alternate hypothesis, and an acceptable level of .
2. Determine an appropriate sample-based test statistics and the rejection region for
the specified .
3. Collect the sample data and calculate the test statistics.
4. Make a decision to either reject or fail to reject .
5. Interpret the result in common language suitable for practitioner.
IIITS: IDA - M2021 16
CASE STUDY 1: STEP 1
•Step
1: Specification of hypothesis and acceptable level of
Let us consider the hypotheses for the given problem as follows.
cups per day
The null hypothesis that sales average 500 cups per day and they have not
increased.
The alternative hypothesis is that the sales have increased.
Given the acceptance level of
IIITS: IDA - M2021 17
CASE STUDY 1: STEP 2
• 2: Sample-based test statistics and the rejection region for specified
Step
Given the sample as
550 570 490 615 505 580 570 460 600 580 530 526
Since the sample size is small and the population standard deviation is not known, we shall
use assuming normal population. The test statistics is
To find and , we make the following computations.
= IIITS: IDA - M2021 18
CASE STUDY 1: STEP 2
IIITS: IDA - M2021 19
Case Study 1: Step 2
•
Hence,
Note:
Statistical table for t-distributions gives a t-value given n, the degrees of freedom and ,
the level of significance and vice-versa.
IIITS: IDA - M2021 20
Case
•
Study 1: Step 3
Step 3: Collect the sample data and calculate the test statistics
As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.
IIITS: IDA - M2021 21
Case
•
Study 1: Step 3
Step 3: Collect the sample data and calculate the test statistics
As is one-tailed, we shall determine the rejection region applying one-tailed in the right
tail because is more than type ) at level of significance.
Using table of for 11 degrees of freedom and with level of significance,
IIITS: IDA - M2021 22
Case Study 1: Step 4
•Step
4: Make a decision to either reject or fail to reject H0
The observed value of which is in the rejection region and thus is rejected at level of
significance.
IIITS: IDA - M2021 23
Case Study 1: Step 5
Step 5: Final comment and interpret the result
We can conclude that the sample data indicate that coffee sales have increased.
IIITS: IDA - M2021 24
CASE STUDY 2: MACHINE TESTING
•A medicine production company packages medicine in a tube of 8 ml with . In
maintaining the control of the amount of medicine in tubes, they use a machine. To
monitor this control a sample of 16 tubes is taken from the production line at
random time interval and their contents are measured precisely. The mean amount of
medicine in these 16 tubes will be used to test the hypothesis that the machine is
indeed working properly. The given sample size has a sample mean 7.89 and sample
follows normal distribution.
IIITS: IDA - M2021 25
CASE STUDY 2: STEP 1
•
Step 1: Specification of hypothesis and acceptable level of
The hypotheses are given in terms of the population mean of medicine per tube.
The null hypothesis is
The alternative hypothesis is
We assume , the significance level in our hypothesis testing 0.05.
(This signifies the probability that the machine needs to be adjusted less than 5).
IIITS: IDA - M2021 26
CASE STUDY 2: STEP 2
•Step
2: Sample-based test statistics and the rejection region for specified
Rejection region: G, which gives (obtained from standard normal calculation for two-
tailed test).
IIITS: IDA - M2021 27
CASE STUDY 2: STEP 3
•
Step 3: Collect the sample data and calculate the test statistics
Sample results: , ,
With the sample, the test statistics is
Hence,
IIITS: IDA - M2021 28
CASE STUDY 2: STEP 4
•
Step 4: Make a decision to either reject or fail to reject H0
-2.20 -1.96 0 1.96 2.20
Since , we reject
IIITS: IDA - M2021 29
CASE STUDY 2: STEP 5
•
Step 5: Final comment and interpret the result
We conclude and recommend that the machine be adjusted.
IIITS: IDA - M2021 30
CASE STUDY 2: ALTERNATIVE TEST
•Suppose
that in our initial setup of hypothesis test, if we choose instead of 0.05, then the
test can be summarized as:
1. ,
2. Reject if
3. Sample result n =16, = 0.2, =7.89, ,
4. , we fail to reject = 8
5. We do not recommend that the machine be readjusted.
IIITS: IDA - M2021 31
Hypothesis Testing Strategies
• The hypothesis testing determines the validity of an assumption (technically
described as null hypothesis), with a view to choose between two conflicting
hypothesis about the value of a population parameter.
• There are two types of tests of hypotheses
Non-parametric tests (also called distribution-free test of hypotheses)
Parametric tests (also called standard test of hypotheses).
IIITS: IDA - M2021 32
Parametric Tests : Applications
• Usually assume certain properties of the population from
which we draw samples.
• Observation come from a normal population
• Sample size is small
• Population parameters like mean, variance, etc. are hold good.
• Requires measurement equivalent to interval scaled data.
IIITS: IDA - M2021 33
Parametric Tests
•Important
Parametric Tests
The widely used sampling distribution for parametric tests are
Note:
All these tests are based on the assumption of normality (i.e., the source of data is
considered to be normally distributed).
IIITS: IDA - M2021 34
Parametric Tests : Z-test
•: This is most frequently test in statistical analysis.
• It is based on the normal probability distribution.
• Used for judging the significance of several statistical measures particularly
the mean.
• It is used even when or is applicable with a condition that such a distribution
tends to normal distribution when n becomes large.
• Typically it is used for comparing the mean of a sample to some
hypothesized mean for the population in case of large sample, or when
population variance is known.
IIITS: IDA - M2021 35
Parametric Tests : t-test
•
: It is based on the t-distribution.
• It is considered an appropriate test for judging the significance of a sample
mean or for judging the significance of difference between the means of two
samples in case of
• small sample(s)
• population variance is not known (in this case, we use the variance of the sample as an
estimate of the population variance)
IIITS: IDA - M2021 36
Parametric Tests : -test
•
: It is based on Chi-squared distribution.
• It is used for comparing a sample variance to a theoretical population
variance.
IIITS: IDA - M2021 37
Parametric Tests : -test
•
: It is based on F-distribution.
• It is used to compare the variance of two independent samples.
• This test is also used in the context of analysis of variance (ANOVA) for
judging the significance of more than two sample means.
IIITS: IDA - M2021 38
Hypothesis Testing : Assumptions
•Case
1: Normal population, population infinite, sample size may be large or small, variance
of the population is known.
Case 2: Population normal, population finite, sample size may large or small………variance
is known.
Case 3: Population normal, population infinite, sample size is small and variance of the
population is unknown.
and
IIITS: IDA - M2021 39
Hypothesis Testing
•Case
4: Population finite
Note: If variance of population is known, replace by . Population normal, population
infinite, sample size is small and variance of the population is unknown.
IIITS: IDA - M2021 40
Hypothesis Testing : Non-Parametric Test
• Non-Parametric tests
Does not under any assumption
Assumes only nominal or ordinal data
Note: Non-parametric tests need entire population (or very large sample size)
IIITS: IDA - M2021 41
Any question?
IIITS: IDA - M2021 42