0% found this document useful (0 votes)
17 views68 pages

Full Text en

Uploaded by

hhau2292005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views68 pages

Full Text en

Uploaded by

hhau2292005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

Inferential Statistical Analysis

© 2023 Aptech Limited


All rights reserved.
No part of this book may be reproduced or copied in any form or by any means –
graphic, electronic, or mechanical, including photocopying, recording, taping, or
storing in information retrieval system or sent or transferred without the prior
written
permission of copyright owner Aptech Limited.
All trademarks acknowledged.

APTECH LIMITED
Contact E-mail: ov-support@[Link]
Edition 1 – 2023

Preface

In this Learner’s Guide we delve into the heart of inferential statistics, guiding
you
through the fundamental concepts, methodologies, and applications that empower
us to make informed decisions in the face of uncertainty. Whether you are eager to
grasp the intricacies of statistical inference, enhance decision-making skills, or
intrigued by the magic of numbers, this book is designed to cater to your
intellectual
curiosity.
The book provide a comprehensive exploration of key topics in statistics, offering
a
structured learning journey. Beginning with Introduction to Statistics, the
progression
includes fundamental concepts such as probability, correlation, and regression,
leading into more advanced discussions on inferential statistics. It also delve
into
precise topics, exploring the exact sampling distribution and the intricacies of
analysis
of variance, collectively building a solid foundation in statistical methods.

This book is the result of a concentrated effort of the Design Team, which is
continuously striving to bring you the best and the latest in Information
Technology.
The process of design has been a part of the ISO 9001 certification for Aptech-IT
Division, Education Support Services. As part of Aptech’s quality drive, this te am
does
intensive research and curriculum enrichment to keep it in line with industry
trends.

We will be glad to receive your suggestions.

Design Team

Table of Contents
Session 1: Introduction to Statistics
Session 2: Introduction to Probability
Session 3: Correlation and Regression
Session 4: Inferential Statistics
Session 5: Exact Sampling Distribution
Session 6: Analysis of Variance Sessions
Session 01
Introduction
to Statistics
This session introduces to fundamental concepts of statistics and the measures of
central tendency.
In this session, you will learn to:
Define statistics
Describe the applications of statistics in Data Science
Describe descriptive statistics
Explain measures of central tendency
1.1 OVERVIEW OF STATISTICS
Everyday people come across terms such as average, range, frequency, data,
population, sample,
and many more. What needs to be wondered is the depth of these terms and how widely
are they
used. T he term STATISTICS originated from the Italian word ‘Statista’ , which
means ‘Political state’.
In ancient times, the government used to collect data on its population to have an
idea of country’s
manpower and introduce new taxes.
1.1.1 DEFINITION OF STA TISTICS
Statistics is a branch of mathematics. Here, after data is collected, it is analyz
ed, and interpret ed.
It provides a framework for making decisions and predictions based on data.
In order to use
statistics effectively, it is important to understand the different types of
data, the measures of
central tendency and variability, and the basic principles of probability.
1.1.2 APPLICATIONS O F STA TISTICS IN DATA SCIENCE
Statistics plays a crucial role in Data Science, which is the interdisciplinary
field that uses statistical
and computational methods to extract insights and knowledge from data. It has a
wide range of
applications across many fields, including science, business, economics,
medicine, engineering,
and more.
© Aptech Limited
Here are some common applications of statistics:

Statistics is often used to ensure the quality of products or services. Quality


control techniques such as statistical process control and acceptance sampling
can help identify defects or variations in a production [Link] Control
Statistics is frequently used in market research to gather and analyze data
about consumer behavior and preferences. Techniques such as surveys and
focus groups are often used to collect data and statistical analysis is used to
identify trends and patterns in the [Link] Research
Statistics plays a key role in epidemiology, the study of the spread and control
of diseases. Epidemiologists use statistical techniques to analyze data on the
incidence and prevalence of diseases and to identify risk factors and
[Link]
Statistics is widely used in finance for risk management and investment
analysis. Techniques such as regression analysis, time series analysis, and
Monte Carlo simulation are commonly used to analyze financial data and
make investment [Link]
Statistics is used extensively in the social sciences to analyze data on human
behavior and social phenomena. Techniques such as regression analysis,
correlation analysis, and factor analysis are used to identify patterns and
relationships in the [Link] Sciences
Statistics is often used in sports to evaluate player performance and make
strategic decisions. In baseball, for example, statistics such as batting average
and on -base percentage are used to evaluate player performance and make
decisions about player selection and [Link]
Probability theory provides a framework for understanding the likelihood of
different outcomes in a given situation. In Data Science , probability is used to
model uncertain events, such as predicting the likelihood of a customer
churning (leaving) a subscription [Link] Theory
Experimental design is used to design experiments that can test hypotheses
and make causal inferences. In Data Science, experimental design is used to
test the effectiveness of different marketing strategies or to determine the
impact of a new product feature on customer [Link] Design
© Aptech Limited
Overall, statistics is a foundational tool for Data Science and plays a crucial
role in all aspects of
the field, from data collection and cleaning to modelling and analysis. It is
divided into two types:
 Descriptive Statistics
 Inferential Statistics
1.2 DESCRIPTIVE STATISTICS
Descriptive statistics deals with the analysis and interpretation of data. It
involves the use of
numerical and graphical techniques to summarize and describe the main features of a
dataset .
Descriptive statistics is used to provide a concise summary of a large databa se
and identify
patterns, trends, and relationships within the data.
Following are the main measures used in descriptive statistics:

1.2.1 MEASURES OF CENTRAL TENDENCY


Measures of central tendency covers mainly the mean, median, and mode. Following
explains
each term:
 Mean: Mean is also known as the average value. It is the most used summary
statistic to
describe a set of values (or a population). It is calculated by adding up all the
values in a dataset
and then, dividing by the total number of values. The formula for calculating the

𝑿̅= ∑𝒙𝒊
mean is:

Central Tendency
•Measures ofcentral tendency areused todescribe thecenter ofadistribution .The
most used measures ofcentral tendency are the mean, median, and mode .
Suppose ateacher wants tocalculate theaverage score ofherstudents onarecent
math exam .Inthis case, themean would beanappropriate measure [Link]
mean iscalculated byadding upallthescores and dividing bythetotal number of
students .This gives asingle number that represents theaverage score oftheclass .
Dispersion
•Measures ofdispersion areused todescribe thespread ofadistribution .The most
used measures ofdispersion aretherange, variance, standard deviation, skewness,
and kurtosis .Skewness ofincome distribution isusually calculated todetermine the
appropriate taxpolicies ordesign social programs toreduce income inequality .
© Aptech Limited
Example : Consider the dataset of 5 values: 4,6,8,10,12.
Mean = (4+6+8+10+12)/5
= 40/5
= 8
Here, the mean of the data is 8.
The mean is a useful measure of central tendency. It considers all the values in a
dataset and
gives a single value that represents the ‘typical ’ value in the dataset. However,
it can be
sensitive to extreme values or outliers, which can skew the value of the mean. In
similar cases,
other measures of central tendency, such as the median or mode, may be more
appropriate.
 Median: The median is a statistical measure of central tendency that represents
the middle
value of a dataset arranged in order from smallest to largest (or largest to
smallest). In other
words, it is the value that separates the lower half of the dataset from the upper
half. To
calculate the median, first, the data should be arranged in order from smallest to
largest (or
largest to smallest). If there is an odd number of data points, the median is the
middle value.
If there is an even number of data points, the median is the average of two middle
values.
Example : Consider the dataset: 1, 3, 5, 7, 9.
To find the median, first arrange the data in order: 1, 3, 5, 7, 9
Since there are five data points, which is an odd number, the median is the middle
value, which
is 5.
Now, consider another dataset: 2, 4, 6, 8, 10, 12
To find the median, first arrange the data in order: 2, 4, 6, 8, 10, 12
Since there are six data points, which is an even number, the median is the average
of the two
middle values, which are 6 and 8. Therefore, the median is (6+8)/2 = 7 .
 Mode: The mode is a statistical measure of central tendency that represents the
most
frequently occurring value in a dataset. It is the value that appears most
frequently in the
dataset.
To calculate the mode, identify the value that appears most frequently in the
dataset. In some
cases, a dataset may have more than one mode if there are multiple values that
occur with
the same highest frequency.
Example : Consider the following dataset: 3, 4, 5, 5, 6, 7, 7, 7, 8.
In this dataset, the value 7 appears most frequently, occurring three times, while
all other
values occur only once or twice. Therefore, the mode of this dataset is 7. Now,
consider
another dataset: 1, 2, 2, 3, 4, 4, 5, 5
In this dataset, both the values 2 and 4 occur twice, which means this dataset has
two modes:
2 and 4.
It is important to note that not all datasets have a mode. For example, in the
dataset -
1, 3, 5, 7, 9 n o value appears more than once, so this dataset does not have a
mode.

© Aptech Limited
Figure 1.1 shows a visual representation of mean, median, and mode.

Figure 1.1: Visual Representation of Mean, Median, and Mode

1.2.2 MEASURES OF DISPERSION


Measures of dispersion, also known as measures of variability, are statistical
measures that
describe how spread out or dispersed a set of data is. These measures provide
information about
the spread or variability of the data around a central value, such as the mean or
median.
Figure 1.2 shows the dispersion of two datasets having the same mean. This shows
that the data
values of red graph are less diverse than the blue one.
© Aptech Limited

Figure 1.2: Dispersion Graph of Two Datasets

Some common measures of dispersion include:


 Range : The range is the difference between the maximum and minimum values in a
dataset. It
provides a simple measure of the spread of the data, but it is sensitive to
outliers.
Example : Consider a dataset with values: 13, 19, 30, 50, 20, 32, 17, 28, 33, 15,
34, 44.
Max value in the data: 50
Min value in the data : 13
Range = Max value – Min value = 50 - 13 = 37
 Interquartile Range (IQR): The IQR is the range of the middle 50% of the data.
It is less sensitive
to outliers than the range and provides a measure of the spread of the ‘typical ’
values in the
data. It is calculated as the difference between the 75th percentile (Q 3) and the
25th percentile
(Q1) of the data.
IQR = Q3 - Q 1
Example: Consider the dataset: 13, 19, 30, 50, 19, 32, 17, 28, 33, 19, 34, 44
Arrange the data in ascending order: 13, 15, 17, 19, 20, 28, 30, 32, 33, 34, 44, 50
Q1 = 25th percentile value: 17
Q2 = 75th percentile value: 33
IQR = Q 3 - Q 1 = 33 – 17 = 16

© Aptech Limited
 Variance : The variance is the average of the squared deviations of each data
point from the
mean. It provides a measure of the spread of the data, but is influenced by extreme
values.

𝝈𝟐= ∑(𝒙𝒊− 𝒙̅)𝟐


The formula to calculate variance is:

𝒏
Where 𝒙𝒊 are the data values, 𝒙̅ is the mean of the data, 𝒏 is the number of values
in the data.
Example : Consider the data values: 2,3,5,6,8,10,12,18
Mean of data : (2+3+5+6+8+10+12+18)/8 = 64/8 = 8
Variance = [(2- 8)2 + (3- 8)2 + (5- 8)2 + (6- 8)2 + (8- 8)2 + (10- 8)2 + (12- 8)2
+ (18- 8)2]/8
= (36 + 25 + 9 + 4 + 0 + 4 + 16 + 100)/8
= 194/8 = 24.25
 Standard Deviation: The standard deviation is the square root of the variance. It
provides a
measure of the spread of the data in the same units as the original data and is a
commonly
used measure of dispersion. The formula to calculate standard deviation is:

𝝈 = √∑(𝒙𝒊− 𝒙̅)𝟐
𝒏= ( √∑(𝒙𝒊− 𝒙̅)𝟐
𝒏)𝟏
𝟐
Example : In the example of variance, the variance of the data is 24.25.
Hence, standard deviation of the same data is (24.25)1/2 = 4.92

1.2.3 SKEWNESS AND KURTOSIS


Skewness and kurtosis are both mea sures of the shape of a data.
Skewness measures the degree of asymmetry of a distribution. If a distribution is
perfectly
symmetric, its skewness is zero. If the tail of a distribution extends further to
the right than to the
left, the distribution is positively skewed and its skewness value is positive
(Refer to Figure 1.3). If
the tail of a distribution extends further to the left than to the right, the
distribution is said to be
negatively skewed and its skewness value is negative. Skewness can be calculated
using different
formulas, such as Pearson's skewness coefficient or the sample skewness
coefficient. Formula to

𝟑(𝑴𝒆𝒂𝒏 − 𝑴𝒆𝒅𝒊𝒂𝒏)
calculate skewness is:

𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑫𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
If Mean > Median, then the data is positively skewed.
If Mean < Median, then the data is negatively skewed.
If Mean = Median, then the data has no skewness.

Figure 1.3 shows the graph of a positively skewed, negatively skewed, and no skewed
data.
© Aptech Limited

Figure 1.3: Positive Skewed, N o Skewed, and Negative Skewed Graphs


Image Courtesy:
[Link]
Relationship_between_mean_and_median_under_differen
t_skewness.png

Example : Consider following dataset: 10,50,30,20,10,20,70


Mean of data = 30
Median of data = 20
Standard deviation = 20.70
Skewness = 3*(30-20)/20.70 = 30/20.07 = 1.49
This shows that the data is positively skewed because the result is positive.

Kurtosis measures the degree of peakiness of a distribution. A normal distribution


has a kurtosis
value of 0, indicating a moderate degree of peakiness, also known as mesokurtic . A
distribution
with kurtosis greater than 0 is said to be leptokurtic , meaning it has a sharper
peak and heavier
tails than a normal distribution. A distribution with kurtosis less than 0 is said
to be platykurtic ,
meaning it has a flatter peak and lighter tails than a normal distribution. The
formula to calculate

𝒏 ∗ ∑(𝒙𝒊− 𝒙̅)𝟒
kurtosis is:

(∑(𝒙𝒊− 𝒙̅)𝟐)𝟐
Figure 1.4 shows three types of kurtosis graphs.
© Aptech Limited

Figure 1.4: Types of Kurtosis Graphs


Consider the dataset : 10,50,30,20,10,20,70
Mean of data = 30

∑(𝒙𝒊− 𝒙̅)𝟒= (10-30)4 + (50-30)4 + (30- 30)4 + (20- 30)4 + (10-30)4 + (20-30)4 +
Median of data = 20

(70-30)4
= 160000 + 160000 + 0 + 10000 + 160000 + 10000 + 2560000

(∑(𝒙𝒊− 𝒙̅)𝟐)𝟐= [(10-30)2 + (50-30)2 + (30-30)2 + (20-30)2 + (10-30)2 + (20-30)2 +


= 3060000

(70-30)2 ]2
= [1600 + 1600 + 0 + 100 + 1600 + 100 + 2 5600]2
= 306002 = 936360000
Kurtosis = 7 * (3060000/936360000) = 0.023

© Aptech Limited
1.3 Summary
 In Statistics, data is collect ed, analyz ed, and interpret ed.
 Statistical methods are used in a wide range of fields, including science,
business, economics,
engineering, and social sciences.
 Statistics has two branches , Descriptive Statistics and Inferential Statistics.
 Descriptive statistics include measures of central tendency, such as mean,
median, and mode
measures of dispersion such as range, va riance, standard deviation, skewness and
kurtosis.
 Mean is the average value of the data while median is the positional average.
 Mode is maximum occurred value in a data.
 Variance provides a measure of the spread of the data, but is influenced by
extreme values.
 Standard Deviation is the square root of variance.
 Skewness measures the level of asymmetry in a data whereas kurtosis measures the
level of
peakiness in a data.

© Aptech Limited
1.4 Check Your Progress

1. Which one of the following is a measure of central tendency?

A. Mean B. Median
C. Standard Deviation D. Skewness

2. Which one of the following is a positional average?


3. To which of these statistics is the interquartile range related?

A. Mean B. Range
C. Kurtosis D. Median

4. ___________ is the name of the graph in the kurtosis theory where the kurtosis >
0.

5. When a graph is negatively skewed, then ____________.

6. The formula for calculating inter quartile range is:

A. Q2 – Q1 B. Q3 – Q1
C. Q1 – Q3 D. Q1 – Q2
A. Mean B. Median
C. Mode D. Variance
A. Mesokurtic B. Leptokurtic
C. Platykurtic D. Mountkurtic
A. Mean > Median B. Mean < Median
C. Mean < Mode D. Mean = Median
© Aptech Limited
1.4.1 Answers for Check Your Progress

1. A
2. B
3. D
4. B
5. B
6. B

© Aptech Limited
Try It Yourself
1. Find the mean of the first 15 whole numbers.
2. Find the mean of the following data: 2.2, 10.2, 14.7, 5.9, 4.9, 11.1, 10.5 .
3. Find the median of first 15 natural numbers.
4. Find the median of the following data: 1, 7, 2, 4, 5, 9, 8, 3.
5. The weights in kg of 10 students are as follows: 39, 43, 36, 38, 46, 51, 33, 44,
44, 43. Find
the mode of this data. Is there more than 1 mode? If yes, why?
6. Consider following table:
Denominations 10 20 50 5 100
Number of
notes 40 30 100 50 10
Find the mode of the data.
7. Following observations are arranged in ascending order. The median of the data
is 25 . Find
the value of x. 17, x, 24, x + 7, 35, 36, 46
8. The mean of 6, 8, x + 2, 10, 2x - 1, and 2 is 9. Find the value of x and also
the value of the
observation in the data.
9. Find the variance and standard devi ation of following data:
173, 149, 165, 157, 164.
10. Given following information, find the variance:
Mean = 179, n= 3000, SD = 9
11. Determine the skewness of following data:
12, 13, 54, 56, 25
In addition, identify whether the data is positively skewed, negatively skewed, or
zero
skewed.

© Aptech Limited
12. Determine the kurtosis of following data:
61, 64, 67, 70, 73
Identify whether the graph of the data is platykurtic, leptokurtic, or mesokurtic.

© Aptech Limited
This session will delve deeper into probability and its distributions and various
other concepts of probability such as random variable and Central Limit Theorem.
In this session, you will learn to:
Describe classical probability and probability distributions
Explain random variable and its expectation
Explain central limit theoremSession 0 2
Introduction
to Probability
2.1 EXPLORING PROBABILITY
Probability is a conce pt in statistic s that quantifie s the chance of occurr ing
of an event . It is
expresse d as a numerical value r anging from zer o to one. A probability of zero
signifie s that an
event is impossible, while a probability of one signifie s that an event is
guarantee d to occur. For
example, if a fair coin is tossed, there are two possible outcome s: heads or
tails. The probability
of ge tting heads is 1/2 or 0.5. Ge tting heads is one favorable outcome out of two
possible
outcome s (getting heads or tails).

𝑷(𝑬)= 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒐𝒖𝒕𝒄𝒐𝒎𝒆𝒔 𝒊𝒏 𝒇𝒂𝒗𝒐𝒖𝒓 𝒐𝒇 𝑬


Probability c an be calculated us ing following formula:

𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒑𝒐𝒔𝒔𝒊𝒃𝒍𝒆 𝒐𝒖𝒕𝒄𝒐𝒎𝒆𝒔


Where P(E) represents the probability of event E occurring.
For example:
To find the probability of getting a 5 when a fair dice is rolled can be
calculated by:
P(Getting a 5) = Number of 5’s in a dice / Total number of possible outcomes
= 1/6
To find the probability of getting a queen from a deck of 52 cards can be
calculated by:
P(Getting a quee n) = Number of queens in a deck/Total number of cards in the deck
= 4/52 = 1/13
To get the probability of getting a letter B from a bag of letters containing
letters from A to Z
can be calculated by:
P(Getting a B) = Number of letter B in the bag/Total number of letters in the bag
= 1/26
Probability is used extensively in fields such as mathematics, statistics, science,
and engineering
to analyze and make predictions about uncertain events. It is an essential concept
in
understanding and modeling random phenomena and is the foundation of many important
applications, such as risk analysis, decision-making, and Data Science.
2.1.1 BASICS OF PROBABILITY
Following are the basic probability terms:
Trial and Event: Trial is an experiment under identical conditions. It gives
distinct outcomes
and results in any of the several possible outcomes of the experiment. For example,
rolling of
a dice is a trial and the possible outcomes are 1,2,3,4,5, and 6. Each time a dice
is rolled it
may give different outcomes, but outcomes will be from 1 to 6.
Event is the outcome of a trial. If the dice is rolled and outcome is a 2, this
means it is an
event.
Sample Space: The set of possible outcomes of an experiment is known as a sample
space.
Example, sample space of a dice will be {1,2,3,4,5,6}.
© Aptech Limited
 Exhaustive Events: An exhaustive event includes all possible outcomes of an
experiment. If an
event is exhaustive, it covers every possible outcome that could happen. Example,
suppose a
dice is rolled. The possible outcomes are 1, 2, 3, 4, 5, and 6. If one defines the
event ‘rolling
an even number ’, then the complementary event would be ‘rolling an odd number ’,
and
again, these two events are exhaustive.
 Mutually Exclusive Events: Mutually exclusive events cannot occur simultaneously.
If one of
the events occurs, then the other event(s) cannot occur at the same time. Example,
suppose
a dice is rolled, the possible outcomes are 1, 2, 3, 4, 5, or 6. The events
‘rolling an even
number ’ and ‘rolling an odd number ’ are mutually exclusive because a number on
the dic e
cannot be both even and odd at the same time.
It is important to note that mutually exclusive events are not the same as
independent
events. Two events can be mutually exclusive but dependent, meaning that the
occurrence
of one event affects the probability of the other event occurring. Conversely, two
events can
be independent, but not mutually exclusive, meaning that both events can occur at
the same
time.
 Equally Likely Events: Equally likely events have the same probability of
occurring. I f there are
n possible outcomes of an experiment and each of these outcomes is equally likely
to occur,
then the probability of each outcome is 1/n. Example, if fair six-sided dice is
rolled, the
possible outcomes are 1, 2, 3, 4, 5, and 6 and each outcome has the same
probability of 1/6
or approximately 16.7%.
Since each outcome is equally likely, the number of favorable outcomes for each
event is the
same. The total number of possible outcomes is simply the number of outcomes in the
sample space.
 Independent events: Independent events have no influence on each other. The
probability of
one event occurring does not affect the probability of other event occurring. Exam
ple
flipping a coin and rolling a dice, these events are independent because the
outcome of one
event does not affect the outcome of the other event. The probability of getting
heads on
the coin flip is ½. The probability of rolling a 6 on the dice is 1/6. These
probabilities do not
change based on whether the other event occurred or not.
Mathematically, the probability of two independent events A and B occurring
together is
equal to the product of their individual probabilities , P(A and B) = P(A) × P(B).
For instance, suppose two dice are rolled and it is required to find the
probability of getting a
3 on the first di ce and a 6 on the second di ce. Since the events of getting a 3
on the first di ce
and getting a 6 on the second dice are independent, their individual probabilities
can be
multiplied, that is, P(3 and 6) = P(3) × P(6) = 1/6 × 1/6 = 1/36.
Knowing whether events are independent or dependent is important in calculating
probabilities and making predictions in various fields, such as economics, finance,
and
science.
 INTERSECTION AND UNION : In probability theory, the union and intersection of
events are
important concepts.
© Aptech Limited
The union of two events A and B, denoted by A ∪ B, is the event that either A or B
or both
occur. In other words, A ∪ B occurs if either A occurs, or B occurs or both occur.
The intersection of two events A and B, denoted by A ∩ B , is the event that both A
and B
occur. In other words, A ∩ B occurs if and only if both A and B occur
simultaneously.

𝑷(𝑨 ∪ 𝑩 )= 𝑷 (𝑨)+ 𝑷 (𝑩)− 𝑷(𝑨 ∩ 𝑩)


The probability of the union of two events can be calculated using the formula:

Where P(A) is the probability of event A, P(B) is the probability of event B and
P(A ∩ B) is the
probability of the intersection of events A and B.
Note that the formula subtracts the probability of the intersection of events A and
B to avoid
double counting the probability of the intersection.
The probability of the intersection of two events can be calculated using the

𝑷(𝑨 ∩ 𝑩 )= 𝑷 (𝑨|𝑩)∗ 𝑷(𝑩)


formula:

Where P(A) is the probability of event A and P(B|A) is the conditional probability
of event B
given that event A has occurred.
Note that the conditional probability P(B|A) is the probability of event B
occurring given that
event A has already occurred.
These concepts of union and intersection are fundamental in probability theory and
are used
extensively in various applications such as statistics, machine learning, and
decision-making.
2.1.2 CONDITIONAL PROBABILITY – BAYES THEOREM
Conditional probability of an event occurs given that another event has already
occurred. It is
denoted by P(A|B), which means the probability of event A given that event B has
already

𝑷(𝑨|𝑩)=𝑷(𝑨 ∩ 𝑩)
occurred. The formula for conditional probability is:

𝑷(𝑩)
where, P(A ∩B) is the probability of both events A and B occurring and P(B) is the
probability of
event B occurring.
For example, there is a bag with 10 marbles, six of which are red and four of which
are blue. If a
marble is randomly select ed from the bag, the probability of getting a red marble
is:
P(Red) = 6/10 = 0.6
Now, consider that the red marble is put back in the bag and another marble is
randomly
select ed. The probability of getting a blue marble on the second selection, given
that the first
marble was red, is:
P(Blue|Red) = P(Blue ∩Red) / P(Red)
Since the red marble is put back in the bag, the probability of selecting a blue
marble on the
second selection is still 4/10. However, the probability of selecting a red marble
on the first
selection was 6/10. So:
P(Blue|Red) = (4/10) / (6/10) = 0.67
© Aptech Limited
This means that given that a red marble is already selected on the first selection,
the probability
of selecting a blue marble on the second selection is 0.67 or 67%.

 BAYES THEOREM : Bayes' theorem is a fundamental concept in probability theory


that
describes how to update our beliefs about an event based on new evidence or
information. It
states that probability of an event A given that event B has occurred {P(A|B)}, is
equal to
probability of event B given that event A has occurred {P(B|A)}. This is then
multiplied by the
prior probability of event A, denoted by P(A), divided by the prior probability of
event B,

𝑷(𝑨|𝑩)= 𝑷(𝑩|𝑨)∗ 𝑷(𝑨)


denoted by P(B):

𝑷(𝑩)
Where,

Here is an example of how Bayes' theorem can be applied in a real-world scenario:


Suppose a doctor is trying to diagnose a patient with a rare disease. The disease
affects about 1%
of the population, so the prior probability of someone having the disease is 0.01.
A test is run that is 95% accurate. This means if someone has the disease, there is
a 95% chance
the test will be positive. If someone does not have the disease, there is a 95%
chance the test
will be negative.
A patient takes the test and the result comes back positive. What is the
probability that the
patient actually has the disease?
Bayes' theorem can be used to answer this question.
P(A|B) = P(B|A) * P(A) / P(B)

P(A|B) is the
conditional
probability of
event A given
event B.P(B|A) is the
conditional
probability of
event B given
event A.P(A) is the prior
probability of
event A. P(B) is the prior
probability of
event B.
© Aptech Limited

Where,

Using the information given in the problem, values can be plug ged in the values
and calculated
as shown:
P(A|B) = P(B|A) * P(A) / P(B)
P(A|B) = 0.95 * 0.01 / (0.95 * 0.01 + 0.05 * 0.99)
P(A|B) = 0.16
Therefore, the probability that the patient actually has the disease given a
positive test result is
only 16%. This may seem counterintuitive, but it is due to the fact that the test
is not 100 %
accurate and there is a relatively low prior probability of someone having the
disease.

2.2 RANDOM VARIABLE


A random variable is one that can take on different values based on the outcome of
a random
event. It is a function that maps the outcomes of a random experiment to a set of
numerical
values.
For example, when tossing a fair coin, the outcome of the experiment can either be
heads or
tails. A random variable X can be defined as the number of heads obtained in a
certain number
of coin tosses. In this case, X can take on the values of 0, 1, or 2, depending on
the outcome of
the coin tosses.
Random variables can be discrete or continuous. It depend s on the set of possible
values that
they can take. P(A|B) is the probability of event A given that event B has occurred
(that is, the probability that the patient has the disease given that
the test is positive).
P(B|A) is the probability of event B given that event A has occurred
(that is, the probability that the test is positive given that the patient
has the disease).
P(A) is the prior probability of event A (that is, the probability that
the patient has the disease).
P(B) is the probability of event B (that is, the probability that the test
is positive).
© Aptech Limited

EXPECTATION OF A RANDOM VARIABLE: The expectation of a random variable is a measure


of the
central tendency of the variable. It is also known as the expected value or the
mean of the
random variable.
The expectation of a discrete random variable X is defined as the sum of the
product of th e

𝑬(𝑿)= ∑ 𝒙 ∗ 𝑷(𝑿 = 𝒙)
possible values of X and their corresponding probabilities:

Where the summation is taken to be all possible values of X.


For a continuous random variable X, the expectation is defined as the integral of
the product of

𝑬(𝑿)= ∫ [𝒙 ∗ 𝒇 (𝑿)]𝒅𝒙
the variable and its probability density function:

Where the integral is taken over the entire range of X.

Here is an example of finding the expectation of a discrete random variable:


Suppose X is a random variable that represents the number on a fair six-sided di
ce. That is, X can
take on the values {1, 2, 3, 4, 5, 6}, each with probability 1/6.
To find the expectation of X, denoted E(X), following formula is used:
E(X) = Σ[x * P(X = x)], where, the sum is taken over all possible values x that X
can take on.
In this case, following exists:
E(X) = (1 * 1/6) + (2 * 1/6) + (3 * 1/6) + (4 * 1/6) + (5 * 1/6) + (6 * 1/6)
= 3.5
Therefore, the expectation of X is 3.5. This means that if the dice is to be rolled
many times and
the average value is taken , the expect ed average would be approximately 3.5.
Discrete random variables can
take on only countable number
of values. Examples of discrete
random variables include the
number of cars sold by a
dealership in a month and the
number of defects in a batch of
[Link] random variables can take on
any value within a certain range. Examples
of continuous random variables include the
weight of a product, the height of a person,
or the temperature of a room. The
probability distribution of a random
variable describes the probabilities
associated with each possible value of the
random variable. For discrete random
variables, the probability distribution is
often given in the form of a Probability
Mass Function (PMF). For continuous
random variables, it is often given in the
form of a Probability Density Function
(PDF).
© Aptech Limited
2.3 DISCRETE PROBABILITY DISTRIBUTION
A discrete probability distribution is a probability distribution that can take on
a countable
number of possible values. The distribution is defined by a PMF, which specifies
the probability
of each possible value.
The PMF is defined as follows:
For a random variable X that can take on values x 1, x2, x3, ..., the PMF is a
function that maps
each possible value x i to its probability, denoted by P(X=x i).
The PMF must satisfy following conditions:

Examples of discrete probability distributions include the Bernoulli distribution,


binomial
distribution, and the Poisson distribution. These distributions are often used to
model random
events that have a finite or countably infinite number of possible outcomes. This
could be the
number of successes in a fixed number of trials or the number of events that occur
in a given
time interval.
The properties of discrete probability distributions, such as expected value,
variance, and
standard deviation, can be calculated using the PMF. These properties can provide
useful
information about the distribution and can be used to make predictions and
decisions in
real-world applications.

2.3.1 BERNOULLI DISTRIBUTION


Bernoulli distribution, also known as Bernoulli trials, are a sequence of
independent experiments
or trials. These can have only two outcomes, usually denoted as success (S) and
failure (F), with
probabilities of success and failure denoted as p and q = 1-p, respectively.
Each trial has the same probability of success and the outcome of each trial does
not affect the
outcome of subsequent trials. Examples of Bernoulli trials include flipping a coin,
rolling a die to
obtain a certain number, or shooting a basketball to make a shot.
The probability of getting exactly k successes in n Bernoulli trials is given by
the binom ial
distribution, which is a discrete probability distribution. The binomial
distribution is a
generalization of the Bernoulli distribution to multiple trials. Non -negative
values: P(X=xi) ≥ 0 for all xi.
Summing to 1: ∑ P(X=xi) = 1 over all possible values of X.
© Aptech Limited
For example, drawing a card from a deck of cards can also be considered a Bernoulli
trial. I f one
is interested in drawing a red card, then the probability of success is 26/52
(assuming a standard
deck of cards) and the probability of failure is 26/52.

2.3.2 BINOMIAL DISTRIBUTION


The binomial distribution is a discrete probability distribution. It describes the
probability of a
certain number of successes in a fixed number of independent trials. Here, each
trial has only
two possible outcomes (success or failure) and the probability of success remains
constant
across all trials.

(𝑿 = 𝒌 )= ∑ 𝑪𝒌𝒏∗ 𝒑𝒌∗ 𝒒𝒏−𝒌


The formula for the binomial distribution is:

Where:

𝑪 𝒌𝒏is the binomial coefficient, also known as "n choose k", which represents the
P(X = k) is the probability of getting k successes in n trials

number of ways
to choose k successes from n trials and is calculated as: n! / (k! * (n - k)!)
p is the probability of success in a single trial
q = (1 - p) is the probability of failure in a single trial
k is the number of successes in n trials

Figure 2.1 shows probability curve of binomial distribution.

Figure 2.1: Probability Curve of a Binomial Distribution


Here is an example of binomial distribution.
Suppose a survey is conducted on proportion of people who prefer cats over dogs as
pets. 100
people are randomly surveyed and their responses are recorded. Assume that the true
proportion of people who prefer cats over dogs in the population is 0.6.
© Aptech Limited
This situation can be modeled using the Binomial distribution. Let X be the number
of people in
the sample who prefer cats. Then, X follows a Binomial distribution with n = 100
and p = 0.6,
where, n is the sample size and p is the true proportion of people who prefer cats.
Suppose one wants to find the probability that at most 50 people in the sample
prefer cats. This
probability can be calculated using the Cumulative Distribution Function (CDF) of
the Binomial
distribution:
P(X <= 50) = Σ {100Ck} (0.6)k (1-0.6)(100- k)
Using a calculator or a software package, this sum can be evaluated and one can
find that P(X <=
50) is approximately 0.0007. This means that the probability of getting a sample
with at most 50
people who prefer cats is very low. This is given that the true proportion of
people who prefer
cats is 0.6.

2.3.3 POISSON DISTRIBUTION


The Poisson distribution is a discrete probability distribution. It describes the
probability of a
given number of events occurring in a fixed interval of time or space. It assum es
that these
events occur independently and at a constant average rate.

𝑷(𝑿 = 𝒌 )= 𝛌𝒌
The probability mass function of the Poisson distribution is given by:

𝒌!∗ 𝒆−𝛌
Where λ is the average rate of occurrence of the events in the interval and k is
the number of
events that occur in that interval.
Some key properties of the Poisson distribution include:

The mean and variance


of the distribution is
both equal to λ.The Poisson
distribution is a
limiting case of the
binomial distribution.
It occurs when the
number of trials goes
to infinity and the
probability of success
in each trial goes to
zero, while the
product np = λ
remains [Link] Poisson
distribution is often
used as an
approximation for the
binomial distribution.
Consider this: when
the number of trials is
large and the
probability of success
in each trial is small,
but the product np = λ
is moderate to large.
© Aptech Limited
Figure 2.2 shows Poisson distribution.

Figure 2.2: Poisson Distribution


Here is an example of how to use the Poisson distribution to calculate the
probability of a certain
number of events occurring within a given time period.
Suppose a call center receives an average of four customer calls per hour. What is
the probability
that it will receive exactly six calls in the next hour?
To solve this problem, first, identify the values of the Poisson distribution's
parameters:
λ (lambda) represents the average rate of occurrence of the event (in this case,
customer calls
per hour).
x represents the number of events to calculate the probability for (in this case,
six calls in the
next hour).
Using these values, one can calculate the probability of receiving exactly six
calls in the next hour
with the Poisson distribution formula:
P(X = x) = (e^ -λ* λ^x) / x!
Where e is the mathematical constant e (approximately equal to 2.71828) and x!
represents the
factorial of x (i.e., x * (x-1) * (x-2) * ... * 2 * 1).
Plugging in the values from the example, following is arrived:
 P(X = 6) = (e^-4 * 4^6) / 6!
 P(X = 6) = (0.01832 * 4096) / 720
 P(X = 6) = 0.1042
The probability of receiving exactly six calls in the next hour is approximate ly
10.42%.

2.4 CONTINUOUS PROBABILITY DISTRIBUTION


A continuous probability distribution is a mathematical function that describes the
probabilities
of different outcomes in a continuous random variable.
© Aptech Limited
Continuous probability distributions are used to model real-world phenomena that
involve
continuous measurements, such as time, distance, or temperature. Examples of
continuous
probability distributions include the normal distribution, the uniform
distribution.
Continuous probability distributions are important in statistical inference and
hypothesis testing,
as they allow us to calculate probabilities of events and make predictions about
future
outcomes.
2.4.1 UNIFORM DISTRIBUTION
The uniform distribution is a continuous probability distribution that models
random variables
with a constant probability density function over a specified interval. This means
that any value
within the interval is equally likely to occur.

𝑭(𝒙)= 𝟏
The PDF of a uniform distribution is given by:

𝒃 − 𝒂 𝒇𝒐𝒓 𝒂 ≤ 𝒙 ≤ 𝒃
Where a and b are the lower and upper bounds of the interval and f(x) represents
the probability
density function at any point x within the interval.
Some key properties of the uniform distribution include:

Here is an example of calculating the probability density function for a uniform


distribution.
Suppose there is a continuous random variable X that follows a uniform distribution
on the
interval [a,b], where a = 2 and b = 8. The probability density function of X is:
f(x) = 1/(b -a) = 1/(8 -2) = 1/6, for a ≤ x ≤ b
The uniform distribution is often used in simulations and models where, randomness
is required
but there is no preference for any particular outcome. Examples include generating
random
numbers, selecting random elements from a set, or modeling the distribution of
measurements
that are subject to random error.

The mean of the distribution is (a + b) / 2.


The variance of the distribution is (b -a)^2 / 12.
The CDF of a uniform distribution is given by:
F(x) = 0 for x < a
F(x) = (x -a) / (b -a) for a ≤ x ≤ bF(x) = 1 for x > b
F(x) = 1 for x > b
© Aptech Limited
2.4.2 NORMAL DISTRIBUTION
The normal distribution is a continuous probability distribution. It is also known
as the Gaussian
distribution. It is widely used in statistics, probability theory, and other fields
to model random
variables that have a bell-shaped probability density function.
The normal distribution is distinguished by its parameters, mean (μ) and standard
deviation (σ),
which determine the shape, center, and spread of the distribution. The mean is the
central value
of the distribution, while the standard deviation measures the variability or
spread of the data
around the mean.
The probability density function of the normal distribution is given by following

𝑭(𝑿)= 𝟏
equation:

𝝈√𝟐𝝅∗ 𝒆−𝟏
𝟐(𝒙−𝝁
𝝈)𝟐
Where x is the random variable, μ is the mean, σ is the standard deviation, π is
the mathematical
constant pi and e is the base of the natural logarithm.
Figure 2.3 shows a normal distribution graph.

Figure 2.3: Normal Distribution

Here is an example of how to calculate probabilities using the normal distribution:


An example of how to calculate the PDF of a normal
distribution.
Suppose there is a normally distributed variable X with mean µ = 10 and standard
deviation σ =
2. One needs to find the PDF at a specific point x = 12.

© Aptech Limited

𝐅(𝐗)= 𝟏
The PDF of a normal distribution is given by following formula:

𝛔√𝟐𝛑∗ 𝐞−𝟏
𝟐(𝐱−𝛍
𝛔)𝟐

𝐅(𝟏𝟐)= 𝟏
Plugging in the values, following is arrived at:

𝟐√𝟐𝛑∗ 𝐞−𝟏
𝟐(𝟏𝟐− 𝟏𝟎
𝟐)𝟐
Simplifying the exponent:
f(12) ≈ 0.120985
So the PDF of our normal distribution at x = 12 is approximately 0.120985. This
means that the
probability of x being exactly equal to 12 is very small, but the probability of x
being near 12 is
relatively high.
The normal distribution has several important properties, such as the empirical
rule or 68-95-
99.7 rule. This states that approximately 68% of the observations in a normal
distribution are
within one standard deviation of the mean. About 95% of the observations are
within two
standard deviations of the mean and almost all observations 99.7% are within three
standard
deviations of the mean.
Figure 2.4 shows the visual representation of 68- 95-99.7 rule.

Figure 2.4: 68-95-99.7 rule


Image Courtesy:
[Link]
File:Empirical_rule_histogram.svg
The normal distribution is widely used in statistical inference, hypothesis
testing, and confidence
interval estimation, as well as in many other applications in science, engineering,
finance, and
social sciences.

© Aptech Limited
2.5 CENTRAL LIMIT THEOREM
The concept of Central Limit Theorem (CLT) describes the behavior of the means of a
large
number of independent random variables. In simple terms, the CLT states that if
many random
samples are taken from any population, then the means of those samples will
approximate a
normal distribution. This is regardless of the shape of the population's original
distribution.
Figure 2.5 shows a visual representation of the central limit theorem.

Figure 2. 5: Visual Representation of CLT


Image Courtesy:
[Link]
[Link]
The theorem has broad applicability in many areas of science and engineering,
including finance,
physics, engineering, and biology. It is often used in hypothesis testing,
confidence intervals, and
statistical inference.
The CLT has several important implications. Firstly, it suggests that the
distribution of sample
means will be less variable than the distribution of individual observations. This
makes it easier
to draw conclusions about the population from a representative sample. Secondly, it
implies that
the normal distribution is a good approximation for many types of data, making it
easier to
calculate probabilities and perform statistical analyses. Finally, the CLT
highlights the importance
of large sample sizes in statistical analysis, since smaller samples may not
accurately reflect the
properties of the population.

© Aptech Limited
Assumptions of CLT:
Assumptions of the central limit theorem include:

It is important to note that the central limit theorem applies to a wide range of
distributions,
regardless of whether the underlying distribution is normal or not. Additionally,
the central limit
theorem holds even when the sample size is small provided that the underlying
distribution is
approximately symmetric and not too skewed.

Independence
The sample data should be collected independently of each other .
Sample Size
The sample size should be large enough to ensure that the sample mean is
normally distributed.
Identical Distribution
The population from which the sample is drawn should have an identical
distribution.
Finite Variance
The population from which the sample is drawn should have a finite
variance.
© Aptech Limited
2.6 Summary

 Probability is a measure of the chance of an event occurring. It is a numerical


value between
zero and one. Zero indicates that an event is impossible and one indicates that an
event is
certain to happen.
 Trial is an experiment under identical conditions and gives unique results. This
can be in any
of the several possible outcomes of the experiment.
 Event is the outcome of a trial.
 The set of possible outcomes of an experiment is known as a sample space.
 Mutually exclusive cannot occur simultaneously. If one of the events occurs, then
the other
event(s) cannot occur at the same time.
 Independent events have no influence on each other. The probability of one event
occurring
does not affect the probability of the other event occurring.
 Conditional probability is the probability of an event occurring given that
another event has
already occurred.
 A random variable can take on different values based on the outcome of a random
event. It
is a function that maps the outcomes of a random experiment to a set of numerical
values.
 Discrete random variables can take on only countable number of values , whereas
continuous
random variables can take on any value within a certain
 The expectation of a random variable is a measure of the central tendency of the
variable. I t
is also known as the expected value or the mean of the random variable.
 Bernoulli trials are a sequence of independent experiments or trials that can
have only two
outcomes, usually denoted as success (S) and failure (F).
 The binomial distribution is a discrete probability distribution that describes
the probability
of a certain number of successes in a fixed number of independent trials. Each
trial has only
two possible outcomes (success or failure) and the probability of success remains
constant
across all trials.
 The Poisson describes the probability of a given number of events occurring in a
fixed
interval of time or space. It assum es that these events occur independently and at
a constant
average rate.
 The uniform distribution is a continuous probability distribution that models
random
variables with a constant probability density function over a specified interval.
This means
that any value within the interval is equally likely to occur.
 The normal distribution is characterized by its mean (μ) and standard deviation
(σ), which
determine the shape, center, and spread of the distribution. The mean is the
central value of
the distribution, while the standard deviation measures the variability or spread
of the data
around the mean.
 Central Limit Theorem states that if many random samples are taken from any
population,
the means of those samples will approximate a normal distribution. This will be
regardless of
the shape of the population's original distribution.

© Aptech Limited
2.7 Check Your Progress

1. Which of the following is not a discrete probability distribution?

A. Bernoulli Trials B. Binomial Distribution


C. Normal Distribution D. Poisson Distribution

2. What is the probability of rolling a five on a standard six-sided dice?

3. A coin is flipped three times. What is the probability of getting exactly two
heads?

A. 1/8 B. 3/8
C. 1/4 D. 3/4

4. Which of the following is not a property of the normal distribution?

A. Symmetric B. Unimodal
C. Skewed D. Bell-Shaped
5. Which of the following is a discrete probability distribution?
A. Exponential B. Gamma
C. Weibull D. Bernoulli

6. Which of the following statements is TRUE about the Central Limit Theorem?
A. 1/6 B. 1/5
C. 1/4 D. 1/3
A. It guarantees that the sample
mean will always be close to the
population mean. B. It applies to any population
distribution, regardless of its
shape.
C. It states that the distribution of the
sample means will be identical to
the population distribution. D. It requires that the sample size
be less than 30.
© Aptech Limited
7. Which of the following statements is TRUE about the Central Limit Theorem?

A. 5 B. 10
C. 50 D. 100
© Aptech Limited

2.7.1 Answers for Check Your Progress

1. C
2. A
3. B
4. C
5. D
6. B
7. C
© Aptech Limited
Try It Yourself
1. A jar contains four red balls and six black balls. Two balls are drawn at random
without
replacement. What is the probability that both balls are red?
2. In a class of 30 students, 18 are boys and 12 are girls. If a student is
selected at random,
what is the probability that the student is a boy?
3. A card is drawn at random from a standard deck of 52 cards. What is the
probability that it
is a face card (jack, queen, or king)?
4. A certain disease affects one in 1, 000 people in a population. A test for the
disease is 99%
accurate when it correctly identifies 99% of people who have the disease. In
addition, it
identifies 99% of people who do not have the disease. If a person tests positive
for the
disease, what is the probability that they actually have the disease?
5. A company employs two types of workers: skilled and unskilled. 60% of the
skilled workers
and 40% of the unskilled workers are union members. The union represents 45% of the
total workforce. If a worker is chosen at random from the company and it is known
that th e
worker is a union member, what is the probability that the worker is skilled?
6. A company manufactures light bulbs and it is known that 2% of the bulbs are
defective. If a
sample of 100 bulbs is randomly selected, what is the probability that exactly
three of them
are defective? A basketball player has a 70% success rate in free throws. If he
attempts 10
free throws, what is the probability that he will make at least eight of them?
7. Manufacturer of electronic components claims that 5% of its components are
defective. If a
sample of 200 components is randomly selected, what is the probability that less
than 10 of
them are defective?
8. A company produces light bulbs, and the lifetimes of the bulbs follow a normal
distribution
with a mean of 1, 000 hours and a standard deviation of 100 hours. If the company
wants to
guarantee that at least 90% of the bulbs last for at least 800 hours, what minimum
lifetime
should the bulbs be designed for?
9. Suppose that the time it takes for a machine to complete a task is uniformly
distributed
between five and ten minutes. What is the probability that the machine will take
between
six and eight minutes to complete the task?
10. A bag contains ten red balls and eig ht blue balls. Two balls are drawn at
random without
replacement. If the first ball drawn is red, what is the probability that the
second ball drawn
is blue?
© Aptech Limited
This session introduces the concept of correlation and regression and how they
play significant role in statistics.
In this session, you will learn to:
Explain correlation
Describe the rank of correlation
Explain regression and its typeSession 0 3
Correlation and
Regression
3.1 WHAT IS CORRELATION?
Correlation describes the relationship between two variables. It is often used in
data science and
research. Correlation indicates the degree to which the variables are associated
with each other .
It can be either positive or negative. A positive correlation implies that as one
variable goes up,
the other variable typically rises as well. A negative correlation suggests that
when one variable
increases, the other tends to decrease . For example, there is a positive
correlation between
smoking and the risk of lung cancer. As the number of cigarettes smoked per day
increases, the
risk of developing lung cancer also increases. On the other hand, there is a
negative correlation
between exercise and body weight.
When exercise levels rise, body weight tends to drop.
The strength of the correlation can range from -1 to +1 . -1 indicates a perfect
negative
correlation . +1 indicates a perfect positive correlation, and 0 indicates no
correlation at all.
3.1.1 SCATTER CHART
A scatter chart displays the relationship between two numerical variables. It
consists of a set of
data points, each of which represents the values of the two variables for a single
observation.
The values of one variable are plotted along the horizontal axis, while the values
of other
variable are plotted along the vertical axis.
In a scatter chart, each data point is represented by a dot. The position of the
dot on the chart
corresponds to the values of the two variables for that observation. The chart can
be used to
identify patterns or trends in the data, as well as to identify outliers or unusual
observations.
Scatter charts are commonly used in scientific research, engineering, economics,
and other fields
to explore relationships between variables and to identify patterns in data. They
can also be
useful for visualizing data sets with a large number of observations.
Figure 3.1 shows an illustration of a scatter plot.
Figure 3.1: Scatter Plot
© Aptech Limited
3.1.2 COEFFICIENT OF CORRELATION
The coefficient of correlation is denoted by r. It shows the strength and direction
of the linear
relationship between two variables. It is a value that ranges from -1 to 1. A
value of -1 signifies a
complete negative correlation, where an increase in one variable is matched by a
decrease in the
other . 0 indicates no correlation, and 1 indicates a positive correlation where an
increase in one
variable is matched by a increase in the other. There are different methods used to
calculate
correlation between variables. The most used are Karl Pearson’s coefficient of
correlation and
Spearman’s Rank correlation coefficient.

Karl Pearson’s Coefficient of Correlation


The formula for calculating the coefficient of correlation, r, between two
variables x and y with n

𝒓 = 𝑪𝒐𝒗 (𝒙, 𝒚)
observations is:

𝝈𝒙∗ 𝝈 𝒚
Where, x and y are two variables, Cov(x,y) is the covariance between x & y, and 𝝈𝒙
& 𝝈𝒚 are
standard deviations of x & y respectively.
Here is an example on how to calculate Karl Pearson’s coefficien t of correlation.
Consider following data:
X Y
2 3
3 5
4 4
5 6
6 7

𝑪𝒐𝒗 (𝒙, 𝒚)= ∑[𝒙𝒊− 𝒙̅]∗ [𝒚𝒊− 𝒚̅]


The formula for calculating the covariance is:

𝝈𝒙= √∑(𝒙𝒊− 𝒙̅)𝟐


The formula for calculating standard deviation is:

Mean of x ( 𝒙̅)= (2+3+4+5+6)/5 = 20/5 = 4


Calculations required by the formula will be following:

Mean of y ( 𝒚̅) = (3+5+4+6+7)/5 = 25/5 = 5


Table 3.1 refers to the calculations required to calculate the covariance according

formula. Here 𝒙𝒊 and 𝒚𝒊 are values of X & Y from the data.


to the given

© Aptech Limited
The calculation for one row is given as follows:

Here x1= 2, 𝒙̅ = 4, x1 - 𝒙̅ = 2 - 4 = -2
The data has 5 records. Hence, i will take the values 1,2,3,4,5.

Similarly, y1= 3, 𝒚̅ = 5, y1 - 𝒚̅ = 3 - 5 = -2
Hence, [ x1 - 𝒙̅] ∗ [y1 - 𝒚̅] = -2 * -2 = 4
In similar manner all other values can be calculated.

X 𝒙𝒊− 𝒙̅ (𝒙𝒊−𝒙̅)𝟐 Y 𝒚𝒊− 𝒚̅ (𝒚𝒊−𝒚̅)𝟐 [𝒙𝒊− 𝒙̅]∗[𝒚𝒊− 𝒚̅]


2 -2 4 3 -2 4 4
3 -1 1 5 0 0 0
4 0 0 4 -1 1 0
5 1 1 6 1 1 1
6 2 4 7 2 4 4
Sum = 10 Sum= 10 Sum = 9
Table 3. 1: Calculation of Covariance
Using the values in the formula, following will be the calculation:

𝝈𝒙 = (10/5)1/2 = 1.414
Cov(X,Y) = 9/5 = 1.8

𝝈𝒚 = (10/5)1/2 = 1.414
r = 1.8/(1.414 * 1.414) = 0.9
So, the correlation coefficient between X and Y is 0.9 . This indicates a strong
positive correlation
between the two variables.
3.1.3 RANK CORRELATION
Rank correlation is used to assess the strength and direction of the relationship
between two
variables by comparing their rankings. There are several types of rank correlation
measures, but
the most commonly used ones is the Spearman's rank correlation coefficient.

© Aptech Limited
SPEARMAN’S RANK CORRELATION COEFFICIENT

𝒓 = 𝟏 − ∑𝒅𝟐
Spearman's rank correlation coefficient is calculated using following formula:

𝒏(𝒏𝟐− 𝟏)
Where Σd2 is the sum of the squared differences between the ranks of the two
variables , n is the
number of observations.

𝒓 = 𝟏 − 𝟔 ∑𝑹𝟐
Alternatively, the formula can be simplified as:

𝒏(𝒏𝟐− 𝟏)

Where ΣR2 is the sum of the squared ranks of the differences between the two
variables.
To calculate r, first assign ranks to the observations of each variable, then
calculate the
differences between the ranks of the two variables. F inally use the values into
the formula. The
resulting value of r will range from -1 to +1, with a value of 0 indicating no
correlation and values
closer to -1 or +1 indicating stronger correlations.

Spearman's rank correlation coefficient measures the


degree to which the ranks of two variables are linearly
related. It is denoted by the symbol ρ.
It ranges from -1 to +1. -1 indicates a perfect negative
correlation.
0 indicates no correlation, and + 1 indicates a perfect
positive correlation.
Spearman's rank correlation coefficient is based on the
difference between the ranks of the two variables and is
robust to outliers.
© Aptech Limited
Here is an example on how to calculate Spearman’s Rank Correlation Coefficient:
Consider following data for two variables, X and Y:

X Y
10 12
7 5
5 8
3 1
6 7
8 10
2 3
4 4
9 11
1 2

To calculate Spearman's rank correlation coefficient, first rank the values of each
variable, from
lowest to highest. Assign ranks by sorting the values and then assign ranks
according to their
position in the sorted list. If there are ties (that is, multiple values with the
same rank), assign the
average rank to those values.
It can be seen from the table, that the elements are sort ed in ascending order and
ranks are
assigned starting from 1 to n (n=10 here).
Table 3.2 shows the ranks for X and Y.

X Rank of X Y Rank of Y
10 10 12 10
7 7 5 5
5 5 8 7
3 3 1 1
6 6 7 6
8 8 10 8
2 2 3 3
4 4 4 4
9 9 11 9
1 1 2 2
Table 3.2 : Ranks for X and Y
© Aptech Limited
Table 3.3 shows the differences between the ranks of X and Y that can be
calculated, squared ,
and added .

X Rank of X Y Rank of Y Difference


(d) (Rank
of X –
Rank of Y) d2
10 10 12 10 0 0
7 7 5 5 2 4
5 5 8 7 -2 4
3 3 1 1 2 4
6 6 7 6 0 0
8 8 10 8 0 0
2 2 3 3 -1 1
4 4 4 4 0 0
9 9 11 9 0 0
1 1 2 2 -1 1
Sum(d2 ) = 14
Table 3.3 : Differences Between Ranks of X and Y
Here is an example of one row of calculation:
Rank of first element of X = 10
Rank of first element of Y = 10
Rank of X – Rank of Y (d) = 0
d2 = 0
In a similar manner, the other rows can be calculated.
Hence, Sum of d2 = 14
Next, the formula for Spearman's rank correlation coefficient can be used:
r = 1-6*14/10(102 -1) = - 84/990 = -0 .08
This shows that there is a weak negative correlation between two variables.
Figure 3.2 shows the examples of scatter plots with various correlation
coefficients.
© Aptech Limited

Figure 3.2: Scatter Plot with Various Correlation Coefficients


Image Courtesy:
[Link]
File:Pearson_Correlation_Coefficient_and_associated_scatterplots.png

PROBABLE ERROR OF RANK CORRELATION

The probable error of a rank correlation coefficient can be estimated using

𝑷𝑬= 𝟎.𝟔𝟕𝟒𝟓 ∗ (𝟏 − |𝒓|)


following formula:

√𝒏 − 𝟐
Where PE is the probable error, r is the sample rank correlation coefficient, and n
is the sample
size.
This formula assumes that the underlying data are normally distributed and that the
sample is
representative of the population. It also assumes that the sample size is large
enough for the
central limit theorem to apply.
Suppose there are two variables, X and Y and the ranks obtained for each variable
are as follows:
X: 3, 7, 6, 2, 5, 1, 4
Y: 2, 5, 3, 1, 7, 6, 4
To calculate the rank correlation coefficient (Spearman's rank) and the probable
error of this
coefficient, the illustration in same section to calculate the value of r can be
used.
Therefore, r = 0.452

The probable error of a rank correlation coefficient measures the likely amount of
error or
uncertainty in the sample estimate of the true rank correlation coefficient. It is
a measure
of the variability of the sample rank correlation coefficient that would be
expected if the
study were repeated many times.
© Aptech Limited
Now, this value of r can be used in the formula to calculate the probable error:
Probable error = 0.6745 * (1 - |0.452|) / sqrt(7) = 0.317
Therefore, the probable error of the rank correlation coefficient is 0.317. On
repeat ing the
experiment with different samples of same size, true value of rank correlation
coefficient will fall
within +/- 0.317 of the observed value about 50% of the time.

3.2 REGRESSION
Regression is used to determine the relationship between a dependent variable
(also
known as the response variable) and one or more independent variables (also known
as
predictor variables).
The goal of regression analysis is to create a mathematical model that describes
the relationship between the variables . It can also be used to make predictions
about the dependent variable based on values of the independent variables.
There are different types of regression analysis, including linear regression,
logistic regression, polynomial regression, and others.
Linear regression is the most commonly used type of regression analysis and it
involves finding a straight line that best fits the data points in a scatter plot.
Regression analysis can be used in many fields, including economics, finance,
biology,
engineering, and social sciences, to study the relationship between variables and
make
predictions about future outcomes.
© Aptech Limited
Figure 3.3 illustrates the graph of linear regression.

Figure 3.3: Scatter Plot for Linear Regression

3.2.1 LINEAR REGRESSION


Linear regression is used to model the relationship between a dependent variable
(often
denoted as Y) and one or more independent variables (often denoted as X). The goal
of linear
regression is to find the best-fitting straight line that can explain the
relationship between the
variables.
There are two types of linear regression, Simple Linear Regression and Multiple
Linear
Regression

𝒃𝟎= (∑𝒚)(∑𝒙𝟐) − ( ∑𝒙)(∑𝒙𝒚)


Following is the formula for calculating coefficients of simple linear regressions:

𝒏(∑𝒙𝟐)− (∑𝒙)𝟐
𝒃𝟏= 𝒏(∑𝒙𝒚) − ( ∑𝒙 )(∑𝒚)
𝒏(∑𝒙𝟐)− (∑𝒙)𝟐
In simple linear regression, there is only one independent
variable X. The relationship between X and Y is modeled using a
straight -line equation. This is Y = b0+ b1X. Here, b0and b1are the
intercept and slope coefficients, [Link] Linear
Regression
•For example: To predict weight of a person using his height, the height is the
independent variable X. Weight is the dependent variable Y. Therefore, the equation
hence becomes Weight = b0+ b1 * Height
In multiple linear regression, there are two or more independent
variables X1, X2, ..., Xn. The relationship between these variables
and Y is modeled using an equation. This is Y = b0+ b1X1+ b2X2+ ...
+ [Link], b0, b1, b2, ..., bnare the coefficients. Multiple Linear
Regression
•For example: To predict weight of a person using height and age, the height and
age
are the independent variables X1and X2. Weight is the dependent variable Y .
Therefore, the equation hence becomes Weight = b0+ b1 * Height + b2* Age
© Aptech Limited
Consider the following data containing age and glucose level of six patients.
Here , age is the
independent variable (x), with the help of which the glucose level will be
predicted. This is the
dependent variable (y). The equation hence becomes:
Y = b 0 + b 1X
Table 3.4 shows the calculation of terms used in calculating the regression
coefficients.

X Y XY X2 Y2
41 97 3977 1681 9409
22 66 1452 484 4356
23 79 1817 529 6241
44 73 3212 1936 5329
57 87 4959 3249 7569
59 81 4779 3481 6561
Sum = 246 Sum = 483 Sum = 20196 Sum = 11360 Sum = 39465
Table 3.4: Calculation of Regression Coefficients
Putting values in the formula:
b0 = (5486880 – 4968216)/(68160 - 60516) = 518664/7644 = 67.85
b1 = (121176 - 118818)/ (68160 - 60516) = 2358/7644 = 0.30
Hence the final equation becomes:
Y= 67.85 + 0.30*X

© Aptech Limited
3.2.2 NON -LINEAR REGRESSION

𝒚 = 𝒂𝒙𝟒+𝒃𝒙𝟑+ 𝒄𝒙𝟐+ 𝒅𝒙𝟏+ 𝒆


A non-linear equation (of degree 4) could be of the form:

Non-linear regression is a powerful tool for analyzing complex data. This requires
careful
attention to the model assumptions and selection of appropriate techniques for
fitting the
model.
Figure 3.4 illustrates an example of non-linear regression between two variables,
petal length
and petal width. It can be seen that the graph is a curve which shows non-linear
regression.

Figure 3.4: Scatter Plot of Non-linear Regression -->•Non -linear regression is


used to model a
relationship between a dependent variable and one
or more independent variables that is not linear.
-->•Linear regression assumes a linear relationship
between the dependent variable and the
independent variables. Non -linear regression
models use a nonlinear function to describe the
relationship.
-->•Non -linear regression can be used to model a wide
variety of phenomena, such as growth curves,
enzyme kinetics, and dose -response relationships.
-->•The choice of the appropriate nonlinear function to
use depends on the nature of the data being
analyzed and the underlying theoretical model.
© Aptech Limited
3.2.3 REGRESSION COEFFICIENTS

Regression coefficients are numerical values. They


represent the relationship between a dependent
variable and one or more independent variables in a
regression model. They are estimated using statistical
methods, such as ordinary least squares regression.
They are used to quantify the strength and direction
of the relationship between the variables.
In a simple linear regression model with one independent variable,
the regression coefficient represents the change in the dependent
variable for a one -unit change in the independent variable. For
example, if the regression coefficient is 2.5, then a one -unit
increase in the independent variable is associated with a 2.5 unit
increase in the dependent variable.
In a multiple regression model with two or more independent variables, each
regression coefficient represents the change in the dependent variable for a one -
unit change in the corresponding independent variable. All other variables are held
constant. Consider a multiple regression model with two independent variables X 1
and X 2. A regression coefficient of 2.5 for X 1 means that a one -unit increase in
X 1 is
associated with a 2.5 unit increase in the dependent variable, holding X 2
constant.
Regression coefficients are important for interpreting the results of a
regression analysis, as they provide information about the direction and
magnitude of the relationship between variables. They can also be used
to make predictions about the dependent variable based on the values
of the independent variables.
© Aptech Limited
3.3 Summary
 Correlation serves as a statistical metric elucidating the connection between two
variables. It
gauges the extent of their association, revealing whether the relationship is
positive or
negative.
 The strength of the correlation can range from -1 to +1, where
o -1 indicates a perfect negative correlation
o +1 indicates a perfect positive correlation
o 0 indicates no correlation at all
 A scatter chart is a type of chart used to display the relationship between two
numerical
variables.
 The coefficient of correlation is a statistical measure that indicates the
strength and direction
of the linear relationship between two variables.
 The most used method s to calculate correlation between variables are Karl
Pearson’s
coefficient of correlation and Spearman’s Rank correlation coefficient.
 The probable error of a rank correlation coefficient measures the likely amount
of error or
uncertainty in the sample estimate of the true rank correlation coefficient.
 Regression is a statistical method used to determine the relationship between a
dependent
variable and one or more independent variables.
 In simple linear regression, the relationship between X and Y is modeled using a
straight-line
equation. This equation is of the form Y = b0 + b1X where b0 and b1 are the
intercept and
slope coefficients, respectively.
 In multiple linear regression, there are two or more independent variables X1,
X2, ..., Xn. The
relationship between these variables and Y is modeled using an equation of the form
Y = b0 +
b1X1 + b2X2 + ... + bn Xn, where b0, b1, b2, ..., bn are the coefficients.
 Non-linear regression is a statistical method used to model a relationship
between a
dependent variable and one or more independent variables that is not linear.

© Aptech Limited
3.4 Check Your Progress

1. Which of the following statements is true about correlation coefficient ?

A. Correlation coefficient measures the


strength and direction of the linear
relationship between two variables B. Correlation coefficient always ranges
from -1 to 1
C. A correlation coefficient of 0 indicates that
there is no relationship between two
variables D. All of these

2. Which of the following statements is true about regression analysis?

A. Regression analysis can be used to predict


the value of one variable based on the value
of another variable B. Regression analysis can only be used
if the variables are strongly correlated
C. Regression analysis can be used to establish
causality between two variables D. All of these

3. Which of the following statements is true about simple linear regression ?

A. Simple linear regression involves only one


independent variable and one dependent
variable B. Simple linear regression is used when
there is a curvilinear rel ationship
between two variables
C. Simple linear regression can be used to
determine the relationship between more
than two variables D. All of these

4. What is Karl Pearson's coefficient of correlation ?

A. A measure of the strength of the relationship


between two variables B. Unimodal A measure of the difference
between two variables
C. A measure of the variability within a single
variable D. A measure of the probability of a
relationship between two variables

© Aptech Limited
5. What is the purpose of the coefficient of determination (R -squared) in the
context of multiple
linear regression?

A. It measures the strength of the relationship


between the depe ndent and independent
variables B. It indicates the average val ue of the
independent variables
C. It quantifies the proportion of variance in
the dependent variable explain ed by the
independent variables D. It calculates t he slope of the regression
line
© Aptech Limited
3.4.1 Answers for Check Your Progress
1. D
2. A
3. A
4. A
5. C

© Aptech Limited
Try It Yourself
1. Calculate Pearson’s correlation coefficient and Spearman’s rank correlation
coefficient.
X: 2, 4, 6, 8, 10
Y: 1, 3, 5, 7, 9

X: 1, 2, 3, 4, 5
Y: 5, 4, 3, 2, 1
2. We have a 0.6 correlation coefficient and 30 pairs of samples. Calculate the
probable error
in this example.
3. Calculate probable error of rank correlation for following data:
X: 70, 76, 71, 98, 88, 61, 79
Y: 45, 44, 72, 67, 48, 35, 40
4. Calculate the regression coefficients and form the regression equation with
following data:
X: 1,2,3,4,5,6,7
Y: 9,8,10, 12, 11, 13, 14

Price : 10, 12, 13, 12, 16, 15


Amount demanded : 40, 38, 43, 45, 37, 43
© Aptech Limited

This session introduces the basics of inferential statistics by diving deep into
sampling theory and hypothesis testing.
In this session, you will learn to:
 Explain hypothesis testing
 Describe sampling theory
 Describe confidence intervals & level of significance
Session 0 4
Inferential
Statistics
4.1 INTRODUCTION TO INFERENTIAL STATISTICS
Inferential statistics is a branch of statistics that deals with making inferences
or conclusions
about a population based on information obtained from a sample. It us es sample
data to fetch
inferences about a larger population.
Inferential statistics uses probability theory to make these inferences. It helps
researchers to
make generalizations about a population by studying a subset of individuals or
objects from the
population.
To carry out inferential statistics, typically begin with a hypothesis or claim
about a population.
Next, collect a sample of data from the population and use statistical methods to
analyze the
sample data. Based on the results of analysis, make inferences about the
population.
Inferential statistics can be used in a wide range of applications, including
marketing research,
medical studies, social sciences, and more. It allows researchers to draw
conclusions about a
population with a certain level of confidence and can help decision-makers make
informed
decisions based on data.
4.2 SAMPLING THEORY
Sampling theory is a field of statistics that deals with the selection of portions
of individuals or
objects from a larger population. The goal of sampling is to gather information
about the
population by studying the characteristics of the sample. Sampling theory provides
a framework
for selecting a sample that is representative of the population and for estimating
the parameters
of interest, such as the mean, variance, or proportion.
In practical terms, sampling theory is used in a variety of fields, such as market
research, social
sciences, public health, and manufacturing. For example, a market research firm may
use
sampling theory to select a representative sample of consumers to survey about
their
preferences for a new product. In public health, researchers may use sampling
theory to select a
sample of patients from a population to study the prevalence of a disease.
Sampling theory involves a variety of techniques for selecting a sample, including
random
sampling, stratified sampling, cluster sampling, and systematic sampling. The
choice of sampling
technique depends on the characteristics of the population and the research
objectives. Once a
sample is selected, statistical methods can be used to estimate population
parameters and to
quantify the precision and accuracy of the estimates.
Figure 4.1 shows a visual representation of sampling.
© Aptech Limited

Figure 4.1: Visual Representation of Sampling

4.2.1 TYPES OF SAMPLING


There are several types of sampling methods that are commonly used in research and
statistical
analysis. Here are some of the most common types of sampling:

Simple Random
Sampling•In this approach, every individual in the population has an equal
likelihood of being selected. This usually entails assigning
numerical values to each member and then employing a random
number generator to make the selections.
Stratified Sampling•Here, the population is categorized into subgroups (or strata)
based on specific characteristics like age, gender, or income.
Participants are then randomly chosen from each subgroup, aiming
for a sample that accurately represents the entire population.
Cluster Sampling•Cluster Sampling involves dividing the population into clusters or
groups, then randomly selecting clusters, and including all
members within those chosen clusters in the sample.
Systematic Sampling•This method involves organizing the population into a sequence
or
list. A random starting point is selected, and then every nth
member in the sequence is included in the sample.
Convenience
Sampling•Convenience sampling selects participants who are easily available
and willing to take part, like students in a class or customers in a
store. However, it may not always yield a representative sample.
Snowball Sampling•This method relies on participants identifying and referring
others.
For instance, researchers might start with a couple of participants
and ask them to suggest additional willing participants. It is often
used when the target population is challenging to identify or
access.
© Aptech Limited
Every sampling method has its unique advantages and drawbacks. The selection of the
appropriate method hinges on factors such as the precise research question, the
characteristics
of the target population, and the resources at the researcher's disposal.

4.2.2 USE CASES OF SAMPLING


Sampling is a common technique used in various fields to select a subset of
individuals or
observations from a larger population. The selected subset is used to make
inferences about the
larger population.

Here are some common use cases of sampling:

4.3 PARAMETER AND STATISTICS


Parameters and statistics are both important concepts in the field of statistics.
In market research, a sample of consumers may be selected from a
larger population to gather information about their preferences,
behaviors, and attitudes towards a product or [Link] research
In medical research, a sample of patients may be selected from a
larger population to study the effectiveness of a treatment or [Link]
research
In opinion polling, a sample of voters may be selected from a larger
population to predict the outcome of an election or to measure public
opinion on a particular [Link] polling
In quality control, a sample of products may be selected from a larger
batch to test for defects or to ensure that they meet certain quality
[Link] control
In environmental monitoring, a sample of soil, water, or air may be
taken from a larger area to determine the level of pollutants or
[Link]
monitoring
In auditing, a sample of financial transactions may be selected from a
larger dataset to check for errors or fraudulent [Link]
© Aptech Limited

In summary, parameters are used to describe populations, while statistics are used
to describe
samples. Parameters are often estimated using statistics, which allow inferences
about the
population based on the sample.

Parameters refer to numerical characteristics that


describe a population. They are usually denoted by
Greek letters such as µ (mu) or σ (sigma).
Parameters are often used to make inferences
about a population based on a sample. For
example, to know the average height of all people
in a certain country, one can take a sample of
people and calculate the sample mean. The sample
mean is an estimate of the population mean, which
is the interested [Link] , on the other hand, are numerical
characteristics that describe a sample. They are
usually denoted by Latin letters such as x̄ (x-bar) or
s. Statistics are used to summarize and analyze
data, and to make inferences about the population.
For example, if one takes a sample of people and
calculates the sample mean, that sample mean is a
statistic. Statistic can be used to make inferences
about the population mean, which is the interested
parameter.
© Aptech Limited
4.3.1 SAMPLING DISTRIBUTION
In statistics, the sampling distribution refers to the probability
distribution of a statistic obtained from a large number of
samples drawn from a population. It describes the distribution
of the values of a statistic (such as the mean or standard
deviation). This would be obtained if many random samples
were taken from a population and compute the statistic for
each sample.
The sampling distribution is important because it allows to
make inferences about the population based on the sample
data. By examining the sampling distribution of a statistic, one
can estimate the population parameter and determine the
likelihood that a particular sample statistic was drawn from the
population.
For example, suppose there is a requirement to estimate the
mean height of all students in a university. A sample can be
taken, say, 100 students and compute their mean height.
However, this sample mean might not be exactly equal to the
true population mean. The sampling distribution of the mean
tells how likely it is that your sample mean is close to the true
population mean.
The central limit theorem is an important concept in the theory
of sampling distributions. It states that, under certain
conditions, the sampling distribution of a statistic (such as the
mean) will be approximately normally distributed, regardless of
the underlying population distribution. This makes it possible to
use inferential statistics to make predictions about the
population based on sample .
© Aptech Limited
Figure 4.2 illustrates the sampling distribution with different sample sizes.

Figure 4.2: Sampling Distribution


Image Co urtesy :
[Link]
Sampling_Distributions_of_the_Sample_Mean_from_a_Normal_Popula
[Link]

4.3.2 STANDARD ERROR (SE)


The SE is a statistical term that measures the variability of a sample statistic,
such as the mean or
standard deviation, from its population parameter. It is the standard deviation of
the sampling
distribution of a statistic, which is an estimate of the population parameter based
on a sample of
observations.
In other words, the standard error measures the amount of variation or uncertainty
in the
estimate of the population parameter based on a sample. The greater the precision
of the
estimate, the smaller the standard error tends to be. Conversely, when the standard
error is
larger, the estimate is likely to be less precise.
The formula for SE depends on the type of statistic being used. Here are the
formulas for three
common statistics:

𝑺𝑬𝑴 = 𝒔
Standard error of the mean (SEM):

√𝒏
Where, s is the standard deviation of the sample and n is the sample size.

𝑺𝑬= √𝒑 ∗ (𝟏 − 𝒑)
Standard error of the proportion (SE):

𝒏
Where, p is the proportion of successes in the sample and n is the sample size.

© Aptech Limited

𝑺𝑬𝑫 = √𝒔𝟏𝟐
Standard error of the difference between means (SED):

𝒏𝟏+𝒔𝟐𝟐
𝒏𝟐

Where, s1 and s2 are the standard deviations of the two samples and n1 and n2 are
the sample
sizes.
Consider an example.
Suppose there is a sample of 20 students and the average height of all students at
a school
needs to be estimated . The height of each student in the sample needs to be
measured and the
mean height is calculated to be 170 cm. The standard deviation of the sample is 5
cm.
To calculate the standard error of the mean, following formula is used:
SE = s / sqrt(n)
Where, s is the standard deviation of the sample and n is the sample size.
In our example, s = 5 cm and n = 20, so:
SE = 5 / sqrt(20) = 5 / 4.472 = 1.118 cm
Therefore, the standard error of the mean is 1.118 cm . Here is what this means.
When mean
height for each sample is to be calculated for multiple samples of same size from
same
population, standard deviation of samples comes to be approximately 1.118 cm.

4.4 HYPOTHESIS TESTING


Hypothesis testing is a statistical method used to determine whether a hypothesis
about a
population is supported by sample data. The process involves formulating a
hypothesis about a
population parameter, collecting data, and then using statistical methods to
analyze the data
and draw conclusions about the hypothesis.

4.4.1 NULL AND ALTERNATIVE HYPOTHESIS


There are two types of hypotheses in hypothesis testing: the null hypothesis and
the alternative
hypothesis . The null hypothesis is a statement about the population parameter that
is assumed
to be true until proven otherwise. The alternative hypothesis is a statement that
contradicts the
null hypothesis and is what the researcher hopes to show is true.
© Aptech Limited

For example:
The process of hypothesis testing involves four steps:

If the test statistic falls in the rejection region (more extreme than critical
value or p-value is <
chosen level of significance), then null hypothesis is rejected for alternative
hypothesis. If test
statistic does not fall in the rejection region (not more extreme than critical
value or p-value is >
chosen level of significance), then null hypothesis cannot be rejected.

Null Hypothesis : There is no


difference in test scores between
students who study for two hours
and those who study for four hours.
Alternative Hypothesis : Students
who study for four hours will have
higher test scores than those who
study for two [Link] Hypothesis : There is no
relationship between caffeine
consumption and heart rate.
Alternative Hypothesis : Increased
caffeine consumption leads to an
increase in heart rate.
Null Hypothesis : The new
medication has no effect on
reducing the duration of the
common cold.
Alternative Hypothesis : The new
medication reduces the duration of
the common [Link] Hypothesis : There is no
difference in the average income
between men and women.
Alternative Hypothesis : Men earn
more on average than women.
State the null and alternative hypothesis.
Choose an appropriate level of significance.
Collect data and calculate a test statistic.
Compare the test statistic to the critical value or p -value to determine whether
to reject or fail to reject the null hypothesis.
© Aptech Limited
4.4.2 TYPES OF ERROR
There are several types of errors that can occur in hypothesis testing:

4.5 CRITICAL REGION


In statistical hypothesis testing, the critical region is the set of values that
would lead to reject
the null hypothesis . The level of significance, denoted by alpha (α), is the
maximum probability of
making a type I error, which is the rejection of a true null hypothesis.
The significance level establishes the critical region. For example, if the level
of significance is set
at 0.05 (or 5%), a 5% chance of making a type I error can be accepted . Therefore,
the critical
region will be the set of values that fall in the extreme 5% of the distribution,
in the tails of the
distribution.
The critical region is defined based on the null hypothesis, the alternative
hypothesis, and the
chosen level of significance. If the test statistic falls in the critical region,
the null hypothesis is
rejected and conclude that there is evidence to support the alternative hypothesis.
If the test
statistic falls outside the critical region, one fails to reject the null
hypothesis.

Type 1 Error
•In statistical hypothesis testing, a Type
1 error (also known as a false positive)
occurs when a null hypothesis is rejected
even though it is actually true.
•In other words, it is the error of concluding
that there is a significant difference or
effect when there is no such difference or
effect.
•Type 1 error is usually represented by the
symbol α (alpha). It is commonly set at a
significance level of 0.05 or 0.01 , which
means that the probability of making a
type 1 error is 5% or 1%, respectively.
•Type 1 errors can be reduced by increasing
the sample size, choosing a lower
significance level, or by using more rigorous
statistical [Link] 2 Error
•Type 2 error, also known as false negative,
is a statistical term used to describe a
situation where a hypothesis test fails to
reject a null hypothesis that is false. In
other words, a Type 2 error occurs when it
is concluded that there is no significant
difference between two groups or
variables, when in fact, there is a
difference.
•Type 2 errors are common in hypothesis
testing, particularly when the sample size is
small, or the effect size is weak. The
probability of making a Type 2 error is
denoted by the symbol β(beta) and is
related to the level of significance α(alpha)
of the hypothesis test.
•To reduce likelihood of making a Type 2
error, one can increase sample size or use
more sensitive statistical tests with higher
power to detect differences between
groups or variables. However, reducing the
risk of a Type 2 error usually comes at the
cost of increasing the risk of a Type 1 error
(rejecting a null hypothesis) that is true.
© Aptech Limited
Figure 4.3 shows the region of rejection and region of acceptance.

Figure 4.3: Critical Region and Acceptance Region

4.5.1 CONFIDENCE INTERVAL


A confidence interval is a statistical range of values that is likely to contain
the true value of an
unknown population parameter with a certain level of confidence.
It is an estimate of range of values within which a population parameter, such as
mean or
proportion, is likely to fall, based on a sample from that population. The
confidence interval is
calculated using a confidence level, which is typically expressed as a percentage,
such as 90%,
95%, or 99%.
For example, to estimate mean height of a population based on sample of heights, a
95%
confidence interval could be calculated. This would give a range of heights within
which are 95%
confident that the true mean height of the population lies.

𝑪𝑰= 𝑿 ∓ (𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 𝒗𝒂𝒍𝒖𝒆 ∗ 𝑺𝑬𝑴)


Formula for calculating confidence interval is:

Where, X is the sample mean, the standard error of mean is SEM.


Example: Consider there is a need to calculate a 95% confidence interval for the
mean weight of
a certain population of dogs. There is a sample of 50 dogs, and it is known that
the sample mean
weight is 25 pounds, with a standard deviation of three pounds.

© Aptech Limited
To calculate the confidence interval, first find the Standard Error of the Mean
(SEM), which is the
standard deviation of the sample mean:
SEM = σ / sqrt(n)
Where, σ is the population standard deviation (which one does not know, so the
sample
standard deviation can be used as an estimate), n is the sample size.

So in this case:
SEM = 3 / sqrt(50) = 0.424
Next, one need s to find the critical value for a 95% confidence interval. Look
this up in a standard
normal distribution table or use a calculator. For a two-tailed 95% confidence
interval, the critical
value is 1.96.
Now , the confidence interval can be calculated :
CI = X ± (critical value * SEM)
So in the example, the 95% confidence interval is:
CI = 25 ± (1.96 * 0.424) = [24.17, 25.83]

This means one can be 95% confident that the true mean weight of the population of
dogs lies
somewhere between 24.17 and 25.83 pounds, based on the sample data.
Confidence intervals are useful in statistical inference, as they provide a measure
of the precision
of estimates and help make inferences about the population based on the sample
data.

4.5.2 ONE TAIL, TWO TAIL TESTS


One-tailed test , also known as directional test, is a statistical hypothesis test.
Here, the
researcher or analyst makes a specific prediction about the direction of the
relationship between
two variables. In one-tailed test, null hypothesis is rejected if the sample
statistic falls either in
the upper or lower tail of the sampling distribution, depending on the direction of
the
hypothesis.
For example, suppose a researcher wants to test the hypothesis that a new drug will
increase the
average lifespan of patients. If researcher predicts that the drug will increase
the lifespan, the
one-tailed test will be a right-tailed test, and the null hypothesis will be that
the drug has no
effect. In this case, the researcher will reject the null hypothesis if the sample
mean of the
lifespan is significantly higher than the expected mean.
Alternatively, if researcher predicts that the drug will decrease the lifespan,
one-tailed test will
be a left-tailed test and the null hypothesis will be that the drug has no effect.
In this case, the
researcher will reject the null hypothesis if the sample mean of the lifespan is
significantly lower
than the expected mean.
© Aptech Limited
A two-tailed test is a statistical hypothesis test in which the null hypothesis is
tested against an
alternative hypothesis. This can be in either direction, either less than or
greater than the null
hypothesis.
In other words, in two-tailed test, researcher wants to determin e whether there is
a significant
difference between the two groups being compared, but without specifying the
direction of the
difference.
For example, to compar e the mean heights of men and women, a two-tailed test
checks for a
significant difference in height between them. Whether men are taller or shorter
than women
will not be specified.
Two-tailed test is commonly used in statistical analysis to test hypotheses. The
direction of the
difference is uncertain, or when the researcher wants to avoid the possibility of
bias in one
direction or the other.
Figures 4.4 and 4.5 shows one tail and two tail tests.

Figure 4.4: One Tail test

Figure 4.5: Two Tail Tests

© Aptech Limited
4.6 MAKING A DECISION ON A POPULATION
To make a statistical decision on a population, one typically needs to follow a
structured process
that involves several steps:
It is important to note that making a statistical decision on a population requires
careful
planning, data collection, and analysis. It is also important to be aware of
potential biases or
limitations in your data and to interpret your results within the broader context
of your research
question or hypothesis.

4.6.1 CRITICAL VALUE METHOD


Making a decision: critical value method
The critical value method is a statistical approach used to make decisions based on
a test statistic
and a predetermined critical value. The critical value represents the value that
the test statistic
must exceed to reject the null hypothesis. Clearly define the problem to solve and
the related population. This
may involve identifying a specific research question or [Link] the
problem
Identify the appropriate statistical test to use based on the type of
data and the research question. Some common statistical tests
involves t -tests, ANOVA, regression analysis, and chi -square [Link] a
statistical test
Determine the level of significance to make your decision. This is
typically set at 0.05 , which means that there is a 5% chance that
your result occurred by [Link] the
significance level
Collect data from your population using appropriate sampling
methods to ensure that your sample is representative of the
[Link] data
Calculate the appropriate statistics based on the chosen test and the
data [Link]
statistics
Interpret the results of statistical analysis, considering the significant
level set and any other relevant factors. Determine whether the
results support or reject your research question or [Link] results
Draw conclusions based on your statistical analysis and
communicate them clearly to others, considering any limitations or
assumptions that may have affected your [Link] conclusions
© Aptech Limited
Here are the steps to use the critical value method:

For example, the mean weight of apples is greater than 50 grams needs to be
tested . Take a
sample of 30 apples and calculate the sample mean weight to be 52 grams with a
standard
deviation of 4 grams. Now, calculate the critical value considering the level of
significance of 0.05
to test the null hypothesis that the mean weight is 50 grams.
Frame the null and alternative hypothesis.
H0: Mean weight of the apples >= 50 grams
H1: Mean weight of the apples < 50 grams
This constitutes to an upper tail test, therefore area of critical region on the
right side would be
α = 0.05. Which means that the area until the upper critical value would be 1-0.05
= 0.95. So one
need s to find the z-value corresponding to 0.95 probability. From the z-table, the
z value comes
out to be 1.65. The formula for calculating critical value is:
𝑪𝑽= 𝝁 ± 𝒁 𝒄∗ 𝝈
√𝒏
Since, this is a one tail test, hence instead of using the ± symbol, only the –
symbol will be used
which will calculate the lower critical value.
Critical Value (CV) = 50 - 1.65*(4/5.477) = 48.8
Since the calculated critical value comes out to be 48.8 which is less than sample
mean which is
52 grams, this means the sample mean lies in the rejection region. Hence, the null
hypothesis is
rejected. Formulate the null hypothesis
and alternative hypothesis.
Choose an appropriate test statistic and
calculate its value based on the sample data.
Determine the level of
significance ( α) for the test.
Look up the critical value for the chosen level of
significance and the test statistic's degrees of freedom.
Compare the calculated test statistic
value with the critical value.
If the calculated test statistic value is greater than the critical value, reject
the null
hypothesis. If it is less than or equal to the critical value, fail to reject the
null hypothesis.
© Aptech Limited
The calculated test statistic is (52-50)/(4/sqrt(30)) = 3.87. Using a t-
distribution table or
calculator with 29 degrees of freedom (n-1), one find s the critical value to be
1.699 at a level of
significance of 0.05.
Overall, the critical value method is a useful tool in statistical decision-making.
It provides a clear
threshold for rejecting or failing to reject a null hypothesis based on the test
statistic and level of
significance.
4.6.2 p VALUE METHOD
The p-value method is a commonly used statistical technique for making decisions
based on
data. In this method, a hypothesis is formulated and then tested using a
statistical test. The p-
value is calculated, which represents the probability of obtaining a result as
extreme or more
extreme than the observed result, assuming the null hypothesis is true.
If the p-value is less than a predetermined significance level, typically 0.05, the
null hypothesis is
rejected, and the alternative hypothesis is accepted. If the p-value is greater
than the
significance level, the null hypothesis is not rejected, and the alternative
hypothesis is not
accepted.
It is important to note that rejecting the null hypothesis does not necessarily
mean that the
alternative hypothesis is true, only that the data provide evidence against the
null hypothesis.
When using the p-value method to make a decision, it is important to carefully
consider the
assumptions and limitations of the statistical test being used. It is also
important to consider the
context of the data and the potential consequences of the decision. Additionally,
it is important
to consider other factors beyond statistical significance, such as effect size and
practical
significance.

© Aptech Limited
Steps to make a decision using p-value method:

Example of a p-value method:


Consider there is a hypothesis that the mean height of people in a certain town is
170 cm. A
random sample of 49 people from that town can be taken and their heights is
measured. It is
found that the mean height of this sample is 172 cms and the standard deviation is
5 cm. Test
the hypothesis using significance level as 0.05.
Here the hypothesis would be:
H0 = Mean height µ= 170
H1 = Mean height µ ≠ 170
Now, the formula to calculate z-score is: (X u - µ)/(σ/sqrt(n))
Where, X u = 172, σ = 5, n = 49
Zc = (X u - µ)/(σ/sqrt(n)) = 2.81
Now, one can find the probability of Z c = 2.81 from the z-table which comes out to
be 0.9975.
Since it is a two-tailed test, following will be multiplied, 0.9975 *2 = 1.995.
Since, p -value is greater than the significance level, one fails to reject the
null hypothesis.

1Calculate the value of the z -


score for the sample mean
point on the distribution.
2Calculate the p -value from the
cumulative probability for the
given z -score using the z -
table.
3Make a decision on the basis of
the p -value (multiply it by 2 for
a two -tailed test) with respect
to the given value of α
(significance value).
© Aptech Limited
4.7 Summary
 Inferential statistics is a branch of statistics that deals with making
inferences or conclusions
about a population based on information obtained from a sample.
 Sampling theory, a subset of statistics, focuses on the process of choosing a
subgroup of
individuals or objects from a larger population .
 The goal of sampling is to gather information about the population by studying
the
characteristics of the sample.
 Parameters are numerical features that define a population and are often used to
make
inferences about a population based on a sample.
 Statistics are numerical characteristics that describe a sample and are used to
summarize
and analyze data, and to make inferences about the population.
 The sampling distribution pertains to the likelihood distribution of a statistic
derived from
numerous samples taken from a population.
 The Standard Error (SE) quantifies the variability of a sample statistic, such as
the mean or
standard deviation, in comparison to its corresponding population parameter.
 Hypothesis testing is a statistical approach employed to assess whether sample
data
supports a hypothesis regarding a population.
 The null hypothesis posits a presumption about the population parameter,
considered true
until proven otherwise. Conversely, the alternative hypothesis presents a statement
that
opposes the null hypothesis.
 A type 1 error (also known as a false positive) occurs when a null hypothesis is
rejected even
though it is actually true.
 Type 2 error, also known as a false negative, describes a situation where a
hypothesis test
fails to reject a null hypothesis that is false.
 In statistical hypothesis testing, the critical region is the set of values that
would lead us to
reject the null hypothesis.
 The level of significance, denoted by alpha (α), is the maximum probability of
making a type I
error. This marks the rejection of a true null hypothesis.
 A confidence interval is a statistical range of values that is likely to contain
the true value of
an unknown population parameter with a certain level of confidence.
 A one-tailed test is a statistical hypothesis test in which the researcher or
analyst makes a
specific prediction about the direction of the relationship between two variables.
 A two-tailed test is a statistical hypothesis test where the null hypothesis is
tested against an
alternative hypothesis that can be in either direction.

© Aptech Limited
4.8 Check Your Progress

1. Which of the following is NOT a type of probability sampling?

A. Stratified random sampling B. Simple random sampling


C. Convenience sampling D. Systematic random sampling
2. In which type of sampling technique is the population divided into non-
overlapping groups
and then a random sample is taken from each group?

A. Simple random sampling B. Stratified random sampling


C. Cluster sampling D. Systematic random sampling

3. Which of the following statements about sampling distribution is correct?

A. It is a distribution of sample data B. It is a distribution of population


data
C. It is a distribution of sample statistics D. It is a distribution of
population statistics

4. The formula for calculating standard error of the mean is:


A. (standard deviation of the
sample)/√(sample size) B. (standard deviation of the
population)/√(sample size)
C. (mean of the sample)/√(sample size) D. (mean of the population)/√(sample size)

5. Standard error is used to calculate which of the following?

6. What is a p-value?

A. The probability of obtaining the


observed data if the null hypothesis is
true B. The probability of obtaining the observed
data if the alternative hypothesis is true
C. The probability of obtaining a sample
mean that is different from the
population mean D. The probability of making a t ype I error

A. Confidence interval B. P-value


C. Z-score D. None of these
© Aptech Limited
7. What does the level of significance represent?
A. The probability of making a Type I error B. The probability of making a Type
II
error
C. The probability of making both Type I and Type
II errors D. The probability of making no
errors

© Aptech Limited
4.8.1 Answers for Check Your Progress

1. C
2. B
3. C
4. B
5. A
6. A
7. A
© Aptech Limited
Try It Yourself
1. A researcher is interested in estimating the average weight of a certain type of
fruit. A
sample of 50 fruits is selected and the average weight is found to be 250 grams
with a
standard deviation of 20 grams. Calculate a 95% confidence interval for the
population
mean weight of this type of fruit.
2. A manufacturer claims that the mean weight of a box of cereal is 12 ounces. A
consumer
advocacy group suspects that the mean weight is actually less than 12 ounces. A
random
sample of 36 boxes is selected, and the mean weight is found to be 11.8 ounces with
a
standard deviation of 0.6 ounces. Test the hypothesis that the mean weight is less
than
12 ounces at a significance level of 0.05 using p-value method.
3. A company claims that their new product has a mean lifespan of at least five
years. A
sample of 25 products is tested, and the sample mean lifespan is found to be 4.8
years
with a standard deviation of 0.7 years. Test the hypothesis that the mean lifespan
is less
than five years at a significance level of 0.01 using p-value method.
4. Suppose you want to test the null hypothesis that the population mean is equal
to 50,
against the alternative hypothesis that it is greater than 50. You take a random
sample
of size 25 from the population and obtain a sample mean of 52 and a sample standard
deviation of 5. Use the critical value method with a significance level of 0.05 to
test the
hypothesis.
5. Suppose you want to test the null hypothesis that the population proportion is
equal to
0.4, against the alternative hypothesis that it is less than 0.4. You take a random
sample
of size 100 from the population and obtain 35 successes. Use the critical value
method
with a significance level of 0.01 to test the hypothesis.
6. A company manufactures a certain type of product. A sample of 50 products was
taken
and the mean weight was found to be 500 grams with a standard deviation of 20
grams.
What is the standard error of the mean weight?
7. A survey of 1000 adults was conducted to estimate the average number of hours
they
spend watching TV per day. The mean number of hours was found to be 3.5 hours with
a standard deviation of 1.2 hours. What is the standard error of the mean?
8. A survey of 500 people was conducted to determine the proportion of people who
support a particular political candidate. Of the 500 people surveyed, 250 said they
support the candidate. Calculate the 99% confidence interval for the true
proportion of
people who support the candidate.
© Aptech Limited
This session introduces various parameterized tests that can be executed
on a population to identify patterns in them.
In this session, you will learn to:
 Explain Chi-Square test
 Explain T-test
 Explain Z-test
 Explain F-test
Session 0 5
Exact Sampling
Distribution
5.1 INTRODUCTION TO EXACT SAMPLING DISTRIBUTIONS
Exact sampling distributions refer to the probability distribution of a statistic
that is obtained
through a process of repeated sampling from a population. The exact sampling
distribution of a
statistic is a theoretical distribution .
For example, suppose the mean weight of all dogs in a particular city needs to be
estimated . One
could take a sample of dogs from the city and calculate the mean weight of the
sample. One
could then repeat this process many times, each time taking a new random sample of
dogs, and
calculate the mean weight of each sample. The resulting distribution of means would
be the
exact sampling distribution of the sample mean.
Exact sampling distributions can be derived mathematically using probability theory
and
statistical methods. They can provide valuable insights into the properties of
statistical
estimators and hypothesis tests. They are often used to make inferences about
population
parameters based on sample statistics, such as estimating confidence intervals or
conducting
hypothesis tests.
5.2 CHI-SQUARE TEST
The Chi-square test is a statistical test used to determine the association between
two
categorical variables. It is used to test whether there is a significant difference
between the
observed frequencies and the expected frequencies in a contingency table.
In a contingency table, the rows represent one categorical variable and the columns
represent
another categorical variable. Each cell in the table contains the frequency of the
occurrence of a
combination of the two variables. The Chi-square test determines if there is a
significant
difference between the observed frequencies in the table and the expected
frequencies,
assuming that the variables are independent.
The Chi-square test is calculated by comparing the observed frequencies in each
cell of the
contingency table to the expected frequencies. These are calculated based on the
assumption of

𝑿𝟐= ∑(𝒐𝒃𝒔𝒆𝒓𝒗𝒆𝒅 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 − 𝒆𝒙𝒑𝒆𝒄𝒕𝒆𝒅 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚)𝟐


independence between the two variables. The formula for the Chi-square test is:

𝒆𝒙𝒑𝒆𝒄𝒕𝒆𝒅 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚
Where, χ² is the test statistic, and the sum is taken over all the cells in the
contingency table.
One real-life example where chi-square test is commonly used is in analyzing the
results of a
survey or poll. For instance, consider a political poll where a sample of 1,000
voters were asked
to choose between two candidates in an upcoming election. The data collected could
be
arranged in a contingency table with the rows representing the two candidates and
the columns
representing the responses from the sample of voters.
Table 5.1 shows an example of contingency tables.
© Aptech Limited
Candidate A Candidate B
Observed 450 550
Expected 650 350
Table 5.1: Contingency Table
The question now is whether the difference between the responses for the two
candidates is
statistically significant. To determine this, a chi-square test can be performed on
the data in the
contingency table.
The chi-square test will generate a p-value. This tells that probability of getting
a difference
between the two candidates as extreme as the one observed, assuming that there is
no real
difference between them in the population. If p-value is less than a predetermined
significance
level (such as 0.05), one can reject the null hypothesis that there is no
difference between the
candidates. One can conclude that there is a statistically significant difference.
Consider another example.
Suppose there is a study, where 100 participants were administered two different
medications
(A and B) for a specific medical condition. The effect of these two medications is
to be compared.
The null hypothesis for this test is: there is a significant dif ference between
the effects of the two
medications . One can record the number of participants who improved with each
medication .
Also, it is expected that Medication A and Medication B might have a positive
effect on 65
people and 35 people out of 100 people, respectively. However, a test was conducted
on the
patients and it was observed that Medication A cured 40 people and Medication B
cured 60
people.

Table 5.2 shows the obtained data.

Medication A Medication B
Observed 40 60
Expected 65 35
Table 5.2: Obtained Data
To perform a chi-square test on this data, first calculate the numerator of the
chi-square
formula. This is done by subtracting the expected values from the observed values.
Table 5.3
shows the calculation of numerator of Chi-Square formula.
Medication A Medication B Row Total
Observed 40 60 100
Expected 65 35 100
(Observed – Expected )2 (-25)2 (25)2
Table 5.3: Numerator of Chi -Square Formula
© Aptech Limited
Here, Expected frequencies (denominator of chi-square formula) = 65, 35
[Observed – Expected] frequencies (numerator of chi-square formula) =- 252, 252
Now, putting these values in the formula for chi-square:
χ² = [-252/65] + [252/35]
= [ 625/ 65] + [ 625 / 35]
= 9.61 + 17.85
= 27.46
Finally, determine the Degrees of F reedom ( DF) for the test.
For a 2x2 contingency table, DF = (number of rows - 1) x (number of columns - 1) =
1 x 1 = 1.
Figure 5.1 shows a Chi-square distribution chart which will help in finding the
critical value.

Figure 5.1: Chi -square Distribution Figure


Figure Courtesy: [Link] is-chi-square-
test-how-it -works/
To find out the p-value associated with this chi -square statistic and degrees of
freedom, consider
a significance level of 0.05. From the chi-square distribution chart with one DF,
it is found that
the critical value is 3.84. Since the calculated chi-square value (27.46) is
greater than the critical
value, the null hypothesis is rejected. One can conclude that there is a
statistically significant
difference between the two medications in terms of their effect on the condition.
© Aptech Limited
The Chi-square test is used in many areas of research, such as biology, psychology,
and social
sciences, to test hypotheses about the relationship between categorical variables.
It is a
powerful tool for identifying patterns and associations in data and is widely used
in statistical
analysis.

5.3 Z- test
A z-test is a statistical test used to determine whether two population means are
different when
the sample size is large and the population variance is known. It is a hypothesis
test that
compares the means of two samples, usually from normally distributed populations.
The z-test is named after the standard normal distribution, which is a probability
distribution
with a mean of 0 and a standard deviation of 1. The test works by transforming the
sample
means into z-scores, which represent the number of standard deviations that the
sample mean
is from the population mean. The z-score is then compared to a critical value from
the standard
normal distribution to determine whether to discard or fail to reject the null
hypothesis.
The null hypothesis in a z-test is that the means of the two populations being
compared are
equal. The alternative hypothesis states that the means un equal. The test
statistic is calculated

𝒛 = 𝒙𝟏− 𝒙 𝟐
using the formula:

𝒏𝟏−𝝈𝟐𝟐
√𝝈𝟏𝟐

𝒏𝟐
Where, x 1 and x 2 are the sample means, σ is the population standard deviation,
and n is the
sample size. The resulting z-score is compared to a critical value from the
standard normal
distribution based on the desired level of significance and the number of tails of
the test.
© Aptech Limited

The sample mean productivity before implementing the new process is 80 units and
the sample

Hence, 𝒙𝟏 = 90, 𝒙𝟐 = 80
mean productivity after implementing the new process is 90 units.

𝝈𝟏𝟐 = 10, 𝝈𝟐𝟐 = 10


𝒏𝟏 = 10, 𝒏𝟐 = 10
One can use the z-test to determine whether the difference between these means is
statistically
significant.
The test statistic is calculated as follows:
z = (90 – 80) / [ (10 / √10) + (10 / √10) ] = 2.24
Assuming a significance level of 0.05, one can look up the critical value from the
standard normal
distribution table or use statistical software, which is 1.645 for a two-tailed
test.
For a two-tailed test with a significance level of 0.05, you need to find the
critical z-values for
each tail. Since the significance level is split between the two tails, each tail
corresponds to a
probability of 0.05/2 = 0.025
The critical z-values for a two-tailed test at 0.025 significance level are
approximately ±1.96. If the calculated z -score falls within the rejection
region (that is, its absolute value is greater than
the critical value), one can reject the null
hypothesis. One can conclude that there is a
statistically significant difference between the
means of the two populations. If the z -score falls
outside the rejection region, one can fail to
reject the null hypothesis and conclude that
there is insufficient evidence to support a
difference between the [Link] example, suppose a company wants to test
whether their new manufacturing process has
improved the productivity of their workers. The
company randomly selects 10 workers and
records their productivity before and after
implementing the new process. The data is
normally distributed and has a known
population standard deviation of 10 units.
The null hypothesis is that there is no difference
in the mean productivity of the workers before
and after implementing the new process. The
alternative hypothesis is that the mean
productivity after implementing the new
process is greater than the mean productivity
before.
© Aptech Limited
Since the calculated z-value (2.24) is greater than the critical value (1. 96), one
can reject the null
hypothesis. One can conclude that the mean productivity after implementing the new
process is
statistically significantly greater than the mean productivity before.
Therefore, the company can conclude that the new manufacturing process has improved
worker
productivity.
5.4 t-test
A t-test is a statistical test used to determine whether two population means are
different when
the sample size is small or the population variance is unknown. It is a hypothesis
test that
compares the means of two samples, usually from normally distributed populations.
The t-test is named after the t-distribution, which is a probability distribution
that is similar to
the standard normal distribution but with heavier tails. The test works by
calculating a t-statistic,
which measures the difference between the sample means relative to the variability
within the
samples.
The null hypothesis in a t-test is that the means of the two populations being
compared are
equal. The alternative hypothesis states that the means are un equal. The test
statistic is

𝒕 = 𝒙𝟏− 𝒙 𝟐
calculated using the formula:

𝒔
√𝒏
Where, x1 and x2 are the sample means, s is the sample standard deviation, and n
is the sample
size. The obtained t-score is gauged by comparing it to a critical value from the
t-distribution,
determined by the chosen significance level and the test's number of tails.
If the calculated t-score falls within the rejection region (that is, its absolute
value is > the critical
value), one can reject the null hypothesis. One can conclude that there is a
statistically significant
difference between the means of the two populations. If the t-score falls outside
the rejection
region, one can fail to reject the null hypothesis and conclude that there is
insufficient evidence
to support a difference between the means.
There are two types of t-tests: independent samples t -test and paired samples t -
test. The
independent samples t-test compares the means of two independent samples. The
paired
samples t-test compares the means of two related samples (example, before and after
measurements from the same individuals).
Consider an example. Suppose a nutritionist wants to test whether a new diet plan
results in a
significant weight loss for participants. The nutritionist randomly selects 20
participants and
divides them into two groups: a control group and a treatment group. The control
group follows
their usual diet plan, while the treatment group follows the new diet plan. The
nutritionist
records the weight of each participant at the beginning and end of the study. The
data is
distributed normally and the population variance is unknown.
The null hypothesis is that there is no difference in the mean weight loss between
the control
group and the treatment group. The alternative hypothesis is that the mean weight
loss in the
control group is less than the mean weight loss in the treatment group.
© Aptech Limited
The sample mean weight loss for the control group is two pounds with a standard
deviation of
1.5 pounds. The sample mean weight loss for the treatment group is five pounds with
a standard
deviation of two pounds.
To test the hypothesis, the nutritionist performs a two-sample independent t-test.
The test
statistic is calculated as follows:
t = (5 - 2) / (sqrt((2^2/20) + (1.5^2/20))) = 2.86
Assuming a significance level of 0.05 and degrees of freedom of 18 (n1 + n2 - 2),
one can look up
the critical value from the t-distribution table. One can also use a statistical
software, which is
1.734 for a one-tailed test.
Since the calculated t-value (2.86) is greater than the critical value (1.734), one
can reject the
null hypothesis. One can conclude that the mean weight loss in the treatment group
is
statistically significantly greater than the mean weight loss in the control group.
Therefore, the nutritionist can conclude that the new diet plan results in a
significant weight loss
for participants compared to the control group.
The t-test is widely used in many fields, including social sciences, engineering,
and business, to
test hypotheses about means and to compare the effectiveness of different
treatments or
interventions.

5.5 F-test
A F-test is a statistical test used to compare the variances of two or more than
two populations.
It is a hypothesis test that determines whether the variability between two or more
groups is
significantly different or not. The F-test is named after the F-distribution, which
is a probability
distribution that arises when comparing the variances of two normally distributed
populations.
The null hypothesis in an F-test is that the variances of the populations being
compared are
equal. The alternative hypothesis states that at least one of the variances is
different. The test
statistic for an F-test is calculated by dividing the variance of one group by the
variance of
another group. If there are more than two groups, the F-test calculates the ratio
of the variances
of the largest and smallest groups.
If the calculated F-value falls within the rejection region (that is, it is > the
critical value), one can
reject the null hypothesis. One can conclude that there is a statistically
significant difference
between the variances of the groups. If the F-value falls outside the rejection
region, one can fail
to reject the null hypothesis and conclude that there is insufficient evidence to
support a
difference between the variances.
The F-test is often used in Analysis of Variance (ANOVA), a statistical technique
that compares
means across multiple groups. ANOVA is used to determine whether the means of two
or more
populations are equal, and the F-test is used to determine whether the variances
are equal.
The F-test is also used in regression analysis to test the overall significance of
the regression
model. The F-test is used to determine whether the variance in the dependent
variable
explained by the model is significantly greater than the variance that cannot be
explained by the
model.
© Aptech Limited
Here is a real-life example of an F-test:
Suppose a car manufacturer wants to compare the variances of fuel efficiency (in
miles per
gallon) of three different models of cars: Model A, Model B, and Model C. The
manufacturer
randomly selects 10 cars of each model and measures their fuel efficiency. The data
is normally
distributed.
The null hypothesis is that the variances of fuel efficiency for all three models
are equal. The
alternative hypothesis is that at least one of the variances is different and is
unequal .
To test the hypothesis, the manufacturer performs an F-test. The test statistic is
calculated as

𝑭 =𝒔𝟐(𝒍𝒂𝒓𝒈𝒆𝒔𝒕)
the ratio of the largest variance to the smallest variance:

𝒔𝟐(𝒔𝒎𝒂𝒍𝒍𝒆𝒔𝒕 )
F = s2 (largest) / s2 (smallest)
Where, s2 represents the sample variance.
Consider a significance level of 0.05 and degrees of freedom of 27 (n - k, where n
is the total
sample size and k is the number of groups) . One can look up the critical value
from the F-
distribution table or use statistical software, which is 3.01 for a two-tailed
test.

Suppose the sample variances are as follows:


Model A: 4.5
Model B: 5.2
Model C: 3.9
Then , the F-value is calculated as:
F = 5.2 / 3.9 = 1.33
The calculated F-value (1.33) is less than the critical value (3.01). One can fail
to reject the null
hypothesis. This will conclude that there is insufficient evidence to support a
difference in the
variances of fuel efficiency between the three car models.
Therefore, the car manufacturer can conclude that the variances of fuel efficiency
are
statistically similar for all three car models. This information can be useful for
future product
development and marketing decisions.
Overall, the F- test is a powerful statistical tool that helps to determine whether
the variability
between two or more groups is significantly different or not. It is used in a
variety of fields,
including biology, physics, economics, and engineering.

© Aptech Limited
5.6 Summary
 The sampling distribution of a statistic represents the range of possible values
for that
statistic when repeatedly sampled from a population under specific conditions.
 The Chi-square test is a statistical test employed to ascertain the relationship
between tw o
categorical variables.
 A z-test is a statistical test used to determine whether two population means are
different
when the sample size is large and the population variance is known.
 A t-test is a statistical test used to determine whether two population means are
different
when the sample size is small or the population variance is unknown.
 An F-test compares variances in two or more populations, checking if the
differences in
variability between groups are statistically significant.

© Aptech Limited
5.7 Check Your Progress

1. Which of the following statements is true about the chi-square test?


A. It is used to test the independence of two
categorical variables B. It is used to compare the means of two
continuous variables
C. It is used to test the significance of a
correlation coefficient D. It is used to test the normality of a
distribution

2. What is the p-value in a chi-square test?


A. The probability of observing a test statistic
as extreme as the one computed from the
data, assuming the null hypothesis is true B. The probability of making a type I
error
C. The probability of rejecting a true null
hypothesis D. To compare the correlations of two
groups

3. What is a t-test used for?


A. To compare the medians of two groups B. To compare the means of two groups
C. To compare the variances of two groups D. It is the distribution of the
population
statistics

4. When is a paired samples t-test used?


A. When the two groups being compared are
independent B. When the two groups being compared
have equal variances
C. When the two groups being compared are
related or paired D. When the two groups being compared
have different sample sizes

5. What is the null hypothesis in a z-test?

A. The sample mean is greater than the


population mean B. The sample mean is less than the
population mean
C. The population mean is unknown D. The sample mean is equal to the
population mean

6. Which of the following is NOT a requirement for conducting a z-test?

A. The population standard deviation must be


known B. The sample size must be greater than
30
C. The population distribution must be normal D. The sample must be randomly
selected
from the population
© Aptech Limited

7. What is the interpretation of the p-value in an F-test?


A. The probability of obtaining the observed
F-statistic or a more extreme value if the
null hypothesis is true B. The probability of obtaining the
observed F -statistic or a more extreme
value if the alternative hypothesis is
true
C. The probability of obtaining the observed
difference between the two means if the
null hypothesis is true D. The probability of obtaining the
observed difference between the two
means if the alternative hypothesis is
true

© Aptech Limited
5.7.1 Answers for Check Your Progress

1. A
2. A
3. B
4. C
5. D
6. B
7. A

© Aptech Limited
Try It Yourself
1. A researcher wants to test whether there is a significant difference in the
distribution of
hair color among men and women. They survey 200 men and 200 women and find that
60 men and 80 women have blonde hair. Calculate the chi-square statistic and
degrees
of freedom for this test.
2. A biologist cross es two different strains of fruit flies and count the number
of offspring
with each combination of traits. The results are as follows:
o Trait 1 and Trait 2: 120
o Trait 1 and not Trait 2: 80
o Not Trait 1 and Trait 2: 80
o Not Trait 1 and not Trait 2: 120
Help the biologist check if there is a significant association between two genetic
traits in
a population of fruit flies.
3. A manufacturer claims that the average lifespan of their product is five years.
A sample
of 50 products is taken and the average lifespan is found to be 4.5 years with a
standard
deviation of 1.2 years. Test whether the manufacturer's claim is true at a
significance
level of 0.05.
4. A company wants to test whether there is a significant difference in the mean
sales
revenue per day between two stores: Store A and Store B. Store A has an average
sales
revenue of $500 with a standard deviation of $50 . Store B has an average sales
revenue
of $550 with a standard deviation of $70. Conduct a two-sample t-test at the 1%
significance level.
5. A study wants to test whether a new weight loss pill is effective in reducing
weight. The
study includes 20 participants who took the pill for four weeks and lost an average
of
five pounds with a standard deviation of five pounds. Conduct a
one-sample t-test at the 5% significance level to determine if the weight loss is
significant.

© Aptech Limited
7. A company is interested in comparing the performance of three different
marketing
strategies. Using data given in Table 5.4, perform an F-test to determine if there
is a
significant difference between the means of the three groups at a significance
level of
0.05. The population size here is 10. The five different domains where the
marketing
strategy is applied are: Fast Moving Consumer Durables (FMCD), Fast Moving Consumer
Goods (FMCG), Retail, E-Commerce, and Real Estate. The scores given are a count of
positive responses received after the marketing strategies are applied.
FMCD FMCG Retail E-
commerce Real
Estate
Strategy 1 15 18 12 14 16
Strategy 2 10 22 15 14 13
Strategy 3 25 20 18 23 21
Table 5.4: Data for F-test

© Aptech Limited

This session introduces the methods to analy ze difference in variance of


populations.
In this session, you will learn to:
 Explain ANOVA
 Explain MANOVA
 Describ e One - way and two - way classification Session 0 6
Analysis of
Variance
6.1 INTRODUCTION TO ANALYSIS OF VARIANCE
Analysis of Variance (ANOVA) is a statistical measure which is used to compare the
means of two
or more than two groups. It is a parametric test that assumes normal distribution
of the data and
equal variances across the groups.
ANOVA works by partitioning the total variation in the data into two components the
variation
between the groups and the variation within the groups. The between-group variation
represents the differences in the means of the groups being compared, while the
within-group
variation represents the random variability within each group.
The ANOVA test generates an F-statistic, which is the ratio of the between-group
variation to the
within-group variation. If the F-value is large enough, it indicates that there is
a significant
difference between the means of the groups being compared.
There are different types of ANOVA tests, depending on the number of independent
variables
and the design of the experiment. One-way ANOVA is used to compare the means of
three or
more groups, while two-way ANOVA is used to analyze the effects of two independent
variables
on the dependent variable.
ANOVA is a powerful and widely used statistical tool in research and data analysis.
It allows for
the comparison of means across multiple groups simultaneously, providing insight
into the
sources of variability in the data.
6.2 ONE-WAY AND TWO-WAY CLASSIFICATION
One-way and two-way classification are two different methods used in statistical
analysis to
group and analyze data based on one or two factors, respectively.
One-way classification, also known as one-factor Analysis of Variance (ANOVA), is
used when
there is only one categorical independent variable, or factor, that is being used
to classify the
data. This method is used to test if there is a significant difference between the
means of three
or more groups. For example, one-way ANOVA could be used to determine if there is a
significant difference in the average height of trees between three different types
of soil.
Two-way classification, also known as two-factor ANOVA, is used when there are two
categorical
independent variables, or factors, that are being used to classify the data. This
method is used to
test if there is a significant difference between the means of groups that are
created based on
the combination of the two factors. For example, two-way ANOVA could determine if
there is a
significant difference in the average height of trees based on both the type of
soil and amount of
fertilizer used.
Overall, one-way classification is used to analyze data based on one categorical
factor, while
two-way classification is used to analyze data based on two categorical factors.
© Aptech Limited
6.3 ANOVA

ANOVA is a statistical method used to


determine whether there are significant
differences between the means of three or
more groups. It is used to test hypotheses
and determine the sources of variation in a
[Link] ANOVA test calculates the F -statistic,
which compares the variation between the
means of the groups to the variation within
the groups. If variation between the groups
is significantly larger than variation within
the groups, then there is evidence to
suggest that there are significant
differences between the means of the
groups.
There are several types of ANOVA tests,
including one -way ANOVA, two -way ANOVA,
and repeated measures ANOVA. One -way
ANOVA is used when there is one
independent variable (example, treatment
group). T wo-way ANOVA is used when there
are two independent variables (example,
treatment group and gender). Repeated
measures ANOVA is used when the same
participants are measured under different
conditions or at varied times.
© Aptech Limited
Table 6.1 shows the formula to calculate F-test score of ANOVA:

Source of
variation Sum of Squares Degrees
of
freedom Mean Squares F-score

(SSG) 𝑺𝑺𝑮 = ∑(𝑿̅𝒋−𝑿̅)𝟐𝒏


Between groups

𝒋=𝟏 n-k 𝑴𝑺𝑮 = 𝑺𝑺𝑮


𝒅𝒇 𝑭= 𝑴𝑺𝑩
𝑴𝑺𝑾

(SSW) 𝑺𝑺𝑾 =∑∑(𝑿−𝑿̅𝒋)𝟐𝒏


Within groups

𝒋=𝟏𝒌
𝒋=𝟏 k-1 𝑴𝑺𝑾 = 𝑺𝑺𝑾
𝒅𝒇

𝑺𝑺𝑻 = ∑(𝑿̅𝒋−𝑿̅)𝟐𝒏
Total

𝒋=𝟏 n-1
Table 6.1: Formula to Calculate F -test Score

To understand this, here is an example:


Suppose there is data on the test scores of students in three different schools (A,
B, and C). It is
to be determined if there are any significant differences in the mean test scores
among the three
schools. Table 6.2 shows the test scores for each school.

SCHOOL TEST SCORE TEST SCORE TEST SCORE TEST SCORE


A 80 85 90 95
B 75 80 85 90
C 70 75 80 85
Table 6.2 : Test Scores
To perform an ANOVA, first calculate the overall mean of all the scores:
Overall mean = (80+85+90+95+75+80+85+90+70+75+80+85)/12 = 81.25
Next, calculate the Sum of Squares between Groups (SSG) by finding the sum of
squares of the
deviation of each group mean from the overall mean:
SSG = [(80+85+90+95)/4 - 81.25]2 + [(75+80+85+90)/4 - 81.25]2 + [(70+75+80+85)/4 -
81.25]2 =
(17.1875)2 + (5.0625)2 + (17.1875)2 = 699.21875 Calculation
Time

© Aptech Limited
Then, calculate the Sum of Squares within groups (SSW) by finding the sum of
squares of the
deviation of each score from its group mean:
SSW = (80-87.5)2 + (85-87.5)2 + (90-87.5)2 + (95-87.5)2 + (75-82.5)2 + (80-82.5)2 +
(85-82.5)2+ (90-
82.5)2 + (70-77.5)2 + (75-77.5)2 + (80-77.5)2 + (85-77.5)2 = 380
Finally, calculate the Degrees of F reedom ( DF) and Mean Squares (MS) for both SSG
and SSW:
df_G = 3-1 = 2 df_W = 12-3 = 9
Mean Square Group (MSG) = SSG/ df_G = 699.21875/2 = 349.609375
Mean Sum of Squares Within (MSW) = SSW/df_W = 380/9 = 42.22222222
Now we can calculate the F-ratio:
F = MSG/MSW = 349.609375/42.22222222 = 8.266203704
To determine whether this F-ratio is statistically significant, it is compared to
the F-distribution
with DF_G=2 and DF_W=9. Assuming a significance level of 0.05, the critical F-value
for this test
is 4.2565. Since the calculated F-value (8.2662) is greater than the critical F-
value (4.2565), one
can reject the null hypothesis. One can then conclude that there are significant
differences in the
mean test scores among the three schools.
ANOVA is a powerful tool for analyzing data and can be used in a variety of fields,
including
psychology, sociology, biology, and economics. It is important to use caution when
interpreting
results of an ANOVA test, as there are many factors that influence the results,
such as sample
size, outliers, and non-normal distributions.

© Aptech Limited
6.4 MANOVA
MANOVA stands for Multivariate
Analysis of Variance. It is a statistical
technique used to analyze the
relationships among two or more
continuous dependent variables and
one or more independent variables.
MANOVA is an extension of the
ANOVA technique, which only
analyzes one dependent [Link] main objective of MANOVA is to
determine whether there are any
significant differences among the
groups formed by the independent
variables in terms of the dependent
variables. MANOVA allows researchers
to analyze multiple dependent
variables simultaneously, which can
provide a more comprehensive
understanding of the relationship
between the independent and
dependent variables.
MANOVA can be used in various fields
such as social sciences, medicine, and
engineering to analyze the effects of
different variables on a system.
MANOVA requires certain assumptions
to be met, such as normality of the
data and equality of covariance
matrices across groups. If these
assumptions are not met, alternative
techniques such as non -parametric
tests may be used.
© Aptech Limited
6.5 Summary
 ANOVA compares means in multiple groups, assuming normal distribution and equal
variances.
 One-way and two-way classification are two different methods to group and analyze
data
based on one or two factors, respectively.
 One-way classification is used when there is only one categorical independent
variable, that
is being used to classify the data.
 Two-way classification is used when there are two categorical independent
variables, that
are being used to classify the data.
 MANOVA is used to analyze the relationships between two or more continuous
dependent
variables and one or more independent variables.

© Aptech Limited
6.6 Check Your Progress

1. In ANOVA, if the F-statistic is larger than the critical value, then:


A. Reject the null hypothesis B. Fail to reject the null hypothesis
C. Cannot make a decision D. Accept the null hypothesis

2. One-way ANOVA is used when:


A. There is one independent variable and
one dependent variable B. There is multiple independent variables
and one dependent variable
C. There is one independent variable and
multiple dependent variables D. To compare the correlations of two
groups

3. Two-way ANOVA is used when:


A. There is one independent variable and
one dependent variable B. There is multiple independent variables
and one dependent variable
C. There is one independent variable and
multiple dependent variables D. There are two independent variables
and one dependent variable

4. What does MANOVA stands for?

A. Multiple Analysis of Variance B. Multivariate Analysis of Variance


C. Multi -Analysis of Variability D. Multidimensional Analysis of Variance

5. What is the main difference between ANOVA and MANOVA?

A. ANOVA tests the differences between


groups on one dependent variable, while
MANOVA tests differences on two or more
dependent variables B. ANOVA tests differences between
groups on two or more dependent
variables, while MANOVA tests
differences on one dependent variable
C. ANOVA can only be used with two groups,
while MANOVA can be used with three or
more groups D. ANOVA assumes homogeneity of
variance, while MANOVA assumes
homoscedasticity

© Aptech Limited
6. Which of the following is true about MANOVA?

A. It is a statistical technique used for


analyzing the relationship between two
variables B. It is a statistical technique used for
analyzing the relationship between
three or more variables
C. The population distribution must be
normal It is a statistical technique used for
analyzing the relationship between two or
more categorical variables D. None of the se

7. ANOVA is used to test:

A. Differences between two population


means B. Differences between two sample means
C. Differences between multiple population
means D. Differences between multiple sample
means

8. Which of the following is an example of one-way classification design?

A. A study comparing the effectiveness of


two different treatments on depression B. A study comparing the attitudes of men
and women towards climate change
C. A study measuring the effects of two
different levels of caffeine on reaction
time D. A study comparing the reading ability
of children with and without dyslexia
© Aptech Limited
6.6.1 Answers for Check Your Progress

1. A
2. A
3. D
4. B
5. A
6. B
7. C
8. D

© Aptech Limited
Try It Yourself
1. A researcher conducts a study comparing the performance of three different types
of
fertilizer on the growth of tomato plants. The data collected shows the mean
heights for
the plants in each group: Group1: 10 inches, Group 2: 12 inches, and Group 3: 15
inches.
Conduct an ANOVA to determine if there is a significant difference in plant growth
between the three groups.
2. A study was conducted to investigate the effect of three different types of diet
(A, B, and C) on the level of three different blood biomarkers (X, Y, and Z).
The data
collected are shown in Table 6.3. Conduct a MANOVA to determine if there is a
significant difference in the levels of three blood biomarkers based on the type of
diet.
Diet A Diet B Diet C
X 10 8 5
Y 12 14 16
Z 8 6 4
Table 6.3: Sample D ata
3. A researcher wants to compare the mean blood pressure of three groups: a control
group, a low-dose group, and a high-dose group. She measures the blood pressure of
10
participants in each group. The ANOVA F-test produces a p-value of 0.001. What can
be
concluded from this result?
4. A teacher wants to determine the significant difference in the mean scores of
math
exams of three groups of students, Group A, Group B, and Group C. Each group has
10
students. The list of scores of these students in each group are:
Group A: 70,75,80,85,90,95,100,105,110,115
Group B: 65,70,75,80,85,90,95,100,105,110
Group C: 60,65,70,75,80,85,90,95,100,105
Conduct an ANOVA and interpret your results.
© Aptech Limited
5. A chef wants to compare the average cooking time of three different ovens. He
randomly selects 12 recipes and bakes each recipe in all three ovens. The cooking
time
(in minutes) is recorded. What is the alternative hypothesis of this study?
6. A study was conducted to determine if there is a difference in the mean scores
of three
different groups on a standardized test. The ANOVA F-test produced a p-value of
0.5.
What does this result indicate?
7. A researcher conducted a MANOVA with two independent variables, each with two
levels, and four dependent variables. What is the Degrees of Freedom for the Wilks'
Lambda statistic?
© Aptech Limited

You might also like