Full Text en
Full Text en
APTECH LIMITED
Contact E-mail: ov-support@[Link]
Edition 1 – 2023
Preface
In this Learner’s Guide we delve into the heart of inferential statistics, guiding
you
through the fundamental concepts, methodologies, and applications that empower
us to make informed decisions in the face of uncertainty. Whether you are eager to
grasp the intricacies of statistical inference, enhance decision-making skills, or
intrigued by the magic of numbers, this book is designed to cater to your
intellectual
curiosity.
The book provide a comprehensive exploration of key topics in statistics, offering
a
structured learning journey. Beginning with Introduction to Statistics, the
progression
includes fundamental concepts such as probability, correlation, and regression,
leading into more advanced discussions on inferential statistics. It also delve
into
precise topics, exploring the exact sampling distribution and the intricacies of
analysis
of variance, collectively building a solid foundation in statistical methods.
This book is the result of a concentrated effort of the Design Team, which is
continuously striving to bring you the best and the latest in Information
Technology.
The process of design has been a part of the ISO 9001 certification for Aptech-IT
Division, Education Support Services. As part of Aptech’s quality drive, this te am
does
intensive research and curriculum enrichment to keep it in line with industry
trends.
Design Team
Table of Contents
Session 1: Introduction to Statistics
Session 2: Introduction to Probability
Session 3: Correlation and Regression
Session 4: Inferential Statistics
Session 5: Exact Sampling Distribution
Session 6: Analysis of Variance Sessions
Session 01
Introduction
to Statistics
This session introduces to fundamental concepts of statistics and the measures of
central tendency.
In this session, you will learn to:
Define statistics
Describe the applications of statistics in Data Science
Describe descriptive statistics
Explain measures of central tendency
1.1 OVERVIEW OF STATISTICS
Everyday people come across terms such as average, range, frequency, data,
population, sample,
and many more. What needs to be wondered is the depth of these terms and how widely
are they
used. T he term STATISTICS originated from the Italian word ‘Statista’ , which
means ‘Political state’.
In ancient times, the government used to collect data on its population to have an
idea of country’s
manpower and introduce new taxes.
1.1.1 DEFINITION OF STA TISTICS
Statistics is a branch of mathematics. Here, after data is collected, it is analyz
ed, and interpret ed.
It provides a framework for making decisions and predictions based on data.
In order to use
statistics effectively, it is important to understand the different types of
data, the measures of
central tendency and variability, and the basic principles of probability.
1.1.2 APPLICATIONS O F STA TISTICS IN DATA SCIENCE
Statistics plays a crucial role in Data Science, which is the interdisciplinary
field that uses statistical
and computational methods to extract insights and knowledge from data. It has a
wide range of
applications across many fields, including science, business, economics,
medicine, engineering,
and more.
© Aptech Limited
Here are some common applications of statistics:
𝑿̅= ∑𝒙𝒊
mean is:
Central Tendency
•Measures ofcentral tendency areused todescribe thecenter ofadistribution .The
most used measures ofcentral tendency are the mean, median, and mode .
Suppose ateacher wants tocalculate theaverage score ofherstudents onarecent
math exam .Inthis case, themean would beanappropriate measure [Link]
mean iscalculated byadding upallthescores and dividing bythetotal number of
students .This gives asingle number that represents theaverage score oftheclass .
Dispersion
•Measures ofdispersion areused todescribe thespread ofadistribution .The most
used measures ofdispersion aretherange, variance, standard deviation, skewness,
and kurtosis .Skewness ofincome distribution isusually calculated todetermine the
appropriate taxpolicies ordesign social programs toreduce income inequality .
© Aptech Limited
Example : Consider the dataset of 5 values: 4,6,8,10,12.
Mean = (4+6+8+10+12)/5
= 40/5
= 8
Here, the mean of the data is 8.
The mean is a useful measure of central tendency. It considers all the values in a
dataset and
gives a single value that represents the ‘typical ’ value in the dataset. However,
it can be
sensitive to extreme values or outliers, which can skew the value of the mean. In
similar cases,
other measures of central tendency, such as the median or mode, may be more
appropriate.
Median: The median is a statistical measure of central tendency that represents
the middle
value of a dataset arranged in order from smallest to largest (or largest to
smallest). In other
words, it is the value that separates the lower half of the dataset from the upper
half. To
calculate the median, first, the data should be arranged in order from smallest to
largest (or
largest to smallest). If there is an odd number of data points, the median is the
middle value.
If there is an even number of data points, the median is the average of two middle
values.
Example : Consider the dataset: 1, 3, 5, 7, 9.
To find the median, first arrange the data in order: 1, 3, 5, 7, 9
Since there are five data points, which is an odd number, the median is the middle
value, which
is 5.
Now, consider another dataset: 2, 4, 6, 8, 10, 12
To find the median, first arrange the data in order: 2, 4, 6, 8, 10, 12
Since there are six data points, which is an even number, the median is the average
of the two
middle values, which are 6 and 8. Therefore, the median is (6+8)/2 = 7 .
Mode: The mode is a statistical measure of central tendency that represents the
most
frequently occurring value in a dataset. It is the value that appears most
frequently in the
dataset.
To calculate the mode, identify the value that appears most frequently in the
dataset. In some
cases, a dataset may have more than one mode if there are multiple values that
occur with
the same highest frequency.
Example : Consider the following dataset: 3, 4, 5, 5, 6, 7, 7, 7, 8.
In this dataset, the value 7 appears most frequently, occurring three times, while
all other
values occur only once or twice. Therefore, the mode of this dataset is 7. Now,
consider
another dataset: 1, 2, 2, 3, 4, 4, 5, 5
In this dataset, both the values 2 and 4 occur twice, which means this dataset has
two modes:
2 and 4.
It is important to note that not all datasets have a mode. For example, in the
dataset -
1, 3, 5, 7, 9 n o value appears more than once, so this dataset does not have a
mode.
© Aptech Limited
Figure 1.1 shows a visual representation of mean, median, and mode.
© Aptech Limited
Variance : The variance is the average of the squared deviations of each data
point from the
mean. It provides a measure of the spread of the data, but is influenced by extreme
values.
𝒏
Where 𝒙𝒊 are the data values, 𝒙̅ is the mean of the data, 𝒏 is the number of values
in the data.
Example : Consider the data values: 2,3,5,6,8,10,12,18
Mean of data : (2+3+5+6+8+10+12+18)/8 = 64/8 = 8
Variance = [(2- 8)2 + (3- 8)2 + (5- 8)2 + (6- 8)2 + (8- 8)2 + (10- 8)2 + (12- 8)2
+ (18- 8)2]/8
= (36 + 25 + 9 + 4 + 0 + 4 + 16 + 100)/8
= 194/8 = 24.25
Standard Deviation: The standard deviation is the square root of the variance. It
provides a
measure of the spread of the data in the same units as the original data and is a
commonly
used measure of dispersion. The formula to calculate standard deviation is:
𝝈 = √∑(𝒙𝒊− 𝒙̅)𝟐
𝒏= ( √∑(𝒙𝒊− 𝒙̅)𝟐
𝒏)𝟏
𝟐
Example : In the example of variance, the variance of the data is 24.25.
Hence, standard deviation of the same data is (24.25)1/2 = 4.92
𝟑(𝑴𝒆𝒂𝒏 − 𝑴𝒆𝒅𝒊𝒂𝒏)
calculate skewness is:
𝑺𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝑫𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
If Mean > Median, then the data is positively skewed.
If Mean < Median, then the data is negatively skewed.
If Mean = Median, then the data has no skewness.
Figure 1.3 shows the graph of a positively skewed, negatively skewed, and no skewed
data.
© Aptech Limited
𝒏 ∗ ∑(𝒙𝒊− 𝒙̅)𝟒
kurtosis is:
(∑(𝒙𝒊− 𝒙̅)𝟐)𝟐
Figure 1.4 shows three types of kurtosis graphs.
© Aptech Limited
∑(𝒙𝒊− 𝒙̅)𝟒= (10-30)4 + (50-30)4 + (30- 30)4 + (20- 30)4 + (10-30)4 + (20-30)4 +
Median of data = 20
(70-30)4
= 160000 + 160000 + 0 + 10000 + 160000 + 10000 + 2560000
(70-30)2 ]2
= [1600 + 1600 + 0 + 100 + 1600 + 100 + 2 5600]2
= 306002 = 936360000
Kurtosis = 7 * (3060000/936360000) = 0.023
© Aptech Limited
1.3 Summary
In Statistics, data is collect ed, analyz ed, and interpret ed.
Statistical methods are used in a wide range of fields, including science,
business, economics,
engineering, and social sciences.
Statistics has two branches , Descriptive Statistics and Inferential Statistics.
Descriptive statistics include measures of central tendency, such as mean,
median, and mode
measures of dispersion such as range, va riance, standard deviation, skewness and
kurtosis.
Mean is the average value of the data while median is the positional average.
Mode is maximum occurred value in a data.
Variance provides a measure of the spread of the data, but is influenced by
extreme values.
Standard Deviation is the square root of variance.
Skewness measures the level of asymmetry in a data whereas kurtosis measures the
level of
peakiness in a data.
© Aptech Limited
1.4 Check Your Progress
A. Mean B. Median
C. Standard Deviation D. Skewness
A. Mean B. Range
C. Kurtosis D. Median
4. ___________ is the name of the graph in the kurtosis theory where the kurtosis >
0.
A. Q2 – Q1 B. Q3 – Q1
C. Q1 – Q3 D. Q1 – Q2
A. Mean B. Median
C. Mode D. Variance
A. Mesokurtic B. Leptokurtic
C. Platykurtic D. Mountkurtic
A. Mean > Median B. Mean < Median
C. Mean < Mode D. Mean = Median
© Aptech Limited
1.4.1 Answers for Check Your Progress
1. A
2. B
3. D
4. B
5. B
6. B
© Aptech Limited
Try It Yourself
1. Find the mean of the first 15 whole numbers.
2. Find the mean of the following data: 2.2, 10.2, 14.7, 5.9, 4.9, 11.1, 10.5 .
3. Find the median of first 15 natural numbers.
4. Find the median of the following data: 1, 7, 2, 4, 5, 9, 8, 3.
5. The weights in kg of 10 students are as follows: 39, 43, 36, 38, 46, 51, 33, 44,
44, 43. Find
the mode of this data. Is there more than 1 mode? If yes, why?
6. Consider following table:
Denominations 10 20 50 5 100
Number of
notes 40 30 100 50 10
Find the mode of the data.
7. Following observations are arranged in ascending order. The median of the data
is 25 . Find
the value of x. 17, x, 24, x + 7, 35, 36, 46
8. The mean of 6, 8, x + 2, 10, 2x - 1, and 2 is 9. Find the value of x and also
the value of the
observation in the data.
9. Find the variance and standard devi ation of following data:
173, 149, 165, 157, 164.
10. Given following information, find the variance:
Mean = 179, n= 3000, SD = 9
11. Determine the skewness of following data:
12, 13, 54, 56, 25
In addition, identify whether the data is positively skewed, negatively skewed, or
zero
skewed.
© Aptech Limited
12. Determine the kurtosis of following data:
61, 64, 67, 70, 73
Identify whether the graph of the data is platykurtic, leptokurtic, or mesokurtic.
© Aptech Limited
This session will delve deeper into probability and its distributions and various
other concepts of probability such as random variable and Central Limit Theorem.
In this session, you will learn to:
Describe classical probability and probability distributions
Explain random variable and its expectation
Explain central limit theoremSession 0 2
Introduction
to Probability
2.1 EXPLORING PROBABILITY
Probability is a conce pt in statistic s that quantifie s the chance of occurr ing
of an event . It is
expresse d as a numerical value r anging from zer o to one. A probability of zero
signifie s that an
event is impossible, while a probability of one signifie s that an event is
guarantee d to occur. For
example, if a fair coin is tossed, there are two possible outcome s: heads or
tails. The probability
of ge tting heads is 1/2 or 0.5. Ge tting heads is one favorable outcome out of two
possible
outcome s (getting heads or tails).
Where P(A) is the probability of event A, P(B) is the probability of event B and
P(A ∩ B) is the
probability of the intersection of events A and B.
Note that the formula subtracts the probability of the intersection of events A and
B to avoid
double counting the probability of the intersection.
The probability of the intersection of two events can be calculated using the
Where P(A) is the probability of event A and P(B|A) is the conditional probability
of event B
given that event A has occurred.
Note that the conditional probability P(B|A) is the probability of event B
occurring given that
event A has already occurred.
These concepts of union and intersection are fundamental in probability theory and
are used
extensively in various applications such as statistics, machine learning, and
decision-making.
2.1.2 CONDITIONAL PROBABILITY – BAYES THEOREM
Conditional probability of an event occurs given that another event has already
occurred. It is
denoted by P(A|B), which means the probability of event A given that event B has
already
𝑷(𝑨|𝑩)=𝑷(𝑨 ∩ 𝑩)
occurred. The formula for conditional probability is:
𝑷(𝑩)
where, P(A ∩B) is the probability of both events A and B occurring and P(B) is the
probability of
event B occurring.
For example, there is a bag with 10 marbles, six of which are red and four of which
are blue. If a
marble is randomly select ed from the bag, the probability of getting a red marble
is:
P(Red) = 6/10 = 0.6
Now, consider that the red marble is put back in the bag and another marble is
randomly
select ed. The probability of getting a blue marble on the second selection, given
that the first
marble was red, is:
P(Blue|Red) = P(Blue ∩Red) / P(Red)
Since the red marble is put back in the bag, the probability of selecting a blue
marble on the
second selection is still 4/10. However, the probability of selecting a red marble
on the first
selection was 6/10. So:
P(Blue|Red) = (4/10) / (6/10) = 0.67
© Aptech Limited
This means that given that a red marble is already selected on the first selection,
the probability
of selecting a blue marble on the second selection is 0.67 or 67%.
𝑷(𝑩)
Where,
P(A|B) is the
conditional
probability of
event A given
event B.P(B|A) is the
conditional
probability of
event B given
event A.P(A) is the prior
probability of
event A. P(B) is the prior
probability of
event B.
© Aptech Limited
Where,
Using the information given in the problem, values can be plug ged in the values
and calculated
as shown:
P(A|B) = P(B|A) * P(A) / P(B)
P(A|B) = 0.95 * 0.01 / (0.95 * 0.01 + 0.05 * 0.99)
P(A|B) = 0.16
Therefore, the probability that the patient actually has the disease given a
positive test result is
only 16%. This may seem counterintuitive, but it is due to the fact that the test
is not 100 %
accurate and there is a relatively low prior probability of someone having the
disease.
𝑬(𝑿)= ∑ 𝒙 ∗ 𝑷(𝑿 = 𝒙)
possible values of X and their corresponding probabilities:
𝑬(𝑿)= ∫ [𝒙 ∗ 𝒇 (𝑿)]𝒅𝒙
the variable and its probability density function:
Where:
𝑪 𝒌𝒏is the binomial coefficient, also known as "n choose k", which represents the
P(X = k) is the probability of getting k successes in n trials
number of ways
to choose k successes from n trials and is calculated as: n! / (k! * (n - k)!)
p is the probability of success in a single trial
q = (1 - p) is the probability of failure in a single trial
k is the number of successes in n trials
𝑷(𝑿 = 𝒌 )= 𝛌𝒌
The probability mass function of the Poisson distribution is given by:
𝒌!∗ 𝒆−𝛌
Where λ is the average rate of occurrence of the events in the interval and k is
the number of
events that occur in that interval.
Some key properties of the Poisson distribution include:
𝑭(𝒙)= 𝟏
The PDF of a uniform distribution is given by:
𝒃 − 𝒂 𝒇𝒐𝒓 𝒂 ≤ 𝒙 ≤ 𝒃
Where a and b are the lower and upper bounds of the interval and f(x) represents
the probability
density function at any point x within the interval.
Some key properties of the uniform distribution include:
𝑭(𝑿)= 𝟏
equation:
𝝈√𝟐𝝅∗ 𝒆−𝟏
𝟐(𝒙−𝝁
𝝈)𝟐
Where x is the random variable, μ is the mean, σ is the standard deviation, π is
the mathematical
constant pi and e is the base of the natural logarithm.
Figure 2.3 shows a normal distribution graph.
© Aptech Limited
𝐅(𝐗)= 𝟏
The PDF of a normal distribution is given by following formula:
𝛔√𝟐𝛑∗ 𝐞−𝟏
𝟐(𝐱−𝛍
𝛔)𝟐
𝐅(𝟏𝟐)= 𝟏
Plugging in the values, following is arrived at:
𝟐√𝟐𝛑∗ 𝐞−𝟏
𝟐(𝟏𝟐− 𝟏𝟎
𝟐)𝟐
Simplifying the exponent:
f(12) ≈ 0.120985
So the PDF of our normal distribution at x = 12 is approximately 0.120985. This
means that the
probability of x being exactly equal to 12 is very small, but the probability of x
being near 12 is
relatively high.
The normal distribution has several important properties, such as the empirical
rule or 68-95-
99.7 rule. This states that approximately 68% of the observations in a normal
distribution are
within one standard deviation of the mean. About 95% of the observations are
within two
standard deviations of the mean and almost all observations 99.7% are within three
standard
deviations of the mean.
Figure 2.4 shows the visual representation of 68- 95-99.7 rule.
© Aptech Limited
2.5 CENTRAL LIMIT THEOREM
The concept of Central Limit Theorem (CLT) describes the behavior of the means of a
large
number of independent random variables. In simple terms, the CLT states that if
many random
samples are taken from any population, then the means of those samples will
approximate a
normal distribution. This is regardless of the shape of the population's original
distribution.
Figure 2.5 shows a visual representation of the central limit theorem.
© Aptech Limited
Assumptions of CLT:
Assumptions of the central limit theorem include:
It is important to note that the central limit theorem applies to a wide range of
distributions,
regardless of whether the underlying distribution is normal or not. Additionally,
the central limit
theorem holds even when the sample size is small provided that the underlying
distribution is
approximately symmetric and not too skewed.
Independence
The sample data should be collected independently of each other .
Sample Size
The sample size should be large enough to ensure that the sample mean is
normally distributed.
Identical Distribution
The population from which the sample is drawn should have an identical
distribution.
Finite Variance
The population from which the sample is drawn should have a finite
variance.
© Aptech Limited
2.6 Summary
© Aptech Limited
2.7 Check Your Progress
3. A coin is flipped three times. What is the probability of getting exactly two
heads?
A. 1/8 B. 3/8
C. 1/4 D. 3/4
A. Symmetric B. Unimodal
C. Skewed D. Bell-Shaped
5. Which of the following is a discrete probability distribution?
A. Exponential B. Gamma
C. Weibull D. Bernoulli
6. Which of the following statements is TRUE about the Central Limit Theorem?
A. 1/6 B. 1/5
C. 1/4 D. 1/3
A. It guarantees that the sample
mean will always be close to the
population mean. B. It applies to any population
distribution, regardless of its
shape.
C. It states that the distribution of the
sample means will be identical to
the population distribution. D. It requires that the sample size
be less than 30.
© Aptech Limited
7. Which of the following statements is TRUE about the Central Limit Theorem?
A. 5 B. 10
C. 50 D. 100
© Aptech Limited
1. C
2. A
3. B
4. C
5. D
6. B
7. C
© Aptech Limited
Try It Yourself
1. A jar contains four red balls and six black balls. Two balls are drawn at random
without
replacement. What is the probability that both balls are red?
2. In a class of 30 students, 18 are boys and 12 are girls. If a student is
selected at random,
what is the probability that the student is a boy?
3. A card is drawn at random from a standard deck of 52 cards. What is the
probability that it
is a face card (jack, queen, or king)?
4. A certain disease affects one in 1, 000 people in a population. A test for the
disease is 99%
accurate when it correctly identifies 99% of people who have the disease. In
addition, it
identifies 99% of people who do not have the disease. If a person tests positive
for the
disease, what is the probability that they actually have the disease?
5. A company employs two types of workers: skilled and unskilled. 60% of the
skilled workers
and 40% of the unskilled workers are union members. The union represents 45% of the
total workforce. If a worker is chosen at random from the company and it is known
that th e
worker is a union member, what is the probability that the worker is skilled?
6. A company manufactures light bulbs and it is known that 2% of the bulbs are
defective. If a
sample of 100 bulbs is randomly selected, what is the probability that exactly
three of them
are defective? A basketball player has a 70% success rate in free throws. If he
attempts 10
free throws, what is the probability that he will make at least eight of them?
7. Manufacturer of electronic components claims that 5% of its components are
defective. If a
sample of 200 components is randomly selected, what is the probability that less
than 10 of
them are defective?
8. A company produces light bulbs, and the lifetimes of the bulbs follow a normal
distribution
with a mean of 1, 000 hours and a standard deviation of 100 hours. If the company
wants to
guarantee that at least 90% of the bulbs last for at least 800 hours, what minimum
lifetime
should the bulbs be designed for?
9. Suppose that the time it takes for a machine to complete a task is uniformly
distributed
between five and ten minutes. What is the probability that the machine will take
between
six and eight minutes to complete the task?
10. A bag contains ten red balls and eig ht blue balls. Two balls are drawn at
random without
replacement. If the first ball drawn is red, what is the probability that the
second ball drawn
is blue?
© Aptech Limited
This session introduces the concept of correlation and regression and how they
play significant role in statistics.
In this session, you will learn to:
Explain correlation
Describe the rank of correlation
Explain regression and its typeSession 0 3
Correlation and
Regression
3.1 WHAT IS CORRELATION?
Correlation describes the relationship between two variables. It is often used in
data science and
research. Correlation indicates the degree to which the variables are associated
with each other .
It can be either positive or negative. A positive correlation implies that as one
variable goes up,
the other variable typically rises as well. A negative correlation suggests that
when one variable
increases, the other tends to decrease . For example, there is a positive
correlation between
smoking and the risk of lung cancer. As the number of cigarettes smoked per day
increases, the
risk of developing lung cancer also increases. On the other hand, there is a
negative correlation
between exercise and body weight.
When exercise levels rise, body weight tends to drop.
The strength of the correlation can range from -1 to +1 . -1 indicates a perfect
negative
correlation . +1 indicates a perfect positive correlation, and 0 indicates no
correlation at all.
3.1.1 SCATTER CHART
A scatter chart displays the relationship between two numerical variables. It
consists of a set of
data points, each of which represents the values of the two variables for a single
observation.
The values of one variable are plotted along the horizontal axis, while the values
of other
variable are plotted along the vertical axis.
In a scatter chart, each data point is represented by a dot. The position of the
dot on the chart
corresponds to the values of the two variables for that observation. The chart can
be used to
identify patterns or trends in the data, as well as to identify outliers or unusual
observations.
Scatter charts are commonly used in scientific research, engineering, economics,
and other fields
to explore relationships between variables and to identify patterns in data. They
can also be
useful for visualizing data sets with a large number of observations.
Figure 3.1 shows an illustration of a scatter plot.
Figure 3.1: Scatter Plot
© Aptech Limited
3.1.2 COEFFICIENT OF CORRELATION
The coefficient of correlation is denoted by r. It shows the strength and direction
of the linear
relationship between two variables. It is a value that ranges from -1 to 1. A
value of -1 signifies a
complete negative correlation, where an increase in one variable is matched by a
decrease in the
other . 0 indicates no correlation, and 1 indicates a positive correlation where an
increase in one
variable is matched by a increase in the other. There are different methods used to
calculate
correlation between variables. The most used are Karl Pearson’s coefficient of
correlation and
Spearman’s Rank correlation coefficient.
𝒓 = 𝑪𝒐𝒗 (𝒙, 𝒚)
observations is:
𝝈𝒙∗ 𝝈 𝒚
Where, x and y are two variables, Cov(x,y) is the covariance between x & y, and 𝝈𝒙
& 𝝈𝒚 are
standard deviations of x & y respectively.
Here is an example on how to calculate Karl Pearson’s coefficien t of correlation.
Consider following data:
X Y
2 3
3 5
4 4
5 6
6 7
© Aptech Limited
The calculation for one row is given as follows:
Here x1= 2, 𝒙̅ = 4, x1 - 𝒙̅ = 2 - 4 = -2
The data has 5 records. Hence, i will take the values 1,2,3,4,5.
Similarly, y1= 3, 𝒚̅ = 5, y1 - 𝒚̅ = 3 - 5 = -2
Hence, [ x1 - 𝒙̅] ∗ [y1 - 𝒚̅] = -2 * -2 = 4
In similar manner all other values can be calculated.
𝝈𝒙 = (10/5)1/2 = 1.414
Cov(X,Y) = 9/5 = 1.8
𝝈𝒚 = (10/5)1/2 = 1.414
r = 1.8/(1.414 * 1.414) = 0.9
So, the correlation coefficient between X and Y is 0.9 . This indicates a strong
positive correlation
between the two variables.
3.1.3 RANK CORRELATION
Rank correlation is used to assess the strength and direction of the relationship
between two
variables by comparing their rankings. There are several types of rank correlation
measures, but
the most commonly used ones is the Spearman's rank correlation coefficient.
© Aptech Limited
SPEARMAN’S RANK CORRELATION COEFFICIENT
𝒓 = 𝟏 − ∑𝒅𝟐
Spearman's rank correlation coefficient is calculated using following formula:
𝒏(𝒏𝟐− 𝟏)
Where Σd2 is the sum of the squared differences between the ranks of the two
variables , n is the
number of observations.
𝒓 = 𝟏 − 𝟔 ∑𝑹𝟐
Alternatively, the formula can be simplified as:
𝒏(𝒏𝟐− 𝟏)
Where ΣR2 is the sum of the squared ranks of the differences between the two
variables.
To calculate r, first assign ranks to the observations of each variable, then
calculate the
differences between the ranks of the two variables. F inally use the values into
the formula. The
resulting value of r will range from -1 to +1, with a value of 0 indicating no
correlation and values
closer to -1 or +1 indicating stronger correlations.
X Y
10 12
7 5
5 8
3 1
6 7
8 10
2 3
4 4
9 11
1 2
To calculate Spearman's rank correlation coefficient, first rank the values of each
variable, from
lowest to highest. Assign ranks by sorting the values and then assign ranks
according to their
position in the sorted list. If there are ties (that is, multiple values with the
same rank), assign the
average rank to those values.
It can be seen from the table, that the elements are sort ed in ascending order and
ranks are
assigned starting from 1 to n (n=10 here).
Table 3.2 shows the ranks for X and Y.
X Rank of X Y Rank of Y
10 10 12 10
7 7 5 5
5 5 8 7
3 3 1 1
6 6 7 6
8 8 10 8
2 2 3 3
4 4 4 4
9 9 11 9
1 1 2 2
Table 3.2 : Ranks for X and Y
© Aptech Limited
Table 3.3 shows the differences between the ranks of X and Y that can be
calculated, squared ,
and added .
√𝒏 − 𝟐
Where PE is the probable error, r is the sample rank correlation coefficient, and n
is the sample
size.
This formula assumes that the underlying data are normally distributed and that the
sample is
representative of the population. It also assumes that the sample size is large
enough for the
central limit theorem to apply.
Suppose there are two variables, X and Y and the ranks obtained for each variable
are as follows:
X: 3, 7, 6, 2, 5, 1, 4
Y: 2, 5, 3, 1, 7, 6, 4
To calculate the rank correlation coefficient (Spearman's rank) and the probable
error of this
coefficient, the illustration in same section to calculate the value of r can be
used.
Therefore, r = 0.452
The probable error of a rank correlation coefficient measures the likely amount of
error or
uncertainty in the sample estimate of the true rank correlation coefficient. It is
a measure
of the variability of the sample rank correlation coefficient that would be
expected if the
study were repeated many times.
© Aptech Limited
Now, this value of r can be used in the formula to calculate the probable error:
Probable error = 0.6745 * (1 - |0.452|) / sqrt(7) = 0.317
Therefore, the probable error of the rank correlation coefficient is 0.317. On
repeat ing the
experiment with different samples of same size, true value of rank correlation
coefficient will fall
within +/- 0.317 of the observed value about 50% of the time.
3.2 REGRESSION
Regression is used to determine the relationship between a dependent variable
(also
known as the response variable) and one or more independent variables (also known
as
predictor variables).
The goal of regression analysis is to create a mathematical model that describes
the relationship between the variables . It can also be used to make predictions
about the dependent variable based on values of the independent variables.
There are different types of regression analysis, including linear regression,
logistic regression, polynomial regression, and others.
Linear regression is the most commonly used type of regression analysis and it
involves finding a straight line that best fits the data points in a scatter plot.
Regression analysis can be used in many fields, including economics, finance,
biology,
engineering, and social sciences, to study the relationship between variables and
make
predictions about future outcomes.
© Aptech Limited
Figure 3.3 illustrates the graph of linear regression.
𝒏(∑𝒙𝟐)− (∑𝒙)𝟐
𝒃𝟏= 𝒏(∑𝒙𝒚) − ( ∑𝒙 )(∑𝒚)
𝒏(∑𝒙𝟐)− (∑𝒙)𝟐
In simple linear regression, there is only one independent
variable X. The relationship between X and Y is modeled using a
straight -line equation. This is Y = b0+ b1X. Here, b0and b1are the
intercept and slope coefficients, [Link] Linear
Regression
•For example: To predict weight of a person using his height, the height is the
independent variable X. Weight is the dependent variable Y. Therefore, the equation
hence becomes Weight = b0+ b1 * Height
In multiple linear regression, there are two or more independent
variables X1, X2, ..., Xn. The relationship between these variables
and Y is modeled using an equation. This is Y = b0+ b1X1+ b2X2+ ...
+ [Link], b0, b1, b2, ..., bnare the coefficients. Multiple Linear
Regression
•For example: To predict weight of a person using height and age, the height and
age
are the independent variables X1and X2. Weight is the dependent variable Y .
Therefore, the equation hence becomes Weight = b0+ b1 * Height + b2* Age
© Aptech Limited
Consider the following data containing age and glucose level of six patients.
Here , age is the
independent variable (x), with the help of which the glucose level will be
predicted. This is the
dependent variable (y). The equation hence becomes:
Y = b 0 + b 1X
Table 3.4 shows the calculation of terms used in calculating the regression
coefficients.
X Y XY X2 Y2
41 97 3977 1681 9409
22 66 1452 484 4356
23 79 1817 529 6241
44 73 3212 1936 5329
57 87 4959 3249 7569
59 81 4779 3481 6561
Sum = 246 Sum = 483 Sum = 20196 Sum = 11360 Sum = 39465
Table 3.4: Calculation of Regression Coefficients
Putting values in the formula:
b0 = (5486880 – 4968216)/(68160 - 60516) = 518664/7644 = 67.85
b1 = (121176 - 118818)/ (68160 - 60516) = 2358/7644 = 0.30
Hence the final equation becomes:
Y= 67.85 + 0.30*X
© Aptech Limited
3.2.2 NON -LINEAR REGRESSION
Non-linear regression is a powerful tool for analyzing complex data. This requires
careful
attention to the model assumptions and selection of appropriate techniques for
fitting the
model.
Figure 3.4 illustrates an example of non-linear regression between two variables,
petal length
and petal width. It can be seen that the graph is a curve which shows non-linear
regression.
© Aptech Limited
3.4 Check Your Progress
© Aptech Limited
5. What is the purpose of the coefficient of determination (R -squared) in the
context of multiple
linear regression?
© Aptech Limited
Try It Yourself
1. Calculate Pearson’s correlation coefficient and Spearman’s rank correlation
coefficient.
X: 2, 4, 6, 8, 10
Y: 1, 3, 5, 7, 9
X: 1, 2, 3, 4, 5
Y: 5, 4, 3, 2, 1
2. We have a 0.6 correlation coefficient and 30 pairs of samples. Calculate the
probable error
in this example.
3. Calculate probable error of rank correlation for following data:
X: 70, 76, 71, 98, 88, 61, 79
Y: 45, 44, 72, 67, 48, 35, 40
4. Calculate the regression coefficients and form the regression equation with
following data:
X: 1,2,3,4,5,6,7
Y: 9,8,10, 12, 11, 13, 14
This session introduces the basics of inferential statistics by diving deep into
sampling theory and hypothesis testing.
In this session, you will learn to:
Explain hypothesis testing
Describe sampling theory
Describe confidence intervals & level of significance
Session 0 4
Inferential
Statistics
4.1 INTRODUCTION TO INFERENTIAL STATISTICS
Inferential statistics is a branch of statistics that deals with making inferences
or conclusions
about a population based on information obtained from a sample. It us es sample
data to fetch
inferences about a larger population.
Inferential statistics uses probability theory to make these inferences. It helps
researchers to
make generalizations about a population by studying a subset of individuals or
objects from the
population.
To carry out inferential statistics, typically begin with a hypothesis or claim
about a population.
Next, collect a sample of data from the population and use statistical methods to
analyze the
sample data. Based on the results of analysis, make inferences about the
population.
Inferential statistics can be used in a wide range of applications, including
marketing research,
medical studies, social sciences, and more. It allows researchers to draw
conclusions about a
population with a certain level of confidence and can help decision-makers make
informed
decisions based on data.
4.2 SAMPLING THEORY
Sampling theory is a field of statistics that deals with the selection of portions
of individuals or
objects from a larger population. The goal of sampling is to gather information
about the
population by studying the characteristics of the sample. Sampling theory provides
a framework
for selecting a sample that is representative of the population and for estimating
the parameters
of interest, such as the mean, variance, or proportion.
In practical terms, sampling theory is used in a variety of fields, such as market
research, social
sciences, public health, and manufacturing. For example, a market research firm may
use
sampling theory to select a representative sample of consumers to survey about
their
preferences for a new product. In public health, researchers may use sampling
theory to select a
sample of patients from a population to study the prevalence of a disease.
Sampling theory involves a variety of techniques for selecting a sample, including
random
sampling, stratified sampling, cluster sampling, and systematic sampling. The
choice of sampling
technique depends on the characteristics of the population and the research
objectives. Once a
sample is selected, statistical methods can be used to estimate population
parameters and to
quantify the precision and accuracy of the estimates.
Figure 4.1 shows a visual representation of sampling.
© Aptech Limited
Simple Random
Sampling•In this approach, every individual in the population has an equal
likelihood of being selected. This usually entails assigning
numerical values to each member and then employing a random
number generator to make the selections.
Stratified Sampling•Here, the population is categorized into subgroups (or strata)
based on specific characteristics like age, gender, or income.
Participants are then randomly chosen from each subgroup, aiming
for a sample that accurately represents the entire population.
Cluster Sampling•Cluster Sampling involves dividing the population into clusters or
groups, then randomly selecting clusters, and including all
members within those chosen clusters in the sample.
Systematic Sampling•This method involves organizing the population into a sequence
or
list. A random starting point is selected, and then every nth
member in the sequence is included in the sample.
Convenience
Sampling•Convenience sampling selects participants who are easily available
and willing to take part, like students in a class or customers in a
store. However, it may not always yield a representative sample.
Snowball Sampling•This method relies on participants identifying and referring
others.
For instance, researchers might start with a couple of participants
and ask them to suggest additional willing participants. It is often
used when the target population is challenging to identify or
access.
© Aptech Limited
Every sampling method has its unique advantages and drawbacks. The selection of the
appropriate method hinges on factors such as the precise research question, the
characteristics
of the target population, and the resources at the researcher's disposal.
In summary, parameters are used to describe populations, while statistics are used
to describe
samples. Parameters are often estimated using statistics, which allow inferences
about the
population based on the sample.
𝑺𝑬𝑴 = 𝒔
Standard error of the mean (SEM):
√𝒏
Where, s is the standard deviation of the sample and n is the sample size.
𝑺𝑬= √𝒑 ∗ (𝟏 − 𝒑)
Standard error of the proportion (SE):
𝒏
Where, p is the proportion of successes in the sample and n is the sample size.
© Aptech Limited
𝑺𝑬𝑫 = √𝒔𝟏𝟐
Standard error of the difference between means (SED):
𝒏𝟏+𝒔𝟐𝟐
𝒏𝟐
Where, s1 and s2 are the standard deviations of the two samples and n1 and n2 are
the sample
sizes.
Consider an example.
Suppose there is a sample of 20 students and the average height of all students at
a school
needs to be estimated . The height of each student in the sample needs to be
measured and the
mean height is calculated to be 170 cm. The standard deviation of the sample is 5
cm.
To calculate the standard error of the mean, following formula is used:
SE = s / sqrt(n)
Where, s is the standard deviation of the sample and n is the sample size.
In our example, s = 5 cm and n = 20, so:
SE = 5 / sqrt(20) = 5 / 4.472 = 1.118 cm
Therefore, the standard error of the mean is 1.118 cm . Here is what this means.
When mean
height for each sample is to be calculated for multiple samples of same size from
same
population, standard deviation of samples comes to be approximately 1.118 cm.
For example:
The process of hypothesis testing involves four steps:
If the test statistic falls in the rejection region (more extreme than critical
value or p-value is <
chosen level of significance), then null hypothesis is rejected for alternative
hypothesis. If test
statistic does not fall in the rejection region (not more extreme than critical
value or p-value is >
chosen level of significance), then null hypothesis cannot be rejected.
Type 1 Error
•In statistical hypothesis testing, a Type
1 error (also known as a false positive)
occurs when a null hypothesis is rejected
even though it is actually true.
•In other words, it is the error of concluding
that there is a significant difference or
effect when there is no such difference or
effect.
•Type 1 error is usually represented by the
symbol α (alpha). It is commonly set at a
significance level of 0.05 or 0.01 , which
means that the probability of making a
type 1 error is 5% or 1%, respectively.
•Type 1 errors can be reduced by increasing
the sample size, choosing a lower
significance level, or by using more rigorous
statistical [Link] 2 Error
•Type 2 error, also known as false negative,
is a statistical term used to describe a
situation where a hypothesis test fails to
reject a null hypothesis that is false. In
other words, a Type 2 error occurs when it
is concluded that there is no significant
difference between two groups or
variables, when in fact, there is a
difference.
•Type 2 errors are common in hypothesis
testing, particularly when the sample size is
small, or the effect size is weak. The
probability of making a Type 2 error is
denoted by the symbol β(beta) and is
related to the level of significance α(alpha)
of the hypothesis test.
•To reduce likelihood of making a Type 2
error, one can increase sample size or use
more sensitive statistical tests with higher
power to detect differences between
groups or variables. However, reducing the
risk of a Type 2 error usually comes at the
cost of increasing the risk of a Type 1 error
(rejecting a null hypothesis) that is true.
© Aptech Limited
Figure 4.3 shows the region of rejection and region of acceptance.
© Aptech Limited
To calculate the confidence interval, first find the Standard Error of the Mean
(SEM), which is the
standard deviation of the sample mean:
SEM = σ / sqrt(n)
Where, σ is the population standard deviation (which one does not know, so the
sample
standard deviation can be used as an estimate), n is the sample size.
So in this case:
SEM = 3 / sqrt(50) = 0.424
Next, one need s to find the critical value for a 95% confidence interval. Look
this up in a standard
normal distribution table or use a calculator. For a two-tailed 95% confidence
interval, the critical
value is 1.96.
Now , the confidence interval can be calculated :
CI = X ± (critical value * SEM)
So in the example, the 95% confidence interval is:
CI = 25 ± (1.96 * 0.424) = [24.17, 25.83]
This means one can be 95% confident that the true mean weight of the population of
dogs lies
somewhere between 24.17 and 25.83 pounds, based on the sample data.
Confidence intervals are useful in statistical inference, as they provide a measure
of the precision
of estimates and help make inferences about the population based on the sample
data.
© Aptech Limited
4.6 MAKING A DECISION ON A POPULATION
To make a statistical decision on a population, one typically needs to follow a
structured process
that involves several steps:
It is important to note that making a statistical decision on a population requires
careful
planning, data collection, and analysis. It is also important to be aware of
potential biases or
limitations in your data and to interpret your results within the broader context
of your research
question or hypothesis.
For example, the mean weight of apples is greater than 50 grams needs to be
tested . Take a
sample of 30 apples and calculate the sample mean weight to be 52 grams with a
standard
deviation of 4 grams. Now, calculate the critical value considering the level of
significance of 0.05
to test the null hypothesis that the mean weight is 50 grams.
Frame the null and alternative hypothesis.
H0: Mean weight of the apples >= 50 grams
H1: Mean weight of the apples < 50 grams
This constitutes to an upper tail test, therefore area of critical region on the
right side would be
α = 0.05. Which means that the area until the upper critical value would be 1-0.05
= 0.95. So one
need s to find the z-value corresponding to 0.95 probability. From the z-table, the
z value comes
out to be 1.65. The formula for calculating critical value is:
𝑪𝑽= 𝝁 ± 𝒁 𝒄∗ 𝝈
√𝒏
Since, this is a one tail test, hence instead of using the ± symbol, only the –
symbol will be used
which will calculate the lower critical value.
Critical Value (CV) = 50 - 1.65*(4/5.477) = 48.8
Since the calculated critical value comes out to be 48.8 which is less than sample
mean which is
52 grams, this means the sample mean lies in the rejection region. Hence, the null
hypothesis is
rejected. Formulate the null hypothesis
and alternative hypothesis.
Choose an appropriate test statistic and
calculate its value based on the sample data.
Determine the level of
significance ( α) for the test.
Look up the critical value for the chosen level of
significance and the test statistic's degrees of freedom.
Compare the calculated test statistic
value with the critical value.
If the calculated test statistic value is greater than the critical value, reject
the null
hypothesis. If it is less than or equal to the critical value, fail to reject the
null hypothesis.
© Aptech Limited
The calculated test statistic is (52-50)/(4/sqrt(30)) = 3.87. Using a t-
distribution table or
calculator with 29 degrees of freedom (n-1), one find s the critical value to be
1.699 at a level of
significance of 0.05.
Overall, the critical value method is a useful tool in statistical decision-making.
It provides a clear
threshold for rejecting or failing to reject a null hypothesis based on the test
statistic and level of
significance.
4.6.2 p VALUE METHOD
The p-value method is a commonly used statistical technique for making decisions
based on
data. In this method, a hypothesis is formulated and then tested using a
statistical test. The p-
value is calculated, which represents the probability of obtaining a result as
extreme or more
extreme than the observed result, assuming the null hypothesis is true.
If the p-value is less than a predetermined significance level, typically 0.05, the
null hypothesis is
rejected, and the alternative hypothesis is accepted. If the p-value is greater
than the
significance level, the null hypothesis is not rejected, and the alternative
hypothesis is not
accepted.
It is important to note that rejecting the null hypothesis does not necessarily
mean that the
alternative hypothesis is true, only that the data provide evidence against the
null hypothesis.
When using the p-value method to make a decision, it is important to carefully
consider the
assumptions and limitations of the statistical test being used. It is also
important to consider the
context of the data and the potential consequences of the decision. Additionally,
it is important
to consider other factors beyond statistical significance, such as effect size and
practical
significance.
© Aptech Limited
Steps to make a decision using p-value method:
© Aptech Limited
4.8 Check Your Progress
6. What is a p-value?
© Aptech Limited
4.8.1 Answers for Check Your Progress
1. C
2. B
3. C
4. B
5. A
6. A
7. A
© Aptech Limited
Try It Yourself
1. A researcher is interested in estimating the average weight of a certain type of
fruit. A
sample of 50 fruits is selected and the average weight is found to be 250 grams
with a
standard deviation of 20 grams. Calculate a 95% confidence interval for the
population
mean weight of this type of fruit.
2. A manufacturer claims that the mean weight of a box of cereal is 12 ounces. A
consumer
advocacy group suspects that the mean weight is actually less than 12 ounces. A
random
sample of 36 boxes is selected, and the mean weight is found to be 11.8 ounces with
a
standard deviation of 0.6 ounces. Test the hypothesis that the mean weight is less
than
12 ounces at a significance level of 0.05 using p-value method.
3. A company claims that their new product has a mean lifespan of at least five
years. A
sample of 25 products is tested, and the sample mean lifespan is found to be 4.8
years
with a standard deviation of 0.7 years. Test the hypothesis that the mean lifespan
is less
than five years at a significance level of 0.01 using p-value method.
4. Suppose you want to test the null hypothesis that the population mean is equal
to 50,
against the alternative hypothesis that it is greater than 50. You take a random
sample
of size 25 from the population and obtain a sample mean of 52 and a sample standard
deviation of 5. Use the critical value method with a significance level of 0.05 to
test the
hypothesis.
5. Suppose you want to test the null hypothesis that the population proportion is
equal to
0.4, against the alternative hypothesis that it is less than 0.4. You take a random
sample
of size 100 from the population and obtain 35 successes. Use the critical value
method
with a significance level of 0.01 to test the hypothesis.
6. A company manufactures a certain type of product. A sample of 50 products was
taken
and the mean weight was found to be 500 grams with a standard deviation of 20
grams.
What is the standard error of the mean weight?
7. A survey of 1000 adults was conducted to estimate the average number of hours
they
spend watching TV per day. The mean number of hours was found to be 3.5 hours with
a standard deviation of 1.2 hours. What is the standard error of the mean?
8. A survey of 500 people was conducted to determine the proportion of people who
support a particular political candidate. Of the 500 people surveyed, 250 said they
support the candidate. Calculate the 99% confidence interval for the true
proportion of
people who support the candidate.
© Aptech Limited
This session introduces various parameterized tests that can be executed
on a population to identify patterns in them.
In this session, you will learn to:
Explain Chi-Square test
Explain T-test
Explain Z-test
Explain F-test
Session 0 5
Exact Sampling
Distribution
5.1 INTRODUCTION TO EXACT SAMPLING DISTRIBUTIONS
Exact sampling distributions refer to the probability distribution of a statistic
that is obtained
through a process of repeated sampling from a population. The exact sampling
distribution of a
statistic is a theoretical distribution .
For example, suppose the mean weight of all dogs in a particular city needs to be
estimated . One
could take a sample of dogs from the city and calculate the mean weight of the
sample. One
could then repeat this process many times, each time taking a new random sample of
dogs, and
calculate the mean weight of each sample. The resulting distribution of means would
be the
exact sampling distribution of the sample mean.
Exact sampling distributions can be derived mathematically using probability theory
and
statistical methods. They can provide valuable insights into the properties of
statistical
estimators and hypothesis tests. They are often used to make inferences about
population
parameters based on sample statistics, such as estimating confidence intervals or
conducting
hypothesis tests.
5.2 CHI-SQUARE TEST
The Chi-square test is a statistical test used to determine the association between
two
categorical variables. It is used to test whether there is a significant difference
between the
observed frequencies and the expected frequencies in a contingency table.
In a contingency table, the rows represent one categorical variable and the columns
represent
another categorical variable. Each cell in the table contains the frequency of the
occurrence of a
combination of the two variables. The Chi-square test determines if there is a
significant
difference between the observed frequencies in the table and the expected
frequencies,
assuming that the variables are independent.
The Chi-square test is calculated by comparing the observed frequencies in each
cell of the
contingency table to the expected frequencies. These are calculated based on the
assumption of
𝒆𝒙𝒑𝒆𝒄𝒕𝒆𝒅 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚
Where, χ² is the test statistic, and the sum is taken over all the cells in the
contingency table.
One real-life example where chi-square test is commonly used is in analyzing the
results of a
survey or poll. For instance, consider a political poll where a sample of 1,000
voters were asked
to choose between two candidates in an upcoming election. The data collected could
be
arranged in a contingency table with the rows representing the two candidates and
the columns
representing the responses from the sample of voters.
Table 5.1 shows an example of contingency tables.
© Aptech Limited
Candidate A Candidate B
Observed 450 550
Expected 650 350
Table 5.1: Contingency Table
The question now is whether the difference between the responses for the two
candidates is
statistically significant. To determine this, a chi-square test can be performed on
the data in the
contingency table.
The chi-square test will generate a p-value. This tells that probability of getting
a difference
between the two candidates as extreme as the one observed, assuming that there is
no real
difference between them in the population. If p-value is less than a predetermined
significance
level (such as 0.05), one can reject the null hypothesis that there is no
difference between the
candidates. One can conclude that there is a statistically significant difference.
Consider another example.
Suppose there is a study, where 100 participants were administered two different
medications
(A and B) for a specific medical condition. The effect of these two medications is
to be compared.
The null hypothesis for this test is: there is a significant dif ference between
the effects of the two
medications . One can record the number of participants who improved with each
medication .
Also, it is expected that Medication A and Medication B might have a positive
effect on 65
people and 35 people out of 100 people, respectively. However, a test was conducted
on the
patients and it was observed that Medication A cured 40 people and Medication B
cured 60
people.
Medication A Medication B
Observed 40 60
Expected 65 35
Table 5.2: Obtained Data
To perform a chi-square test on this data, first calculate the numerator of the
chi-square
formula. This is done by subtracting the expected values from the observed values.
Table 5.3
shows the calculation of numerator of Chi-Square formula.
Medication A Medication B Row Total
Observed 40 60 100
Expected 65 35 100
(Observed – Expected )2 (-25)2 (25)2
Table 5.3: Numerator of Chi -Square Formula
© Aptech Limited
Here, Expected frequencies (denominator of chi-square formula) = 65, 35
[Observed – Expected] frequencies (numerator of chi-square formula) =- 252, 252
Now, putting these values in the formula for chi-square:
χ² = [-252/65] + [252/35]
= [ 625/ 65] + [ 625 / 35]
= 9.61 + 17.85
= 27.46
Finally, determine the Degrees of F reedom ( DF) for the test.
For a 2x2 contingency table, DF = (number of rows - 1) x (number of columns - 1) =
1 x 1 = 1.
Figure 5.1 shows a Chi-square distribution chart which will help in finding the
critical value.
5.3 Z- test
A z-test is a statistical test used to determine whether two population means are
different when
the sample size is large and the population variance is known. It is a hypothesis
test that
compares the means of two samples, usually from normally distributed populations.
The z-test is named after the standard normal distribution, which is a probability
distribution
with a mean of 0 and a standard deviation of 1. The test works by transforming the
sample
means into z-scores, which represent the number of standard deviations that the
sample mean
is from the population mean. The z-score is then compared to a critical value from
the standard
normal distribution to determine whether to discard or fail to reject the null
hypothesis.
The null hypothesis in a z-test is that the means of the two populations being
compared are
equal. The alternative hypothesis states that the means un equal. The test
statistic is calculated
𝒛 = 𝒙𝟏− 𝒙 𝟐
using the formula:
𝒏𝟏−𝝈𝟐𝟐
√𝝈𝟏𝟐
𝒏𝟐
Where, x 1 and x 2 are the sample means, σ is the population standard deviation,
and n is the
sample size. The resulting z-score is compared to a critical value from the
standard normal
distribution based on the desired level of significance and the number of tails of
the test.
© Aptech Limited
The sample mean productivity before implementing the new process is 80 units and
the sample
Hence, 𝒙𝟏 = 90, 𝒙𝟐 = 80
mean productivity after implementing the new process is 90 units.
𝒕 = 𝒙𝟏− 𝒙 𝟐
calculated using the formula:
𝒔
√𝒏
Where, x1 and x2 are the sample means, s is the sample standard deviation, and n
is the sample
size. The obtained t-score is gauged by comparing it to a critical value from the
t-distribution,
determined by the chosen significance level and the test's number of tails.
If the calculated t-score falls within the rejection region (that is, its absolute
value is > the critical
value), one can reject the null hypothesis. One can conclude that there is a
statistically significant
difference between the means of the two populations. If the t-score falls outside
the rejection
region, one can fail to reject the null hypothesis and conclude that there is
insufficient evidence
to support a difference between the means.
There are two types of t-tests: independent samples t -test and paired samples t -
test. The
independent samples t-test compares the means of two independent samples. The
paired
samples t-test compares the means of two related samples (example, before and after
measurements from the same individuals).
Consider an example. Suppose a nutritionist wants to test whether a new diet plan
results in a
significant weight loss for participants. The nutritionist randomly selects 20
participants and
divides them into two groups: a control group and a treatment group. The control
group follows
their usual diet plan, while the treatment group follows the new diet plan. The
nutritionist
records the weight of each participant at the beginning and end of the study. The
data is
distributed normally and the population variance is unknown.
The null hypothesis is that there is no difference in the mean weight loss between
the control
group and the treatment group. The alternative hypothesis is that the mean weight
loss in the
control group is less than the mean weight loss in the treatment group.
© Aptech Limited
The sample mean weight loss for the control group is two pounds with a standard
deviation of
1.5 pounds. The sample mean weight loss for the treatment group is five pounds with
a standard
deviation of two pounds.
To test the hypothesis, the nutritionist performs a two-sample independent t-test.
The test
statistic is calculated as follows:
t = (5 - 2) / (sqrt((2^2/20) + (1.5^2/20))) = 2.86
Assuming a significance level of 0.05 and degrees of freedom of 18 (n1 + n2 - 2),
one can look up
the critical value from the t-distribution table. One can also use a statistical
software, which is
1.734 for a one-tailed test.
Since the calculated t-value (2.86) is greater than the critical value (1.734), one
can reject the
null hypothesis. One can conclude that the mean weight loss in the treatment group
is
statistically significantly greater than the mean weight loss in the control group.
Therefore, the nutritionist can conclude that the new diet plan results in a
significant weight loss
for participants compared to the control group.
The t-test is widely used in many fields, including social sciences, engineering,
and business, to
test hypotheses about means and to compare the effectiveness of different
treatments or
interventions.
5.5 F-test
A F-test is a statistical test used to compare the variances of two or more than
two populations.
It is a hypothesis test that determines whether the variability between two or more
groups is
significantly different or not. The F-test is named after the F-distribution, which
is a probability
distribution that arises when comparing the variances of two normally distributed
populations.
The null hypothesis in an F-test is that the variances of the populations being
compared are
equal. The alternative hypothesis states that at least one of the variances is
different. The test
statistic for an F-test is calculated by dividing the variance of one group by the
variance of
another group. If there are more than two groups, the F-test calculates the ratio
of the variances
of the largest and smallest groups.
If the calculated F-value falls within the rejection region (that is, it is > the
critical value), one can
reject the null hypothesis. One can conclude that there is a statistically
significant difference
between the variances of the groups. If the F-value falls outside the rejection
region, one can fail
to reject the null hypothesis and conclude that there is insufficient evidence to
support a
difference between the variances.
The F-test is often used in Analysis of Variance (ANOVA), a statistical technique
that compares
means across multiple groups. ANOVA is used to determine whether the means of two
or more
populations are equal, and the F-test is used to determine whether the variances
are equal.
The F-test is also used in regression analysis to test the overall significance of
the regression
model. The F-test is used to determine whether the variance in the dependent
variable
explained by the model is significantly greater than the variance that cannot be
explained by the
model.
© Aptech Limited
Here is a real-life example of an F-test:
Suppose a car manufacturer wants to compare the variances of fuel efficiency (in
miles per
gallon) of three different models of cars: Model A, Model B, and Model C. The
manufacturer
randomly selects 10 cars of each model and measures their fuel efficiency. The data
is normally
distributed.
The null hypothesis is that the variances of fuel efficiency for all three models
are equal. The
alternative hypothesis is that at least one of the variances is different and is
unequal .
To test the hypothesis, the manufacturer performs an F-test. The test statistic is
calculated as
𝑭 =𝒔𝟐(𝒍𝒂𝒓𝒈𝒆𝒔𝒕)
the ratio of the largest variance to the smallest variance:
𝒔𝟐(𝒔𝒎𝒂𝒍𝒍𝒆𝒔𝒕 )
F = s2 (largest) / s2 (smallest)
Where, s2 represents the sample variance.
Consider a significance level of 0.05 and degrees of freedom of 27 (n - k, where n
is the total
sample size and k is the number of groups) . One can look up the critical value
from the F-
distribution table or use statistical software, which is 3.01 for a two-tailed
test.
© Aptech Limited
5.6 Summary
The sampling distribution of a statistic represents the range of possible values
for that
statistic when repeatedly sampled from a population under specific conditions.
The Chi-square test is a statistical test employed to ascertain the relationship
between tw o
categorical variables.
A z-test is a statistical test used to determine whether two population means are
different
when the sample size is large and the population variance is known.
A t-test is a statistical test used to determine whether two population means are
different
when the sample size is small or the population variance is unknown.
An F-test compares variances in two or more populations, checking if the
differences in
variability between groups are statistically significant.
© Aptech Limited
5.7 Check Your Progress
© Aptech Limited
5.7.1 Answers for Check Your Progress
1. A
2. A
3. B
4. C
5. D
6. B
7. A
© Aptech Limited
Try It Yourself
1. A researcher wants to test whether there is a significant difference in the
distribution of
hair color among men and women. They survey 200 men and 200 women and find that
60 men and 80 women have blonde hair. Calculate the chi-square statistic and
degrees
of freedom for this test.
2. A biologist cross es two different strains of fruit flies and count the number
of offspring
with each combination of traits. The results are as follows:
o Trait 1 and Trait 2: 120
o Trait 1 and not Trait 2: 80
o Not Trait 1 and Trait 2: 80
o Not Trait 1 and not Trait 2: 120
Help the biologist check if there is a significant association between two genetic
traits in
a population of fruit flies.
3. A manufacturer claims that the average lifespan of their product is five years.
A sample
of 50 products is taken and the average lifespan is found to be 4.5 years with a
standard
deviation of 1.2 years. Test whether the manufacturer's claim is true at a
significance
level of 0.05.
4. A company wants to test whether there is a significant difference in the mean
sales
revenue per day between two stores: Store A and Store B. Store A has an average
sales
revenue of $500 with a standard deviation of $50 . Store B has an average sales
revenue
of $550 with a standard deviation of $70. Conduct a two-sample t-test at the 1%
significance level.
5. A study wants to test whether a new weight loss pill is effective in reducing
weight. The
study includes 20 participants who took the pill for four weeks and lost an average
of
five pounds with a standard deviation of five pounds. Conduct a
one-sample t-test at the 5% significance level to determine if the weight loss is
significant.
© Aptech Limited
7. A company is interested in comparing the performance of three different
marketing
strategies. Using data given in Table 5.4, perform an F-test to determine if there
is a
significant difference between the means of the three groups at a significance
level of
0.05. The population size here is 10. The five different domains where the
marketing
strategy is applied are: Fast Moving Consumer Durables (FMCD), Fast Moving Consumer
Goods (FMCG), Retail, E-Commerce, and Real Estate. The scores given are a count of
positive responses received after the marketing strategies are applied.
FMCD FMCG Retail E-
commerce Real
Estate
Strategy 1 15 18 12 14 16
Strategy 2 10 22 15 14 13
Strategy 3 25 20 18 23 21
Table 5.4: Data for F-test
© Aptech Limited
Source of
variation Sum of Squares Degrees
of
freedom Mean Squares F-score
𝒋=𝟏𝒌
𝒋=𝟏 k-1 𝑴𝑺𝑾 = 𝑺𝑺𝑾
𝒅𝒇
𝑺𝑺𝑻 = ∑(𝑿̅𝒋−𝑿̅)𝟐𝒏
Total
𝒋=𝟏 n-1
Table 6.1: Formula to Calculate F -test Score
© Aptech Limited
Then, calculate the Sum of Squares within groups (SSW) by finding the sum of
squares of the
deviation of each score from its group mean:
SSW = (80-87.5)2 + (85-87.5)2 + (90-87.5)2 + (95-87.5)2 + (75-82.5)2 + (80-82.5)2 +
(85-82.5)2+ (90-
82.5)2 + (70-77.5)2 + (75-77.5)2 + (80-77.5)2 + (85-77.5)2 = 380
Finally, calculate the Degrees of F reedom ( DF) and Mean Squares (MS) for both SSG
and SSW:
df_G = 3-1 = 2 df_W = 12-3 = 9
Mean Square Group (MSG) = SSG/ df_G = 699.21875/2 = 349.609375
Mean Sum of Squares Within (MSW) = SSW/df_W = 380/9 = 42.22222222
Now we can calculate the F-ratio:
F = MSG/MSW = 349.609375/42.22222222 = 8.266203704
To determine whether this F-ratio is statistically significant, it is compared to
the F-distribution
with DF_G=2 and DF_W=9. Assuming a significance level of 0.05, the critical F-value
for this test
is 4.2565. Since the calculated F-value (8.2662) is greater than the critical F-
value (4.2565), one
can reject the null hypothesis. One can then conclude that there are significant
differences in the
mean test scores among the three schools.
ANOVA is a powerful tool for analyzing data and can be used in a variety of fields,
including
psychology, sociology, biology, and economics. It is important to use caution when
interpreting
results of an ANOVA test, as there are many factors that influence the results,
such as sample
size, outliers, and non-normal distributions.
© Aptech Limited
6.4 MANOVA
MANOVA stands for Multivariate
Analysis of Variance. It is a statistical
technique used to analyze the
relationships among two or more
continuous dependent variables and
one or more independent variables.
MANOVA is an extension of the
ANOVA technique, which only
analyzes one dependent [Link] main objective of MANOVA is to
determine whether there are any
significant differences among the
groups formed by the independent
variables in terms of the dependent
variables. MANOVA allows researchers
to analyze multiple dependent
variables simultaneously, which can
provide a more comprehensive
understanding of the relationship
between the independent and
dependent variables.
MANOVA can be used in various fields
such as social sciences, medicine, and
engineering to analyze the effects of
different variables on a system.
MANOVA requires certain assumptions
to be met, such as normality of the
data and equality of covariance
matrices across groups. If these
assumptions are not met, alternative
techniques such as non -parametric
tests may be used.
© Aptech Limited
6.5 Summary
ANOVA compares means in multiple groups, assuming normal distribution and equal
variances.
One-way and two-way classification are two different methods to group and analyze
data
based on one or two factors, respectively.
One-way classification is used when there is only one categorical independent
variable, that
is being used to classify the data.
Two-way classification is used when there are two categorical independent
variables, that
are being used to classify the data.
MANOVA is used to analyze the relationships between two or more continuous
dependent
variables and one or more independent variables.
© Aptech Limited
6.6 Check Your Progress
© Aptech Limited
6. Which of the following is true about MANOVA?
1. A
2. A
3. D
4. B
5. A
6. B
7. C
8. D
© Aptech Limited
Try It Yourself
1. A researcher conducts a study comparing the performance of three different types
of
fertilizer on the growth of tomato plants. The data collected shows the mean
heights for
the plants in each group: Group1: 10 inches, Group 2: 12 inches, and Group 3: 15
inches.
Conduct an ANOVA to determine if there is a significant difference in plant growth
between the three groups.
2. A study was conducted to investigate the effect of three different types of diet
(A, B, and C) on the level of three different blood biomarkers (X, Y, and Z).
The data
collected are shown in Table 6.3. Conduct a MANOVA to determine if there is a
significant difference in the levels of three blood biomarkers based on the type of
diet.
Diet A Diet B Diet C
X 10 8 5
Y 12 14 16
Z 8 6 4
Table 6.3: Sample D ata
3. A researcher wants to compare the mean blood pressure of three groups: a control
group, a low-dose group, and a high-dose group. She measures the blood pressure of
10
participants in each group. The ANOVA F-test produces a p-value of 0.001. What can
be
concluded from this result?
4. A teacher wants to determine the significant difference in the mean scores of
math
exams of three groups of students, Group A, Group B, and Group C. Each group has
10
students. The list of scores of these students in each group are:
Group A: 70,75,80,85,90,95,100,105,110,115
Group B: 65,70,75,80,85,90,95,100,105,110
Group C: 60,65,70,75,80,85,90,95,100,105
Conduct an ANOVA and interpret your results.
© Aptech Limited
5. A chef wants to compare the average cooking time of three different ovens. He
randomly selects 12 recipes and bakes each recipe in all three ovens. The cooking
time
(in minutes) is recorded. What is the alternative hypothesis of this study?
6. A study was conducted to determine if there is a difference in the mean scores
of three
different groups on a standardized test. The ANOVA F-test produced a p-value of
0.5.
What does this result indicate?
7. A researcher conducted a MANOVA with two independent variables, each with two
levels, and four dependent variables. What is the Degrees of Freedom for the Wilks'
Lambda statistic?
© Aptech Limited