University of Mumbai
Program – Bachelor of Engineering in
Computer Science and Engineering (Artificial Intelligence
and Machine Learning)
Class - T.E.
Course Code – CSDLO5011
Course Name – Statistics for Artificial
Intelligence Data Science
By
Prof. A.V.Phanse
Module 3 - Statistical Experiments and Significance Testing
Statistical Experiments
A statistical experiment is a process or set of procedures that generate data,
allowing you to observe the outcome of a random variable.
Experiments are designed to test hypotheses by manipulating one or more
independent variables to observe their effect on a dependent variable.
Steps in Conducting an Experiment:
1. Define the Hypothesis: Establish a clear, testable hypothesis (null and
alternative).
2. Design the Experiment: Plan how to manipulate variables and control
extraneous factors.
3. Collect Data: Perform the experiment and gather data.
4. Analyze Data: Use statistical methods to analyze the data.
5. Draw Conclusions: Determine whether the results support or reject the
hypothesis.
A/B Testing
An A/B test is an experiment with two groups to establish which of two
treatments, products, procedures is superior than the other. A/B tests are
common in web design and marketing.
A/B testing is used to compare two versions of a webpage, app, or other digital
product to determine which one performs better in terms of a specific metric,
like conversion rate, click-through rate, or user engagement.
When you run an A/B test, you compare one
page against one or more variations that
contain one major difference in an element of
the control page.
After a set amount of time, or visits, you
compare the results to how the change
affected your results.
Every visitor will see one version of the page
or another, and you’ll measure conversions
from each set of visitors.
A/B tests allow you to test one version of
copy, images, forms etc. against another.
How A/B Testing Works:
Create Two Variants (A and B):
Variant A is typically the current version (the control).
Variant B is the modified version (the treatment).
Randomly Split the Audience:
Users are randomly assigned to either Variant A or Variant B. This
randomization helps ensure that any differences in performance are due to
the changes made, rather than external factors.
Measure Performance:
Performance of both versions is tracked against the desired metric (e.g., sign-
ups, purchases etc.).
Analyze Results:
After collecting enough data, statistical analysis is performed to determine
whether the differences in performance between A and B are statistically
significant. This helps in deciding whether the new version (B) should replace
the current version (A).
To conclude this example:
It appears quite likely that the “A” variant (i.e. orange button) has a higher
conversion rate than the “B” variant (green button)
Decision: Keep orange button
Applications of A/B Testing:
Web Design: Testing different layouts, button colors, or call-to-action text.
Marketing: Comparing different ad copy or email subject lines.
Product Development: Assessing new features or changes to existing ones.
Significance Testing
Significance testing is a statistical method used to determine if the observed
results in a sample are strong enough to infer that they apply to a larger
population.
A sample is used to test a condition and make a statement about the whole
population.
Steps of Hypothesis Testing
So, Hypotheses can be made like
1. Null Hypothesis (H₀):
2. Alternative Hypothesis (H₁):
Difference Hypothesis –
Example –
Association Hypothesis –
Example –
Undirected Hypothesis –
Example –
Example –
Directed Hypothesis –
Example –
A hypothesis is a proposed explanation or prediction about a phenomenon or
a relationship between variables, often used as a starting point for further
investigation.
It is typically framed in a way that can be tested through experimentation,
observation, or analysis.
There are two main types of hypotheses:
Null Hypothesis (H₀): This assumes that
there is no relationship between the
variables being studied, or no effect or
difference exists. Researchers usually
test against the null hypothesis to see if
it can be rejected.
Alternative Hypothesis (H₁ or Ha): This
proposes that there is a relationship,
effect, or difference between the
variables being studied. It is the opposite
of the null hypothesis.
Hypothesis testing is a method for testing a claim about a parameter in a
population using data measured in a sample.
Hypothesis Testing is a type of statistical analysis in which you put your
assumptions about a population parameter to the test.
It is used to estimate the relationship between 2 statistical variables.
Few examples of statistical hypothesis from real-life -
1. A teacher assumes that 60% of his college's students come from lower-middle-
class families.
2. A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for
diabetic patients.
In "classical" inferential statistics, the null hypothesis is always tested using a
hypothesis test. The hypothesis is tested to see if there is no difference or no
relationship.
If you want to be 100% accurate, the null hypothesis H0 can only ever be
rejected or not rejected using a hypothesis test.
The non-rejection of H0 is not a sufficient reason to conclude that H0 is true.
Therefore, the wording "H0 was not rejected" is preferable to "H0 was retained."
Developing Null and Alternative Hypotheses
In statistical hypothesis testing, there are always two hypotheses.
The hypothesis to be tested is called the null hypothesis and given the symbol
H0.
The null hypothesis states that there is no difference between a hypothesized
population mean and a sample mean.
For example, if we were to test the hypothesis that college freshmen study 20
hours per week, we would express our null hypothesis as:
In this example, our alternative hypothesis would express that freshmen do not
study 20 hours per week
Deciding Whether to Reject or Not Reject the Null Hypothesis
The alternative hypothesis can be supported only by rejecting the null
hypothesis.
To reject the null hypothesis means to find a large enough difference between
your sample mean and the hypothesized (null) mean.
If the difference between the hypothesized mean and the sample mean is very
large, we reject the null hypothesis.
If the difference is very small, we do not reject the null hypothesis.
In each hypothesis test, we have to decide in advance what the magnitude of
that difference must be to allow us to reject the null hypothesis.
If we fail to find a large enough difference to reject, we fail to reject the null
hypothesis.
Why is there a probability of error in a hypothesis test?
Each time you take a sample, you of course get a different one, which means
that the results are different every time.
In the worst case, a sample is taken that happens to deviate very strongly from
the population and the wrong statement is made.
Therefore there is always a probability of error for every statement or
hypothesis.
Level of significance
The probability that random value of a statistic will lie in the critical region is
called the Level of significance or α.
The significance level is used to decide whether the null hypothesis should be
rejected or not.
If the p-value is smaller than the significance level, the null hypothesis is to be
rejected; otherwise, it is not to be rejected.
It is important to note that the significance level is always set before the test
and may not be changed afterwards in order to obtain the "desired" statement
after all. To ensure a certain degree of comparability, the significance level is
usually 5% or 1%.
If a significance level of 5% is set, it means that it is 5% likely to reject the null
hypothesis even though it is actually true. Similarly, If a significance level of 1% is
set, it means that it is 1% likely to reject the null hypothesis even though it is
actually true.
p-value
In hypothesis testing, the p-value is a probability that measures the strength of
evidence against the null hypothesis.
It tells us how likely we are to observe the test statistic (or one more extreme)
assuming that the null hypothesis is true.
Assumption –
In the population, there is no difference in salary between men and women.
p-Value –
How likely is to draw a sample in which the salary of men and women differs by
more than 250 Euros.
A hypothesis test can be either one-tailed or two-tailed.
Two-tailed Hypothesis Tests
The examples discussed in previous slides indicate that the average study time
is either 20 hours per week, or it is not. Computer use averages 3.2 hours per
week, or it does not. We do not specify whether we believe the true mean to be
higher or lower than the hypothesized mean. We just believe it must be
different.
In a two-tailed test, you will reject the null hypothesis if your sample mean falls
in either tail of the distribution.
For this reason, the alpha level (let’s assume 0.05) is split across the two tails.
The curve shows the critical regions for a two-tailed test. These are the regions
under the normal curve that, together, sum to a probability of 0.05.
Each tail has a probability of 0.025.
The z-scores that designate the start of the critical region are called the critical
values
If the sample mean taken from the population falls within these critical regions,
or "rejection regions," we would conclude that there was too much of a
difference and we would reject the null hypothesis.
However, if the mean from the sample falls in the middle of the distribution (in
between the critical regions) we would fail to reject the null hypothesis.
One-Tailed Hypothesis Test
We would use a single-tail hypothesis test when the direction of the results is
anticipated or we are only interested in one direction of the results.
When performing a single-tail hypothesis test, our alternative hypothesis looks
a bit different. We use the symbols of greater than or less than.
A single-tail hypothesis test also means that we have only one critical region
because we put the entire critical region into just one side of the distribution.
When the alternative hypothesis is that the sample mean is greater, the critical
region is on the right side of the distribution.
When the alternative hypothesis is that the sample is smaller, the critical
region is on the left side of the distribution.
Critical Values –
Level of 1% 5% 10%
Significance
Two tail test +/- 2.57 +/- 1.96 +/- 1.645
Right tail test + 2.326 + 1.645 + 1.282
Left tail test - 2.326 - 1.645 - 1.282
For 5% level of significance in case of a two tailed test, the shaded area under
the curve on both tails is considered the critical region.
Since this is a two-tailed test, half of 5% i.e. 2.5% of the values would be in
the left tail, and the other 2.5% would be in the right tail.
Looking up the Z-score associated with 0.025 on a reference table, we find
1.96.
Therefore, +1.96 is the critical value of the right tail and -1.96 is the critical
value of the left tail.
The critical value for a 95% confidence level is Z = +/−1.96.
Type I and Type II Errors
When we decide to reject or not reject the null hypothesis, we have four possible
scenarios:
a. A true hypothesis is rejected.
b. A true hypothesis is not rejected.
c. A false hypothesis is not rejected.
d. A false hypothesis is rejected.
The probability of committing a Type I error is denoted by the symbol α (alpha),
which is typically set at a significance level, such as 0.05.
This means there’s a 5% risk of rejecting the null hypothesis when it should not
be rejected.
Understanding Type I errors is crucial in research, as they can lead to misleading
conclusions.
The probability of committing a Type II error is denoted by the symbol β (beta).
Committing type II error means failing to detect an effect or difference that truly
exists.
The power of a test is the probability of correctly rejecting a false null
hypothesis, which is calculated as 1−β. Higher power means a lower chance of
making a Type II error.
Minimizing Type II errors is crucial for ensuring that real differences or effects
are identified.
Selection of test statistic
Sample size Population standard Population standard
deviation is known deviation is unknown
n >/=30 z-test z-test
n < 30 z-test t-test
When conducting a hypothesis test, we are asking ourselves whether the
information in the sample is consistent, or inconsistent, with the null
hypothesis about the population.
We follow a series of four basic steps:
1. State the null and alternative hypotheses.
2. Select the appropriate significance level and check the test assumptions.
3. Analyze the data and compute the test statistic.
4. Interpret the result
If we reject the null hypothesis we are saying that the difference between the
observed sample mean and the hypothesized population mean is too great to
be accepted.
When we fail to reject the null hypothesis, we are saying that the difference
between the observed sample mean and the hypothesized population mean
can be accepted if the null hypothesis is true.
Example A
A researcher claims that black horses are, on average, more than 30 lbs heavier
than white horses, which average 1100 lbs. What is the null hypothesis, and what
kind of test is this?
Solution :
The null hypothesis would be notated H0 : µ ≤ 1130 lbs
This is a right-tailed test, since the tail of the graph would be on the right.
Example B
A package of gum claims that the flavor lasts more than 39 minutes. What would be
the null hypothesis of a test to determine the validity of the claim? What sort of
test is this?
Solution
The null hypothesis would by notated as H0 : µ ≤ 39.
This is a right-tailed test, since the rejection region would consist of values greater
than 39.
z test
A z-test is a statistical test used to determine whether there is a significant
difference between a sample statistic (such as a sample mean) and a population
parameter (such as a population mean) or between two sample statistics.
The test is based on the z-statistic, which measures how many standard
deviations a sample statistic is from the population parameter.
When to Use z-Test:
Large Sample Size (n > 30): The sample size should be sufficiently large (more
than 30).
Known Population Standard Deviation (σ): The population standard deviation
is known.
Types of z-Tests:
1. One-Sample Z-Test
•Purpose: To test whether the mean of a single sample is significantly different
from a known population mean.
•When to Use: When you have one sample and want to compare its mean to a
known population mean, with a known population variance.
•Example: You want to determine if the average weight of apples from one city is
different from the national average weight.
Formula:
2. Two-Sample Z-Test
Purpose : To test if the means of two independent samples are significantly
different from each other
When to Use: When you want to compare the means of two groups, assuming
both groups have known population variances and are independent of each other.
Example: You want to compare the average heights of men and women in a
population to see if there is a significant difference.
Formula
Example: One-Sample Z-Test
Suppose the average weight of apples in a population is μ=150 grams with a
standard deviation of σ=20 grams. You collect a sample of 40 apples and find the
sample mean weight is 155 grams. At a 5% significance level, does this sample
provide enough evidence to suggest that the average apple weight is different
from 150 grams?
Solution :
Conclusion: There is not enough evidence to suggest that the average apple weight
is different from 150 grams.
Example: Two-Sample Z-Test
Compare the average test scores of two classes to see if their average scores differ
significantly.
Solution :
Step 5: Conclusion
There is not enough evidence to suggest that the average test scores of Class A
and Class B are significantly different at the 5% significance level.
Example A
The school nurse thinks the average height of 7th graders has increased. The
average height of a 7th grader five years ago was 145 cm with a standard deviation
of 20 cm. She takes a random sample of 200 students and finds that the average
height of her sample is 147 cm. Are 7th graders now taller than they were before?
Conduct a single-tailed hypothesis test using a .05 significance level to evaluate the
null and alternative hypotheses.
Solution :
First, we develop our null and alternative hypotheses:
Choose α = .05.
The critical value for this one tailed test is z=1.64.
This is a one-tailed test, and a z-score of 1.64 cuts off 5% in the single tail.
Any test statistic greater than 1.64 will be in the rejection region.
Next, we calculate the test statistic for the
sample
The calculated z−score of 1.414 is smaller than 1.64 and thus does not fall in the
critical region.
Our decision is to fail to reject the null hypothesis and conclude that the
probability of obtaining a sample mean equal to 147 is likely to have been due to
chance.
Example B
A farmer is trying out a planting technique that he hopes will increase the yield on
his pea plants. The average number of pods on one of his pea plants is 145 pods
with a standard deviation of 100 pods. This year, after trying his new planting
technique, he takes a random sample of 144 plants and finds the average number
of pods to be 147. He wonders whether or not this is a statistically significant
increase. What are his hypotheses and the test statistic?
Solution :
First, we develop our null and alternative hypotheses:
This alternative hypothesis is >since he believes that there might be a gain in the
number of pods.
Next, we calculate the test statistic for the sample of pea plants.
If we choose α = 0.05 , the critical value will be 1.645 for one tailed test.
We will reject the null hypothesis if the test statistic is greater than 1.645.
The value of the test statistic is 0.24.
This is less than 1.645 and so our decision is to fail to reject the null hypothesis.
Based on our sample we believe the mean is equal to 145.
t test
A t-test is a statistical test used to compare the means of two groups or a
sample to a population when the population variance is unknown.
Unlike the Z-test, the t-test is more commonly used when the sample size is
small (n < 30) and the population standard deviation is not known.
Types of t-tests:
1. One-Sample t-Test:
Purpose: To test if the mean of a single sample is significantly different from a
known or hypothesized population mean.
When to Use: When you want to compare the mean of one sample to a known
population mean, but the population standard deviation is unknown.
Example: Testing whether the average weight of apples in a sample differs from a
hypothesized population mean of 150 grams.
Formula
2. Independent Two-Sample t-Test:
•Purpose: To compare the means of two independent groups to determine if there
is a statistically significant difference between them.
•When to Use: When you have two independent samples (e.g., different groups of
people) and want to test whether their means are significantly different.
•Example: Comparing the average test scores of two different classes to see if there
is a significant difference in performance.
Formula (assuming equal variances)
3. Paired t-Test:
•Purpose: To compare the means of two related groups, such as measurements
taken from the same group at different times.
•When to Use: When you have paired or matched samples, like before-and-after
measurements or the same participants undergoing two treatments.
•Example: Measuring the weight of individuals before and after a diet program to
test if there is a significant change in weight.
Formula
Back in the early 1900’s, William Sealy Gosset, a chemist at a brewery in Ireland
discovered that when he was working with very small samples, the distributions
of the mean differed significantly from the normal distribution.
He noticed that as his sample sizes changed, the shape of the distribution
changed as well.
He published his results under the pseudonym ‘Student’ and this concept and
the distributions for small sample sizes are now known as “Student’s
t−distributions.”
The differences between the t-distribution and the normal distribution are more
exaggerated when there are fewer data points, and therefore fewer degrees of
freedom.
Degrees of freedom are essentially the number of samples that have the
‘freedom’ to change without affecting the sample mean.
If you were conducting a two-tailed hypothesis test on a sample of 25 students,
your df = 25-1 = 24
Example A
The high school athletic director is asked if football players are doing as well
academically as the other student athletes. We know from a previous study that the
average GPA for the student athletes is 3.10. After an initiative to help improve the
GPA of student athletes, the athletic director randomly samples 20 football players
and finds that the average GPA of the sample is 3.18 with a sample standard
deviation of 0.54. Is there a significant improvement? Use a 0.05 significance level.
Solution :
Step 1: Cleary state the null and alternative hypotheses.
Step 2: Identify the appropriate significance level and confirm the test assumptions.
We were told that we should use a 0.05 significance level. The size of the sample
also helps here, as we have 20 players. So, we can conclude that the assumptions
for the single sample t-test have been met.
Step 3: Analyze the data
We use our t-test formula:
We know that we have 20 observations, so our degrees of freedom for this test is 19.
Nineteen degrees of freedom at the 0.05 significance level gives us a critical value of
± 2.093.
Step 4: Interpret your results
Since our calculated t-test value is lower than our t-critical value, we fail to
reject the Null Hypothesis.
Therefore, the average GPA of the sample of football players is not significantly
different from the average GPA of student athletes.
Thus, the athletic director can conclude that the mean academic performance
of football players does not differ from the mean performance of other student
athletes.
Example B
Duracell manufactures batteries that the CEO claims will last an average of 300
hours under normal use. A researcher randomly selected 20 batteries from the
production line and tested these batteries. The tested batteries had a mean life
span of 270 hours with a standard deviation of 50 hours. Do we have enough
evidence to suggest that the claim of an average lifetime of 300 hours is false?
Solution :
Step 1: Clearly state the Null and Alternative Hypothesis
Step 2: Identify the appropriate significance level and confirm the test
assumptions.
We’ll use the standard significance level of 0.05, and we assume a normal
population distribution.
Step 3: Analyze the data and compute the test statistic
We know that we have 20 batteries, so our degrees of freedom for this test is
(20-1)= 19.
Nineteen degrees of freedom at the 0.05 significance level gives us a critical
value of
± 2.093.
Step 4: Interpret your results
Since our calculated t-test value is greater than our t-critical value, it lies in the
critical region therefore, we reject the Null Hypothesis.
The average battery life of the sample is significantly different from the average
battery life claim by the CEO. Therefore, the claim of an average lifetime of 300
hours is false
Example : Independent Two Sample Test
A researcher wants to determine if two different diets have different effects on
weight loss. The researcher takes a random sample of 10 people from each group:
Group 1 (Diet A): Their weight losses in pounds are: 5, 7, 6, 9, 8, 4, 7, 5, 6, 8
Group 2 (Diet B): Their weight losses in pounds are: 8, 10, 6, 9, 12, 11, 9, 10, 8, 11
The researcher wonders if there is a significant difference in the average weight loss
between Diet A and Diet B.
Solution :
Null hypothesis (H₀): There is no difference in the means of the two groups.
H0:μ1=μ2
Alternative hypothesis (H₁): There is a difference in the means of the two groups.
H1:μ1≠μ2
This is a two-tailed test since we are testing for any difference between the
means, not specifically an increase or decrease.
Since ∣t∣=4.01 is greater than the critical value of 2.101, we reject the null hypothesis.
There is significant evidence to suggest that the average weight loss between the two diets is
different.
Chi square test
The Chi-Squared test is a statistical method used to examine the relationships
between categorical variables.
It compares observed results with expected outcomes to determine whether
differences between these two are due to chance or if they signify a statistically
significant pattern.
There are two primary types of Chi-Squared tests:
1. Chi-Squared Test for Independence
•Purpose: Determines if there is a significant association between two categorical
variables.
•Example: You could test if there is an association between gender (male, female)
and voting preference (candidate A, candidate B).
Oi is the observed frequency, and Ei is the
expected frequency.
The degrees of freedom (df) for this test is calculated as:
where r is the number of rows and c is the
number of columns in the contingency table.
2. Chi-Squared Goodness-of-Fit Test
Purpose: Tests whether a sample data matches a population with a specific
distribution.
Example: Testing if a die is fair by comparing the observed outcomes with the
expected frequencies (equal probability for each face).
Important things to note about chi square test
For 5% level of significance
The calculated chi
square value is less
than the tabulated chi
square value.
Therefore, the null
hypothesis is not
rejected.
Fishers Exact Test
Fisher's Exact Test is a statistical significance test used to determine if there are
nonrandom associations between two categorical variables in a contingency
table, typically 2x2.
It is particularly useful when sample sizes are small, and the assumptions of the
more common Chi-square test might not be valid.
Key Features:
Exact: Unlike the Chi-square test, which is an approximation, Fisher's Exact Test
calculates the exact probability of obtaining a given distribution of the data,
assuming the null hypothesis is true.
Small Sample Sizes: It's especially valuable when dealing with small sample sizes
(n < 5) because it doesn't rely on large-sample approximations.
Example Use:
Suppose you are testing whether a new treatment works better than a standard
one:
Fisher's Exact Test can determine if the success rate of the new drug is
significantly different from that of the standard treatment, without needing large
sample sizes.
How It Works:
It computes the probability of the observed contingency table under the null
hypothesis by calculating hypergeometric probabilities for all possible tables with
the same marginal totals and compares them with the observed table.
When to Use:
When you have a small sample size (less than 5 observations in any cell of the
table).
When you need to evaluate the significance of associations between two
categorical variables
Example:
Let’s say you have a table that compares the outcome of a new drug versus a
placebo for patient recovery:
Thank You…