0% found this document useful (0 votes)
33 views47 pages

Module 5

Statistics of AIDS

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views47 pages

Module 5

Statistics of AIDS

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

University of Mumbai

Program – Bachelor of Engineering in


Computer Science and Engineering (Artificial Intelligence
and Machine Learning)

Class - T.E.
Course Code – CSDLO5011

Course Name – Statistics for Artificial


Intelligence Data Science

By
Prof. A.V.Phanse
The Analysis of Variance
 ANOVA (Analysis of Variance) is a statistical method used to compare the
means of three or more groups to determine if at least one of the group means
is significantly different from the others.

 It's commonly used in experiments where researchers are interested in


understanding how different treatments, conditions, or factors influence a
dependent variable.

Why Use ANOVA?

 When comparing the means of two groups, a t-test is appropriate.


 When comparing three or more groups, performing multiple t-tests increases
the chance of Type I errors (false positives).
 ANOVA solves this by using a single test to compare all groups at once.
Types of ANOVA:

One-way ANOVA: Compares means across different groups based on a single


factor (independent variable).
Example: Comparing the average test scores of students from three different
teaching methods.

Two-way ANOVA: Looks at the influence of two different factors on the


dependent variable, and also tests for interaction between the two factors.
Example: Studying the effect of both teaching method (factor 1) and student
gender (factor 2) on test scores.

Repeated Measures ANOVA:


A Repeated Measures ANOVA is used when the same subjects are measured
under different conditions or at different time points.
Example: The study of the effect of three different teaching methods on students'
performance. The same group of students takes three different tests, each
corresponding to one teaching method.
Population of
people with high
blood pressure
Before After Difference in
medication medication measurement
Assumptions in ANOVA
Components of ANOVA:
ANOVA partitions the total variability in the data into two types:
Between-group variability: The variation caused by differences between the
group means.
Within-group variability: The variation due to differences within each group
(random variation or noise).
F-Statistic:

 Compare the variance between groups to the variance within groups.


 If the between-group variability is significantly larger than the within-group
variability, it suggests that the group means are not all the same.
Consider a data
of smartphone
usage in hours
for 3 groups.

Null Hypothesis : There is no difference in average smartphone usage of 3 groups


Alternate Hypothesis : There is no difference in average smartphone usage of 3 groups
 For 2 degrees of freedom between the
groups and for 21 degrees of freedom
within the groups, Ftab = 3.47

 Fcal is compared with 3.47 (Ftab), and then


the decision will be taken whether to
reject the null hypothesis or not to reject
the null hypothesis.
 As Fcal (35.39) is greater than 3.47 (Ftab), the null hypothesis is rejected.
Practice Problem from University Exam
Problem of multiple comparisons

 The problem of multiple comparisons arises when a statistical analysis involves


testing several hypotheses simultaneously.
 As the number of comparisons increases, so does the likelihood of obtaining a
significant result purely by chance.
 This increases the risk of Type I errors (false positives), where we incorrectly
reject a true null hypothesis.

Example of the Problem:


 Suppose you're comparing the effectiveness of four different medications on
lowering blood pressure.
 You could perform t-tests to compare each pair of medications (A vs B, A vs C,
etc.), resulting in a total of six pairwise comparisons.
 If each test is performed at a significance level of 0.05, there's a 5% chance of
incorrectly finding a significant difference for each test.
 Across six tests, the overall probability of making at least one Type I error is
greater than 5%.
Family-Wise Error Rate (FWER)

 The family-wise error rate (FWER) is the probability of making at least one
Type I error across all the tests in a family of comparisons.

 The problem of multiple comparisons is about controlling the FWER so that the
chance of making false discoveries does not increase with the number of tests.

 If we perform m independent tests, each at a significance level of α (e.g. 0.05),


the probability of making at least one Type I error can be calculated as:

For m=6 comparisons at α=0.05, the probability of making at least one Type I
error is:

This means that there’s a 26.5% chance of making a false discovery with six
comparisons, which is much higher than the desired 5%.
Solutions to the Multiple Comparisons Problem:
Bonferroni Correction
 One of the simplest and most widely used methods is the Bonferroni
correction. It controls the FWER by adjusting the significance level.
 The Bonferroni correction divides the original significance level (α) by the
number of comparisons (m). The new significance level for each test is:

So, if you are performing 6 tests and want to maintain an overall significance level
of 0.05, the adjusted level for each test is:

This means that for each individual comparison, you would reject the null
hypothesis only if the p-value is less than 0.0083.
A Nonparametric Method—The Kruskal-Wallis Test

 The Kruskal-Wallis test is a nonparametric alternative to the one-way ANOVA.

 It is used to determine if there are significant differences between the medians


of three or more independent groups.

 Since it is nonparametric, it does not require the assumption of normality and


can be used with ordinal or continuous data that do not meet the normality
assumption required by ANOVA.

When to Use the Kruskal-Wallis Test:

 When the assumptions of ANOVA (e.g., normal distribution, homogeneity of


variance) are not met.

 When you have three or more independent groups.

 When the data is ordinal, not normally distributed, or has outliers.


Hypotheses:

Null Hypothesis (H₀): The medians of all groups are equal (there is no difference
between the groups).

Alternative Hypothesis (H₁): At least one group has a different median.


How the Kruskal-Wallis Test Works:

1. Rank the Data: The test works by converting the data into ranks across all
groups. Each observation is assigned a rank, with the smallest value receiving
rank 1, the next smallest rank 2, and so on.

2. Sum of Ranks for Each Group: The ranks are then summed within each group,
and the test statistic is based on comparing these rank sums between groups.

3. H-Statistic: The Kruskal-Wallis test produces an H-statistic, which is a function


of the sum of ranks, group sizes, and total sample size. The H-statistic follows
a chi-square distribution.
The formula for the H-statistic is:

The H-statistic is compared to a chi-square distribution with (g−1) degrees of freedom


Example : With the following data on content (in ml) of potassium per bottle in
brands of a medicine, determine if there is a significant difference in the potassium
content between brands.

Solution :

Rank the Data: The test works by converting the data into ranks across all groups.
Brand Content Rank
A 4.7 2
A 3.2 1
A 5.1 4 Finally note that n1 = n2 = n3 = 5 and N = 15.
A 5.2 5
For 5% level of significance and 2 (g -1) degrees of
A 5.0 3
freedom,
B 5.3 6
B 6.4 9
B 7.3 14
B 6.8 11
B 7.2 13
C 6.3 8
C 8.2 15
C 6.2 7
C 7.1 12
C 6.6 10
H-statistic is:

Conclusion :
As the calculated H statistic value is greater than the chi square critical value, the
null hypothesis will be rejected.

Interpretation :

The potassium content of at least one of the brands is different. Since R1 is far less
than the rank sums of the other two brands, we know that Brand A is different
before we do any kind of post hoc testing.
Example :
A researcher wants to test whether three different teaching methods lead to
different exam scores. The exam scores (out of 100) for students under each
method are as follows:
Method A: 70, 62, 78, 65, 80
Method B: 68, 75, 60, 85, 72
Method C: 90, 85, 88, 92, 84

Solution :
State the Hypotheses:

Null Hypothesis (H₀): The distributions of exam scores for all three methods are
the same.

Alternative Hypothesis (H₁): At least one method has a different distribution of


exam scores.
The degrees of freedom (df) for the Kruskal-Wallis test is the number of groups
minus one (g - 1). Here, we have 3 groups, so:
Using a chi-square distribution table and a significance level (α) of 0.05, the critical
value for χ² with 2 degrees of freedom is 5.991.

 If the test statistic H=8.35 is greater


than the critical value of 5.991, we
reject the null hypothesis.
 Since 8.35 > 5.991, we reject the null
hypothesis.
 This suggests that at least one of the
teaching methods leads to a
significantly different distribution of
exam scores.
Practice Problem from University Exam
Two way ANOVA

 A Two-Way ANOVA is a statistical method used to examine the effect of two


independent factors on a dependent variable.
 It also tests for interactions between these two factors. This type of ANOVA is
useful when the data is categorized by two factors, and you want to see how
these factors, individually and in combination, affect the outcome.
Collection of Data

Two way ANOVA find answer for following three questions :


Using F distribution table, the critical value of F are found and then compared with calculated
values of F.
A Nonparametric Method—Friedman‘s Test
 Friedman's Test is a nonparametric statistical test used to detect differences in
treatments across multiple test attempts.
 It is often used when the assumptions of parametric tests, such as normality
and homogeneity of variance, are not met.
 This test is particularly useful for analyzing randomized block designs with
repeated measures or matched samples.
Steps to Conduct Friedman's Test

Data Arrangement: Organize the data in a matrix format where rows represent
blocks (subjects) and columns represent treatments.

Rank the Data: Within each block (row), rank the treatment responses. If there are
ties, assign the average rank to tied values.

Calculate Test Statistic: For each treatment (column), sum the ranks across blocks.
Calculate the Friedman statistic Q using the formula

Determine the Degrees of Freedom: The degrees of freedom for Friedman's test
is k−1, where k is the number of treatments.
Critical Value:

Compare the calculated Q statistic to the critical value from the Chi-squared
distribution table with k−1 degrees of freedom at the desired significance level
(e.g., α=0.05).

Make a Decision:

 If Q exceeds the critical value, reject the null hypothesis, indicating that there
are significant differences among the treatments.

 If Q does not exceed the critical value, fail to reject the null hypothesis.
Consider call time of 7 people at 3
different time zones

As 2.57 < 5.991, the null hypothesis will


not be rejected.
Thank You…

You might also like