.
STAT-7213
BIO-STATISTICS
M.PHIL STATISTICS
YEAR-1 (SEMESTER-II)
Submitted to:
Dr. Jamal Abdul Nasir
PRESENTATION
PRESENTED BY
MUBEEN ASGHAR (0557)
LAIBA SUBHANI (0559)
NOOR-E-AMNA (0561)
RABIA SAIF (0563)
3
TOPIC
ANALYSIS OF
CATEGORICAL
.
DATA
Categorical data is data that classifies an observation as belonging
to one or more categories. For example, an item might be judged as
good or bad, or a response to a survey might includes categories
such as agree, disagree, or no opinion
CATEGORICAL DATA
5
ANALYSIS OF CATEGORICAL DATA
Categorical data analysis is the analysis of data where the
response variable has been grouped into a set of mutually exclusive
ordered (such as age group) or unordered (such as eye color)
categories.
6
BASIC ANALYSIS
The Goodness-of-Fit Test
The 2 by 2 Contingency Table
The r by c Contingency Table
Multiple 2 by 2 Contingency Tables
7
GOODNESS-OF-FIT TEST
The goodness-of-fit test is a statistical hypothesis test to see how
well sample data fit a distribution from a population with a normal
distribution.
8
USES OF GOF TEST
• Goodness-of-fit tests are statistical methods often used to make
inferences about observed values.
• These tests determine how related actual values are to the predicted
values in a model, and when used in decision-making, goodness-of-
fit tests can help predict future trends and patterns.
• Goodness-of-fit tests are commonly used to test for the normality of
residuals or to determine whether two samples are gathered from
identical distributions.
9
METHODS OF GOODNESS-OF-FIT TEST
There are multiple methods for determining goodness-of-fit. Some
of the most popular methods used in statistics include the
• Chi-square
• The Kolmogorov-Smirnov test
• The Anderson-Darling test
• The Shipiro-Wilk test
10
KEY POINTS
• Goodness-of-fit tests are statistical tests aiming to determine whether a set of
observed values match those expected under the applicable model.
• There are multiple types of goodness-of-fit tests, but the most common is the
chi-square test.
• Chi-square determines if a relationship exists between categorical data.
• The Kolmogorov-Smirnov test used for large samples determines whether a
sample comes from a specific distribution of a population.
• Goodness-of-fit tests can show you whether your sample data fit an expected set
of data from a population with normal distribution.
11
CHI-SQUARE TEST
The chi-square independence test is a procedure for
testing, if two categorical variables are related in some
population.
12
TEST STATISTIC
Where, O is an observed frequency and E is an estimated expected frequency.
E=
13
DEGREES OF FREEDOM
The degrees of freedom is basically a
number that determines the exact
shape of our distribution. The figure
illustrates this point.
degrees of freedom -or df- are calculated as
df = (r-1)*(c-1)
14
PROCEDURE
1. State Null and Alternative Hypothesis.
2. Level of Significance.
3. Test Statistic.
2 =
4. Computation.
5. Critical Region:
reject
6. Conclusion.
If reject
15
GOVERNMENT COLLEGE UNIVERSITY
EXAMPLE
Popularity of psychology professors who enrolled students in college at 0.05
significance level test the random enrolment of students.
16
GOVERNMENT COLLEGE UNIVERSITY
SOLUTION
• State Null and Alternative Hypothesis.
• Level of Significance.
• Test Statistic.
2 =
17
GOVERNMENT COLLEGE UNIVERSITY
PROCEDURE
• Computation.
• Critical Region:
• Conclusion.
As so we reject . And conclude that students do not enroll at random. 18
GOVERNMENT COLLEGE UNIVERSITY
CONTINGENCY TABLE
A contingency table (also known as a cross tabulation or crosstab)
is a type of table in a matrix format that displays the
(multivariate) frequency distribution of the variables.
19
GOVERNMENT COLLEGE UNIVERSITY
TYPES CONTINGENCY TABLE
• 2×2 Contingency table
• r × c Contingency table
• Multiple 2×2 Contingency table
20
GOVERNMENT COLLEGE UNIVERSITY
2×2 CONTINGENCY TABLE
The two by two or fourfold contingency
table represents two classifications of a set of counts or frequencies.
The rows represent two classifications of one variable (e.g.,
outcome positive/outcome negative) and the columns
represent two classifications of another variable (e.g.,
intervention/no intervention).
21
TEST STATISTIC
where, for r rows and c columns of n observations, O is an observed frequency and E
is an estimated expected frequency.
E=
22
FISHER EXACT TEST
Fisher's exact test is a statistical significance test used in the
analysis of contingency tables.
Although in practice it is employed when sample sizes are small.
23
CRITERIA FOR FISHER EXACT TEST
Both variables are dichotomous qualitative (2 cross 2 table).
When the overall total of the table (sample size) is 30.
When anyone expected cell value is less than 5.
24
ASSUMPTIONS
Data consist of two population. A sample observation from
population 1 and B sample observation from population 2.
The samples are random and independent.
Each observation can be categorized as one of two mutually
exclusive type.
25
26
27
ODD RATIO
28
Difference Between ODDS AND ODDS RATIO
DEF: The odds for success are the ratio Odds ratio that we may compute from the
of the probability of success to the data of a retrospective study.
probability of failure. We use symbol OR to indicate that the
measure is computed from sample data and
The odds of being a case(having used as an estimate of population odds ratio
disease) to being a control(not having
disease) among subjects with risk factor
is [a/(a+b)]/[b/(a+b)]=a/b
The odds of being a case(having
disease) to being a control(not having
disease) among subjects without risk
factor is [c/(c+d)]/[d/(c+d)]=c/d
29
PROPERTIES
Equal to any non-negative number
The odds of success are higher in row 1 as compared to row 2 when OR>1
When one cell has zero probability, OR equals 0 or ∞
30
INTERPRETATION
A value of 1 indicates no association between the risk factor and disease status.
A value less than 1 indicates reduced odds of the disease among subjects with
the risk factor.
A value greater than 1 indicates increased odds of having the disease among
subjects in whom the risk factor is present
31
EXAMPLE
To compute the odds of receiving a death penalty for each groups
32
The odds of death sentence if the defendant was blacks= 28/45=0.6222
The odds of death sentence if the defendant was non-black=22/52=0.4231
The impact of being black on receiving the death penalty is measured by the odds ratio. Such
as ;
INTERPRET
The odds of death sentence for black is 47% higher for blacks as compared to
non-blacks
33
YATE’S METHOD
Cochran suggests that chi square test should not be used if n is small and
expected frequency less than 5.
Yates (1934) proposed a procedure for correcting in case of 2*2 table, That is,
OR
34
CRITERIA FOR YATE’S CORRECTION
Both variables are dichotomous qualitative (2 cross 2 table).
When the overall total of the table (sample size) is 30.
When anyone expected cell value is less than 5.
35
36
37
38
MATCHED-PAIR STUDIES
A matched pairs design is an experimental design that is used when
an experiment only has two treatment conditions. The subjects in
the experiment are grouped together into pairs based on some
variable they “match” on, such as age or gender. Then, within each
pair, subjects are randomly assigned to different treatments.
39
EXAMPLE
Pairs with the same exposure status for both case and control the diagonal cells
are called concordant pairs (c1and c2), and pairs with different exposures the off-
diagonal cells are called discordant pairs (d1 and d2).
40
EXAMPLE
Let be the probability that a discordant pair has an exposed case. Then, from the
preceding table, can be estimated by the following proportion,
41
HYPOTHESIS
Under the null hypothesis of no association between the risk factor and the
disease, each discordant pair is just as likely to have a case exposed as to have a
control
exposed. Thus, the null hypothesis can be written as
42
APPROXIMATION
For
large samples, we can use the normal approximation.
43
44
45
R × C CONTINGENCY TABLE
We now consider the more general situation where two
classification variables have more than two categories. First, we
consider the situation where both variables are nominal followed by
the situation when one of the variables is ordinal.
46
R × C CONTINGENCY TABLE
Testing Hypothesis of No Association
The same ideas used in the 2 by 2 table still apply to the r by c
contingency table. If there is no association between a row variable
and a column variable, the ratio of the expected cell frequency in
the ith row and jth column, mij, to the ith row total, ni⋅, should
equal the ratio of the jth column total, n⋅j, to the overall total.
47
R × C CONTINGENCY TABLE
There are (r - 1)(c - 1) degrees of freedom for the r by c table because once we
know the frequencies of any (r - 1)(c - 1) cells, we can find the values of the
other frequencies by subtraction from the row and column totals. The hypothesis
of no association between the row and column variables is tested using the chi-
square goodness-of-fi t statistic. Most statisticians perform no adjustment to the
test statistic when used with tables other than the 2 by 2 table. If the test statistic
is greater than the value of , we reject the hypothesis of no association in favor
of the alternative that the row and column variables are related. If the test statistic
is less than we fail to reject the null hypothesis.
48
49
50
MULTIPLE 2×2 CONTINGENCY
TABLE
Here, we gonna focus on the relationship between 2 factors in the
presence of a third factor. We examined the relationship between 2
categorical variables (factors).
51
EXAMPLE
For example, we might be interested in the relationship between
smoking and lung cancer, and how this relationship may change
with gender (a third factor). We observe that the apparent
(combining) relationship between 2 factors may switch or change
its direction and magnitude depending on third factor.
52
EXAMINE THE RELATIONSHIP
We will test for such a dependency, and, if we don’t
seem to find one, we will analyze the aggregated data; if we do find
such a dependency, then it is appropriate to examine the relationship
of the 2 factors of interest separately for each of the levels of the
third factor (don’t aggregate).
We will focus on 2 factors each with 2 levels, including a third
factor with possibly several (g) levels; thus, we will be working
with multiple 2x2 contingency tables.
53
A study to determine if there is any association between the occurrence of upper respiratory infections (URI) of young children and outdoor
air pollution. There are several variables that could affect the relationship between the occurrence of infections and outdoor air pollution.
(I.E, dust, traffic, smoke etc) hypothetical data for this situation are based on an article by jaakkola et al. (1991) and are shown in table
54
EXAMPLE
55
EXAMPLE
56
EXAMPLE
57
SOLUTION
One way of taking the passive smoke variable into account is to analyze each 2 by 2 table
separately. Then we have two tables i.e, one who smoked and other who don’t smoked
Table.1
Passive smoke City polluted URI URI total
in the home some none
yes high 100 20 120
yes low 124 40 164
total 224 60 284
58
SOLUTION
Calculations:
By using the chi-square and odd ratio formula, we have the XYC -square statistic is 2.039
and its p-value is 0.1533 for homes in which someone smoked. The odds ratio for this data
is 1.613. The 95 percent confidence intervals for the odds ratios is from 0.887 to 2.933
Table 2.
Passive smoke City polluted URI URI total
in the home some none
NO high 128 62 190
NO low 166 119 285
total 294 181 475
59
SOLUTION
Calculations:
By using the chi-square and odd ratio formula, we have the XYC -square statistic is 2.039
and its p-value is 0.1533 for homes in which someone smoked. The odds ratio for this data
is 1.613. The 95 percent confidence intervals for the odds ratios is from 0.887 to 2.933
Table 2.
Passive smoke City polluted URI URI total
in the home some none
NO high 128 62 190
NO low 166 119 285
total 294 181 475
60
SOLUTION
Calculations
The XYC -square value is 3.645, and its p-value is 0.0562 for those without passive smoke
in the home. The odds ratio for this data is 1.480. The 95 percent confidence intervals for
the odds ratios is from 1.007 to 2.171
Interpretation
The first confidence interval, a much wider interval than the second interval, includes the
value of one that suggests that there is no relation between the two variables. The second
interval barely misses including one. The second interval’s smaller size reflects the larger
sample size associated with the
home in which there was no passive smoke. Neither of these tables has a statistically
significant association between the outdoor air pollution and the occurrence of URI at the
0.05 level based on the test statistics. The conclusion from the analyses of the separate
tables is different from that of the combined table.
A problem with the use of the separate tables is that the analyses are based on the smaller
sample sizes associated with each sub-table, not on the sample size of the combined table.
This makes it diffificult to find the presence of small but consistent trends across tables.
61
COCHRAN MENTAL HAENSZEL TEST
Two bio statisticians, Nathan Mantel and William Haenszel, developed a method in 1959
for examining the relation between two categorical variables while controlling for another
categorical variable (Mantel and Haenszel 1959).
This method, like a method published by William Cochran in 1954, uses all the data in the
combined table and produces one overall test statistic. The test is designed to detect the
consistent effect of the independent variable on the dependent variable across the levels of
the extraneous variable.
Thus, this method should only be used when the estimated odds ratios in the Sub-tables
are similar to one another. One very attractive feature of this test is that it can be used with
extremely small sample sizes.
62
PROPERTIES
For large samples, when H0 is true, CMH has chi-squared distribution with df = 1.
If all θ(AB(k))=1, then CMH is close to zero
If some or all θ(AB(k))>1, then CMH is large
If some or all θ(AB(k))<1, then CMH is large
If some θ(AB(k))<1 and others θ(AB(k))>1, then CMH is NOT an appropriate test;
that is, the test works well if the conditional odds ratios are in the same direction and
comparable in size.
This test has also been generalized for application to three-way tables of size other than 2
by 2 by k (Landis, Heyman, and Koch 1978)
63
WHEN TO USE
Use the Cochran–Mantel–Haenszel test (which is sometimes called the Mantel–
Haenszel test) for repeated tests of independence. The most common situation is
that you have multiple 2×2 tables of independence; we're analyzing the kind of
experiment that we had to analyze with a test of independence, and we have done
the experiment multiple times or at multiple locations. There are three nominal
variables: the two variables of the 2×2 test of independence, and the third
nominal variable that identifies the repeats (such as different times, different
locations, or different studies).
64
CMH
We have one Z* test statistics, but we are dealing with discrete variables, we should use
the continuity correction with Z*. However, instead of using the continuity-corrected
Z* statistic, we would prefer to use a chi-square statistic, since all the other tests
associated with contingency tables use a chi-square statistic. This poses no problem, since
the square of a standard normal variable follows a chi-square distribution with one degree
of freedom. Thus, the statistic to be used to test the hypothesis of no association between
air pollution and the occurrence of upper respiratory problems is the Cochran-Mantel-
Haenszel chi-square statistic.
65
CMH
. Also called the Mantel-Haenszel statistic, it is defined by
where Oi and Ei are the observed and expected values in the (1,1) cell in the ith
sub-table.
In terms of the entries in the ith table, Ei is defined as,
66
VARIANCE
Vi, with a variance of Oi minus Ei, can be as,
In XCMH-square O, E, and V are defined as the sums of the Oi, the Ei and the Vi
over the k subtables. If XCMH-square is greater than chi-square table value, we
reject the hypothesis of no association between air
pollution and the occurrence of upper respiratory infections. Otherwise we fail to
reject the null hypothesis.
67
EXAMPLE
68
EXAMPLE
69
EXAMPLE
70
MENTAL HEANSZEL COMMON ODD RATIOS
Mantel and Haenszel also showed how to combine the data from the separate sub tables to
form a common odds ratio for the data. Again, this should only be done when the
estimated odds ratios in the sub tables are similar. If the estimated odds ratios for the sub
tables are not similar — for example, some are less than one and some are greater than one
— the common odds ratio would not be very useful. The relation between the independent
and dependent variable would depend on the level of the extraneous variable, and the use
of a common odds ratio would mask this. The Mantel-Haenszel
estimator of the common odds ratio, θ is,
71
DISADVANTAGES
• There is a limit to the kind of statistical analysis that can be performed on
categorical data.
• The options in categorical data do not have a standardized interval scale.
Therefore, respondents are not able to effectively gauge their options before
responding.
• Quantitative analysis cannot be performed on categorical data. Therefore,
numerical or arithmetic operations can not be performed.
72
REFERENCES
• https://www3.nd.edu/~rwilliam/stats1/x51.pdf
• https://www.investopedia.com/terms/g/goodness-of-fit.asp
• https://www.statsdirect.com/help/chi_square_tests/22.htm
• https://www.statsdirect.com/help/chi_square_tests/22.htm
• https://onlinestatbook.com/2/chi_square/contingency.html
• https://www2.stat.duke.edu/courses/Spring02/sta102/chap16.pdf
73
RECOMMENDATION
• https://ncss-wpengine.netdna-ssl.com/wp-
content/themes/ncss/pdf/Procedures/NCSS/Contingency_Tables-Crosstabs-
Chi-Square_Test.pdf
74
THE END
75