10/7/24, 4:22 PM about:blank
Chi-Square Test for Categorical Variables
Introduction
The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is widely used in various
fields, including social sciences, marketing, and healthcare, to analyze survey data, experimental results, and observational studies.
Concept
The chi-square test is a non-parametric statistical method used to examine the association between two categorical variables. It evaluates whether the frequencies of
observed outcomes significantly deviate from expected frequencies, assuming the variables are independent. The test is grounded in the chi-square distribution,
which is applied to count data and helps in determining if any observed deviations could have arisen by random chance.
Null Hypothesis and Alternative Hypothesis
The chi-square test involves formulating two hypotheses:
Null Hypothesis (𝐻0 )(H0) - Assumes that there is no association between the categorical variables, implying that any observed differences are due to random
chance.
Alternative Hypothesis (𝐻1 )(H1) - Assumes that there is a significant association between the variables, indicating that the observed differences are not due to
chance alone.
Formula
The chi-square statistic is calculated using the formula:
(𝑂𝑖 − 𝐸𝑖 )2
𝜒2 = ∑ χ2 = ∑ Ei(Oi−Ei)2
𝐸𝑖
where
𝑂𝑖 Oiis the observed frequency for category 𝑖i.
𝐸𝑖 Eiis the expected frequency for category 𝑖i, calculated as:
(𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 × 𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙)
𝐸𝑖 = Ei= grand total(row total×column total)
𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙
The sum is taken over all cells in the contingency table.
The calculated chi-square statistic is then compared to a critical value from the chi-square distribution table. This table provides critical values for different degrees
of freedom (𝑑𝑓)(df ) and significance levels (𝛼)(α).
The degrees of freedom for the test are calculated as:
𝑑𝑓 = (𝑟 − 1) × (𝑐 − 1)df = (r − 1) × (c − 1)
where 𝑟r is the number of rows and 𝑐c is the number of columns in the table.
Applications
1. Market Research: Analyzing the association between customer demographics and product preferences.
2. Healthcare: Studying the relationship between patient characteristics and disease incidence.
3. Social Sciences: Investigating the link between social factors (e.g., education level) and behavioral outcomes (e.g., voting patterns).
4. Education: Examining the connection between teaching methods and student performance.
5. Quality Control: Assessing the association between manufacturing conditions and product defects.
Practical Example - Weak Correlation
Suppose a researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The researcher
surveys 100 people and records the following data:
Category Like Dislike Total
Male 20 30 50
Female 25 25 50
Total 45 55 100
Step 1: Calculate Expected Frequencies
Using the formula for expected frequencies:
(50 × 45)
𝐸𝑀𝑎𝑙𝑒, 𝐿𝑖𝑘𝑒 = = 22.5EM ale,Like= 100(50×45)= 22.5
100
(50 × 55)
𝐸𝑀𝑎𝑙𝑒, 𝐷𝑖𝑠𝑙𝑖𝑘𝑒 = = 27.5EM ale,Dislike= 100(50×55)= 27.5
100
about:blank 1/3
10/7/24, 4:22 PM about:blank
(50 × 45)
𝐸𝐹𝑒𝑚𝑎𝑙𝑒, 𝐿𝑖𝑘𝑒 = = 22.5EF emale,Like= 100(50×45)= 22.5
100
(50 × 55)
𝐸𝐹𝑒𝑚𝑎𝑙𝑒, 𝐷𝑖𝑠𝑙𝑖𝑘𝑒 = = 27.5EF emale,Dislike= 100(50×55)= 27.5
100
Step 2: Compute Chi-Square Statistic
(20 − 22.5)2 (30 − 27.5)2 (25 − 22.5)2 (25 − 27.5)2
𝜒2 = 22.5
+ 27.5
+ 22.5
+ 27.5
χ2 = 22.5(20−22.5)2+ 27.5(30−27.5)2
+ 22.5(25−22.5)2+ 27.5(25−27.5)2
(2.5)2 (2.5)2 (2.5)2 (2.5)2
𝜒2 = + + + χ2 = 22.5(2.5)2+ 27.5(2.5)2+ 22.5(2.5)2+ 27.5(2.5)2
22.5 27.5 22.5 27.5
6.25 6.25 6.25 6.25
𝜒2 = + + + χ2 = 22.56.25+ 27.56.25+ 22.56.25+ 27.56.25
22.5 27.5 22.5 27.5
𝜒2 = 0.277 + 0.227 + 0.277 + 0.227χ2 = 0.277 + 0.227 + 0.277 + 0.227
𝜒2 = 1.008χ2 = 1.008
Step 3: Determine Degrees of Freedom
𝑑𝑓 = (2 − 1) × (2 − 1) = 1df = (2 − 1) × (2 − 1) = 1
Step 4: Interpret the Result
Using a chi-square distribution table, we compare the calculated chi-square value (1.008) with the critical value at one degree of freedom and a significance level
(e.g., 0.05). The critical value, as determined from chi-square distribution tables, is approximately 3.841.
Since 1.008 < 3.841, we fail to reject the null hypothesis. Thus, there is no significant association between gender and product preference in this sample.
Practical Example - Strong Association
Consider a study investigating the relationship between smoking status (smoker, non-smoker) and the incidence of lung disease (disease, no disease). The researcher
collects data from 200 individuals and records the following information:
Category Disease No Disease Total
Smoker 50 30 80
Non-Smoker 20 100 120
Total 70 130 200
Step 1: Calculate Expected Frequencies
Using the formula for expected frequencies:
(80 × 70)
𝐸𝑆𝑚𝑜𝑘𝑒𝑟, 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 = = 28ESmoker,Disease= 200(80×70)= 28
200
(80 × 130)
𝐸𝑆𝑚𝑜𝑘𝑒𝑟, 𝑁𝑜 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 = = 52ESmoker,N o Disease= 200(80×130)= 52
200
(120 × 70)
𝐸𝑁𝑜𝑛 − 𝑆𝑚𝑜𝑘𝑒𝑟, 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 = = 42EN on−Smoker,Disease= 200(120×70)= 42
200
(120 × 130)
𝐸𝑁𝑜𝑛 − 𝑆𝑚𝑜𝑘𝑒𝑟, 𝑁𝑜 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 = = 78EN on−Smoker,N o Disease= 200(120×130)= 78
200
Step 2: Compute Chi-Square Statistic
(50 − 28)2 (30 − 52)2 (20 − 42)2 (100 − 78)2
𝜒2 = + + + χ2 = 28(50−28)2+ 52(30−52)2+ 42(20−42)2
28 52 42 78
+ 78(100−78)2
(22)2 (22)2 (22)2 (22)2
𝜒2 = + + + χ2 = 28(22)2+ 52(22)2+ 42(22)2+ 78(22)2
28 52 42 78
484 484 484 484
𝜒2 = + + + χ2 = 28484+ 52484+ 42484+ 78484
28 52 42 78
𝜒2 = 17.29 + 9.31 + 11.52 + 6.21χ2 = 17.29 + 9.31 + 11.52 + 6.21
𝜒2 = 44.33χ2 = 44.33
Step 3: Determine Degrees of Freedom
𝑑𝑓 = (2 − 1) × (2 − 1) = 1df = (2 − 1) × (2 − 1) = 1
Step 4: Interpret the Result
about:blank 2/3
10/7/24, 4:22 PM about:blank
Using a chi-square distribution table, we compare the calculated chi-square value (44.33) with the critical value at one degree of freedom and a significance level
(e.g., 0.05), approximately 3.841. Since 44.33 > 3.841, we reject the null hypothesis. This indicates a significant association between smoking status and the
incidence of lung disease in this sample.
Conclusion
The chi-square test is a powerful tool for analyzing the relationship between categorical variables. By comparing observed and expected frequencies, researchers can
determine if there is a statistically significant association, providing valuable insights in various fields of study.
about:blank 3/3