STAT 200
Chapter 2 Displaying and Describing Categorical Data
Displaying Data for Categorical Variables
For categorical data, the key is to group similar things together.
I. Frequency tables / Relative frequency tables
A (relative) frequency table shows all the categories of a categorical
variable together with their (relative) frequencies. The relative frequency
is the frequency expressed in percentages / proportions. For non-
overlapping categories, their percentages should add up to 100%.
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 1
STAT 200
Location of earthquakes in Canada during year 2020:
Region AB BC NB NL NS NT NU ON PE QC YT Total
Frequency 7 1631 55 38 16 170 243 89 1 380 197 2827
Region AB BC NB NL NS NT NU ON PE QC YT
Percentage 0.25 57.69 1.95 1.34 0.57 6.01 8.60 3.15 0.04 13.44 6.97
========================================
Season of occurrence:
Season Fall Spring Summer Winter Total
Frequency 691 697 652 787 2827
Season Fall Spring Summer Winter
Percentage 24.44 24.66 23.06 27.84
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 2
STAT 200
II. Bar Charts
A bar chart shows rectangular bars each representing a category. The
bars have the same width, and their heights represent the frequency or
relative frequency.
Because of the same width, the area of the bars is proportional to
the height. This satisfies the area principle: area should correspond to
the magnitude of the value it represents.
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 3
STAT 200
Location of earthquakes in Canada in year 2020:
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 4
STAT 200
Season of occurrence:
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 5
STAT 200
III. Pie Charts
A pie chart shows categories as slices in a circle. The area of each
slice is proportional to the fraction of the whole for the category it
represents.
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 6
STAT 200
IV. Contingency Tables
• useful for showing relationships between 2 categorical variables
• shows the breakdown of the data by 2 variables: each cell shows the
frequency or the percentage (by row, column, or table total)
• row and column totals, grand total
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 7
STAT 200
Example 1: Does season of occurrence differ across regions?
Season (numbers are frequencies)
Fall Spring Summer Winter
-----------------------------------------
Region
AB 2 2 2 1
BC 400 408 413 410
NB 11 11 14 19
NL 8 7 18 5
NS 3 5 5 3
NT 57 21 30 62
NU 57 75 43 68
ON 30 22 21 16
PE 0 0 0 1
QC 88 110 79 103
YT 35 36 27 99
---------------------------------------
Total 691 697 652 787
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 8
STAT 200
Season (numbers are PERCENTAGES out of region total)
Fall Spring Summer Winter
-----------------------------------------
Region
AB 28.6 28.6 28.6 14.3
BC 24.5 25.0 25.3 25.1
NB 20.0 20.0 25.5 34.5
NL 21.1 18.4 47.4 13.2
NS 18.8 31.2 31.2 18.8
NT 33.5 12.4 17.6 36.5
NU 23.5 30.9 17.7 28.0
ON 33.7 24.7 23.6 18.0
PE 0.0 0.0 0.0 100.0
QC 23.2 28.9 20.8 27.1
YT 17.8 18.3 13.7 50.3
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 9
STAT 200
Example 2
A study looks at the relationship between diet type (high versus low
cholesterol diet) and presence of coronary heart disease. Here are data
collected on 23 individuals:
Having heart disease? Total
Yes No
High cholesterol diet 11 4 15
Low cholesterol diet 2 6 8
Total 13 10 23
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 10
STAT 200
Some questions of interest:
1. What percentage of individuals are not having coronary heart disease?
To answer this question, we will look at the distribution of presence
of coronary heart disease:
Having heart disease? Total
Yes No
13 10 23
(13/23 = 56.5%) (10/23 = 43.5%) (100%)
Answer: 43.5%
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 11
STAT 200
2. What proportion of individuals are having high cholesterol diet?
We will look at the distribution of diet type:
High cholesterol diet Low cholesterol diet Total
15 8 23
(15/23 = 65.2%) (8/23 = 34.8%) (100%)
Answer: 65.2%
Note: In each of questions 1 and 2, we look at the distribution of one
variable while collapsing data of the other variable. These distributions
are called marginal distributions (the frequencies are taken along the
margins of the table).
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 12
STAT 200
3. What percentage of individuals having low cholesterol diet have
coronary heart disease?
Here we look specifically at individuals who have low cholesterol diet.
We are interested in finding what percentage of these individuals
have coronary heart disease. We shall look at the distribution of
presence of coronary heart disease among low cholesterol diet takers:
Having heart disease? Total
Yes No
2 6 8
(2/8 = 25%) (6/8 = 75%) (100%)
Answer: 25%
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 13
STAT 200
4. What percentage of individuals having high cholesterol diet have
coronary heart disease?
This time we look specifically at individuals who have high
cholesterol diet. We are interested in finding what percentage of
these individuals have coronary heart disease. We shall look at
the distribution of presence of coronary heart disease among high
cholesterol diet takers:
Having heart disease? Total
Yes No
11 4 15
(11/15 = 73.3%) (4/15 = 26.7%) (100%)
Answer: 73.3%
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 14
STAT 200
5. What proportion of individuals having coronary heart disease are having
high cholesterol diet?
Notice that this question is different from the previous question.
We look at individuals who have coronary heart disease and are
interested in finding the percentage of these individuals that have
high cholesterol diet. We shall look at the distribution of diet type
among those who have heart disease:
High cholesterol diet Low cholesterol diet Total
11 2 13
(84.6%) (15.4%) (100%)
Answer: 84.6%
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 15
STAT 200
Note: For questions 3, 4 and 5, the distributions presented are called
conditional distributions. We fix a condition for one variable and look
at the distribution of the other variable. For example, in this question,
we focus on diseased individuals (i.e., fixing a condition - having heart
disease - for the presence of heart disease variable) and obtain the
distribution of the other variable - diet type.
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 16
STAT 200
Recall that we want to explore the relationship between diet type and
presence of coronary heart disease. Are the two variables associated?
Think: If the two are not associated, it would imply regardless of the
type of diet, there should be the same percentage of diseased individuals
within each diet type. Is this the case? Let’s compare the conditional
distributions presented in questions 3 and 4.
73.3% of individuals taking high cholesterol diet have heart disease
while only 25% of those taking low cholesterol diet have heart disease.
The two percentages (73.3% versus 25%) are different. This suggests
diet type and presence of coronary heart disease are associated.
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 17
STAT 200
Simpson’s Paradox
The following are data on a total of 1300 surgeries performed in two
hospitals. The number of patients who had delayed discharge (due to
excessive bleeding, infection and other post-surgical complications) are
given separately for two hospitals.
Large hospital Small hospital
Total # surgeries 1000 300
# delayed discharge 130 30
% delayed discharge 130/1000=13% 30/300=10%
The small hospital advertises that it has a lower rate of delayed discharge
(10% in the small hospital compared to the 13% in the large hospital).
Does the data truly speak in favour of the small hospital?
What if we look at major and minor surgeries separately?
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 18
STAT 200
Large hospital
Major surgery Minor surgery
Total # surgeries 800 200
# delayed discharge 120 10
% delayed discharge 120/800=15% 10/200=5%
Small hospital
Major surgery Minor surgery
Total # surgeries 50 250
# delayed discharge 10 20
% delayed discharge 10/50=20% 20/250=8%
Now the data suggests that the small hospital has a higher rate of delayed
discharge for both major and minor surgeries!
This phenomenon, a change in the direction of association between
two categorical variables when accounting for a third variable (e.g. type
of surgery), is called the Simpson’s Paradox. Looking at the overall
percentage of delayed discharge ignoring the surgery type could lead to
misleading results.
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 19
STAT 200
How can we explain these conflicting results?
One would expect that delayed discharge is more common among major
surgeries possibly due to surgical complications. Although the large
hospital has a lower delayed discharge rate in both surgery type (major
and minor), the fact that the large hospital handles relatively more major
surgeries pull up the overall delayed discharge rate.
Large hospital Small hospital
% major vs. minor surgeries: 80% vs. 20% 17% vs. 83%
120 10
Major surgeries: (15%) < (20%)
800 50
10 20
Minor surgeries: (5%) < (8%)
200 250
120 + 10 10 + 20
Combined: (13%) > (10%)
800 + 200 50 + 250
Eugenia Yu, UBC Department of Statistics. Not to be copied, used, or revised without explicit written permission from the copyright owner. 20