Stats: Data and Models
Third Canadian Edition
Chapter 2
Displaying and Describing
Categorical Data
Copyright © 2019 Pearson Canada Inc. 2-1
Important Knowledge Points
• Analyze categorical variable in R.
• Make frequency table and relative frequency table of
a single categorical variable.
• Explore relationship between two categorical
variables – contingency table.
• Contingency table – joint distribution, marginal
distribution and conditional distribution.
• Produce contingency table from raw data by using R.
Copyright © 2019 Pearson Canada Inc. 2-2
Variable Sample Distribution
• Not all observations (observed values of the
variables) are the same – variation across
observations
• The pattern of variation of a variable is called its
distribution. Distribution is a summary of the values
a variable takes and how often it takes the different
values.
• Example: distribution of gender variable in our first
data set (number of girls and number of boys)
Copyright © 2019 Pearson Canada Inc. 2-3
Titanic Example – Raw Data
• Who – 2201 people on Titanic
• What – four variables: Survived, Age, Sex, Class
• Distribution of the class variable
• R commands
• > table(Class)
• Class
• Crew First Second Third
• 885 325 285 706
Copyright © 2019 Pearson Canada Inc. 2-4
Frequency Tables (1 of 2)
• To display a categorical variable, we need to organize the
number of observations in each category
• A frequency table records the totals for each category
• R command: table(Class)
Class Count
First 325
Second 285
Third 706
Crew 885
Copyright © 2019 Pearson Canada Inc. 2-5
Frequency Tables (2 of 2)
• A relative frequency displays the proportions or
percentages, rather than the counts, of the values
in each category.
Class %
First 14.77
Second 12.95
Third 32.08
Crew 40.21
Copyright © 2019 Pearson Canada Inc. 2-6
Bar Charts (1 of 2)
• A bar chart displays the
distribution of a categorical
variable, showing the
counts for each category
next to each other for easy
comparison.
Copyright © 2019 Pearson Canada Inc. 2-7
Bar Charts (2 of 2)
• A relative frequency bar
chart displays the
relative proportion of
counts for each category.
Copyright © 2019 Pearson Canada Inc. 2-8
Pie Charts
• When you are interested in
parts of the whole, a pie chart
might be your display of
choice.
• Pie chart show the whole
group of cases as a pie.
• They slice the pie into pieces
whose size is proportional to
the fraction of each category
in the whole.
Copyright © 2019 Pearson Canada Inc. 2-9
Titanic Example Again
Of the 2201 people on Titanic, make a relative
frequency table of the distribution of the variable
survived.
Survival Count Percentage
Dead 1490
Alive 711
Copyright © 2019 Pearson Canada Inc. 2 - 10
Exploring relationship between two
categorical variables
• Sometimes we want to look at two categorical
variables together. We want to see how
observations are distributed along each variable,
contingent on the value of the other variable.
• Titanic Example: we can examine distribution of
survival contingent on First class; or distribution of
class contingent on survival.
Copyright © 2019 Pearson Canada Inc. 2 - 11
Exploring relationship between two
categorical variables
Distribution of survival contingent on class.
> table(Survived[Class=="First"])
Alive Dead
203 122
> table(Survived[Class=="Second"])
Alive Dead
118 167
> table(Survived[Class=="Third"])
Alive Dead
178 528
> table(Survived[Class=="Crew"])
Alive Dead
212 673
Copyright © 2019 Pearson Canada Inc. 2 - 12
Exploring relationship between two
categorical variables
Distribution of class contingent on survival.
> table(Class[Survived=="Alive"])
Crew First Second Third
212 203 118 178
> table(Class[Survived=="Dead"])
Crew First Second Third
673 122 167 528
Copyright © 2019 Pearson Canada Inc. 2 - 13
Exploring relationship between two
categorical variables
• A Contingency Table organize all summary
statistics together nicely. It shows how individuals
are distributed along each variable, contingent on
the value of the other variable.
Survival First Class Second Class Third Class Crew Class Total
Alive 203 118 178 212 711
Dead 122 167 528 673 1490
Total 325 285 706 885 2201
Copyright © 2019 Pearson Canada Inc. 2 - 14
Contingency Tables
• Each cell of the table gives the count for a
combination of the two values that each
categorical variable takes.
– In the previous table, there are 2 x 4 = 8 cells
• The margins of the table, both on the right ( 2 of
them) and at the bottom (4 of them), give totals
for each of the variables taking a specific value.
Copyright © 2019 Pearson Canada Inc. 2 - 15
Joint and Marginal Distributions
(dividing each number by the grand total of
2201)
Survival First Class Second Class Third Class Crew Class Total
Alive 9.2 5.4 8.1 9.6 32.3
Dead 5.5 7.6 24.0 30.6 67.7
Total 14.8 12.9 32.1 40.2 100
Copyright © 2019 Pearson Canada Inc. 2 - 16
Joint and Marginal Distributions
• Joint distribution shows the frequency (or percentage) of
data when BOTH categorical variables are set to a specific
value respectively.
Among all 2201 people on Titanic –
203 (or 9.2%) “Class= First” AND “Survived=Alive”;
167(or 7.6%) “Class=Second” AND “Survived=Dead”
• Marginal distribution shows the frequency (or percentage)
of data when ONE categorical variable is set to a specific
value.
Among all 2201 people on Titanic –
325 (or 14.8%) “Class= First”
1490 (or 67.7%) “Survived=Dead”
Copyright © 2019 Pearson Canada Inc. 2 - 17
Conditional Distributions
• A conditional distribution shows the distribution of
one variable for just the individuals who satisfy
some condition on another variable.
– The following is the conditional distribution of Class,
conditional on “Survived=Alive” and conditional on
“Survived=Dead”:
Survi Class First Class Second Class Third Class Crew Total
val
Alive 203 118 178 212 711
blank 28.6% 16.6% 25.0% 29.8% 100%
Dead 122 167 528 673 1490
blank 8.2% 11.2% 35.4% 45.2% 100%
Copyright © 2019 Pearson Canada Inc. 2 - 18
Conditional Distributions
• Conditional on Alive, we only look at the 711 people and see
among them the percentages of first, second, third, and crew
classes.
• Conditional on Dead, we only look at the 1490 people and
see among them the percentages of first, second, third, and
crew classes.
• We see that the distribution of Class for the survivors is different
from that of the non-survivors.
Survi Class First Class Second Class Third Class Crew Total
val
Alive 203 118 178 212 711
blank 28.6% 16.6% 25.0% 29.8% 100%
Dead 122 167 528 673 1490
blank 8.2% 11.2% 35.4% 45.2% 100%
Copyright © 2019 Pearson Canada Inc. 2 - 19
Conditional Distributions
• Conditional on each of the four classes, finish this table
and compare with the next slide.
Class First Class Second Class Third Class Crew Total
Alive 203 118 178 212 711
blank
Dead 122 167 528 673 1490
blank
325 285 706 885 2201
100% 100% 100% 100%
Copyright © 2019 Pearson Canada Inc. 2 - 20
Conditional Distributions
• Conditional on each of the four classes, finish this table
before next class.
• We can see that the chance of surviving differ by class.
Class First Class Second Class Third Class Crew Total
Alive 203 118 178 212 711
blank 62% 41% 25% 24%
Dead 122 167 528 673 1490
blank 38% 59% 75% 76%
325 285 706 885 2201
100% 100% 100% 100%
Copyright © 2019 Pearson Canada Inc. 2 - 21
Exercise (Section 2.2 exercise #6)
Copyright © 2019 Pearson Canada Inc. 2 - 22
Contingency Table vs. Raw Data
Exercise (Section 2.2 exercise #6)
• The previous contingency table is made from a
raw data set with:
• Observations: 29572
• Two categorical variables
– Age with four categories
– Employment status with three categories
Copyright © 2019 Pearson Canada Inc. 2 - 23
Exercise (Section 2.2 exercise #6)
For each of the following questions, point out the type of
distribution and calculate the percentage.
1.What percentage of the population were unemployed?
2.What percentage of the population were unemployed and
aged 25 to 54 years?
3.What percentage of 15 to 24-year-old were unemployed?
4.What percentage of employed were aged 65 years and
over?
5.What percentage of unemployed were aged 15 to 24
years?
Copyright © 2019 Pearson Canada Inc. 2 - 24
Exercise (Section 2.2 exercise #6)
1. What percentage of the population were unemployed? Marginal
1261/29572
2. What percentage of the population were unemployed and aged 25 to
54 years? Joint 675/29572
3. What percentage of 15 to 24-year-old were unemployed? Conditional
(being 15-24 years old) 380/4387
4. What percentage of employed were aged 65 years and over?
Conditional (being employed) 745/18410
5. What percentage of unemployed were aged 15 to 24 years?
Conditional (being unemployed) 380/1261
Compare 3 and 5: different condition
Copyright © 2019 Pearson Canada Inc. 2 - 25
Conditional Distributions and Independent
• We see that the distributions of Survival differ by
Class.
• This leads us to believe that Class and Survival
are associated (it was more likely to survive in the
first class than in the second class), and that they
are not independent.
Copyright © 2019 Pearson Canada Inc. 2 - 26
Conditional Distributions and Independent
• In a contingency table, when the distribution of one
variable is the same for all categories of another, we
say that the two variables are independent.
• If the variables “Class” and “Survival” were
independent, for each class, we should have
observed 32.3% survived and 67.7% dead (the
marginal distribution).
• In an independent scenario, a person had the same
chance to die regardless of the ticket class.
Copyright © 2019 Pearson Canada Inc. 2 - 27
End of chapter
• Example of overlapping categories – which is not
allowed for contingency tables.
• Now it is time to do chapter review and solve
practice problems of chapter 2.
Copyright © 2019 Pearson Canada Inc. 2 - 28