Graphing Displays of Information
How to make a good Histogram
Data Types:
- Quantitative Data: Numeric Data (e.g., height, weight)
- Categorical Data: Group-based data (e.g., colours, brands)
Bin Width (Width of Each Bar):
- Use at least 5 intervals, but no more than 15
- Calculate bin width:
Bin Width = Range / Number of Intervals
Choosing the Right Bin Width
Too Small:
- Data is not grouped well
- Difficult to identify patterns or trends
Too Large:
- Over-groups data
- Reduces visible structure of the data
Just Right:
- The distribution of data is clear and understandable
Different Distributions of Data
Symmetric Distribution:
- Data is evenly spread out
Not Symmetric
Uniform Distribution: Flat and evenly distributed across values
U-Shaped Distribution: Peaks at the ends, dips in the middle
Mound-Shaped Distribution: A single peak, resembling a bell curve
Left-Skewed Distribution: Peak is on the right, tail extends left
Right-Skewed Distribution: Peak is on the left, tail extends rights
Bimodal Distribution: Two distinct peaks
Note: Imperfections in data can alter these shapes
Measures of Central Tendency
Measures of Central Tendency
A value that represents the center of a dataset
The main measures are:
Mean: Average of all values
Median: Middle value when data is ordered
Mode: Most frequently occurring value
Finding Measures of Central Tendency in Histograms
Mean:
- Find midpoints of each interval
- Use frequencies of those midpoints to calculate the weighted mean
Median:
- Total frequency / 2 = Halfway Point
- Use cumulative frequencies * to locate the interval containing the median
* (sum of frequencies from left to right)
- Median = midpoint of the identified interval
Mode:
- The tallest bar in the histogram represents the mode
How They Relate to Data Distributions
Mound-Shaped Distribution:
Mean = Median = Mode
Right-Skewed Distribution:
Mean > Median > Mode
Left-Skewed Distribution:
Mean < Median < Mode
Key Concepts About Outliers
Resistant Measures:
- A measure is resistant if it is not greatly affected by outliers
- Median is more resistant than the mean
Measures of Spread
What is a Measure of Spread
Definition:
Describes how data is spread out from a central value
Key Points:
Less spread = narrower range, more consistency
More spread = values are more dispersed
Small spread = increases confidence in where values lie
Other Names: Dispersion, Variation
Basic Measure of Spread
Formula for Range:
- Range = Maximum - Minimum
Quartiles (Q1, Q2, Q3)
Dividing data into four equal groups:
Q1: Median of the lower 25% of data
Q2: Middle value (50% of data below it, and 50% of data above it)
Q3: Median of the 75% of data
Interquartile Range (IQR):
IQR = Q3 - Q1
- Represents the middle 50% of data
- Less sensitive to outliers, often paired with the median
Standard Deviation (SD)
Definition:
Measures how far data points are from the mean
Formula:
- Variance is calculated first by averaging the squared deviations from the mean
- SD is the square root of variance
Properties:
Low SD = Data is close to the mean
High SD = Data is spread out
Sensitive to Outliers
Properties of Standard Deviation
- σ ≥ 0 then there is no variation.
(No values deviate from the mean, which says that all values in a distribution are the same)
- σ is more meaningful with larger data sets
- the bigger σ is, the more variation there is in the distribution from the mean
- a measure is resistant if it is not very prone to changes from outliers, if there are outliers then the
standard deviation will change drastically
- measures variation from the mean, so it is paired with the mean in the discussion of the measure of
center and spread in data
Which Measure of Center to Use
Skewed Data: Median
Symmetric Data: Mean or Median
Multiple Modes: Mode
Normal Distributions
Normal Distribution Properties
- Symmetrical and bell-shaped
- The mean, median and mode are the same
Approximately:
- 68% of the data lies within 1 standard deviation of the mean
- 95% of data lies within 2 standard deviations of the mean
- 99.7% of data lies within 3 standard deviations of the mean
Standard Deviation
Measures the spread of data from the mean.
- Small Standard Deviation indicates that the data is closely packed around the mean
- Large Standard Deviation suggests more variation
Z-Scores
A measure of how many standard deviations a data point is from the mean
Formula: Z = (x - µ) / σ
µ = the mean
σ = the standard deviation
Percentiles
The percentage of data points below a specific value in a dataset.
- Percentiles indicate the percentage of data below a specific value
- Higher percentiles don’t always mean better
Ex: The 25th percentile is the first quartile (Q1), while the 75th percentile is the third quartile (Q3)
Understanding Data with Normal Distribution
Histograms and Data Shape:
- Normal distributions appear mound-shaped distributions (bell curves)
- Skewed data may approximate normality with larger sample sizes
Z-Scores and Percentiles
Z-Scores
As mentioned above, the following formula is what is used to calculate the z-score based on the mean and
standard deviation.
Interpretation
- Positive Z-Scores: Data Value above the mean
- Negative Z-Score: Data Value below the mean
Mathematical Indices
Mathematical Indices
- Arbitrarily defined numbers that measure scale or represent collections of data
- Used for comparison but may not always represent actual measurements
Properties
- Indices simplify complex datasets into single metrics for easier interpretation
- Each index has a specific purpose, limitations and calculation method
Examples of Mathematical Indices
Body Mass Index (BMI):
Formula: BMI = mass (kg) / height (m)^2
Categories:
Underweight: < 18.5
Normal: 18.5 - 24.9
Overweight: 25 - 29.9
Obese: > 30
Other Examples of Mathematical Indices:
- Consumer Price Index (CPI)
- Charlson Comorobidity Index (CCI)
- Shannon’s Index (SI)
- Citation Index (CI)
- Expected Goals (xG)
Chapter 3 Questions - For Quiz
1. (Section 3.1: Graphical Displays of Information)
a) For each distribution listed below, match them with a scenario:
i) Uniform distribution
- rolling a fair dice
- flipping a fair coin
- drawing a card from a deck
ii) Mound-shaped distribution
- heights of people
- test scores for a large population of students
- rolling a pair of 6-sided dice and finding the sums
iii) Left-skewed distribution
Examples:
- retirement ages
- scores on competitive video games
- dice rolls
iv) Right-skewed distribution
- rolling a pair of 6-sided dice and finding the difference
- wait times at a fast food restaurant
-property values in a city with luxury homes
v) U-shaped distribution
- popularity of car colours
- age of people visiting a theme park
- weather temperatures in a year
b) Create a histogram with five uniform intervals, based on the following data:
13, 7, 5, 7, 9, 10, 5, 11, 8, 7, 9, 10, 10, 11, 14, 10, 6, 12, 6, 9, 7, 12, 9, 10, 6
Bin width = range / 5 = 15 - /5 = 1.8 > intervals are 5-6.8, 6.8-8.6, 8.6-10.4, 10.4-12.2, 12.2-14
c) Describe the distribution shape of the histogram drawn in part (b).
The shape of the distribution is very roughly symmetrical, almost skewed, but more so mound-shaped.
2. (Section 3.2: Measures of Central Tendency)
a) Calculate the mean, median, and mode for the data in Question 1. Which measure best describes
the central tendency of the data? Why?
From the data alone, we will first organize the data in ascending order (n = 25)
5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 11, 11, 12, 12, 13, 14
i) The mean is x = 5+5+6. . . +12+13+14 / 25 = 8.92 } Since mean = median both are equally valid measures of
ii) The median is 9 } central tendency to describe the data due to the
iii) The mode is 10 } symmetry of the distribution
From the histogram (and frequency table), we find the mean, median and mode slightly differently.
i) We find the weighted mean, Find the midpoints of each interval
For 5-6.8, this is 5 + 6.8 / 2 = 5.9 (Frequency 5 times)
For 6.8-8.6, this is 6.8 + 8.6 / 2 = 7.7 (Frequency 5 times)
For 8.6-10.4, this is 8.6 + 10.4 / 2 = 9.5 (Frequency 9 times)
For 10.4-12.2, this is 10.4 + 12.2 / 2 = 11.3 (Frequency 4 times)
For 12.2-14, this is 12.2 + 14 / 2 = 13.1 (Frequency 2 times)
Find the weighted mean
x = 5.9(5) + 7.7(5) + 9.5(9) + 11.3(4) + 13.1(2) = 224.9 = 8.996
5+5+9+4+2 25
ii) The median is found first
Finding half of the total frequency 5 + 5 + 9 + 4 + 2 = 12.5
2
Then calculate cumulative frequencies (until they are greater than or equal to 12.5)
For internal 5-6.8, we start at 5 since there are five numbers in that interval (5, 5, 6, 6, 6)
6.8-8.6, we add 5 + 5 = 10 (5, 5, 6, 6, 6, 7, 7, 7, 7, 8)
8.6-10.4, we add from the previous 10 + 9 = 19 (5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10,10, 10, 10)
We stop here, and take the midpoint of the interval 8.6-10.4, so 9.5, as the median
iii) The mode is also 9.5 (the midpoint of the interval with the tallest bar)
b) If the data in Question 1 was based on the number of hours high school students spend on
homework per week, interpret what the mean, median and mode represent in this scenario.
Mean: On average, the number of hours high school students spend on homework per week is about 8.92
(or 8.996, or just under 9) hours.
Median: The number of hours high schoo students spend on homework per week middles to 9 (or 9.5)
hours. This means 50% of these students spend less than 9 hours doing homework per week, while the
other half spends more than 9 hours.
Mode: The most frequent number of hours high school students spend on homework per week is 10 hours
(or 9.5), 10 hours could indicate preference or coincidence.
3. (Section 3.3: Measures of Spread)
a) What is the range in the data of Question 1?
The range is max - mini, with the given variables, we know the lowest and highest numbers are 5 and 14
= 14 - 5
=9
b) What are Q1, Q3 and IQR of the data in Question 1? Interpret the meaning of the IQR based on
the scenario specified in Question 2b.
5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 11, 11, 12, 12, 13, 14
^^ ^ ^ ^
Q1 = 7 + 7 / 2 = 7 Median Q3 = 10 + 11 / 2 = 10.5
- IQR = Q3 - Q1 = 10.5 - 7 = 2.5
- The middle half of data varies by values between 7 and 10.5. Contextually, as in Question 2b, 50% of high
school students will spend between 7 and 10.5 hours hours doing homework each week
c) Use technology to calculate the (population) standard deviation (denominator is n, not n - 1).
Interpret the standard deviation based on the scenario specified in Question 2b.
Using technology to find the population standard deviation we get σ = 2.43. This means that, on average,
most average, most data values will differ from the mean (of 8.92) by about 2.43. Contextually, as in
Question 2b, this means that typically, most high school students will spend between 6.49 and 11.35 hours
doing homework each week.
Another way to word this: most number of hours spent doing homework each week will differ form the
mean (8.92 hours) by about 2.43 hours.
d) Here is a different scenario: The number of books read by five students in a month is as follows:
2, 4, 6, 8, 10 Calculate the standard deviation, and interpret the meaning of it in this scenario.
i) Find the mean
x = 2 + 4 + 6 + 8 + 10 / 5 = 6
ii)
Find the deviations Find the squared deviations The sum of squared deviations
2 - 6 = -4 | (-4)^2 = 16 | 16 + 4 + 0 + 4 + 16 = 40
4 - 6 = -2 | (-2)^2 = 4 |
6 - 6 = 0 | (0)^2 = 0 |
8 - 6 = 2 | (2)^2 = 4 |
10 -6 = 4 | (4)^2 = 16 |
Calculate the SD:
= 40 / 5
= 2.83
The standard deviation is σ = 2.83
iii) In the given scenario, σ means this:
Most students will read between 6 - 2.83 = 3.17 and 6 + 2.83 = 8.83 book per month on average.
4. (Section 3.4: Normal Distribution)
The mass of 50 cats in a shelter is normally distributed with a mean of 4.5 kg and a standard deviation of
0.8 kg.
a) Approximately how many cats weigh
i) Between 3.7 and 5.3 kg?
68% of cats have a mass between 3.7 an 5.3kg. So roughly 50 (0.60) = 34 cats weigh between 3.7 and 5.3kg.
ii) Less than 2.9 kg?
About 2.35% (could also use 2.5% to ocmpensate for missing 0.3 of 99.7) of cats weigh less than 2.9kg. This
means about 50 (0.0235) = 1.175 (or just 1) cat weighs less than 2.9kg.
iii) More than 5.3 kg?
About 13.5 + 2.35 (or 13.5 + 2.5) = 15.85% of cats weigh over 5.3kg. This means about 50 (0.1585) = 7.925
(round down to 7) cats weigh over 5.3kg
b) If the shelter can only adopt out cats weighing between 2.1 and 5.3 kg, what percentage of the cats
are eligible for adoption?
We are looking at 2.35 + 13.5 + 34 + 34 = 83.85% of cats that are eligible for adoption. (that means 50
(0.8385) = 41.925, or 41 cats are eligible for adoption)
5. (Section 3.5: Applying the Normal Distribution: Z-Scores)
a) Draw the normal curve for the scenario in Question 4, and interpret the meaning of a z-score of z
= -1.25 in this scenario. Indicate where the data is on the normal distribution.
A z-score of z = -1.25 means a data value that is exactly 1.25x standard deviations below the mean. In this
context, a cat that weighs 4.5 -1.25 (0.8) = 3.5kg is what we are looking for.
b) Determine the percentile the cat in part (a) would be at. Interpret the meaning of the percentile in
this scenario.
The percentage of cats that weigh less than 3.5kg (or that have a z-score of z = -1.25) can be found from the
z-score table on pg, 398-399 of the textbook. This percentage is 0.1056, or 10.56%. Round up for the
percentile. The cat would be at the 11th percentile. This means that 11% (specifically 0.56%) of cats weigh
under 3.5kg.
c) Determine the percentage of cats that weigh between 3.0 and 6.0 kg.
A cat that weighs 3.0kg has a z-score of: A cat that weighs 6.0 has a z-score of:
𝑧3.0 = x - 𝑥 / σ 𝑧6.0 = x - 𝑥 / σ
= 3 - 4.5 / 0.8 = 6 - 4.5 / 0.8
= -1.875 = 1.875
= -1.88 = 1.88
3.01% of cats weigh under 3.0kg 96.99% of cats weigh under 6.0kg
96.99 - 3.0 = 93.98% of cats weigh between 3.0 and 6.0kg
d) Is the 99th percentile a good thing in this scenario? Explain your reasoning.
Depending on the species of cat, or their age, being at the 99th percentile could simply be an abnormality.
Cats that weigh over 6.9kg can be considered overweight or even obese in most cases, so typically we can
say that being in the 99th percentile is not good here.
6. (Section 3.6: Mathematical Indices)
Recall that the Body-Mass Index (BMI) is a mathematical index that measures whether a person's weight
is appropriate for their height, helping to indicate if they are underweight, normal weight, overweight, or
obese. In a similar light, an animal’s Body Condition Score (BCS) is a mathematical index that assesses the
body fat and overall body condition of the animal. While it is not calculated with a formula, a numerical
value is derived from observations of an animal’s profile and body conditions around the ribs, back, spine,
hips, and abdomen.
a) The scale works as follows: if the cat’s BCS is:
i) 1-4, then it is underweight
ii) 5-6, then it is at a healthy weight
iii) 7-9, then it is considered overweight, with higher numbers leading to obesity.
Based on this scale, how might you score a cat’s BCS if their mass is at the 99th percentile (as mentioned
in Question 5b)? Explain your reasoning.
While it may be hard to say, the fact that the mass of cats in the shelter is normally distributed tells us
that we may not have to depend so much on other variables like species or age. It is a safe assumption,
therefore, to say that a cat whose mass is at the 99th percentile is considered overweight ( a BCS of 7 - 9,
maybe closer to 9)
b) What are some limitations the BCS might have? (e.g. think about different species of cats)
BCS does not take into consideration of different species of cats where higher weights are more expected.
The age of each cat may also ot be taken into account.
c) Back to humans: calculate the BMI of a man who weighs 80 kg and who is 180 cm tall. Interpret
the BMI of the man according to the scale below.
BMI = mass / height^2 = 80 / 1.80^2 = 80/ 3.24 = 24.69kg/m^2
The body mass index of this man is 24.69kg/m^2, which suggests he is healthy (has a normal weight for his
height, considering the range given of 18.5 - 2.5).