Univariate Descriptive Statistics
Displays: pie charts, bar graphs, box plots, his-
tograms, density estimates, dot plots, stem-
leaf plots, tables, lists.
Example: sea urchin sizes
Boxplot Histogram
60
10 20 30 40 50 60
50
Number of Urchins
Urchin Size (mm)
40
30
20
10
0
0
0 10 20 30 40 50 60 70
Urchin Size (mm)
Dot Plot Density
0.015
0.010
Density
0.005
0.000
0 10 20 30 40 50 60 −20 0 20 40 60 80
Urchin Size (mm) Urchin Size (mm)
14
Points:
1) Useful for quantitative variables.
2) Boxplot shows five point summary: mini-
mum, first quartile, median, third quartile, max-
imum.
3) Dot Plot illegible with 250 data points. (1
dot for each size plotted on line.)
4) Histogram, density plot serve similar pur-
poses.
5) Density goes below 0: bad.
6) Histogram doesn’t show clustering density
plot shows.
15
Example: Categorical: Weather in Central
Park
Pie Chart Bar Graph
10
clear
8
6
[Link]
4
cloudy
2
0
clear [Link] cloudy
Pie chart harder to read.
General summary: Pie Charts are bad.
More useful with more categories.
Ordering of categories important for nominal
variables.
Cloudiness is ordinal.
16
Pie charts: wedge has area proportional to #
of individuals in category.
Bar chart: bar has height equal to # of indi-
viduals in category.
Density estimates not discussed in this course.
Histogram:
1) divide range of values into intervals.
2) Count numbers of individuals in each inter-
val.
3) bar AREA is proportional to # of individuals
in interval; width is length of interval.
4) equal width bars best – then height propor-
tional to # of individuals.
5) label x-axis; include units.
6) label y-axis.
17
Example: Personal Income for BC (ages 15+).
(For those with income.) Source: 2001 Cen-
sus.
Adult Personal Income (BC)
0.03
0.02
0.01
0.00
0 20 40 60 80 100
Income ($000s)
18
Points
1) Bar widths unequal – census tables given
that way.
2) So take width times height to get area =
fraction of population in that income group.
3) Last group on right open ended – artificially
cut off at $100,000 by me.
4) Plot is “long-tailed to the right” or “skewed
to the right”.
5) Based on 20% sample of 1,523,720 people
aged 15 + in BC on census day, 2001.
6) Income is for previous year – 2000.
19
Comparison of 1995, 2005.
1996 Income
Density
0 20 40 60 80 100
2001 Income
Density
0 20 40 60 80 100
20
Comparison of 2000, 2005.
BC Individual Income 2000 and 2005
0.030
2005
2000
0.025
0.020
Density
0.015
0.010
0.005
0.000
0 20 40 60 80 100
BC Individual Income 2000
21
Summarizing the pictures.
Purposes: less space in text than a graph; pre-
cise numerical comparison between groups.
Summarizing a histogram:
Where is centre of the x-axis values? Jargon:
location or centre.
How far do the x values extend on either side?
Jargon: spread, variation, width.
Is the picture symmetric or does it extend far-
ther to right than left?
Location and number of bumps.
22
Measures of location:
Mean, Arithmetic Mean, Average, Arith-
metic Average: total of x-values divided by
number of x values.
Histogram balances at mean. (First Moment
in physics.) Think of See-Saw: small kid far
from centre balances big kid close to centre.
Formula: data X1, . . . , Xn.
Pn
Xi
X̄ = i=1
n
Utility of summation notation in this course:
NIL. But X̄ is standard notation for average of
X.
Median: number such that 1/2 of X values at
least that large, and 1/2 of X values at least
that small.
Sort list: if n is odd median is middle of sorted
list. If n is even take average of two middle
values.
23
Numerical examples: ages in my family:
50, 50, 20, 15, 8, 8.
50 + 50 + 20 + 15 + 8 + 8 151
Ā = = ≈ 25.2
6 6
Median age: middle numbers are 15, 20.
Halfway between is median = 17.5.
Mode: most common value. Not useful con-
cept in most cases. Location of tallest bar in
histogram (affected by definition of classes).
Mode of ages is not unique: 50 or 8. Not
useful summary of centre.
24
Comparison:
Advantages of mean:
1) if your average weekly income is $100 you
know how you will do in the long run; not so
if median weekly income is $100.
2) Same point: average and sample size tells
you total.
3) Has simpler mathematical behaviour than
median.
Advantages of median:
Not influenced by extreme members of list.
Median income, for instance, gives more infor-
mation about typical person.
25
Measures of spread:
Standard Deviation
Interquartile Range
Mean Absolute Deviation.
Deviations from the mean: subtract mean from
each number in list: Xi − X̄. For my family de-
viations are
24.8, 24.8, −5.2, −10.2, −17.2, −17.2.
Summarize size of deviations:
Average is 0. Not useful as measure of size
since pluses cancel minuses.
26
Mean absolute deviation: take absolute values
(ignore − signs) and average
24.8 + 24.8 + 5.2 + 10.2 + 17.2 + 17.2
6
= 16.6 years
Standard deviation: square deviations, aver-
age, take square root:
s
(24.8)2 + · · · + (−17.2)2
s=
5
= 19.8 years.
WARNING: notice the 5 not 6. This is Tradi-
tional. Not important in large data sets.
Jargon: variance is s2:
2 (24.8)2 + · · · + (−17.2)2
s =
5
= 390.6 years2
27
Interquartile Range:
First define quartiles, quintiles, etc.
First, second and third quartiles split list into
4 equal pieces.
One quarter of list below first quartile, two
quarters below second, three quarters below
third.
Second quartile is median.
Interquartile range is third quartile minus first
quartile.
Book gives method to find quartiles.
Quintiles split list into 5 equal parts.
Percentiles split list into 100 equal parts.
28
Comparison:
Advantages of IQR: like median not influenced
by extremes.
Easily related to proportions of population.
But: rather than use 2 number summary (me-
dian, IQR) typically use 3 number summary
(quartiles) or 5 number summary (min, max,
quartiles).
Boxplot is graph of 5 number summary.
Advantages of Mean Absolute Deviation.
Seems intuitive.
Less influenced by extremes than Standard De-
viation.
But: poor mathematical properties.
We mostly use Standard Deviation.
29
Why the Standard Deviation?
Usual explanation: squares nicer mathemati-
cally than absolute values.
Real explanation (WARNING: personal view):
ONLY the SD works in normal approximations
for sums.
Normal approximations? A common summary
for curves.
Rule of thumb: in many lists of data about
2/3 of the observations are within 1 SD of the
mean, about 95% within 2 SDs of the mean
and almost all within 3 SDs of the mean.
NEXT TOPIC: the normal curve. (bell curve,
Gaussian)
30