Unit 1 Notes
Unit 1 Notes
What is Statistics?
Statistics is the set of methods for obtaining,
organizing, summarizing, presenting and analyzing
data.
U1-3
Some Definitions
A variable is a characteristic or property of an
individual.
§ Gender
§ Reason for taking this course
§ Favourite television show
§ Eye Colour
U1-5
Some Definitions
In some cases, there is a logical ordering to the values
of a categorical variable:
§ Placing in a hockey tournament (1st, 2nd, 3rd, etc.)
§ Service Rating at a restaurant (Good, Fair, Poor)
§ Letter Grade in a course (A+, …, F)
If ordering makes sense for the values of a categorical
variable, it is called categorical and ordinal.
Otherwise, it is called categorical and nominal
(like all variables on the previous slide).
U1-6
Some Definitions
Quantitative data represent values of quantitative
variables for which arithmetic operations such as
adding and averaging make sense:
U1-7
Classifying Data Types
Do the data take numerical values for which arithmetic
operations make sense?
Yes No
Yes No
Categorical Categorical
& Ordinal & Nominal
U1-8
Example
State whether each of the following variables is
(i) Quantitative
(ii) Categorical and Nominal
(iii) Categorical and Ordinal
U1-9
Some Definitions
U1-11
Categorical Data
Pie charts give us a visual representation of the
relative frequency of the observed values for a
categorical variable.
Bloc Quebecois
Conservative
Liberal
NDP
Green
Other
U1-12
Example
Test scores for a class of 40 chemistry students are
ordered and shown below:
31 37 40 44 49 50 51 53 56 56
62 64 67 67 68 68 69 70 71 72
73 73 74 75 77 78 78 81 82 84
84 87 89 89 92 92 94 95 96 98
U1-13
Quantitative Data
Consider the following frequency distribution for
the data:
Score Frequency
30-40 2
40-50 3
50-60 5
60-70 7
70-80 10
80-90 7
90-100 6
U1-14
Quantitative Data
U1-15
Quantitative Data
U1-17
Quantitative Data
Note that there are in fact two types of quantitative
variables.
U1-19
Quantitative Data
By dividing the number of data values in each class
by the total number of data values (i.e., 40), we get
the relative frequency, or proportion of individuals
in each interval.
U1-21
Quantitative Data
U1-23
Symmetry & Skewness
2 3 4 5 6 7 8
U1-24
Symmetry & Skewness
A distribution is said to be skewed to the right if the
right side of the histogram (the larger half of the data
values) extends much further out than the left side.
0 2 4 6 8 10 12 14
U1-25
Symmetry & Skewness
If the histogram is displayed vertically, we must be
careful interpreting the shape:
14
12
10
4
14
12
10
0
U1-26
Quantitative Data
U1-27
Quantitative Data
Stem and leaf plots are used for the same reasons as a
histogram.
U1-29
Example
5 04566799
6 0012233445566668889
7 0123377
U1-30
Example
5 04 (0, 1, 2, 3 or 4)
5 566799 (5, 6, 7, 8 or 9)
6 001223344
6 5566668889
7 01233
7 778
U1-31
Example
We may also encounter a problem where we have too
many stems. In this case we can trim the leaves,
which involves removing one or more digits from
each data value to make them less precise. Consider
the following stemplot for the annual salaries of 25
teachers (rounded down to the nearest thousand $):
3 478
4 0223457889
5 1256679
6 255
7 1
8 0
U1-32
Quantitative Data
Note that a histogram and a stemplot are very similar.
They both display the same information, except the
stemplot has the advantage of also displaying the data
values.
20 30
30
30 40
3 478
50 60 70 80 90
5 1256679
6 255
7 1
8 0
U1-33
Quantitative Data
Like a histogram, a stemplot enables us to describe
the shape of a data distribution:
3 17
4 049
5 01366
6 2477889
7 0123345788
8 1244799
9 224568
skewed to the left
U1-34
Quantitative Data
Like a histogram, a stemplot enables us to describe
the shape of a data distribution:
5 04
5 566799
6 001223344
6 5566668889
7 01233
7 778
approximately symmetric
U1-35
Quantitative Data
Like a histogram, a stemplot enables us to describe
the shape of a data distribution:
3 478
4 0223457889
5 1256679
6 255
7 1
8 0
skewed to the right
U1-36
Quantitative Data
Consider the following back-to-back stemplot, which
simultaneously displays the height distributions (in
inches) for a sample of males and females
Females Males
1 6 (0 and 1)
33 6 (2 and 3)
55444 6 55 (4 and 5)
77766 6 677 (6 and 7)
98 6 9999 (8 and 9)
0 7 0011
7 223
7 45
7 6
7 8
U1-37
Quantitative Data
A back-to-back stemplot allows us to simultaneously
display data for some variable for two different
samples. This allows us to compare the two
distributions.
Here we see that males are generally taller than
females. The distribution of heights for females is
approximately symmetric and the distribution of
heights for males is skewed to the right. The spread
of the distribution for males is higher than that for
females.
U1-38
Quantitative Data
Time plots are used for plotting time series data,
which are values for some variable measured over
time. Time is plotted on the x-axis, while variable
values are plotted on the y-axis. Data values are
represented by points, which are connected to better
illustrate the pattern or trend.
U1-39
Quantitative Data
The timeplot below displays the average monthly
temperature in Winnipeg over a one-year period.
25
45
20
40
This type of trend is called
15
35 seasonal variation.
10
30
255
200
–5
15
–10
10
–15
5
–20
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
U1-40
Describing Distributions With Numbers
So far, we have seen some visual displays of data
using bar charts, frequency distributions, histograms,
stemplots and time plots.
U1-41
Describing Distributions With Numbers
Location is determined by where the center of our
data falls.
U1-43
Describing Distributions With Numbers
The median is the middle value in an ordered data
set. Half of the data values are as small or smaller
than the median and half of the values are as large or
larger.
U1-44
Describing Distributions With Numbers
To find the median:
§ Order the data from smallest to largest.
§ Count n, the number of data values, and compute
n +1
2
n +1
§ Count data values up from the lowest value.
2
U1-45
Describing Distributions With Numbers
n +1
If n is an odd number, the median is in position .
2
If n is an even number, the median is the average
of the two values on either side of the value in
position n + 1 .
2
n +1
Note: The median is not equal to ; it is the data
value in that position. 2
U1-46
Example
For the earthquake data, n = 35, so the median is in
position (35 + 1)/2 = 18 of the ordered data.
U1-47
Example
For the test score data, n = 40, so the median is in
position (40 + 1)/2 = 20.5 of the ordered data.
The median is the average of the data values in
positions 20 and 21.
31 37 40 44 49 50 51 53 56 56
62 64 67 67 68 68 69 70 71 72
73 73 74 75 77 78 78 81 82 84
84 87 89 89 92 92 94 95 96 98
3, 8, 14, 15, 20
U1-49
Describing Distributions With Numbers
U1-51
Describing Distributions With Numbers
When our mean comes from a sample, we denote it
as x . The sample mean is calculated as follows:
n
å xi
x= i =1
n
The Greek symbol sigma is used to indicate a
summation. The formula tells us to add all the data
values x1 + x2 + ! + xn and then divide by the sample
size n.
U1-52
Example
3, 8, 14, 15, 20
U1-53
Example
U1-55
Describing Distributions With Numbers
X X X X X
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x
This is where a teeter-totter would exactly balance if
five people of equal mass were sitting in the positions
of the five data points.
U1-56
Example
The mean age of the six males in a class is 23.2
years. The mean age of the four females in the class
is 21.7 years. What is the mean age for the whole
class?
U1-57
Solution
Since there are more males than females, we can’t
simply take the average of the two means. We note
that the mean age of the class is calculated as
åx åx+åx
xc = c
= m f
nc nc
U1-58
Solution
Now we must find the total ages of the males and the
females separately:
åx
xm = m
Þ å x = nm xm = 6( 23.2) = 139.2
nm m
Similarly,
å x = n f x f = 4(21.7) = 86.8
f
U1-59
Solution
And so
mean = median = 4
U1-61
Describing Distributions With Numbers
In a skewed distribution, the mean is closer to
the tail.
0 2 4 6 8 10 12 14
mean
median
U1-62
Weighted Mean
In some cases, when calculating the mean, some data
values are given more weight than others. This may
be due to the fact that some values are observed more
frequently, or because some values are in some sense
more important than others. In such cases, we
calculate the weighted mean:
n
å wi xi
xw = i =1
n
å wi
i =1
U1-63
Example
A small restaurant has nine employees. The annual
salary of the two chefs is $45,000. The annual salary
of the six servers is $35,000. The annual salary of
the restaurant manager is $60,000. What is the mean
annual salary of all of the restaurant’s employees?
U1-64
Solution
We could calculate this using the formula for the
regular sample mean. We would add the nine data
values and divide by nine. It is simpler in this case to
use the formula for the weighted mean:
U1-65
Example
A student’s Grade Point Average is also calculated as
a weighted mean. For example, consider one student
who took the following courses one year:
Example
To calculate the student’s GPA for the year, we can’t
just take the regular average, because it does not
account for the fact that some courses (i.e., those
worth more credit hours) count more than others
towards a student’s GPA. Instead, we take the
weighted average, using credit hours as weights:
U1-67
Describing Distributions With Numbers
How do the following two distributions differ?
U1-68
Describing Distributions With Numbers
They are both approximately symmetric, and so the
mean and median are approximately equal, but the
variability differs.
U1-69
Describing Distributions With Numbers
R = maximum – minimum
U1-71
Describing Distributions With Numbers
The interquartile range of a data set measures the
length of an interval which covers the middle 50%
of the ordered observations. It does not consider
outliers (or in fact, any of the 50% of the observations
furthest in position from the median).
U1-72
Describing Distributions With Numbers
The first quartile Q1 of a data set is the ordered
observation such that at least 25% of the data values
are as small or smaller and at least 75% of the
values are as large or larger.
U1-73
Describing Distributions With Numbers
In general, the pth percentile of a data set is the
ordered observation such that at least p% of the
data values are as small or smaller and at least
(100 – p)% of the values are as large or larger.
U1-75
Example
Consider again the earthquake data:
U1-77
Example
The interquartile range is therefore
U1-79
Example
The five-number summary divides the data into four
equal parts:
IQR (50%)
Range (100%)
U1-80
Example
Consider again the test score data:
31 37 40 44 49 50 51 53 56 56
62 64 67 67 68 68 69 70 71 72
73 73 74 75 77 78 78 81 82 84
84 87 89 89 92 92 94 95 96 98
The minimum data value is 31 and the maximum is
98. We previously calculated the median to be 72.5.
U1-81
Example
There are 20 data values less than the median. Q1 is
therefore in position (20 + 1)/2 = 10.5 of the ordered
data, so we take the average of the 10th and 11th
ordered observations. Thus, Q1 = (56 + 62)/2 = 59.
IQR = Q3 – Q1 = 84 – 59 = 25
31 59 72.5 84 98
U1-83
Describing Distributions With Numbers
The five-number summary describes the center, shape
and spread of our data. We can use the five-number
summary to get a “picture” of the data.
U1-84
Boxplots
Below the histogram we see a boxplot for the earthquake
data. This type of boxplot is called a quantile boxplot.
U1-85
Boxplots
A quantile boxplot consists of:
§ a line at the median
§ a box that covers the IQR
§ lines (“whiskers”) that extend from the box out
to the minimum and maximum
U1-87
Boxplots
Like a histogram, a boxplot enables us to characterize
the shape of a distribution:
50% 50%
Half of the data values fall within a relatively short
interval on the left, while the other half covers a large
interval on the right. This distribution is skewed to
the right.
U1-88
Boxplots
Analogous to our use of back-to-back stemplots, we
can create side-by-side boxplots:
U1-89
Boxplots
The side-by-side boxplots for the height data are
shown below:
U1-90
Boxplots
We see that the median height for males is higher
than that for females.
U1-91
Describing Distributions With Numbers
As we have seen, extreme observations can affect our
interpretation of the data if a numerical measure is
not resistant to the effect of outliers.
2 3 4 5 6 7 8
2 3 4 5 6 7 8
U1-93
Describing Distributions With Numbers
We would like to construct a modified boxplot that
takes these extreme observations into account.
The middle three lines (the box) are the same for an
outlier boxplot: Q1, median, Q3.
U1-95
Describing Distributions With Numbers
Instead, we construct a “fence” with lower (LF) and
upper (UF) values:
U1-97
Describing Distributions With Numbers
An outlier is considered to be any data point falling
outside the fences (i.e., less than the lower fence or
greater than the upper fence). As such, the lines
extend out to the “new” minimum and maximum,
after we remove the outliers from consideration.
6 9 10 10 12 13
14 14 15 15 17 28
6 10 13.5 15 28
U1-99
Example
We calculate the fences as follows:
U1-101
Example
The outlier boxplot for these data is shown below:
LF Q1 med Q3 UF
0 5 10 15 20 25 30
U1-102
Example
Note that it is not necessary to include the labels or to
draw the fences when constructing the modified
boxplot.
0 5 10 15 20 25 30
U1-103
Example
Consider what would happen if we had instead
constructed a quantile boxplot for these data:
0 5 10 15 20 25 30
U1-104
Example
Not only would we be given the impression that there
is high variability in the data…
0 5 10 15 20 25 30
U1-105
Example
but we may also conclude that the distribution is
skewed to the right…
0 5 10 15 20 25 30
U1-106
Example
when we see from the modified boxplot that it is in
fact skewed to the left…
0 5 10 15 20 25 30
U1-107
Describing Distributions With Numbers
Note that we only had a problem with the quantile
boxplot when there were extreme values in our data.
When there are no outliers, the quantile boxplot is
identical to the outlier boxplot (since the “new”
minimum and maximum are the same as the “old”
minimum and maximum).
U1-108
Describing Distributions With Numbers
Back to the idea of variability…
U1-109
Describing Distributions With Numbers
The variance of a set of data values (also referred to
as the sample variance) is defined as:
n
å ( xi - x )
2
s 2 = i =1
n -1
n
å ( xi - x )
2
s = s2 = i =1
n -1
U1-111
Describing Distributions With Numbers
To calculate the variance s2 for a sample of size n:
s 2 = i =1
n -1
U1-112
Example
The number of goals scored by each of the five NHL
Pacific Division teams for the 2009/10 regular season
are shown below:
Team Goals
238
237
241
225
264
U1-113
Describing Distributions With Numbers
We calculate 1) x = 241
2) xi - x 3) ( xi - x )
2
xi
238 –3 9
237 –4 16
241 0 0
225 – 16 256
264 23 529
sum = 0 4) sum = 810
U1-115
Describing Distributions With Numbers
n
Note that we calculated å ( xi - x ) = 0 for these data.
i =1
The units for x are the same as the units for the
individual x’s.
U1-117
Describing Distributions With Numbers
The units for variance are the squared units of the
individual observations.
U1-119
Describing Distributions With Numbers
We do, however, need to be careful when interpreting
the meaning of the standard deviation.
U1-121
Describing Distributions With Numbers
Suppose we have some other variable Y that shares a
linear relationship with the variable X, i.e.,
y = a + bx
y = a + bx and s y = b sx
U1-122
Example
The daily high temperature (in °C) is recorded each
day for one week in Grand Forks, North Dakota:
U1-123
Example
Now suppose we want to express the data in °F.
The relationship between the Fahrenheit and Celsius
scales is
y = 32 + 1.8x
a b
U1-125
Describing Distributions With Numbers
We have seen two different numerical summaries for
a given data set:
U1-127
Describing Distributions With Numbers
However, for skewed distributions, or in the presence
of strong outliers, the five-number summary should
be reported, since the mean and standard deviation are
strongly affected by extreme values.