0% found this document useful (0 votes)
34 views64 pages

Unit 1 Notes

Uploaded by

Hongru Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views64 pages

Unit 1 Notes

Uploaded by

Hongru Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

U1-1

What is Statistics?
Statistics is the set of methods for obtaining,
organizing, summarizing, presenting and analyzing
data.

Our data come from characteristics measured on


individuals or units. These can be people, animals,
places, things, etc.

The population is the totality of individuals about


which we want information.
U1-2
Some Definitions
A sample is a subset of the units in a population that
we actually examine in order to gather information.

§ 1000 voters are asked which candidate they


support in the upcoming election.
§ 50 arthritis patients are given a new treatment.
§ 20,000 televisions in the U.S. are monitored for
Nielsen Ratings.

What is the population in each of these cases?

U1-3
Some Definitions
A variable is a characteristic or property of an
individual.

§ Time until a light bulb burns out


§ Heart Rate of smokers vs. non-smokers
§ Number of Heads in five tosses of a quarter
§ Hair Colour
§ Your Grade in this course
U1-4
Some Definitions
There are two broad classifications of data:

Categorical data represent values of categorical


variables (also called qualitative variables) that
place individuals into one of several groups.

§ Gender
§ Reason for taking this course
§ Favourite television show
§ Eye Colour

U1-5
Some Definitions
In some cases, there is a logical ordering to the values
of a categorical variable:
§ Placing in a hockey tournament (1st, 2nd, 3rd, etc.)
§ Service Rating at a restaurant (Good, Fair, Poor)
§ Letter Grade in a course (A+, …, F)
If ordering makes sense for the values of a categorical
variable, it is called categorical and ordinal.
Otherwise, it is called categorical and nominal
(like all variables on the previous slide).
U1-6
Some Definitions
Quantitative data represent values of quantitative
variables for which arithmetic operations such as
adding and averaging make sense:

§ Final Exam Score


§ Height
§ Volume of air in a balloon

U1-7
Classifying Data Types
Do the data take numerical values for which arithmetic
operations make sense?

Yes No

Quantitative Does it make sense


to put the data in order?

Yes No

Categorical Categorical
& Ordinal & Nominal
U1-8
Example
State whether each of the following variables is
(i) Quantitative
(ii) Categorical and Nominal
(iii) Categorical and Ordinal

(a) Political Party you support:


Liberal, Conservative, NDP, Green, etc.
(b) Speed of a passing vehicle
(c) Month of birth

U1-9
Some Definitions

The distribution of a data set tells us what values a


variable takes and how often it takes these values.

Note: In this sense, “value” doesn’t necessarily have


to be quantitative. For example, “Female” is a value
of the variable Gender.

There are several techniques we can use to display


data in order to better study its features.
U1-10
Categorical Data
Bar charts display variable values on one axis and
frequencies on the other.

Note: There are spaces between bars so as not to


imply continuity.
40
35
30
% Votes in 2008
25
Election
20
15
10
5
0
Bloc Conservative Liberal NDP Green Other
Quebecois
Party

U1-11
Categorical Data
Pie charts give us a visual representation of the
relative frequency of the observed values for a
categorical variable.

Bloc Quebecois

Conservative

Liberal

NDP

Green

Other
U1-12
Example
Test scores for a class of 40 chemistry students are
ordered and shown below:

31 37 40 44 49 50 51 53 56 56
62 64 67 67 68 68 69 70 71 72
73 73 74 75 77 78 78 81 82 84
84 87 89 89 92 92 94 95 96 98

U1-13
Quantitative Data
Consider the following frequency distribution for
the data:
Score Frequency
30-40 2
40-50 3
50-60 5
60-70 7
70-80 10
80-90 7
90-100 6
U1-14
Quantitative Data

A frequency distribution is a count of how many of


our data values fall into various predetermined
classes or intervals.

We choose the intervals ourselves. There is no


correct choice of intervals, but we construct them in
such a manner that we get a “nice” picture of our
data. We usually use about 5 – 10 intervals.

U1-15
Quantitative Data

Our first interval must include the lowest data value


(the minimum) and our last interval must include the
highest data value (the maximum). All intervals
should be of equal length.

Note that each interval includes the left endpoint but


not the right.

i.e., 40 is included in the second interval, not the first.


U1-16
Quantitative Data
To avoid this problem, we could create
non-overlapping intervals, i.e., 30-39, 40-49, etc.,
but we would like the frequency distribution to
reflect the continuity of the data.

For example, if a student got a score of 59.5, there


would be no interval containing this value.

U1-17
Quantitative Data
Note that there are in fact two types of quantitative
variables.

A continuous variable can take any value within a


given range. Examples of continuous variables
include weight, age and distance.

A discrete variable can only take a countable


number of values. Examples of discrete variables
include the number of children in a family, the
number of days of rain in a month, and the highest
denomination of bill in someone’s wallet.
U1-18
Quantitative Data
A frequency distribution can be converted into a
relative frequency distribution as follows:
Score Frequency Relative Frequency
30-40 2 2/40 = 0.050
40-50 3 3/40 = 0.075
50-60 5 5/40 = 0.125
60-70 7 7/40 = 0.175
70-80 10 10/40 = 0.250
80-90 7 7/40 = 0.175
90-100 6 6/40 = 0.150
sum = 40 sum = 1

U1-19
Quantitative Data
By dividing the number of data values in each class
by the total number of data values (i.e., 40), we get
the relative frequency, or proportion of individuals
in each interval.

Proportions are values between 0 and 1, inclusive,


and are decimal representations of fractions.
Proportions convert to percentages when multiplied
by 100. Note that the proportions for all intervals
must add up to 1, since 100% of the data values are
counted in this distribution.
U1-20
Quantitative Data

If a frequency distribution is a representation of


quantitative data, we can construct a more useful and
more commonly used display known as a histogram.

Histograms are useful when we need to work with a


large amount of data. They are graphical displays of
the count (or proportion) of data values falling into
each of several intervals.

U1-21
Quantitative Data

The following is a histogram for the test score data:


U1-22
Quantitative Data

A histogram is a form of bar graph (with no spaces


between bars so as to reflect the continuity of the
data). The base of each rectangle represents the
length of the interval (all intervals should be of equal
length). The height represents the frequency (or
relative frequency) of data values falling in the
corresponding interval.

U1-23
Symmetry & Skewness

A distribution is said to be symmetric if its center


divides it into two approximate mirror images.

2 3 4 5 6 7 8
U1-24
Symmetry & Skewness
A distribution is said to be skewed to the right if the
right side of the histogram (the larger half of the data
values) extends much further out than the left side.

0 2 4 6 8 10 12 14

The definition of left skewness follows analogously.

U1-25
Symmetry & Skewness
If the histogram is displayed vertically, we must be
careful interpreting the shape:
14

12

10

4
14

12

10

0
U1-26
Quantitative Data

The following is a stem and leaf plot of the test


score data:
3 17
4 049
5 01366
6 2477889
7 0123345788
8 1244799
9 224568

U1-27
Quantitative Data
Stem and leaf plots are used for the same reasons as a
histogram.

Numbers are considered to consist of a stem and a


leaf. The leaf consists of only one digit – the last one
– and the stem consists of everything else. Note that
the leaves are displayed in numerical order.

§ 36 consists of stem 3 and leaf 6


§ 102 consists of stem 10 and leaf 2
§ 5.86 consists of stem 58 and leaf 6
U1-28
Example
The California Seismic Safety Commission maintains
a list of significant damaging earthquakes in the state.
The sizes of the last 35 earthquakes (measured on the
Richter scale) are ordered and shown below:

5.0 5.4 5.5 5.6 5.6 5.7 5.9


5.9 6.0 6.0 6.1 6.2 6.2 6.3
6.3 6.4 6.4 6.5 6.5 6.6 6.6
6.6 6.6 6.8 6.8 6.8 6.9 7.0
7.1 7.2 7.3 7.3 7.7 7.7 7.8

U1-29
Example

A regular stemplot for these data does not give us a


“nice” picture (i.e., we don’t get much information
about the distribution):

5 04566799
6 0012233445566668889
7 0123377
U1-30
Example

The problem is that we have too few stems. We can


solve this problem by splitting the stems:

5 04 (0, 1, 2, 3 or 4)
5 566799 (5, 6, 7, 8 or 9)
6 001223344
6 5566668889
7 01233
7 778

U1-31
Example
We may also encounter a problem where we have too
many stems. In this case we can trim the leaves,
which involves removing one or more digits from
each data value to make them less precise. Consider
the following stemplot for the annual salaries of 25
teachers (rounded down to the nearest thousand $):
3 478
4 0223457889
5 1256679
6 255
7 1
8 0
U1-32
Quantitative Data
Note that a histogram and a stemplot are very similar.
They both display the same information, except the
stemplot has the advantage of also displaying the data
values.

20 30
30
30 40
3 478

4040 5050 6060 7070 8080 9090 100


4 0223457889

50 60 70 80 90
5 1256679
6 255
7 1
8 0

U1-33
Quantitative Data
Like a histogram, a stemplot enables us to describe
the shape of a data distribution:

3 17
4 049
5 01366
6 2477889
7 0123345788
8 1244799
9 224568
skewed to the left
U1-34
Quantitative Data
Like a histogram, a stemplot enables us to describe
the shape of a data distribution:

5 04
5 566799
6 001223344
6 5566668889
7 01233
7 778
approximately symmetric

U1-35
Quantitative Data
Like a histogram, a stemplot enables us to describe
the shape of a data distribution:

3 478
4 0223457889
5 1256679
6 255
7 1
8 0
skewed to the right
U1-36
Quantitative Data
Consider the following back-to-back stemplot, which
simultaneously displays the height distributions (in
inches) for a sample of males and females
Females Males
1 6 (0 and 1)
33 6 (2 and 3)
55444 6 55 (4 and 5)
77766 6 677 (6 and 7)
98 6 9999 (8 and 9)
0 7 0011
7 223
7 45
7 6
7 8

U1-37
Quantitative Data
A back-to-back stemplot allows us to simultaneously
display data for some variable for two different
samples. This allows us to compare the two
distributions.
Here we see that males are generally taller than
females. The distribution of heights for females is
approximately symmetric and the distribution of
heights for males is skewed to the right. The spread
of the distribution for males is higher than that for
females.
U1-38
Quantitative Data
Time plots are used for plotting time series data,
which are values for some variable measured over
time. Time is plotted on the x-axis, while variable
values are plotted on the y-axis. Data values are
represented by points, which are connected to better
illustrate the pattern or trend.

U1-39
Quantitative Data
The timeplot below displays the average monthly
temperature in Winnipeg over a one-year period.
25
45

20
40
This type of trend is called
15
35 seasonal variation.
10
30

255

200

–5
15

–10
10

–15
5

–20
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
U1-40
Describing Distributions With Numbers
So far, we have seen some visual displays of data
using bar charts, frequency distributions, histograms,
stemplots and time plots.

We will now examine some numerical summaries of


a data set.

Two important features of a data set are its location


and variability.

U1-41
Describing Distributions With Numbers
Location is determined by where the center of our
data falls.

We look at three measures of location:

The mode is the most frequently observed data value.


U1-42
Describing Distributions With Numbers
The mode for the earthquake data is 6.6, as it is
observed four times, more than any other value.

5.0 5.4 5.5 5.6 5.6 5.7 5.9


5.9 6.0 6.0 6.1 6.2 6.2 6.3
6.3 6.4 6.4 6.5 6.5 6.6 6.6
6.6 6.6 6.8 6.8 6.8 6.9 7.0
7.1 7.2 7.3 7.3 7.7 7.7 7.8

U1-43
Describing Distributions With Numbers
The median is the middle value in an ordered data
set. Half of the data values are as small or smaller
than the median and half of the values are as large or
larger.
U1-44
Describing Distributions With Numbers
To find the median:
§ Order the data from smallest to largest.
§ Count n, the number of data values, and compute
n +1
2
n +1
§ Count data values up from the lowest value.
2

U1-45
Describing Distributions With Numbers
n +1
If n is an odd number, the median is in position .
2
If n is an even number, the median is the average
of the two values on either side of the value in
position n + 1 .
2
n +1
Note: The median is not equal to ; it is the data
value in that position. 2
U1-46
Example
For the earthquake data, n = 35, so the median is in
position (35 + 1)/2 = 18 of the ordered data.

5.0 5.4 5.5 5.6 5.6 5.7 5.9


5.9 6.0 6.0 6.1 6.2 6.2 6.3
6.3 6.4 6.4 6.5 6.5 6.6 6.6
6.6 6.6 6.8 6.8 6.8 6.9 7.0
7.1 7.2 7.3 7.3 7.7 7.7 7.8

The median is therefore equal to 6.5.

U1-47
Example
For the test score data, n = 40, so the median is in
position (40 + 1)/2 = 20.5 of the ordered data.
The median is the average of the data values in
positions 20 and 21.

31 37 40 44 49 50 51 53 56 56
62 64 67 67 68 68 69 70 71 72
73 73 74 75 77 78 78 81 82 84
84 87 89 89 92 92 94 95 96 98

The median is therefore equal to (72 + 73)/2 = 72.5.


U1-48
Example
Consider the small data set

3, 8, 14, 15, 20

The median is 14, located in the third position.

Now consider the adjusted data set

3, 8, 14, 15, 200

Despite the extreme value, the median is still 14!

U1-49
Describing Distributions With Numbers

Extreme values (also known as outliers) do not


affect the value of the median.

For this reason we say the median is resistant to the


effect of outliers.
U1-50
Describing Distributions With Numbers

The third and most commonly used measure of center


is the mean.

The mean, or average, is found by adding all of the


n data values and then dividing by n.

U1-51
Describing Distributions With Numbers
When our mean comes from a sample, we denote it
as x . The sample mean is calculated as follows:
n
å xi
x= i =1
n
The Greek symbol sigma is used to indicate a
summation. The formula tells us to add all the data
values x1 + x2 + ! + xn and then divide by the sample
size n.
U1-52
Example

Consider again the small data set

3, 8, 14, 15, 20

The sample mean is equal to


5
å xi 3 + 8 + 14 + 15 + 20 60
x = i =1 = = = 12
n 5 5

U1-53
Example

Now consider again the adjusted data set

3, 8, 14, 15, 200

The sample mean is equal to


5
å xi 3 + 8 + 14 + 15 + 200 240
x = i =1 = = = 48
n 5 5
U1-54
Describing Distributions With Numbers

The value of the mean is strongly affected by the


presence of the extreme value.

Therefore, the sample mean is not resistant to


outliers.

The median measures the center of the data because it


divides the data set into two halves of equal size. But
how is the mean a measure of center?

U1-55
Describing Distributions With Numbers

The mean is the “center of mass” or the “balance


point” of the data:

X X X X X
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

x
This is where a teeter-totter would exactly balance if
five people of equal mass were sitting in the positions
of the five data points.
U1-56
Example
The mean age of the six males in a class is 23.2
years. The mean age of the four females in the class
is 21.7 years. What is the mean age for the whole
class?

U1-57
Solution
Since there are more males than females, we can’t
simply take the average of the two means. We note
that the mean age of the class is calculated as

åx åx+åx
xc = c
= m f

nc nc
U1-58
Solution
Now we must find the total ages of the males and the
females separately:
åx
xm = m
Þ å x = nm xm = 6( 23.2) = 139.2
nm m

Similarly,
å x = n f x f = 4(21.7) = 86.8
f

U1-59
Solution
And so

åx å x + å x 139.2 + 86.8 226


xc = c
= m f
= = = 22.6
nc nc 10 10
U1-60
Describing Distributions With Numbers
In a symmetric distribution, the mean and median
are equal.

mean = median = 4

U1-61
Describing Distributions With Numbers
In a skewed distribution, the mean is closer to
the tail.

0 2 4 6 8 10 12 14

mean
median
U1-62
Weighted Mean
In some cases, when calculating the mean, some data
values are given more weight than others. This may
be due to the fact that some values are observed more
frequently, or because some values are in some sense
more important than others. In such cases, we
calculate the weighted mean:
n
å wi xi
xw = i =1
n
å wi
i =1

where wi is the weight given to the ith data value.

U1-63
Example
A small restaurant has nine employees. The annual
salary of the two chefs is $45,000. The annual salary
of the six servers is $35,000. The annual salary of
the restaurant manager is $60,000. What is the mean
annual salary of all of the restaurant’s employees?
U1-64

Solution
We could calculate this using the formula for the
regular sample mean. We would add the nine data
values and divide by nine. It is simpler in this case to
use the formula for the weighted mean:

2(45,000) + 6(35,000) + 1(60,000)


xw =
(2 + 6 + 1)
90,000 + 210,000 + 60,000 360,000
= = = $40,000
9 9

U1-65

Example
A student’s Grade Point Average is also calculated as
a weighted mean. For example, consider one student
who took the following courses one year:

Course Credit Hours Grade


STAT 1000 3 B+
PHED 3840 2 A+
PHIL 1200 6 C+
HIST 2050 3 B
U1-66

Example
To calculate the student’s GPA for the year, we can’t
just take the regular average, because it does not
account for the fact that some courses (i.e., those
worth more credit hours) count more than others
towards a student’s GPA. Instead, we take the
weighted average, using credit hours as weights:

3(3.5) + 2(4.5) + 6(2.5) + 3(3.0) 43.5


GPA = xw = = = 3.107
(3 + 2 + 6 + 3) 14

U1-67
Describing Distributions With Numbers
How do the following two distributions differ?
U1-68
Describing Distributions With Numbers
They are both approximately symmetric, and so the
mean and median are approximately equal, but the
variability differs.

U1-69
Describing Distributions With Numbers

One simple measure of variability, or spread, is the


range of the data

R = maximum – minimum

More than any descriptive measure we’ve seen so


far, R is most heavily affected by extreme values
(outliers).
U1-70
Describing Distributions With Numbers
In many cases, outliers arise from errors in
measurement. In other cases, the outliers may be
legitimate observations, but we may not be interested
in including these extreme values in our numerical
summary of the data. We would like a measure of
spread that excludes outliers.

U1-71
Describing Distributions With Numbers
The interquartile range of a data set measures the
length of an interval which covers the middle 50%
of the ordered observations. It does not consider
outliers (or in fact, any of the 50% of the observations
furthest in position from the median).
U1-72
Describing Distributions With Numbers
The first quartile Q1 of a data set is the ordered
observation such that at least 25% of the data values
are as small or smaller and at least 75% of the
values are as large or larger.

The third quartile Q3 of a data set is the ordered


observation such that at least 75% of the data values
are as small or smaller and at least 25% of the
values are as large or larger.

U1-73
Describing Distributions With Numbers
In general, the pth percentile of a data set is the
ordered observation such that at least p% of the
data values are as small or smaller and at least
(100 – p)% of the values are as large or larger.

As such, the first quartile is the 25th percentile and the


third quartile is the 75th percentile.
U1-74
Describing Distributions With Numbers
To find Q1: Take the median of all data values lower
in position than the median.

To find Q3: Take the median of all data values higher


in position than the median.

U1-75
Example
Consider again the earthquake data:

5.0 5.4 5.5 5.6 5.6 5.7 5.9


5.9 6.0 6.0 6.1 6.2 6.2 6.3
6.3 6.4 6.4 6.5 6.5 6.6 6.6
6.6 6.6 6.8 6.8 6.8 6.9 7.0
7.1 7.2 7.3 7.3 7.7 7.7 7.8
The minimum data value is 5.0 and the maximum is
7.8. We previously calculated the median to be 6.5.
U1-76
Example
To find Q1, we take the median of all observations
lower in position than the median. The median is in
position 18 of the ordered data, so Q1 is the median
of the first 17 ordered observations, i.e., position
(17 + 1)/2 = 9. Therefore, Q1 = 6.0.

Q3 is the median of all observations higher in position


than the median. We can either count nine positions
up from the median, or nine positions down from the
maximum. Therefore, Q3 = 6.9.

U1-77
Example
The interquartile range is therefore

IQR = Q3 – Q1 = 6.9 – 6.0 = 0.9

We now have five numbers that describe this data


distribution: the minimum, Q1, the median, Q3 and
the maximum. Together, this is known as the
five-number summary and offers a good numerical
description of the distribution of data values.
U1-78
Example
For the earthquake data, the five-number summary is

5.0 6.0 6.5 6.9 7.8

5.0 5.4 5.5 5.6 5.6 5.7 5.9


5.9 6.0 6.0 6.1 6.2 6.2 6.3
6.3 6.4 6.4 6.5 6.5 6.5 6.5
6.6 6.6 6.8 6.8 6.8 6.9 7.0
7.1 7.2 7.3 7.3 7.7 7.7 7.8

U1-79
Example
The five-number summary divides the data into four
equal parts:

5 5.5 6 6.5 7 7.5 8


min Q1 med Q3 max

25% 25% 25% 25%

IQR (50%)

Range (100%)
U1-80
Example
Consider again the test score data:

31 37 40 44 49 50 51 53 56 56
62 64 67 67 68 68 69 70 71 72
73 73 74 75 77 78 78 81 82 84
84 87 89 89 92 92 94 95 96 98
The minimum data value is 31 and the maximum is
98. We previously calculated the median to be 72.5.

U1-81
Example
There are 20 data values less than the median. Q1 is
therefore in position (20 + 1)/2 = 10.5 of the ordered
data, so we take the average of the 10th and 11th
ordered observations. Thus, Q1 = (56 + 62)/2 = 59.

Similarly, Q3 is the average of the 10th and 11th


ordered values above the median (or equivalently, the
average of the 10th and 11th values down from the
maximum). Thus, Q3 = (84 + 84)/2 = 84.
U1-82
Example
The interquartile range is therefore

IQR = Q3 – Q1 = 84 – 59 = 25

and the five-number summary is

31 59 72.5 84 98

U1-83
Describing Distributions With Numbers
The five-number summary describes the center, shape
and spread of our data. We can use the five-number
summary to get a “picture” of the data.
U1-84
Boxplots
Below the histogram we see a boxplot for the earthquake
data. This type of boxplot is called a quantile boxplot.

min Q1 med Q3 max


5 5.5 6 6.5 7 7.5 8

5 5.5 6 6.5 7 7.5 8

U1-85
Boxplots
A quantile boxplot consists of:
§ a line at the median
§ a box that covers the IQR
§ lines (“whiskers”) that extend from the box out
to the minimum and maximum

5 5.5 6 6.5 7 7.5 8


U1-86
Boxplots
A boxplot can be displayed either horizontally or
vertically. Below we see the quantile boxplot for the
test score data in vertical orientation.

U1-87
Boxplots
Like a histogram, a boxplot enables us to characterize
the shape of a distribution:

50% 50%
Half of the data values fall within a relatively short
interval on the left, while the other half covers a large
interval on the right. This distribution is skewed to
the right.
U1-88
Boxplots
Analogous to our use of back-to-back stemplots, we
can create side-by-side boxplots:

§ used for plotting the same variable for samples


representing different populations
§ enable us to compare the distributions with
respect to center, shape and spread
§ must be constructed with a uniform scale to
legitimize the comparison

U1-89
Boxplots
The side-by-side boxplots for the height data are
shown below:
U1-90
Boxplots
We see that the median height for males is higher
than that for females.

The variability in heights is also greater for males


(both the range and IQR are larger).

The distribution of heights for females is


approximately symmetric and the distribution
for males is skewed to the right.

U1-91
Describing Distributions With Numbers
As we have seen, extreme observations can affect our
interpretation of the data if a numerical measure is
not resistant to the effect of outliers.

We can have a similar problem with graphical tools.


Since the lines in our boxplot extend out to the
minimum and maximum, we might not get an
accurate picture of our data.
U1-92
Describing Distributions With Numbers
The two distributions shown below will have similar
looking quantile boxplots, even through they are
quite different in terms of variability.

2 3 4 5 6 7 8

min Q1 med Q3 max

2 3 4 5 6 7 8

U1-93
Describing Distributions With Numbers
We would like to construct a modified boxplot that
takes these extreme observations into account.

An outlier boxplot is also based on quantiles.


U1-94
Describing Distributions With Numbers
In a quantile boxplot, the five lines correspond to the
values in the five-number summary.

The middle three lines (the box) are the same for an
outlier boxplot: Q1, median, Q3.

In this case, however, the lines coming out from the


box (the whiskers) do not extend out to the minimum
and maximum values.

U1-95
Describing Distributions With Numbers
Instead, we construct a “fence” with lower (LF) and
upper (UF) values:

LF = Q1 – 1.5(IQR) = Q1 – 1.5(Q3 – Q1)

UF = Q3 + 1.5(IQR) = Q3 + 1.5(Q3 – Q1)


U1-96
Describing Distributions With Numbers
The line coming out from the left side of the box
extends out to the lowest data value which is still
greater than the lower fence.

Similarly, the line coming out from the right side of


the box extends to the highest data value which is still
less than the upper fence.

U1-97
Describing Distributions With Numbers
An outlier is considered to be any data point falling
outside the fences (i.e., less than the lower fence or
greater than the upper fence). As such, the lines
extend out to the “new” minimum and maximum,
after we remove the outliers from consideration.

The outliers are included in the boxplot as points


along the axis.
U1-98
Example
The number of speeding violations recorded over the
weekend at twelve red-light cameras are ordered and
shown below:

6 9 10 10 12 13
14 14 15 15 17 28

The five-number summary for these data is:

6 10 13.5 15 28

U1-99
Example
We calculate the fences as follows:

LF = Q1 – 1.5(IQR) = 10 – 1.5(15 – 10)


= 10 – 7.5 = 2.5

UF = Q3 + 1.5(IQR) = 15 + 1.5(15 – 10)


= 15 + 7.5 = 22.5
U1-100
Example
There are no data values less than LF, so there are no
outliers on the left. The data value closest to but not
less than LF is 6 (i.e., the “new minimum” is the
same as the “old minimum”).

There is one value greater than UF, 28 > 22.5, so 28


is an outlier. The data value closest to but not greater
than UF is 17 (i.e., the “new maximum”).

U1-101
Example
The outlier boxplot for these data is shown below:

1.5(IQR) IQR 1.5(IQR)

LF Q1 med Q3 UF
0 5 10 15 20 25 30
U1-102
Example
Note that it is not necessary to include the labels or to
draw the fences when constructing the modified
boxplot.

0 5 10 15 20 25 30

U1-103
Example
Consider what would happen if we had instead
constructed a quantile boxplot for these data:

0 5 10 15 20 25 30
U1-104
Example
Not only would we be given the impression that there
is high variability in the data…

0 5 10 15 20 25 30

U1-105
Example
but we may also conclude that the distribution is
skewed to the right…

0 5 10 15 20 25 30
U1-106
Example
when we see from the modified boxplot that it is in
fact skewed to the left…

0 5 10 15 20 25 30

U1-107
Describing Distributions With Numbers
Note that we only had a problem with the quantile
boxplot when there were extreme values in our data.
When there are no outliers, the quantile boxplot is
identical to the outlier boxplot (since the “new”
minimum and maximum are the same as the “old”
minimum and maximum).
U1-108
Describing Distributions With Numbers
Back to the idea of variability…

The median gave us a measure of center, but we saw


that the mean is a better measure of center in some
cases, as it includes information about all data values.

Similarly, the range is a very crude description of the


variability of a data set. Can we find another way to
describe the spread numerically?

U1-109
Describing Distributions With Numbers
The variance of a set of data values (also referred to
as the sample variance) is defined as:
n
å ( xi - x )
2

s 2 = i =1
n -1

Loosely speaking, the variance can be thought of as


the “average squared deviation” from the mean.
U1-110
Describing Distributions With Numbers
The standard deviation of a data set is defined as the
positive square root of the variance:

n
å ( xi - x )
2

s = s2 = i =1
n -1

U1-111
Describing Distributions With Numbers
To calculate the variance s2 for a sample of size n:

1) Calculate the sample mean x .


2) Calculate the n deviations xi - x .
3) Square the deviations ( xi - xn) 2.
4) Add the squared deviations å ( xi - x ) 2.
i =1
5) Divide by n – 1
n
å ( xi - x )
2

s 2 = i =1
n -1
U1-112
Example
The number of goals scored by each of the five NHL
Pacific Division teams for the 2009/10 regular season
are shown below:
Team Goals
238
237
241
225

264

U1-113
Describing Distributions With Numbers
We calculate 1) x = 241
2) xi - x 3) ( xi - x )
2
xi
238 –3 9
237 –4 16
241 0 0
225 – 16 256
264 23 529
sum = 0 4) sum = 810

5) The sample variance is therefore equal to


n
å ( xi - x )
2
810
s 2 = i =1 = = 202.5
n -1 4
U1-114
Describing Distributions With Numbers
We could calculate the sample variance without a
table as follows:
(238 - 241) 2 + (237 - 241) 2 + (241 - 241) 2 + (225 - 241) 2 + (264 - 241) 2
s2 =
4
9 + 16 + 0 + 256 + 529 810
= = = 202.5
4 4

To get the sample standard deviation, we simply take


the square root of the variance:
s = s 2 = 202.5 = 14.23

U1-115
Describing Distributions With Numbers
n
Note that we calculated å ( xi - x ) = 0 for these data.
i =1

In fact, this is always the case, as shown below:


n
å ( xi - x ) = ( x1 - x ) + ( x2 - x ) + ! + ( xn - x )
i =1
= ( x1 + x2 + ! + xn ) - nx
x + x + ! + xn ö
= ( x1 + x2 + ! + xn ) - næç 1 2 ÷
è n ø
= ( x1 + x2 + ! + xn ) - ( x1 + x2 + ! + xn )
=0
U1-116
Describing Distributions With Numbers
What are the units for the standard deviation and
variance?

The units for x are the same as the units for the
individual x’s.

For example, the mean hourly wage of a group of


employees is still expressed in dollars.

U1-117
Describing Distributions With Numbers
The units for variance are the squared units of the
individual observations.

The sample variance of hourly wages for the group of


employees is expressed in (dollars)2.

Note: A variance of 10 dollars2 does not mean that


s2 = 100. It means that s2 = 10 and the units are
dollars2.
U1-118
Describing Distributions With Numbers
The units for the standard deviation are again the
same as the units of the individual observations (in
this case, dollars).

This is one reason why standard deviations are more


commonly reported than variances – because they are
easier to interpret in terms of units we understand.

U1-119
Describing Distributions With Numbers
We do, however, need to be careful when interpreting
the meaning of the standard deviation.

Although the variance is loosely defined as the


average squared deviation of an observation from the
mean, the standard deviation is not the average
deviation from the mean.
U1-120
Describing Distributions With Numbers
What happens to the values of the mean and standard
deviation when the units of measurement are
changed?

Suppose we have data from a sample for some


variable X for which the mean and standard deviation
are x and sx, respectively.

U1-121
Describing Distributions With Numbers
Suppose we have some other variable Y that shares a
linear relationship with the variable X, i.e.,

y = a + bx

Then it can be shown that the mean and standard


deviation of Y are, respectively,

y = a + bx and s y = b sx
U1-122
Example
The daily high temperature (in °C) is recorded each
day for one week in Grand Forks, North Dakota:

24.3 22.7 25.1 18.6 20.4 26.6 23.1

It can be calculated that

x = 22.97 °C and sx = 2.75 °C

U1-123
Example
Now suppose we want to express the data in °F.
The relationship between the Fahrenheit and Celsius
scales is

y = 32 + 1.8x

a b

where y is the temperature in °F and x is the


temperature in °C.
U1-124
Example
It follows that the mean and standard deviation in °F
are, respectively,

y = 32 + 1.8 x = 32 + 1.8 (22.97) = 73.35 °F

s y = 1.8 sx = 1.8 (2.75) = 4.95 °F

Note that only the multiplicative term b affects the


value of the standard deviation. Adding a constant to
each data value does not affect the spread of the data.

U1-125
Describing Distributions With Numbers
We have seen two different numerical summaries for
a given data set:

(1) the five-number summary

(2) the sample mean and standard deviation

Both provide us with descriptions of the center and


variability of a data distribution.
U1-126
Describing Distributions With Numbers
So how do we know which one to report?

It depends on the shape of the data distribution!

Recall that the mean and median are equal for


symmetric distributions. Therefore, if the distribution
of data values is reasonably symmetric with no
outliers, we should report the mean and standard
deviation, since they contain more information about
the sample than do the median, IQR and range.

U1-127
Describing Distributions With Numbers
However, for skewed distributions, or in the presence
of strong outliers, the five-number summary should
be reported, since the mean and standard deviation are
strongly affected by extreme values.

You might also like