Unit 2 Notes
Unit 2 Notes
Examining Relationships
In statistics, we often encounter problems which
involve more than one variable. We often want to
compare two (or more) different populations with
respect to the same variable.
U2-3
Explanatory and Response Variables
In this second case, one of the variables is an
explanatory variable (which we denote by X) and
the other is a response variable (denoted by Y).
U2-5
Example
Are students who excel in English also good at
mathematics, or are most people strictly left- or
right-brained? A psychology professor at a university
locates 450 students who have taken the same
introductory English course and the same Math
course and compares their percentage grades in the
two courses at the end of the semester.
U2-7
Example
Consider the relationship between a country’s fertility
rate (average number of children per adult female) and
life expectancy (average lifespan for its citizens). The
table on the following page gives the values for both
variables for a sample of eight countries.
U2-8
Example
Country Fertility Rate Life Expectancy
Bangladesh 3.2 62.5
Brazil 1.9 71.9
Columbia 2.5 72.3
Haiti 4.9 53.2
Italy 1.3 79.8
Lithuania 1.2 74.4
Pakistan 4.0 63.4
Rwanda 5.4 47.3
U2-9
Example
The scatterplot for these data is shown below:
90
80
Life Expectancy
70
60
50
40
0 1 2 3 4 5 6
Fertility Rate
U2-10
Examining Relationships
We look for four things when examining a scatterplot:
1) Direction
§ In this case, there is a negative association
between the two variables. An above-average
value of Fertility Rate tends to be accompanied
by a below-average value of Life Expectancy,
and vice-versa. If the pattern of points slopes
upward from left to right, we say there is a
positive association.
U2-11
Examining Relationships
2) Form
§ A straight line would do a fairly good job
approximating the relationship between the two
variables. It is therefore not unreasonable to
assume that these two variables share a linear
relationship.
U2-12
Examining Relationships
3) Strength
§ The strength of the relationship is determined by
how close the points lie to a simple form such as
a straight line. In our example, if we draw a line
which roughly approximates the relationship
between the two variables, all points will fall
quite close to the line. As such, the linear
relationship is quite strong.
U2-13
Examining Relationships
3) Strength (cont’d)
§ Not all relationships are linear in form. They
can be quadratic, logarithmic or exponential, to
name a few. Sometimes the points appear to be
“randomly scattered”, in which case many of
them will fall far from a line used to approximate
the relationship. In this case, we say the linear
relationship between the two variables is weak.
U2-14
Examining Relationships
4) Outliers
§ There are several types of outliers for bivariate
data. An observation may be outlying in either
the x- or y-directions (or both). Another type of
outlier occurs when an observation simply falls
outside the general pattern of points, even if it is
extreme in neither the x- nor y-directions. Some
types of outliers have more of an impact on our
analysis than others, as we will discuss shortly.
U2-15
Strength of Linear Relationship
The STAT 1000 and STAT 2000 percentage grades for
a sample of students who have taken both courses are
displayed in the scatterplot below:
100
90
80
70
STAT 2000
60
50
40
30
20
10
40 50 60 70 80 90 100
STAT 1000
U2-16
Strength of Linear Relationship
The scatterplot shows a moderately strong positive
linear relationship. Does the relationship for the data
in the following scatterplot appear stronger?
140
120
100
STAT 2000
80
60
40
20
0
0 20 40 60 80 100 120 140
STAT 1000
U2-17
Strength of Linear Relationship
It might, but these are the same data; the scatterplots
are just constructed with different scales!
100 140
90 120
80
100
70
STAT 2000
STAT 2000
60 80
50 60
40
40
30
20 20
10 0
40 50 60 70 80 90 100 0 20 40 60 80 100 120 140
STAT 1000 STAT 1000
U2-18
Strength of Linear Relationship
This example shows that our eyes are not the best
tools to assess the strength of relationship between two
variables.
U2-19
Correlation
The correlation r measures the direction and strength
of a linear relationship between two quantitative
variables.
U2-21
Correlation
For the Fertility Rate and Life Expectancy example,
(i) x = 3.05, y = 65.6, sx = 1.60, s y = 11.13
xi yi (ii) xi - x yi - y (iii) ( xi - x )( yi - y )
3.2 62.5 0.15 – 3.1 – 0.465
1.9 71.9 – 1.15 6.3 – 7.245
2.5 72.3 – 0.55 6.7 – 3.685
4.9 53.2 1.85 – 12.4 – 22.940
1.3 79.8 – 1.75 14.2 – 24.850
1.2 74.4 – 1.85 8.8 – 16.280
4.0 63.4 0.95 – 2.2 – 2.090
5.4 47.3 2.35 – 18.3 – 43.005
sum = 0 sum = 0 (iv) sum = – 120.56
U2-22
Correlation
(v) r = 1 n
å ( xi - x )( yi - y )
(n - 1) sx s y i =1
1
= (-120.56) = – 0.9671
7(1.60)(11.13)
U2-23
Association vs. Causation
We must be careful when interpreting correlation.
Despite the very strong negative correlation, we
cannot conclude that having more children causes
a shorter life expectancy.
U2-25
Association vs. Causation
Regardless of the existence of identifiable lurking
variables, we must remember that correlation
measures only the linear association between two
quantitative variables. It gives us no information
about the causal nature of the relationship.
U2-27
Correlation
Some properties of correlation (cont’d):
§ r has no units.
§ The correlation makes no distinction between X
and Y. As such, an explanatory and response
variable are not necessary.
§ Changing the units of X and Y has no effect on the
correlation. i.e., It doesn’t matter if we measure a
variable in pounds or kilograms, feet or meters,
dollars or cents, etc.
U2-28
Correlation
Some properties of correlation (cont’d):
§ r measures only the strength of a linear
relationship. In other cases, it is a useless measure.
§ Because the correlation is a function of several
measures that are affected by outliers, r is itself
strongly affected by outliers.
U2-29
Regression
When a relationship appears to be linear in nature,
we often wish to estimate this relationship between
variables with a single straight line.
U2-31
Regression
Given a value of X, we would like to predict the
corresponding value of Y. Unless there is a perfect
relationship, we won’t know the exact value of Y,
because Y is a variable.
U2-32
Regression
We will use a sample to estimate the true relationship
between the two variables. Our estimate of the true
line is
yˆ = b0 + b1 x
U2-33
Regression
We would like to find the line that fits our data the
best. That is, we need to find the appropriate values
of b0 and b1.
i =1
20
15
yi
10 ŷi
Y
0
0 2 4 6 8 10
X
U2-35
Least Squares Regression
The values of b0 and b1 that give us the line that
minimizes this sum of squared deviations are:
sy
b1 = r and b0 = y - b1 x
sx
5
Dx
0
0 2 4 6 8 10
X
U2-37
Least Squares Regression
The intercept of the regression line, b0 , is defined as
the predicted value of y when x = 0.
20
15
10
Y
5
b0
0
0 2 4 6 8 10
X
U2-38
Least Squares Regression
Some variability in Y is accounted for by the fact that,
as X changes, it pulls Y along with it. The remaining
variation is accounted for by other factors (which we
usually don’t know).
U2-39
Least Squares Regression
If r = –1 or 1, then r2 = 1. That is, we can predict Y
exactly for any value of X, as regression on X
accounts for all of the variation in Y.
A sample of fish will be taken from the river and a regression will be run to
predict the mercury concentration of fish that will be caught in the future.
This must be done because a fish cannot be sold or eaten after it has been
tested for mercury.
U2-41
Example
A sample of ten fish is collected and the lengths X (in inches)
and mercury concentrations Y are as follows:
X 5.5 6.1 6.7 7.0 7.5 7.9 8.6 9.2 9.8 10.3
Y 0.11 0.19 0.24 0.37 0.36 0.49 0.59 0.60 0.81 0.78
U2-42
Example
The scatterplot for these data is shown below:
1
0.8
Concentration
0.6
0.4
0.2
0
5 6 7 8 9 10 11
Length
U2-43
Example
We see a strong positive linear relationship between
Length and Concentration. From the data, we
calculate
x = 7.860, y = 0.454, sx = 1.597, s y = 0.241, r = 0.985
And so
sy 0.241 ö
b1 = r = 0.985 æç ÷ = 0.149
sx è 1.597 ø
0.8
Concentration
0.6
0.4
0.2
0
5 6 7 8 9 10 11
Length
U2-45
Example
The slope b1 = 0.149 tells us that, as the length of a
fish increases by one inch, we predict the mercury
concentration to increase by 0.149 ppm.
The intercept b0 = – 0.717 is statistically meaningless
in this case. A fish cannot have a length of 0 inches,
and a negative concentration is impossible .
U2-47
Example
We call this the predicted value of Y when X = 7.
0.8
Concentration
0.6
0.4
0.2
0
5 6 7 8 9 10 11
Length
U2-48
Residuals
Note that there is a fish in the sample that is 7.0 inches
long. How does the actual mercury concentration for
this fish compare with the predicted concentration?
y4 - yˆ 4 = 0.37 - 0.326 = 0.044 ppm
U2-49
Residuals
residual = actual value of y – predicted value of y
1
0.8
Concentration
0.6
0.4 actual
predicted
0.2
0
5 6 7 8 9 10 11
Length
U2-50
Residuals
A positive residual indicates that an observation falls
above the regression line and a negative residual indicates
that it falls below the line. As an example, check that
the residual for the 9.2 inch fish in the sample is equal
to – 0.0538.
Note that it was in fact the sum of squared residuals that is
minimized in calculating the least squares regression line.
What if we want to predict the mercury concentration for a
fish that is 12 inches long? Our prediction is
U2-51
Extrapolation
Mathematically, there is no problem with making this
prediction. However, there is a statistical problem.
Our range of values for X is from 5.5 to 10.3 inches.
We have good evidence of a linear relationship within
this range of values. However, we have no fish in our
sample as long as 12 inches, and so we have no idea
whether this relationship continues to hold outside our
range of data.
The process of predicting a value of Y for a value of X
outside our range of data is known as extrapolation, and
should be avoided if at all possible.
U2-52
Outliers
We have seen that an outlier can be defined as a point
that is far from the other data points in the x-direction
or the y-direction, or if it falls outside the general
pattern of points.
U2-53
Outliers
Point # 1 is an outlier in the y-direction. It generally
has little effect on the regression line.
14
#1
12
10
8
Y
0
0 5 10 15
X
U2-54
Outliers
Point # 2 is not an outlier in either the x- or
y-directions, but falls outside the pattern of points.
14
12
10
8
Y
2 #2
0
0 5 10 15
X
U2-55
Outliers
A bivariate outlier such as this generally has little
effect on the regression line.
14
12
10
8
Y
2 #2
0
0 5 10 15
X
U2-56
Outliers
Point # 3 is an outlier in the x-direction. It has a
strong effect on the regression line.
14
12
10
8
Y
2
#3
0
0 5 10 15
X
U2-57
Influential Observations
An observation is called influential if removing it
from the data set would dramatically alter the position
of the regression line (and the value of r2).
U2-59
Outliers
We see that, with the outlier included, the regression
line is a less accurate description of the relationship.
1
0.6
LSR line with
outlier included
0.4
0.2
0
6 8 10 12 14 16
Length
U2-60
Least Squares Regression
One property of the least squares regression line is
that it always passes through the point ( x , y ).
Consider our previous example for the regression of
Concentration vs. Length of a fish. The mean length
of the fish in the sample was 7.860 inches. The
predicted value for a fish of this length is
yˆ = - 0.717 + 0.149 (7.860) = 0.454
U2-61
Association vs. Causation
Recall our discussion of association vs. causation.
The former does not imply the latter. In the fish
example, there was a strong positive relationship
between the length of a fish and its mercury
concentration. However, this doesn’t mean that
getting longer causes a fish’s mercury concentration
to increase. In fact, we know in this case that it is the
age of a fish that causes its concentration to increase,
as it has been exposed to the mercury for a longer
period of time. As such, the age of a fish is a lurking
variable.
U2-62
Experiment vs. Observational Study
The best way to avoid lurking variables is to perform
an experiment rather than an observational study.
U2-63
Experiment vs. Observational Study
If we simply ask people how much they smoke (say,
in number of cigarettes per week) and measure their
weight, we might see a positive correlation between
the two variables. But this does not imply that
smoking causes weight gain. (In fact, many people
believe smoking causes weight loss!)
U2-65
Experiment vs. Observational Study
The reason for this is that we have diversified away
the similarities within groups of similar smoking
habits with respect to all possible lurking variables.
U2-67
Categorical Variables on a Scatterplot
Sometimes a scatterplot may actually be displaying
two or more distinct relationships.
For example, the Average Driving Distance X and
the Average Score Y are recorded for a sample of
professional golfers. (A “drive” is a golfer’s first
shot on a golf hole).
U2-68
Categorical Variables on a Scatterplot
The data are plotted on the scatterplot below. The
relationship does not appear to be linear, but…..
74
73
Score
72
71
70
220 230 240 250 260 270 280 290 300
Distance
U2-69
Categorical Variables on a Scatterplot
This scatterplot is actually displaying two distinct
linear relationships, one for male golfers and one for
female golfers.
74
73
Score
72
71
70
220 230 240 250 260 270 280 290 300 310
Distance
U2-70
Categorical Variables on a Scatterplot
This example illustrates that we should be careful
when examining a relationship to make sure that the
data belong to only one population.