0% found this document useful (0 votes)
22 views36 pages

Unit 2 Notes

Uploaded by

Hongru Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views36 pages

Unit 2 Notes

Uploaded by

Hongru Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

U2-1

Examining Relationships
In statistics, we often encounter problems which
involve more than one variable. We often want to
compare two (or more) different populations with
respect to the same variable.

We use tools such as side-by-side boxplots and


back-to-back stemplots to make comparisons
between the groups.
U2-2
Examining Relationships
Often, however, we wish to examine relationships
between several variables for the same population.

When we are interested in examining the relationship


between two variables, we may find ourselves in one
of two situations:
§ We may simply be interested in the nature of the
relationship.
§ One of the variables may be thought to explain or
cause changes in the other.

U2-3
Explanatory and Response Variables
In this second case, one of the variables is an
explanatory variable (which we denote by X) and
the other is a response variable (denoted by Y).

A response variable takes values representing the


outcome of a study, while an explanatory variable
helps explain this outcome.
U2-4
Example
Does the caffeine in coffee really help keep you
awake? Researchers interviewed 300 adults and
asked them how many cups of coffee they drink on
an average day, as well as how many hours of sleep
they get at night.

The response variable Y is the hours of sleep, while


the explanatory variable X is number of cups of
coffee per day.

U2-5
Example
Are students who excel in English also good at
mathematics, or are most people strictly left- or
right-brained? A psychology professor at a university
locates 450 students who have taken the same
introductory English course and the same Math
course and compares their percentage grades in the
two courses at the end of the semester.

In this case, there is no explanatory or response


variable; we are simply interested in the nature of
the relationship.
U2-6
Graphical Displays
The best way to display the relationship between two
quantitative variables is with a scatterplot.

A scatterplot displays the values of two different


quantitative variables measured on the same
individuals. The data for each individual (for both
variables) appears as a single point on the scatterplot.
If there is an explanatory and a response variable, they
should be plotted on the x- and y-axes, respectively.
Otherwise the choice of axes is arbitrary.

U2-7
Example
Consider the relationship between a country’s fertility
rate (average number of children per adult female) and
life expectancy (average lifespan for its citizens). The
table on the following page gives the values for both
variables for a sample of eight countries.
U2-8
Example
Country Fertility Rate Life Expectancy
Bangladesh 3.2 62.5
Brazil 1.9 71.9
Columbia 2.5 72.3
Haiti 4.9 53.2
Italy 1.3 79.8
Lithuania 1.2 74.4
Pakistan 4.0 63.4
Rwanda 5.4 47.3

U2-9
Example
The scatterplot for these data is shown below:
90

80
Life Expectancy

70

60

50

40
0 1 2 3 4 5 6
Fertility Rate
U2-10
Examining Relationships
We look for four things when examining a scatterplot:

1) Direction
§ In this case, there is a negative association
between the two variables. An above-average
value of Fertility Rate tends to be accompanied
by a below-average value of Life Expectancy,
and vice-versa. If the pattern of points slopes
upward from left to right, we say there is a
positive association.

U2-11
Examining Relationships
2) Form
§ A straight line would do a fairly good job
approximating the relationship between the two
variables. It is therefore not unreasonable to
assume that these two variables share a linear
relationship.
U2-12
Examining Relationships
3) Strength
§ The strength of the relationship is determined by
how close the points lie to a simple form such as
a straight line. In our example, if we draw a line
which roughly approximates the relationship
between the two variables, all points will fall
quite close to the line. As such, the linear
relationship is quite strong.

U2-13
Examining Relationships
3) Strength (cont’d)
§ Not all relationships are linear in form. They
can be quadratic, logarithmic or exponential, to
name a few. Sometimes the points appear to be
“randomly scattered”, in which case many of
them will fall far from a line used to approximate
the relationship. In this case, we say the linear
relationship between the two variables is weak.
U2-14
Examining Relationships
4) Outliers
§ There are several types of outliers for bivariate
data. An observation may be outlying in either
the x- or y-directions (or both). Another type of
outlier occurs when an observation simply falls
outside the general pattern of points, even if it is
extreme in neither the x- nor y-directions. Some
types of outliers have more of an impact on our
analysis than others, as we will discuss shortly.

U2-15
Strength of Linear Relationship
The STAT 1000 and STAT 2000 percentage grades for
a sample of students who have taken both courses are
displayed in the scatterplot below:
100
90
80
70
STAT 2000

60
50
40
30
20
10
40 50 60 70 80 90 100
STAT 1000
U2-16
Strength of Linear Relationship
The scatterplot shows a moderately strong positive
linear relationship. Does the relationship for the data
in the following scatterplot appear stronger?
140

120

100
STAT 2000

80

60

40

20

0
0 20 40 60 80 100 120 140
STAT 1000

U2-17
Strength of Linear Relationship
It might, but these are the same data; the scatterplots
are just constructed with different scales!
100 140
90 120
80
100
70
STAT 2000

STAT 2000

60 80
50 60
40
40
30
20 20
10 0
40 50 60 70 80 90 100 0 20 40 60 80 100 120 140
STAT 1000 STAT 1000
U2-18
Strength of Linear Relationship
This example shows that our eyes are not the best
tools to assess the strength of relationship between two
variables.

Can we find a numerical measure that will give us a


concrete description of the strength of a linear
relationship between two quantitative variables?

The measure we use is called correlation.

U2-19
Correlation
The correlation r measures the direction and strength
of a linear relationship between two quantitative
variables.

Suppose the values of two quantitative variables X


and Y have been measured for n individuals. Then
1 n æ xi - x öæ yi - y ö
r= åç ÷çç ÷
n - 1 i =1 è sx øè s y ÷ø
1 n
= å ( xi - x )( yi - y )
(n - 1) sx s y i =1
U2-20
Correlation
We will use the second version of the formula, which
is computationally simpler. To calculate the
correlation r:
(i) Calculate x , y , sx and sy
(ii) Calculate the deviations xi - x and yi - y
(iii) Multiply the corresponding deviations for x and y
( xi - x )( yi - y ) n
(iv) Add the n products å ( xi - x )( yi - y )
i =1
(v) Divide by (n – 1)sxsy
1 n
r= å ( xi - x )( yi - y )
(n - 1) sx s y i =1

U2-21
Correlation
For the Fertility Rate and Life Expectancy example,
(i) x = 3.05, y = 65.6, sx = 1.60, s y = 11.13
xi yi (ii) xi - x yi - y (iii) ( xi - x )( yi - y )
3.2 62.5 0.15 – 3.1 – 0.465
1.9 71.9 – 1.15 6.3 – 7.245
2.5 72.3 – 0.55 6.7 – 3.685
4.9 53.2 1.85 – 12.4 – 22.940
1.3 79.8 – 1.75 14.2 – 24.850
1.2 74.4 – 1.85 8.8 – 16.280
4.0 63.4 0.95 – 2.2 – 2.090
5.4 47.3 2.35 – 18.3 – 43.005
sum = 0 sum = 0 (iv) sum = – 120.56
U2-22
Correlation
(v) r = 1 n
å ( xi - x )( yi - y )
(n - 1) sx s y i =1
1
= (-120.56) = – 0.9671
7(1.60)(11.13)

U2-23
Association vs. Causation
We must be careful when interpreting correlation.
Despite the very strong negative correlation, we
cannot conclude that having more children causes
a shorter life expectancy.

There are many other variables that could help explain


the strong relationship between Fertility Rate and Life
Expectancy. One such variable is the Wealth of a
country.
U2-24
Association vs. Causation
Women in richer industrialized countries tend to
have fewer children than women in poor third-world
countries. We also know that life expectancy is higher
in richer countries, because of better health care,
education, resources, etc.

The Wealth of a country in this example is known as a


lurking variable. A lurking variable is one that helps
explain the relationship between variables in a study,
but which is not itself included in the study.

U2-25
Association vs. Causation
Regardless of the existence of identifiable lurking
variables, we must remember that correlation
measures only the linear association between two
quantitative variables. It gives us no information
about the causal nature of the relationship.

Association does not imply causation!


U2-26
Correlation
Some properties of correlation:
§ Positive values of r indicate a positive association
and negative values indicate a negative association.
§ r falls between –1 and 1, inclusive. Values of r
close to –1 or 1 indicate a strong linear association
(negative or positive, respectively). A correlation
of –1 or 1 is obtained only in the case of a perfect
linear relationship; i.e., when all points fall on a
straight line. Values of r close to zero indicate a
weak linear relationship.

U2-27
Correlation
Some properties of correlation (cont’d):
§ r has no units.
§ The correlation makes no distinction between X
and Y. As such, an explanatory and response
variable are not necessary.
§ Changing the units of X and Y has no effect on the
correlation. i.e., It doesn’t matter if we measure a
variable in pounds or kilograms, feet or meters,
dollars or cents, etc.
U2-28
Correlation
Some properties of correlation (cont’d):
§ r measures only the strength of a linear
relationship. In other cases, it is a useless measure.
§ Because the correlation is a function of several
measures that are affected by outliers, r is itself
strongly affected by outliers.

U2-29
Regression
When a relationship appears to be linear in nature,
we often wish to estimate this relationship between
variables with a single straight line.

A regression line is a straight line that describes how


a response variable Y changes as an explanatory
variable X changes. This line is often used to predict
values of Y for given values of X.
U2-30
Regression
Note that with correlation, we didn’t require a
response variable and an explanatory variable.

Usually in regression, we have an explanatory


variable X and a response variable Y, but this need
not be the case.

U2-31
Regression
Given a value of X, we would like to predict the
corresponding value of Y. Unless there is a perfect
relationship, we won’t know the exact value of Y,
because Y is a variable.
U2-32
Regression
We will use a sample to estimate the true relationship
between the two variables. Our estimate of the true
line is
yˆ = b0 + b1 x

ŷ is the predicted value of Y for a given value of X.


b0 is the intercept of the line and b1 is the slope.

We will use this regression line to make our


predictions.

U2-33
Regression
We would like to find the line that fits our data the
best. That is, we need to find the appropriate values
of b0 and b1.

But there are infinitely many possible lines. Which


one is the “best” line?

Since we are using X to predict Y, we would like the


line to lie as close to the points as possible in the
vertical direction.
U2-34
Regression
The line we will use is the line that minimizes the
sum of squared deviations in the vertical direction:
n
å ( yi - yˆ i )
2

i =1
20

15
yi

10 ŷi
Y

0
0 2 4 6 8 10
X

U2-35
Least Squares Regression
The values of b0 and b1 that give us the line that
minimizes this sum of squared deviations are:

sy
b1 = r and b0 = y - b1 x
sx

The line yˆ = b0 + b1 x is called the least squares


regression line, for obvious reasons.
U2-36
Least Squares Regression
The slope of the regression line, b1, is defined as the
predicted increase in y when x increases by one unit.
20
D yˆ
15 b1 =
Dx
10 D ŷ
Y

5
Dx
0
0 2 4 6 8 10
X

U2-37
Least Squares Regression
The intercept of the regression line, b0 , is defined as
the predicted value of y when x = 0.
20

15

10
Y

5
b0
0
0 2 4 6 8 10
X
U2-38
Least Squares Regression
Some variability in Y is accounted for by the fact that,
as X changes, it pulls Y along with it. The remaining
variation is accounted for by other factors (which we
usually don’t know).

The value of r2 has a special meaning in least-squares


regression. It is the fraction of variation in Y that is
accounted for by its regression on X.

U2-39
Least Squares Regression
If r = –1 or 1, then r2 = 1. That is, we can predict Y
exactly for any value of X, as regression on X
accounts for all of the variation in Y.

If r = 0, then r2 = 0, and so regression on X tells us


nothing about the value of Y.

Otherwise, r2 is between 0 and 1.


U2-40
Example
Fishermen are concerned that the pollution from a pulp and paper plant near
a river is contaminating the river’s fish with mercury, which makes the fish
unsafe to eat. They would like to use the age of a fish to predict the
mercury concentration in its system. The age of a fish is a good predictor
for mercury concentration, as older fish have been exposed to the mercury
longer and will have higher levels of contamination.

It is impossible to measure the age of a fish, so it is decided that the length


of the fish will be used as the explanatory variable, as length and age are
highly correlated.

A sample of fish will be taken from the river and a regression will be run to
predict the mercury concentration of fish that will be caught in the future.
This must be done because a fish cannot be sold or eaten after it has been
tested for mercury.

U2-41
Example
A sample of ten fish is collected and the lengths X (in inches)
and mercury concentrations Y are as follows:

X 5.5 6.1 6.7 7.0 7.5 7.9 8.6 9.2 9.8 10.3
Y 0.11 0.19 0.24 0.37 0.36 0.49 0.59 0.60 0.81 0.78
U2-42
Example
The scatterplot for these data is shown below:
1

0.8
Concentration

0.6

0.4

0.2

0
5 6 7 8 9 10 11
Length

U2-43
Example
We see a strong positive linear relationship between
Length and Concentration. From the data, we
calculate
x = 7.860, y = 0.454, sx = 1.597, s y = 0.241, r = 0.985

And so
sy 0.241 ö
b1 = r = 0.985 æç ÷ = 0.149
sx è 1.597 ø

b0 = y - b1 x = 0.454 - 0.149 (7.860) = - 0.717


U2-44
Example
The equation of the least squares regression line is
therefore yˆ = - 0.717 + 0.149 x . The line is shown on
the scatterplot below:
1

0.8
Concentration
0.6

0.4

0.2

0
5 6 7 8 9 10 11
Length

U2-45
Example
The slope b1 = 0.149 tells us that, as the length of a
fish increases by one inch, we predict the mercury
concentration to increase by 0.149 ppm.
The intercept b0 = – 0.717 is statistically meaningless
in this case. A fish cannot have a length of 0 inches,
and a negative concentration is impossible .

We also see that r2 = (0.985)2 = 0.97, which tells us


that 97% of the variation in mercury concentration is
accounted for by its regression on the length of a fish.
U2-46
Example
We can now use this line to predict the mercury
concentration for a fish of a given length.

To do this, we simply plug in the length of a fish into


the equation of the least-squares regression line. For
example, the predicted mercury concentration for a
fish that is 7.0 inches long is

ŷ = - 0.717 + 0.149 (7.0) = 0.326 ppm

U2-47
Example
We call this the predicted value of Y when X = 7.

0.8
Concentration

0.6

0.4

0.2

0
5 6 7 8 9 10 11
Length
U2-48
Residuals
Note that there is a fish in the sample that is 7.0 inches
long. How does the actual mercury concentration for
this fish compare with the predicted concentration?
y4 - yˆ 4 = 0.37 - 0.326 = 0.044 ppm

The value yi - yˆ i is called the residual for the ith


observation. The residual for any value of X reflects
the error of our prediction.

U2-49
Residuals
residual = actual value of y – predicted value of y
1

0.8
Concentration

0.6

0.4 actual
predicted
0.2

0
5 6 7 8 9 10 11
Length
U2-50
Residuals
A positive residual indicates that an observation falls
above the regression line and a negative residual indicates
that it falls below the line. As an example, check that
the residual for the 9.2 inch fish in the sample is equal
to – 0.0538.
Note that it was in fact the sum of squared residuals that is
minimized in calculating the least squares regression line.
What if we want to predict the mercury concentration for a
fish that is 12 inches long? Our prediction is

yˆ = - 0.717 + 0.149 (12) = 1.071 ppm

U2-51
Extrapolation
Mathematically, there is no problem with making this
prediction. However, there is a statistical problem.
Our range of values for X is from 5.5 to 10.3 inches.
We have good evidence of a linear relationship within
this range of values. However, we have no fish in our
sample as long as 12 inches, and so we have no idea
whether this relationship continues to hold outside our
range of data.
The process of predicting a value of Y for a value of X
outside our range of data is known as extrapolation, and
should be avoided if at all possible.
U2-52
Outliers
We have seen that an outlier can be defined as a point
that is far from the other data points in the x-direction
or the y-direction, or if it falls outside the general
pattern of points.

We now examine the effect of each of these three


types of outliers.

U2-53
Outliers
Point # 1 is an outlier in the y-direction. It generally
has little effect on the regression line.
14
#1
12

10

8
Y

0
0 5 10 15
X
U2-54
Outliers
Point # 2 is not an outlier in either the x- or
y-directions, but falls outside the pattern of points.
14

12

10

8
Y

2 #2
0
0 5 10 15
X

U2-55
Outliers
A bivariate outlier such as this generally has little
effect on the regression line.
14

12

10

8
Y

2 #2
0
0 5 10 15
X
U2-56
Outliers
Point # 3 is an outlier in the x-direction. It has a
strong effect on the regression line.
14

12

10

8
Y

2
#3

0
0 5 10 15
X

U2-57
Influential Observations
An observation is called influential if removing it
from the data set would dramatically alter the position
of the regression line (and the value of r2).

In the above illustration, Point # 3 is an influential


observation, which is often the case for outliers in the
x-direction.
U2-58
Outliers
In our example, suppose the length of the tenth fish in
our sample had been 16 inches instead of 10.3 inches.
The equation of the regression line changes to
yˆ = - 0.102 + 0.066 x
In addition, the value of r2 reduces to 0.663. The
outlying value has had a large effect on the equation
of the line and the value of r2.

U2-59
Outliers
We see that, with the outlier included, the regression
line is a less accurate description of the relationship.
1

LSR line with


0.8
outlier excluded
Concentration

0.6
LSR line with
outlier included
0.4

0.2

0
6 8 10 12 14 16
Length
U2-60
Least Squares Regression
One property of the least squares regression line is
that it always passes through the point ( x , y ).
Consider our previous example for the regression of
Concentration vs. Length of a fish. The mean length
of the fish in the sample was 7.860 inches. The
predicted value for a fish of this length is
yˆ = - 0.717 + 0.149 (7.860) = 0.454

which is exactly equal to the mean mercury


concentration of the fish in the sample.

U2-61
Association vs. Causation
Recall our discussion of association vs. causation.
The former does not imply the latter. In the fish
example, there was a strong positive relationship
between the length of a fish and its mercury
concentration. However, this doesn’t mean that
getting longer causes a fish’s mercury concentration
to increase. In fact, we know in this case that it is the
age of a fish that causes its concentration to increase,
as it has been exposed to the mercury for a longer
period of time. As such, the age of a fish is a lurking
variable.
U2-62
Experiment vs. Observational Study
The best way to avoid lurking variables is to perform
an experiment rather than an observational study.

In an experiment, the value of the explanatory


variable is randomly “assigned” to the sample units,
rather than being simply observed prior to the study.

For example, consider the issue of smokers and


weight. Does smoking cause weight gain?

U2-63
Experiment vs. Observational Study
If we simply ask people how much they smoke (say,
in number of cigarettes per week) and measure their
weight, we might see a positive correlation between
the two variables. But this does not imply that
smoking causes weight gain. (In fact, many people
believe smoking causes weight loss!)

There are other lurking variables that we are not


considering. For example, a person who smokes is
generally more likely to have other unhealthy habits
such as eating a bad diet or not exercising as much as
they should.
U2-64
Experiment vs. Observational Study
To eliminate these effects, we could perform an
experiment where we actually assign a group of
former non-smokers a certain number of cigarettes to
start smoking each week. After a given amount of
time, we can measure each person’s weight gain (or
loss) and calculate the correlation and the regression
line. If we still see a positive association, we can then
say that smoking does in fact cause weight gain.

U2-65
Experiment vs. Observational Study
The reason for this is that we have diversified away
the similarities within groups of similar smoking
habits with respect to all possible lurking variables.

For example, some people with bad diets will be


assigned to smoke many cigarettes per week and
some will be assigned to smoke very few.
U2-66
Experiment vs. Observational Study
This example provides a good illustration that it is not
always possible to perform an experiment rather than
an observational study. It is not realistic to expect to
find a group of non-smokers willing to start smoking
(much less the amount of cigarettes we tell them to
smoke!).

Note however that this doesn’t mean observational


studies are “bad”. We must just remember that
association does not imply causation!

U2-67
Categorical Variables on a Scatterplot
Sometimes a scatterplot may actually be displaying
two or more distinct relationships.
For example, the Average Driving Distance X and
the Average Score Y are recorded for a sample of
professional golfers. (A “drive” is a golfer’s first
shot on a golf hole).
U2-68
Categorical Variables on a Scatterplot
The data are plotted on the scatterplot below. The
relationship does not appear to be linear, but…..

74

73
Score

72

71

70
220 230 240 250 260 270 280 290 300
Distance

U2-69
Categorical Variables on a Scatterplot
This scatterplot is actually displaying two distinct
linear relationships, one for male golfers and one for
female golfers.

74

73
Score

72

71

70
220 230 240 250 260 270 280 290 300 310
Distance
U2-70
Categorical Variables on a Scatterplot
This example illustrates that we should be careful
when examining a relationship to make sure that the
data belong to only one population.

In this case, a separate regression line should be fit to


the data for the male and female golfers.

You might also like