0% found this document useful (0 votes)
14 views103 pages

Quant Notes 9-7-21

The document discusses correlation, focusing on the relationship between paired scores and how to visualize them using scatterplots. It explains the concept of correlation coefficients, particularly Pearson's r, which measures the strength and direction of a linear relationship between two variables. Additionally, it emphasizes the importance of understanding that correlation does not imply causation and introduces the concept of regression for predicting values based on these relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views103 pages

Quant Notes 9-7-21

The document discusses correlation, focusing on the relationship between paired scores and how to visualize them using scatterplots. It explains the concept of correlation coefficients, particularly Pearson's r, which measures the strength and direction of a linear relationship between two variables. Additionally, it emphasizes the importance of understanding that correlation does not imply causation and introduces the concept of regression for predicting values based on these relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Correlation

Up until this point, we have been looking at a single


variable.
Correlation
Looking at paired scores
Correlation
Note:
Each dot represents
a pair of scores and
is considered as a
coordinate.
Correlation
Correlation
Correlation
Correlation
Do the units of the
scores matter for
you to tell if there
is a pattern?
Correlation
What do we call this type of a data plot?
Correlation
What do we call this type of a data plot? Scatterplot
Correlation

Scatterplot: A graph containing a cluster of dots


that represents all pairs of scores. Each dot
signifies a coordinate.
Correlation
Your book provides an example using greeting cards.
Is there a relationship between the number of cards
you send and the number you receive?
Correlation
Positive Relationship: Pairs of scores
tend to occupy similar relative
positions (high with high and low with
low) in their respective distributions.

Negative Relationship: Pairs of scores


tend to occupy dissimilar relative
positions (high with low and vice
versa) in their respective distributions.
Correlation
Correlation: A measure of how things area related.
More specifically, it is a statistical measure that
expresses the extent to which two variables are
linearly related (meaning they change together at a
constant rate). It’s a common tool for describing
simple relationships without making a statement
about cause and effect.
Correlation
Correlation
Correlation
How can we quantify the strength of a correlation?

We can employee the use of a Correlation Coefficient.


Correlation
How can we quantify the strength of a correlation?

We can employee the use of a Correlation Coefficient.

Correlation Coefficient: A value used to measure how


strong a relationship is between two variables.
Correlation

Pearson’s Correlation Coefficient (also called


Pearson’s r) is the most common correlation
coefficient that commonly used in linear
regression.
Correlation
Properties of Pearson’s r:

1. The sign of r indicates the type of linear


relationship, whether positive or negative.
Correlation
Properties of Pearson’s r:

1. The sign of r indicates the type of linear


relationship, whether positive or negative.
2. The numerical value value of r, without
regard to sign, indicates the strength of the
linear relationship. (ranges from +1 to -1)
Correlation
Pearson’s r

Helps you to see which relationships are


imaginary and which ones are real.

To what extent can you use x or y to estimate


the other?
Correlation
Caution: Be careful when interpreting the
actual numerical value of r. An r of .70 for
height and weight doesn’t signify that the
strength of this relationship equals either .70
or 70 percent of the strength of a perfect
relationship. The value of r can’t be interpreted
as a proportion or percentage of some perfect
relationship.
Correlation
Coefficient of Determination (r ):
2

The Squared Correlation Coefficient (r2)


represents the proportion of the total
variability in one variable that is predictable
from its relationship with the other variable.
Or the portion of shared variance. r2 will vary
from 0 < r 2 < 1
Correlation
Careful with Cause-and-Effect!

Remember: A correlation coefficient,


regardless of size, never provides information
about whether an observed relationship
reflects a simple cause-effect relationship or
some more complex state of affairs.
Note:

Conclusions are trustworthy only if they are


derived from premises (assumptions) that are
true.
Correlation
Let’s go back to our definition of the squared
correlation coefficient (r2):

r2 represents the proportion of the total variability


in one variable that is predictable from its
relationship with the other variable. Or also
known as the. (varies from 0 < r 2 < 1) portion of
shared variance
Correlation

*** Remember: ***

Predictability does not imply causality!


Correlation
How can we calculate r?
Correlation

Careful with outliers!


Example Strength of Relationship Interpretation
Note: if you use something like this, make sure you get it from reputable source and provide an appropriate reference.
Example Correlation Matrix
Example Correlation Matrix
Example Correlation Matrix

RR = Reading readiness test at beginning of 1st grade;


CA = Age; Y = End of 1st year reading performance
Linear Relationship:
A relationship that can be
described best with a
straight line.

What is another type of


relationship that you might
encounter?
62
Linear Relationship:
A relationship that can be
described best with a
straight line.

Curvilinear Relationship:
A relationship that can be
described best with a
curved line. (Eta Coefficient) 63
Curvilinear Relationship:
A relationship that can be described best with a
curved line. (Eta Coefficient)

Eta, ⴄ, is a letter in the Greek alphabet and looks like


the letter “n” when you didn’t stop soon enough
(per Larimore).

You can square ⴄ just like you can r. Larimore always


called ⴄ the older brother of r.  64
Curvilinear Data Examples
Curvilinear Data Examples
Curvilinear Data Examples
How do we know to use a straight (r) or
curved (ⴄ) line?

Look at the scatterplot

69
Correlation Range Restriction
Except for special circumstances, the value of
the correlation coefficient declines whenever
the range of possible X or Y scores is restricted.
Range restriction is analogous to magnifying a
subset of the original dot cluster and, in the
process, losing much of the orderly and
predictable pattern in the original dot cluster.
Correlation Range Restriction
Correlation Range Restriction
Correlation Range Restriction
Correlation Range Restriction
• If variables co-vary, they must first be
allowed to vary.
• Anything you do to restrict the range of a
variable will suppress the relationship.
• You may have colleagues recommend, “what
if you just look at the low or high students?”
Just understand what this could do to any
underlying relationships.
Correlation Range Restriction

Dr. Larimore’s example about guy selling a


creativity test to school districts.
Correlation Range Restriction

Always check for any possible restrictions on the


ranges of X or Y scores—whether by design or
accident—that could lower the value of r.
The part of the relationship that is predictable
can be estimated with a regression line.
Correlation and Regression to together.
Regression
Regression Line Equation

𝑌𝑌 ′ = 𝑏𝑏𝑏𝑏 + 𝑎𝑎
The above equation is from your text. The Y’
will sometimes be represented as Ŷ.

You may recall this is the simple formula for a


straight line. In P-12, the above equation is
often represented as y = mx + b.
81
𝑌𝑌 ′ = 𝑏𝑏𝑏𝑏 + 𝑎𝑎

10
b = (10/1.75)
52 1.75 b = 5.73

a = 52

𝑌𝑌 ′
= 5.73𝑋𝑋 +52
Grade = 5.73(# 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅) +52
𝑌𝑌 ′ = 5.71𝑋𝑋 +52
10
52 1.75
The Least Squares (Regression) Line

𝑌𝑌 = 𝑏𝑏𝑏𝑏 + 𝑎𝑎
• A good line is one that minimizes the sum of squared
differences between the points and the line.
Minimizing the sum of squared differences avoids the
arithmetic standoff of zero always produced by adding
positive and negative predictive errors (associated with
errors above and below the regression line,
respectively). This is why the regression line is often
referred to as the least squares regression line. 85
Regression
Least Squares
The Least Squares (Regression) Line
𝑌𝑌 ′ = 𝑏𝑏𝑏𝑏 + 𝑎𝑎

• The smaller the sum of squared differences the


better the fit of the regression line to the data.
• Results in minimizing the errors in predicting y
from x.

87
The Least Squares (Regression) Line

𝑌𝑌 ′ = 𝑏𝑏𝑏𝑏 + 𝑎𝑎

Note: Use of the linear regression equation requires


that the underlying relationship be linear.

88
Remember our
52 1.75
10
earlier example?


𝑌𝑌 = 5.73𝑋𝑋 +52
Grade = 5.73(# 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅) +52
The Least Squares (Regression) Line
𝑌𝑌 ′ = 𝑏𝑏𝑏𝑏 + 𝑎𝑎

𝑠𝑠𝑦𝑦
𝑦𝑦�𝑖𝑖 = 𝑌𝑌� + 𝑟𝑟𝑥𝑥𝑥𝑥 𝑥𝑥𝑖𝑖 − 𝑋𝑋�
𝑠𝑠𝑥𝑥

90
The Least Squares (Regression) Line

𝑠𝑠𝑦𝑦
𝑦𝑦�𝑖𝑖 = 𝑌𝑌� + 𝑟𝑟𝑥𝑥𝑥𝑥 𝑥𝑥𝑖𝑖 − 𝑋𝑋�
𝑠𝑠𝑥𝑥

91
Standard Error of Measurement

92
Standard Error of Measurement

93
Standard Error of Measurement

94
Standard Error of Measurement

95
Standard Error of Measurement
How can we estimate the amount of error associated
with our least squares regression line estimate?

2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 𝑠𝑠𝑦𝑦 1 − 𝑟𝑟𝑥𝑥𝑥𝑥

96
Standard Error of Measurement

The Standard Error of Measurement (𝑠𝑠𝑦𝑦𝑦𝑦𝑦 ) is a rough


measure of the average amount of predictive error.

2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 𝑠𝑠𝑦𝑦 1 − 𝑟𝑟𝑥𝑥𝑥𝑥

97
Standard Error of Measurement
Use of the Standard Error of Measurement (𝑠𝑠𝑦𝑦𝑦𝑦𝑦 )
assumes that except for chance, the dots in the
original scatterplot will be dispersed equally about all
segments of the regression line. This is known as the
assumption of homoscedasticity.

2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 𝑠𝑠𝑦𝑦 1 − 𝑟𝑟𝑥𝑥𝑥𝑥
98
Standard Error of
Measurement

Assumption of
Homoscedasticity

99
Standard Error
of
Measurement

Assumption of
Homoscedasticity

100
Standard Error of Measurement
What is one way to check for homoscedasticity?

Look at the scatterplot.

102
Standard Error of Measurement

103
Standard Error of Measurement

Note: Don’t get overly concerned about violating the


assumption of homoscedasticity unless the
scatterplot reveals a dramatically different type of
dot cluster as shown on the following slides.

2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 𝑠𝑠𝑦𝑦 1 − 𝑟𝑟𝑥𝑥𝑥𝑥
104
Standard Error of Measurement

105
Standard Error of Measurement

106
Standard Error of Measurement

107
Standard Error of Measurement

2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 𝑠𝑠𝑦𝑦 1 − 𝑟𝑟𝑥𝑥𝑥𝑥

The below equation is from your text. Notice


anything interesting?

𝑆𝑆𝑆𝑆𝑦𝑦 1 − 𝑟𝑟 2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 =
𝑛𝑛 − 2
108
Standard Error of Measurement

• Standard error can be thought of as the standard


deviation for a conditional distribution;
• Where a conditional distribution is the distribution
of y for a particular value of x.

109
Standard Error of Measurement
• For example, if Jim Bob has a score of 75 (on x),
what is our best guess of how he will do on y? Our
specific guess is from the least squares regression
line, but our best guess of his expected range,
based on an imposed confidence interval, includes
the standard error term.

110
Standard Error of Measurement

• e.g., We estimate that Jim Bob will score an 81 on


his end-of-course test, but we are 68% confident
(1 standard error) that he will score between 76
and 87.
• Or… We estimate that Jim Bob will score an 81 on
his end-of-course test, but we are 99.7% confident
(3 standard errors) that he will score between 64
and 99.
111
Example Problem #1

112
Example Problem #1
Step #1: Prediction
𝑠𝑠𝑦𝑦
𝑦𝑦�𝑖𝑖 = 𝑌𝑌� + 𝑟𝑟𝑥𝑥𝑥𝑥 𝑥𝑥𝑖𝑖 − 𝑋𝑋�
𝑠𝑠𝑥𝑥

1.06
𝑦𝑦�𝑖𝑖 = 2.60 + 0.546 𝑥𝑥𝑖𝑖 − 3.31
0.51
113
Example Problem #1
Step #1: Prediction
𝑠𝑠𝑦𝑦
𝑦𝑦�𝑖𝑖 = 𝑌𝑌� + 𝑟𝑟𝑥𝑥𝑥𝑥 𝑥𝑥𝑖𝑖 − 𝑋𝑋�
𝑠𝑠𝑥𝑥

1.06
𝑦𝑦�𝑖𝑖 = 2.60 + 0.546 2.5 − 3.31 = 1.68
0.51
114
Step #2: Calculate Standard Error of Measurement

2
𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 𝑠𝑠𝑦𝑦 1 − 𝑟𝑟𝑥𝑥𝑥𝑥

𝑠𝑠𝑦𝑦𝑦𝑦𝑦 = 1.06 1 − (0.564) 2 = 0.89

115
Step #3: Calculate Confidence Intervals

𝐶𝐶. 𝐼𝐼.𝑥𝑥𝑥 = 𝑦𝑦�𝑖𝑖 ± 𝑡𝑡𝐶𝐶.𝐼𝐼. × 𝑠𝑠𝑦𝑦𝑦𝑦𝑦

𝐶𝐶. 𝐼𝐼.68% = 1.68 ± 1 × 0.89 = (0.79, 2.57)

𝐶𝐶. 𝐼𝐼.95% = 1.68 ± 1.96 × 0.89 = (−0.06, 3.42)

𝐶𝐶. 𝐼𝐼.99% = 1.68± ? ? ? × 0.89 = (? ? ? , ? ? ? )


116
Step #3: Calculate Confidence Intervals

𝐶𝐶. 𝐼𝐼.𝑥𝑥𝑥 = 𝑦𝑦�𝑖𝑖 ± 𝑡𝑡𝐶𝐶.𝐼𝐼. × 𝑠𝑠𝑦𝑦𝑦𝑦𝑦

𝐶𝐶. 𝐼𝐼.68% = 1.68 ± 1 × 0.89 = (0.79, 2.57)

𝐶𝐶. 𝐼𝐼.95% = 1.68 ± 1.96 × 0.89 = (−0.06, 3.42)

𝐶𝐶. 𝐼𝐼.99% = 1.68 ± 2.58 × 0.89 = (−0.62, 3.97)


118
Example Problem #2
You are given the following information for a sample of students:

� = 500
Mean SAT Score (𝑋𝑋)
SAT Standard Deviation (𝑠𝑠𝑥𝑥 )= 100

� = 2.00
Mean GPA (𝑌𝑌)
GPA Standard Deviation (𝑠𝑠𝑦𝑦 ) = 0.75

Pearson’s Correlation (𝑟𝑟𝑥𝑥𝑥𝑥 ) = 0.60


119
Example Problem
1. Compute your best estimate of grade point average (𝑦𝑦𝑔𝑔𝑔𝑔𝑔𝑔)
For a prospective student having an SAT score (𝑥𝑥𝑠𝑠𝑠𝑠𝑠𝑠 ) of
625.

2. Establish a band of error, or confidence interval, around


the estimate in which you would be confident the
student’s true score would fall.
a. approximately 68 times in 100
b. approximately 95 times in 100
120
Example Problem
Given:

You are given the following information for a sample of students:

� = 500
Mean SAT Score (𝑋𝑋) SAT Standard Deviation (𝑠𝑠𝑥𝑥 )= 100
� = 2.00
Mean GPA (𝑌𝑌) GPA Standard Deviation (𝑠𝑠𝑦𝑦 ) = 0.75
Pearson’s Correlation (𝑟𝑟𝑥𝑥𝑥𝑥 ) = 0.60

Question:

1. Compute your best estimate of grade point average (𝑦𝑦𝑔𝑔𝑔𝑔𝑔𝑔 ) for a prospective student having an SAT score (𝑥𝑥𝑠𝑠𝑠𝑠𝑠𝑠 ) of
625.

2. Establish a band of error, or confidence interval, around the estimate in which you would be confident the
student’s true score would fall.
a. approximately 68 times in 100
b. approximately 95 times in 100
121
Example Problem

Estimated GPA from Regression line


122
Example Problem

123
Example Problem

68% Confidence Interval

124
Multiple Regression Model
Consider a least squares equation that contains more
than one predictor or X variable.

Remember our basic equation for a single predictor?

125
Multiple Regression Model
Consider a least squares equation that contains more
than one predictor or X variable.

Remember our basic equation for a single predictor?

𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏

126
Multiple Regression Model
A least squares equation that contains more than
one predictor or X variable.

𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏

𝑦𝑦 = 𝑚𝑚1 𝑥𝑥1 + 𝑚𝑚2 𝑥𝑥2 + ⋯ + 𝑚𝑚𝑁𝑁 𝑥𝑥𝑁𝑁 + 𝑏𝑏

127
Multiple Regression Model
A least squares equation that contains more than
one predictor or X variable.
𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏

𝑦𝑦 = 𝑚𝑚1 𝑥𝑥1 + 𝑚𝑚2 𝑥𝑥2 + ⋯ + 𝑚𝑚𝑁𝑁 𝑥𝑥𝑁𝑁 + 𝑏𝑏

y = β0 + β1x1 + β2x2 + . . . + βNxN + ε


128
Multiple Regression Model

A multiple regression
model with k independent
variables fits a regression
“surface” in k + 1
dimensional space (cannot
be visualized)
Multiple Regression Model
Let’s pretend that you have data for a sample that
includes information for the school year. The data you
have include the number of books each student read,
how many days of class they missed, and their final
reading class exam score.

You wanted to know if the combination of the books


read and class absences might impact the exam score.
130
y = β0 + β1x1 + β2x2 + . . . + βpxp + ε

Grade = 37.379 + 4.037 (Books Read) + 1.283 (Classes Attended)


Regression Toward the Mean
A tendency for scores, particularly extreme
scores, to shrink toward the mean.

Note: Observed regression toward the mean


occurs for individuals or subsets of individuals,
not for entire groups.
Regression
The Regression Fallacy is committed whenever
regression toward the mean is interpreted as a
real, rather than a chance, effect, such as in
the previous Israeli Air Force pilot training
example.

You might also like