Correlation
A bit about Pearson’s r
Questions
• Why does the • Give an example in
maximum value of r which data properly
equal 1.0? analyzed by ANOVA
• What does it mean cannot be used to infer
when a correlation is causality.
positive? Negative?
• Why do we care about
• What is the purpose of
the Fisher r to z the sampling
transformation? distribution of the
• What is range correlation coefficient?
restriction? Range • What is the effect of
enhancement? What do reliability on r?
they do to r?
Basic Ideas
• Nominal vs. continuous IV
• Degree (direction) & closeness
(magnitude) of linear relations
– Sign (+ or -) for direction
– Absolute value for magnitude
• Pearson product-moment correlation
coefficient
r
z z
X Y
N
Illustrations
Plot of Weight by Height Plot of Errors by Study Time
210
30
180
20
Weight
Errors
150
120 10
90
60 63 66 69 72 75 0
Height 0 100 200 300 400
Study Time
Plot of SAT-V by Toe Size
700
600 Positive, negative, zero
SAT-V
500
400
1.5 1.6 1.7 1.8 1.9
Toe Size
Simple Formulas
r
xy Use either N throughout or else
NS X SY use N-1 throughout (SD and
x X X and y Y Y denominator); result is the
same as long as you are
SX
(X X ) 2
consistent.
N
Cov( X , Y )
xy
N
Pearson’s r is the average
r
zz x y z
XX cross product of z scores.
N SX Product of (standardized)
moments from the means.
Graphic Representation
Plot of Weight by Height Plot of Weight by Height in Z-scores
210 2
180 1 - +
M e a n = 1 5 0 .7 lb s .
Z-weight
Weight
150 0
120 -1 + -
M e a n = 6 6 .8 In c h es
90 -2
60 63 66 69 72 75 -2 -1 0 1 2
Height Z-height
1. Conversion from raw to z.
2. Points & quadrants. Positive & negative products.
3. Correlation is average of cross products. Sign &
magnitude of r depend on where the points fall.
4. Product at maximum (average =1) when points on line
where zX=zY.
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Ht 10 60.00 78.00 69.0000 6.05530
Wt 10 110.00 200.00 155.0000 30.27650
Valid N (listwise) 10
r = 1.0
r=1
Leave X, add error to Y.
r=.99
r=.99
Add more error.
r=.91
With 2 variables, the correlation is the z-score slope.
Review
• Why does the maximum value of r
equal 1.0?
• What does it mean when a correlation is
positive? Negative?
Sampling Distribution of r
Statistic is r, parameter is ρ (rho). In general, r is slightly
biased. Sampling Distributions of r
0 .0 8
Relative Frequ ency
0 .0 6
rho=-.5 rho=0 rho=.5
0 .0 4
0 .0 2
0 .0 0
-1 .2 -0 .8 -0 .4 0 .0 0 .4 0 .8 1 .2
(1 2 2
) Obs er v e d r
The sampling variance is approximately: 2r
N
Sampling variance depends both on N and on ρ.
Empirical Sampling Distributions of the Correlation Coefficient
.5; N 100 .7; N 100
.5; N 50 .7; N 50
0.9 + 0
| 0 |
| 0 |
| 0 0 |
0.8 + 0 | |
| | | |
| | | +-----+
| 0 | +-----+ | |
0.7 + 0 | *--+--* *--+--*
| | | +-----+ | |
| | | | +-----+
| | | | |
0.6 + | | | |
| | +-----+ 0 |
| +-----+ | | 0 |
| | | | | 0 |
0.5 + *--+--* *--+--* 0 0
| | | | | 0 0
| +-----+ | | * 0
| | +-----+ 0
0.4 + | | 0
| | | * 0
| | | *
| | |
0.3 + 0 |
| 0 | *
| 0 |
| 0 0
0.2 + 0 0
| 0 0
| 0 0
| 0
0.1 + 0
| 0
| 0
| 0
0 + *
| *
| *
|
-0.1 +
------------+-----------+-----------+-----------+-----------
param .5_N100 .5_N50 .7_N100 .7_N50
Fisher’s r to z Transformation
Fisher r to z Transformation
1.5
r z
.10 .10 1.3
(1 r )
.20 .20 z .5 ln
(1 r )
1.1
z (output)
.30 .31 0.9
.40 .42
0.6
.50 .55
.60 .69 0.4
.70 .87 0.2
.80 1.10
0.0
.90 1.47 0.0 0.2 0.4 0.6 0.8 1.0
r (sample value input)
Sampling distribution of z is normal as N increases.
Pulls out short tail to make better (normal) distribution.
Sampling variance of z = (1/(n-3)) does not depend on ρ.
Hypothesis test: H0 : 0
r
t N 2 Result is compared to t with (N-
1 r2 2) df for significance.
Say r=.25, N=100
.25 .25
t 98 9.899 2.56 p< .05
1 .25 2 .986
t(.05, 98) = 1.984.
Hypothesis test 2: H 0 : value
1 r 1
.5 log e .5 log e One sample z test where r is
1 r 1
z sample value and ρ is
1/ N 3
hypothesized population value.
Say N=200, r = .54, and ρ is .30.
1.54 1.30
.5 log e .5 log e .60.31
z 1.54 1.30 z =4.13
1 / 200 3 .07
Compare to unit normal, e.g., 4.13 > 1.96 so it is
significant. Our sample was not drawn from a
population in which rho is .30.
Hypothesis test 3: H 0 : 1 2
Testing equality of correlations from 2 INDEPENDENT
samples. 1 r 1 r
.5 log e 1
.5 log e 2
1 r1 1 r2
z
1 / ( N1 3) 1 / ( N 2 3)
Say N1=150, r1=.63, N2=175, r2=70.
1.63 1.70
.5 log e .5 log e .74 .87
z 1 .63 1.70 z = -1.18, n.s.
1 / (150 3) 1 / (175 3) .11
Hypothesis test 4:H 0 : 1 2 ... k
Testing equality of any number of independent correlations.
k
(n 3) z
i i Q (ni 3)( zi z ) 2
z i 1
(n 3) i
Compare Q to chi-square with k-1 df.
Study r n z (n-3)z zbar (z-zbar)2 (n-3)(z-zbar)2
1 .2 200 .2 39.94 .41 .0441 8.69
2 .5 150 .55 80.75 .41 .0196 2.88
3 .6 75 .69 49.91 .41 .0784 5.64
sum 425 170.6 17.21=Q
Chi-square at .05 with 2 df = 5.99. Not all rho are equal.
Hypothesis test 5: dependent r
H 0 : 12 13 Hotelling-Williams test
( N 1)(1 r23 )
t ( N 3) (r12 r13 )
2( N 1) /( N 3) | R | r 2 (1 r23 )3
Say N=101, r12=.4, r13=.6, r23=.3
r (r12 r13 ) / 2 r (.4 .6) / 2 .5
| R | 1 r122 r132 r232 2(r12 )( r13 )( r23 )
| R | 1 .42 .62 .32 2(.4)(.6)(.3) .534
(100)(1 .3)
t ( N 3) (.4 .6) 2.1
2(100) /(98).534 .5 (1 .3)
2 3
t(.05, 98) = 1.98
H 0 : 12 34 See my notes.
Review
• What is the purpose of the Fisher r to z
transformation?
• Test the hypothesis that
1 2
– Given that r1 = .50, N1 = 103
– r2 = .60, N2 = 128 and the samples are
independent.
• Why do we care about the sampling
distribution of the correlation
coefficient?
Range Restriction/Enhancement
Reliability
Reliability sets the ceiling for validity. Measurement error
attenuates correlations.
XY T X TY
XX ' YY '
If correlation between true scores is .7 and reliability of
X and Y are both .8, observed correlation is 7.sqrt(.8*.8)
= .7*.8 = .56.
Disattenuated correlation
T X TY
XY / XX ' YY '
If our observed correlation is .56 and the reliabilities
of both X and Y are .8, our estimate of the correlation
between true scores is .56/.8 = .70.
Review
• What is range restriction? Range
enhancement? What do they do to r?
• What is the effect of reliability on r?
SAS Power Estimation
proc power; proc power;
onecorr dist=fisherz onecorr
corr = 0.35 corr = 0.35
nullcorr = 0.2 nullcorr = 0
sides = 1 sides = 2
ntotal = 100 ntotal = .
power = .; power = .8;
run; run;
Computed N Total
Computed Power Alpha = .05
Actual alpha = .05 Actual Power = .801
Power = .486 Ntotal = 61
Power for Correlations
Rho N required against
Null: rho = 0
.10 782
.15 346
.20 193
.25 123
.30 84
.35 61
Sample sizes required for powerful conventional
significance tests for typical values of the correlation
coefficient in psychology. Power = .8, two tails,
alpha is .05.