0% found this document useful (0 votes)
88 views52 pages

Correlation & Linear Regression Guide

This document discusses statistical methods for correlation and simple linear regression. It provides information on calculating Pearson's correlation coefficient and interpreting the results. Examples are given to demonstrate how to carry out correlation analysis in SPSS and interpret the output. Caveats about correlation are also outlined, including that correlation does not imply causation and the importance of checking assumptions.

Uploaded by

TheGimhan123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views52 pages

Correlation & Linear Regression Guide

This document discusses statistical methods for correlation and simple linear regression. It provides information on calculating Pearson's correlation coefficient and interpreting the results. Examples are given to demonstrate how to carry out correlation analysis in SPSS and interpret the output. Caveats about correlation are also outlined, including that correlation does not imply causation and the importance of checking assumptions.

Uploaded by

TheGimhan123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

community project

encouraging academics to share statistics support resources


All stcp resources are released under a Creative Commons licence

Statistical Methods
11. Correlation and
Simple Linear Regression

Based on materials provided by Coventry University and


Loughborough University under a National HE STEM
Programme Practice Transfer Adopters grant

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Workshop outline
We will consider:
 Correlation coefficients:
 Pearson’s correlation
 Spearman’s rank correlation
 Simple linear regression
 The importance of outliers and
residuals

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Correlation
 Correlation is a measure of the strength of
the linear association between two scale
variables
 Correlation is often used inappropriately
when "association" is meant
 The correct terminology is “correlation
coefficient”
 We usually calculate at Pearson’s
correlation coefficient

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Pearson’s correlation
 Symbols used for the correlation coefficient
 ρ ('rho') is used for the population value
 r is used for its estimate
 ρ and r are always between -1 and +1
 Positive ρ or r implies y as x
 Negative ρ or r implies y as x
 The correlation coefficient is sensitive to outliers.
Look at scatter plot using the ‘Graphs’ menu in
SPSS

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Correlation coefficient: examples
 The sign of r
depends on
the direction
of the fitted
line (not its
slope)
 The magnit-
ude of r
depends on
how closely
the data
points are aligned (all on a line means the coefficient is +1
or -1)
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Activity
Estimate r and describe the correlation in each
diagram
Strong negative correlation Weak positive correlation
r = - 0.8 r = +0.4
25 25

20 20

15
15

10 10

5
5
0
0
0 2 4 6
0 2 4 6

No correlation Perfect negative correlation


r=0 r = -1
25 25

20 20

15 15

10 10

5 5

0 0
0 2 4 6 0 2 4 6

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Interpretation of r

r Interpretation
0.1 Small
0.3 Medium
0.5 Large

Source: Cohen, J. (1988) Statistical Power Analysis for the


Behavioral Sciences, 2nd ed. Hillsdale, NJ: Erlbaum.
Assumptions
1. Both variables are normally distributed
2. The relationship between the two variables is linear.
3. The relationship between
the two variables is
homoscedastic (i.e. the
variance of one variable
is the same for all the
values of the other
variable). We can test 2
and 3 by looking at the scatter plot and observing whether
the data points form a “roughly symmetrical, cigar-shaped
pattern” about the regression line.
If these assumptions or robust exceptions (see later) are not
met, we should use Spearman’s rank correlation (see later).
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Example 1: Forest measurement
A study of the relationship
between basal growth and
crown volume of 62 trees is
reported by Avery and
Burkhart (1994):
 Basal Growth is the change
in cross sectional area (in
square feet) at chest height in
one year
 Crown Volume is the
increase in the total volume (in
cubic feet) of the tree above
the first branch
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Set up the data file
 Open the Excel file BasalCrownData.xlsx from
associated with this presentation
 Copy the data into a data window SPSS
 Set up the variable names as in the Excel file
 Set the measures to ordinal, scale and scale
respectively
 Set the number of decimal places to 0, 2 and 0
respectively
 Save the file as BasalCrown.sav

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Create a scatter plot
 Select Graphs – Chart
Builder
 Select Scatter/Dot from
the Choose from: menu
on the Gallery tab
 Click and drag the first
scatter plot into the chart
preview area
 Click and drag
BasalGrowth onto the x-
axis and CrownVolume
onto the y-axis

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Dispersion
of points is
‘cigar
shaped’

The data
values
appear to
meet the
assumptions
for a
correlation
analysis

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Carry out the correlation
analysis
 Select Analyze >
Correlate > Bivariate
 Select the two
variables
 The default
correlation analysis is
Pearson
 The default
significance is 2-
tailed

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
 Returns a
correlation
coefficient r of
0.871
 Analysis is
repeated because it
is comparing each
variable on the list
with each other in
turn – generally Significance level is actually
look at the cells 0.001, not 0.01, as indicated by
below the leading the footnote, because the p-value
diagonal when of ‘0.000’ actually means <
there are more than 0.0005, so it is clearly also < 0.001
2 variables The null hypothesis is ρ = 0

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Correlation caveats (1)
 Correlation does not imply causation:
 Both x and y could be influenced by z
 E.g. there is a positive correlation between wearing
a waistcoat watch (x) and heart disease (y) because
of the influence of wealth and diet (z)
 With large samples even small correlation
coefficients can be statistically significant:
 Think about what would be of practical importance
 Beware of outliers!
 Correlation coefficients are sensitive to outliers
 However, outliers should never be removed without
a valid reason being given
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Correlation caveats (2)
r = 0 indicates no linear association:
 Low absolute values of r do not necessarily
mean that the variables are not related – any
relationship may be non-linear
 r2 (a.k.a. R2) indicates the amount of
variability ‘explained’ by the relationship
between the two variables
 r = 0.7, gives r2 = 0.49, i.e. only 49% of the
variability is explained by the relationship
 Different absolute values of r are meaningful
in different contexts
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Robustness exceptions
 Correlation calculations are not robust to violations of homo-
scedasticity – the data could be transformed in this case
 The hypothesis test for ρ = 0 is robust to extreme violations of
normality. However Spearman’s rank correlation (see later) is
sometimes a more powerful test.
 Interpretations of the value of r can be completely
meaningless if the joint distribution of the two variables is too
different from a binormal distribution
References:
Asuero, A. Sayago, A. & González, A. (2006) The Correlation
Coefficient: An Overview, Critical Reviews in Analytical
Chemistry, 36: 41–59
Fowler, R. L. (1987) Power and robustness in product-moment
correlation, Applied Psychological Measurement , 11(4):
419-428
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Example 2: Advertising cds
A record company
decided to advertise
200 different cds and
measure the sales of
the cds the week after
they were advertised
(in thousands) against
the amount spent on
advertising (thousands
of pounds).

Source: (Field, 2013: Section 8.3)

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Set up the data file
 Open the Excel file CDSalesData.xlsx
associated with this presentation
 Copy the data into a data window SPSS
 Set up the variable names
 Set the measures to scale and scale
 Set the number of decimal places to 2 and 0
 Save the file as CDSales.sav
 Create a scatter plot of Sales against Adverts

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
 Distribution
is clearly
heterosc-
edastic
 But
variables
are clearly
associated

 Need to use Spearman’s rank correlation


Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Spearman’s rank correlation
 Similar to Pearson’s rank but can be applied to ordinal
data as well as scale data
 Measures how consistently one variable increases or
decreases as a second variable increases (monotonic)
 Represented by the symbol rs

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Spearman’s rank in SPSS
 Select Analyze – Correlate –
Bivariate…
 Select both variables
 Select Spearman for the
correlation coefficients

Probability < 0.0005 so


the association is
Returns a Spearman correlation significant at 0.001 level
coefficient of 0.554

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Regression analysis
 Uses data to build a statistical model to
describe the relationship between different
quantities or variables
 Simple linear regression describes a linear
relationships between two scale (continuous)
variables:
 The x variable is the independent, or predictor,
variable
 The y variable is the dependent, or outcome,
variable

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Simple linear regression
The model is: y = b0 + b1x
Where:
 b0 is the intercept or constant term
 b1 is the slope of the line
 b0 and b1 are known as coefficients
 (You may be more familiar with the notation
y = a + bx or y = mx + c)
 Linear regression fits the ‘best’ straight line to the
data
 There are different ways of defining ‘best’
 The most common method is called least squares
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Simple linear regression –
graphical view
y

b1 Units Change in y
b1 is the slope of
the line
One unit change in x
b0

y = b0 when x = 0
b0 is the intercept x
0 of the line
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Residuals
 For “real” data sets there will always be a
difference between what we observe and
what our model predicts
 We adjust for this difference by adding an
error term in the model:
y = b0 + b1x + e
 Residuals are then defined as:
e = y - (b0 + b1x)
= observed value - predicted value
= error
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Fitting a regression line
y y5
yi = b0 + b1xi + ei
e5
y3
e3 e4
y1

e1 y4
e2

y2
x
0
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Least squares linear regression
 Regression analysis fits a line through the
data using the method of least squares
 The least squares method minimises the
sum of squares of the vertical distances of
the observed data from the fitted line
 I.e. least squares minimizes the sum of the
squared residuals
 This is like drawing squares next to each
residual and minimising the sum of the area
of these squares

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Least squares linear regression
From
http://cast
.massey.
ac.nz/cor
e/index.ht
ml?book
=general
Section
3.4.4

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Model assumptions
1. The mean response (y - e) varies linearly with
predictor (x)
2. The unexplained variation (e) is normally and
independently distributed with constant variance (i.e.
independent of x, or it is homoscedastic)

These are both


shown by a ‘cigar
shaped’ scatter plot
around a straight line

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Activity
 Fit a least squares linear regression model to
the BasalCrown data set:
CrownVolume = b0 + b1BasalGrowth

 Check the model for significance


 What are the estimated values of b0 and b1?
 Write out the model

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
With the file CrownBasal.sav:
 Select Analyze
> Regression >
Linear
 Choose
CrownVolume
as the
dependent
variable and
BasalGrowth
as the
independent
variable

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
The constant
b0 = -5.452 coefficient (b0) The BasalGrowth
estimate is not coefficient (b1)
significant estimate is highly
(H0: b0 = 0) – this is significant (p < 0.001)
b1 = 127.273 normally ignored (H0: b1 = 0)

Therefore the model is:


CrownVolume = -5.452 + 127.273BasalGrowth
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
The fitted model
CrownVolume = -5.452 + 127.273BasalGrowth

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Confidence intervals for b0 and b1
 Redo the analysis
 Select Statistics…
 Select Confidence intervals
 Gives upper and lower
bounds for the confidence
intervals for the two
coefficients

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
 Red line: original
fitted model
 Blue lines: model
with upper and
lower confidence
intervals for b0
 Green lines: model
with upper and
lower confidence
intervals for b1
 Purple lines: model
with upper and
confidence Note: b0 and b1 are estimated from
intervals for both b0 the sample so the confidence intervals
and b1 do not relate to the population
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Simulation of confidence interval of b1

See http://cast.massey.ac.nz/core/index.html?book=general Section 12.2.6

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
The process of regression analysis
Step 1: Get to know your data
 Create a scatter plot, calculate descriptive statistics and
look for outliers
Step 2: Formulate a model
 Based on examination of the results of Step 1,
hypothesize a model that might explain the data
relationships
Step 3: Fit the model to the data
 Examine the regression coefficients
Step 4: Check the fit of the model
 Coming next
Step 5: Report, interpret, and apply the model
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Step 3: Check the model fit
A. Calculate the adjusted R square coefficient
B. Check the significance of the ANOVA model
C. Check the model assumptions, including
robustness assumptions

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Check the model fit:
A. Adjusted R square
 R square (R2) is the percentage of variation in y
explained by the regression on x
 Adjusted R Square (R2adj) is the percentage of
variation adjusted for the sample size and the
number of coefficients in the regression model

BasalGrowth
predicts 75.5%
of the variation
of CrownVolume

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Check the model fit:
B: Check the ANOVA model for
significance
 Generated automatically
 This one is fine (p < 0.001)

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Check the model fit:
C. Check the assumptions
 The regression assumptions are checked
through analysis of the residuals
 The analysis of residuals is a subjective
process of examining:
 Standardised residuals v. standardised predicted
values
 Normal probability plot
 If the model fails then also check the
robustness assumptions (equivalent to
normal probability plot fit not being too bad)
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
 Select Analyze >
Regression > Linear
 Choose the dependent and
independent variable as
before
 Select Plots…
 Select *ZRESID for the Y
variable (standardised
residual)
 Select *ZPRED for the X
variable (standardised
predicted value)
 Select Normal probability
plot

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Standardised residuals v.
standardised predicted values
 These should be
scattered
randomly
 Any discernible
pattern (such as
a ‘U’ shape)
indicates a
problem, e.g. not
linear
 This one is fine

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Normal probability plot
 Normality is
indicated by a
roughly linear
plot
 Any strong
systematic
curvature
suggests
some degree
of non-
normality
 This one is
fine

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Robustness exceptions
 Homoscedasticity is mandatory
 Linearity is mandatory
 “Normality is not necessary for the least-squares
fitting of the regression model but it is required in
general for inference making” (e.g. calculating the
p-values and confidence intervals of b0 and b1)
“only extreme departures of the distribution of Y
from normality yield spurious results”
Source: Kleinbaum, D., Kupper L., Muller K. &
Nizam, A. (1998) Applied Regression Analysis and
Multivariable Methods. 3rd ed. Pacific Grove, CA:
Duxbury, p. 117
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Step 5: Report, Interpret and
Apply
 Report the results of your work in the
appropriate context
 Interpret the model
 Explain the meaning of the coefficients in
practical terms
 Apply the model
 Where appropriate, use the model for prediction

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Don’t extrapolate your data
too far away from its range


Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Application to our example
Report:
 From the regression analysis output, basal growth explains 76%
of the variation in crown volume increase (ANOVA model
significant at 0.001 level)
 The model is:
CrownVolume = -5.45 + 127.27BasalGrowth
Interpret:
 For every unit change in the growth of the trees there is a 127.3
cubic foot increase in the crown volume.
Apply:
 We may use the model to predict future values of crown growth:
For a BasalGrowth of 0.43 square feet,
CrownVolume = -5.45 + 127.270.43 = 49.3 cubic feet increase
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield
Regression caveats
 Always plot your data first (Step 1)
 Don’t infer that x "causes" y
 Be cautious about predicting beyond the range
of x (extrapolation)
 Beware of outliers
 Examine plots of residuals carefully

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Recap
We have discussed:
 Meaning and computation of Pearson and
Spearman correlation coefficients
 Pearson correlation assumptions (including
robust exceptions)
 Simple linear models in regression
 The process of simple linear regression analysis
 Simple linear regression assumptions (including
robust exceptions)
 Using residuals to check regression assumptions

Peter Samuels Reviewer: Ellen Marshall


Birmingham City University University of Sheffield
Bibliography
Avery, T. & Burkhart, H. (1994) Forest Measurements. 5th ed. New York:
McGraw-Hill.
Bovas, A. & Ledolter, J. (2006) Introduction to Regression Modelling.
Belmont, CA: Thomson Brooks/Cole.
Field, A. (2013) Discovering Statistics using SPSS: (And sex and drugs
and rock 'n' roll), 4th ed., London: SAGE, Sections 8.1 - 8.4.
statstutor (n.d.) Pearson Correlation Coefficient resources. Available at:
http://www.statstutor.ac.uk/topics/correlation/pearsons-
correlation-coefficient/ [Accessed 8/01/14].
statstutor (n. d.) Simple Linear Regression resources. Available at:
http://www.statstutor.ac.uk/topics/regression-and-model-
building/simple-linear-regression/ [Accessed 8/01/14].
statstutor (n.d.) Spearmans Correlation Coefficient resources. Available
at: http://www.statstutor.ac.uk/topics/correlation/spearmans-
correlation-coefficient/ [Accessed 8/01/14].
Stirling, W. D. (2013) Welcome to the General CAST e-book. Available
at: http://cast.massey.ac.nz/core/index.html?book=general
[Accessed 8/01/14], Sections 3 and 12.
Peter Samuels Reviewer: Ellen Marshall
Birmingham City University University of Sheffield

You might also like