Correlation and Regression
Correlation and Regression
Libeeth B. Guevarra
Department of Mathematics and Natural
Sciences
August 31, 2018
Data Management 1
Correlation and Regression
Correlation and Regression
Correlation is a statistical method used to
determine whether a relationship between
variables exists.
Regression is a statistical method used to
describe the nature of the relationship between
variables, that is, positive or negative, linear or
nonlinear.
A scatter plot is a graph of the ordered pairs
(x, y) of numbers consisting of the independent
variable x and the dependent variable y.
Data Management 2
Correlation and Regression
Example
Construct a scatter plot for the data shown for
car rental companies in City A for a recent year.
Company Cars Revenue
(in ten thousands) (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Data Management 3
Correlation and Regression
The Correlation coefficient measures the
strength and direction of a linear relationship
between two variables.
The range of the correlation coefficient is from
−1 to +1.
Formula for the Correlation Coefficient r
P P P
n( xy ) − ( x)( y )
r=p P P P P
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y )2 ]
where n is the number of data pairs.
Data Management 4
Correlation and Regression
Example
Compute the correlation coefficient for the data:
Company Cars Revenue
(in ten thousands) (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Data Management 5
Correlation and Regression
If the value of the correlation coefficient is
significant, the next step is to determine the
equation of the regression line, which is the
data’s line of best fit.
This enables the researcher to see the trend
and make predictions on the basis of the data.
The equation of the least-squares line for the
ordered pairs (x1 , y1 ), (x2 , y2 ), . . . (xn , yn ) is the
line
y − ȳ = m(x − x̄)
Data Management 6
Correlation and Regression
y − ȳ = m(x − x̄)
where:
x̄ = mean of variable x
ȳ = mean of variable y
m =slope of the line
P
xy − nx̄ ȳ
m=P 2
x − n(x̄)2
Data Management 7
Correlation and Regression
Example
Find the equation of the regression line for the
data
Company Cars Revenue
(in ten thousands) (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Data Management 8
Correlation and Regression
Another formula for the Regression line
y = a + bx.
( y)( x 2 ) − ( x)( xy)
P P P P
a= P P
n( x 2 ) − ( x)2
P P P
n( xy ) − ( x)( y)
b= P P
n( x 2 ) − ( x)2
where a is the y intercept and b is the slope of the line.
Data Management 9
Correlation and Regression
The Coefficient of Determination is a measure
of the variation of the dependent variable that is
explained by the regression line and the
independent variable. The symbol for the
coefficient of determination is r 2 . If r = 0.90,
then r 2 = 0.81, which is equivalent to 81%. This
result means that 81% of the variation in the
dependent variable is accounted for by the
variations in the independent variable. The rest
of the variation, 0.19, or 19 %, is unexplained.
Data Management 10