CORRELATION & REGRESSION
Prepared By
Nitin Varshney
Assistant Professor
Agricultural Statistics
CoA, NAU, Waghai.
The study related to the characteristics of only one variable
such as height, weight, age, marks, wages etc. is known as
univariate analysis.
The study related to the relationship between two variables
such as height & weight. is known as Bivariate analysis.
CORRELATION
When we study two or more variables simultaneously, we
observe that movements in one variable are accompanied by
movements in other variable.
Example:
Husband’s age and wife’s age move together
Scores on IQ test move with scores in university
examinations
Relation b/w income & expenditure on household.
Relation b/w price & demand of commodity.
Meaning of Correlation
In bivariate distribution (study of two variables), we are
interested to find out if there is any correlation or
covariation b/w the two variables.
If the change in one variable affects a change in the other
variable, the variables are said to be correlated.
Types of Correlation
Positive and negative
Linear and Non-linear
Multiple and Partial
Positive and negative correlation
If the two variables deviate in the same direction i.e. if the
increase (or decrease) in one results in a corresponding
increase (or decrease) in other, correlation is said to be direct
or positive.
Example: Correlation between
Heights & weights of a group of persons
Income & expenditure
If the two variables deviate in the opposite direction i.e. if the
increase (or decrease) in one results in a corresponding
decrease (or increase) in other, correlation is said to be
diverse or negative.
Example: Correlation between
Price & demand of a commodity
Volume & pressure of a perfect gas
Linear and non-linear correlation
If the ratio of change b/w the two variables is constant then
there will be linear correlation b/w them. consider the
following example:
X 2 4 6 8 10 12 14 16
Y 3 6 9 12 15 18 21 24
Here the ratio of change
b/w the two variables is same.
If we plot these point on a
graph we will get a straight
line.
If the amount of change in the one variable does not show a
constant change in the other variable. Then there will be
curvilinear or nonlinear correlation b/w them.
X Y
2 2
4 6
6 8
8 12
10 18
12 24
14 36
16 44
18 54
20 67
22 75
24 89
Multiple and Partial Correlation
When there are interrelationship between many variables
and the value of one variable is influenced by many other
variables, e.g. The yield of crop per acre (X1) may depends
upon quality of seed (X2), fertility of soil (X3), fertilizer
used (X4), irrigation facilities (X5), weather conditions (X6)
and so on.
Whenever we are interested in studying the joint effect of
a group of variables upon a variable, then the correlation
is known as multiple correlation.
The correlation b/w only two variables X1 and X2, while
eliminating the linear effect of other variables is known as
partial correlation.
SCATTER DIAGRAM
It is a simplest way of the diagrammatic representation of a
bivariate data.
For a bivariate distribution (xi, yi); i=1, 2, …, n, if the values
of the variables X and Y are plotted along the x-axis and y-
axis respectively in the x-y plane, the diagram of dots so
obtained is known as scatter diagram.
From the scatter diagram, we can form a fairly good idea
whether the variables are correlated or not. e.g.
If the points are very dense (very close to each other):
There is good correlation between variables
If the points are widely scattered: There is poor correlation
between variables.
Karl Pearson’s coefficient of correlation
Karl Pearson developed a formula called correlation
coefficient as a measure of intensity or degree of linear
relationship between two variables.
Correlation coefficient between two random variables X
and Y, usually denoted by r(X, Y) or rXY, is a numerical
measure of linear relationship between two variables. It is
defined as:
Cov( X , Y )
r( X , Y )
XY
It provide a measure of linear relationship between X and
Y.
If (xi, yi); i=1, 2, …, n is the bivariate distribution, then
Coviance Cov( X , Y ) X Y E[{X E ( X )} {Y E (Y )}]
1
( xi x )( yi y )
n
(x x)
1
Variance X2 E{ X E ( X )}2 i
2
n
( y y)
1
Variance Y2 E{Y E (Y )}2 i
2
n
Variance X2 E{ X E ( X )}2 Cov( X , Y ) E[{ X E ( X )} {Y E (Y )}]
1
( xi x ) 2 1
n ( xi x )( yi y )
n
1
( xi2 x 2 2 xi x )
1
n ( xi yi xi y xyi xy )
n
1 1
xi2 x 2 2 x
xi 1
n n xi yi y xi x yi x y
n
1
X2 xi2 x 2
1
n Cov( X , Y ) xi yi xy
n
y
1 1
Variance Y2 E{Y E (Y )}2 ( yi y ) 2 2
i y2
n n
PROPERTIES OF CORRELATION COEFFICIENT
Range is -1 to +1.
is independent of change of origin and scale.
Two independent variables are uncorrelated.
Interpretation of correlation coefficient
when r=1, there is perfect positive correlation b/w variables.
when r=-1, there is perfect negative correlation b/w variables.
when r=0, there is no relation b/w variables.
when the value of r lies b/w +1 to -1, it signifies that there is
correlation b/w variables.
when the value of r is close to +1 or -1 then it signifies high
positive or negative correlation b/w variables.
when the value of r is close to 0 then it signifies very less
correlation b/w variables.
RANK CORRELATION
This method is useful to study the qualitative measure of
attributes like honesty, intelligence, color, beauty, morality
etc.
This method is based on ranks of the character under study.
Group of individuals is arranged in an order of merit or
proficiency of any two characters A and B.
Example: If we want to find the relation between
intelligence and beauty.
A: Intelligence B: Beauty
Ranks xi yi i=1, 2, 3, …, n
Pearsonian coefficient of correlation between ranks xi’s and
yi’s is called rank correlation coefficient between A and B for
that group of individual.
SPEARMAN'S RANK CORRELATION COEFFICIENT
This method is developed by Edward Spearman.
Spearman’s formula for the rank correlation coefficient is
given by n n
d
i 1
i
2
6d i 1
i
2
1 1
2n X2 n(n 1)
2
di is the difference between ranks di=xi-yi.
Range of rank correlation coefficient is -1 to +1.
Q. 2. In a marketing survey the price of tea and coffee in a town
based on quality. Data is given as follows, find the relation b/w
price of tea and coffee.
Price of Tea 88 90 95 70 60 75 50
Price of Coffee 120 134 150 115 110 140 100
REGRESSION
The term “regression” literally means “stepping towards
the average”.
It is given by Sir Francis Galton.
Galton found that the offspring of abnormally tall or short
parents tend to regress or step back to the average
population height.
Regression analysis is a mathematical measures of the
average relationship between two or more variables in
terms of the original units of the data.
In regression analysis there are two types of variables.
Dependent Variable Independent Variable
OR OR
Regressed Regressor Explanatory
Explained
Predictor variable
variable
The variable which
The variable whose
influences the values
value is influenced or
or is used for
is to be predicted.
prediction.
LINEAR REGRESSION
If the variables in a bivariate distribution are related
(means variables are correlated), we will find that the
points in the scatter diagram will cluster round some
curve called the “curve of regression”.
If the curve is straight line, it is called the “line of
regression”.
Then there is said to be linear regression between the
variables, otherwise curvilinear regression.
Linear Regression Equation
Let us suppose that in the bivariate distribution (xi, yi);
i=1, 2, 3, …, n; Y is dependent variable and X is
independent variable. Let the line of regression of Y on X
be
Y=a+bX (a, b are constants)
There are two regression lines
If Y is dependent variable and X is independent variable, then
it is called the line of regression Y on X.
Y= a+ byx X
If X is dependent variable and Y is independent variable, then
it is called the line of regression X on Y.
Y= a+ bxy X
where byx is regression coefficient (slope) of the regression line
Y on X.
where bxy is regression coefficient (slope) of the regression line
X on Y.
The line of regression is the line which gives the best estimate
to the value of one variable for any specific value of the other
variable.
Thus the line of regression is the line of best fit.
It is obtained by the principle of least squares.
PRINCIPLE OF LEAST SQUARES
Let the line of regression of Y on X be
Y= a+ byx X
ei=yi-(a+byxxi) is called the error of estimate or residual for
y i.
According to the principle of least squares, we have to
determine a and b so that
n n
E
i 1
ei2 ( y a b
i 1
i yx xi )
2
is minimum.
By solving the partial derivatives we will get two normal
equations for estimating a and b.
n n
y
i 1
i na byx xi 1
i (i)
n n n
i 1
xi yi a
i 1
xi byx
i 1
xi2 (ii)
If we divide the eqn. (i) by n then we get
y a byx x
Thus the line of regression of Y on X passes through the point
(x , y).
So regression coefficient (slope) of the line of regression of Y
on X is given by
Cov( x, y)
byx
V ( x)
xy
( x)( y)
byx n
and a y byx x
x n
2
( x) 2
So regression coefficient (slope) of the line of regression of X
on Y is given by
Cov( x, y)
bxy
V ( y)
( x)( y)
bxy
xy
n and a y bxy x
y 2
( y) 2
n
Since byx is the slope of the regression of Y on X and since the
line of regression passes through the point ( x, y ), its equation
is
Cov( X , Y )
Y y byx ( X x ) ( X x) r Y ( X x)
V (X ) X
Similarly for the line X on Y
Cov( X , Y )
X x bxy (Y y ) (Y y ) r X (Y y )
V (Y ) Y
Cov( X , Y )
r Cov( X , Y ) r X Y
V ( X )V (Y )
Cov( X , Y )
bYX Cov( X , Y ) bYX X2
V (X )
r X Y bYX X2
Y X
bYX r similarlyb XY r
X Y
PROPERTIES OF REGRESSION COEFFICIENT
1. Fundamental Property: Correlation coefficient is the geometric
mean between the regression coefficients.
b XY bYX r Y r X r 2
X Y
r b XY bYX
2. Signature Property: Sign of correlation coefficient is the same as
that of regression coefficients. Thus if the regression coefficients
are positive then correlation coefficient will be positive and vice-
versa.
3. Magnitude Property: If one of the regression coefficients is
greater than unity, the other must be less than unity.
If bYX 1 then bXY 1
4. Mean Property: The modulus value of the arithmetic mean of the
regression coefficients is not less than the modulus value of
correlation coefficient r.
1
(b XY bYX ) r
2
5. Regression coefficients are independent of the change of
origin but not of scale.
6. Angle between two lines of regression: If θ is the acute
angle between the two lines of regression, then
2
1 1 r
tan X Y
r X Y
2 2
If r=0, tan θ =∞→ θ=90°. Thus if the two variables are uncorrelated,
the lines of regression become perpendicular to each other.
If r=±1, tan θ =0→ θ=0° or 180°. Thus if the two variables are
perfectly correlated, the lines of regression coincide to each other.
Q.3. From a paddy field, 15 plants were selected randomly. The length of
panicle (cm) and number of grains per panicle were recorded. Fit the
regression line for the given dataset and compute the number of estimated
grains per panicle if the panicle length is 25.2 cm.
Length of Panicle (cm) 22.4 23.3 24.1 24.3 23.5 23.1 21 20.6 26.4 25.4 23.4 21.4 23.6 24.5 22.5
No. of grains per
95 109 133 132 136 116 94 85 143 138 129 88 127 142 110
panicle