Fyba – SEM1,Unit 3:- Correlation Analysis
When we simultaneously study two quantitative variables, we get observations in pairs as
(X,Y) such data is called as bivariate data.
Any variable out of the two can be X then other variable will be denoted as Y. This bivariate
data can be with frequency or without frequency.
Bivariate Frequency Table without Frequency is as below where X and Y are both discrete
variables.
X 10 14 20
Y 12 20 30
Bivariate Frequency Table with Frequency is as below where both X and Y are grouped
variables. N = total number of observations with 3 pairs of X and Y hence n = 3
Height(X) /Weight 40-50 50-60 50- fx
(Y) 60
145-155 4 2 0 4+2+0=
6
145-155 1 2 2 1+2+2=
5
160-165 0 4 5 4+5=9
Fy 4+1= 2+2+4= 7 N=18
5 8 =6+5+9
=5+8+7
we can have X has grouped variable and Y as discrete variable
or X as discrete variable and Y as grouped variable. Any such combinations of X and Y can
happen. Sample of size n is drawn from the bivariate population and observations on
variables X and Y for each item in a sample is denoted as (Xi, Yi); i =1, 2, …n.
such data collected and used is called bivariate data
Here we can have frequency of each pair of observation (X,Y) or it can be raw data as (X,Y).
Given bivariate data we convert it to univariate data as
C.I. 145-155 145-155 160-165
X(class mark ) Total
Fx 6 5 9 18
C.I. 40-50 50-60 50-60
Y(class mark of Total
Y)
Fx 6 5 9 18
Then we can find mean and variance of X as Y
= , , cov(x,y) = = -
s.d (x) = = = , s.d (y) = = = then find r = which is numerical measure of correlation between
the variables X and Y.
For example rainfall and agricultural production
demand (X) and supply (Y)for the particular commodity
demand (X) and price (Y) for the certain product
height (X) and weight of a person(Y)
Income(X), Expenditure (Y)
imports and exports of the country
performance of operators and days production
temp of the surrounding and life of medicine
age of a person to the blood pressure level
no of lectures attained and marks obtained.
If we study two variables(X,Y) simultaneously then we have observations in pairs as (X,Y)
which is represented by bivariate table as above. Given bivariate data we can find the
univariate distribution of X and Y.
Given data for variables (X,Y) we study if there is relation between the two variables (X,Y)
like mean standard deviation we define a new measure as co variance.
Covariance: is defined as the mean value of the product of deviations of two variables from
their respected means.
cov(X,Y)=E(XY)−E(X)E(Y) where E(X) = mean of x denoted as and
E(Y) = mean of Y x denoted as
E(XY) = = mean of product of X and Y values
cov(X,Y) == -
Cov(Y,X)= Cov(X,Y)
It measures how much a bivariate variable (X, Y) change together.
The sign of the covariance therefore shows the tendency in the linear relationship between
the variables X and Y.
There are two ways of obtaining the correlation between two variables.
1. Scatter Diagram :- Graphical way of checking if two variables are correlated.
2. Mathematical way of checking if two variables are correlated or not is by using Karl
Pearson’s Coefficient of Correlation and Spearman’s Coefficient of Correlation.
Karl Pearson’s Coefficient of correlation
Karl Pearson defined coefficient of correlation as a measure of intensity or degree of linear
relationship between two variables.
Let X and Y be the two variables with n pairs of observations say (xi , yi)
x y i i
xy
n = i = 1, 2, …, n
r
x 2
i
x
2 y 2
i
y
2
n n
Spearman’s Rank Correlation Coefficient
It is a numerical measure of degree of correlation between ordinal
types of data i.e. ranked data. It is given by:
=1-
Eg1 :- For the following pairs of observed values of x and y, find correlation coefficient and
comment. (-1,1) ; (0,0) (1,1)
For the following pairs of observed values of x and y, find correlation coefficient and
comment. (-1,1) ; (0,0) (1,1)
Solution :-
Total
X -1 0 1 0=
Y 1 0 1 2=
XY -1 0 1 0=
1 0 1 2=
1 0 1 2=
= = 0, 2/3,
cov(X,Y) = = - = 0 -2/3= 2/3
s.d (x) = = = 0.8165 for raw data
s.d (y) = = =0.4713 for raw data
r= =0 It indicates two variables are uncorrelated
Eg 2- Graphically and mathematically, obtain the nature of correlation for following data.
X 102 82 63 72 39 30
Y 120 100 75 90 50 40
Examples on data as Ranks:-
Eg3:- The following table shows ranks given by two judges to 10 participants in singing
competition. Find Spearman’s rank correlation coefficient.
Judge 1 1 9 4 2 8 5 7 3 10 6
Judge 2 3 10 6 1 8 2 4 5 9 7
Answer :-
Judge1 1 9 4 2 8 5 7 3 10 6 Total
(R1)
Judge 2 3 10 6 1 8 2 4 5 9 7
(R2)
d=R1- R2 -2 -1 -2 1 0 3 3 -2 1 -1
d2 4 1 4 1 0 9 9 4 1 1 34
R = 1 - = 1 - = 0.7939
If the data in not in ranks and Spearman’s Correlation coefficient is to
be calculated we convert data to ranks
Eg4 Obtain Spearman’s rank correlation between performance in
Botany and Zoology.
Botany 56 75 45 71 62 64 58 80 76 61
Zoology 66 70 40 60 65 56 59 77 67 63
Ans :- convert the data to ranks. Rank 1 to highest mark.
botany (R1) Zoology(R2) d = R1-R2
9 4
3 2
10 10
4 7
6 5
5 9
8 8
1 1
2 3
7 6
54
=1- = 1 - = 0.67
Comment:- We see positive correlation between marks of botany and
zoology
Eg 5:- r= 0.4, = 108, , = 900 to find values of number of pairs (x,y)
answer :- r = = 0.4,
0.4 = =
cov(x,y) =,
Squaring both the sides we get
= , = 0.16 so n=81/9=9
Eg 6:- R= 0.5, Sum of squares of differences between ranks is 10 , find
the number of observations used in calculating rank correlation
coefficient.
R= 0.5, = 10 to find n
R =1-
So 0.5 = 1 - => =
120, n(n+1)(n-1) =120 obtain n by substituting various values of n.
hence n = 5
Eg 7:-
Following is the information available on bivariate data.
n=10,150, = 2410,200, , 3063
Later it was observed that one observation (17,19) was wrongly considered instead of (7,9)
.Obtain the correct value of correlation.
Ans :- Since all values above are calculated using wrong pair (17,19) they are to be
corrected as follows.
We first obtain all corrected sums.
Correct = wrong = -17+7 =150-17+7 +140
Correct = wrong -19+9=200-10=190
Correct = wrong - + = 2170
Correct = wrong -+ 92 = 3970
Correct = wrong – (17X19)+ (7*9)= 2803
Correct = 140/10 = 14, Correct = 190/10 =19
Correct covariance = - (corrected (corrected)
= 2803/10 - 14*19 = 280.3- 266= 14.3
Corrected variance of x is [v(x) ] = (2170/10) -142 = 21
Corrected variance of x [V(y)] = (3970/10) – 192 = 36
Corrected coefficient of correlation is
= = 0.5200
Eg 8 The marks scored by 12 students in Mathematics(x) and in statistics (y) are as follows
.Obtain rank correlation coefficient and comment.
Marks in 15 16 19 17 17 15 18 16 16 18 14 10
mathematics(x)
Marks in 10 12 12 13 11 9 11 13 11 12 8 7
statistics (y)
maths stats R1 =Rank of maths R2 =Rank d = R1 -
(x) (y) of stats R2
15 10 9.5 10 9 0.5
16 12 7.5 7 4 3 3.5
19 12 1 4 4 -3
17 13 5.5 rank if no tie 5 1.5 2 4
17 11 5.5 if no tie 6 7 *6 -1.5
15 9 9.5 9 10 -0.5
18 11 3 if no tie 2 7 *7 -4
16 13 7.5 8 1.5 1 6
18 11 3 if no tie 3 7 *8 -4
18 12 3 If no tie 4 4 5 -1
14 8 11 11 0
10 7 12 12 0
109
Some ranks are repeated .we calculate the correction factor as below.
Ranks No of times rank repeated correction factor =
Repeated =m m(m2-1)/12
18 in math 3 2
17 in math 2 0.5
16 in math 2 0.5
15 in math 2 0.5
13 in stat 2 0.5
12 in stat 3 2
11 in stat 3 2
m(m2-1)/12 = 8
R= 1 - = 1 - = 0.5909 , Shows weak positive correlation between two variables marks in maths
and statistics
Eg:- 9
Height 40-50 50-60 60-70 fu x(class u= u fu u2 fu
/Weight marks) (x-160)/10
145-155 4=f 2 0 6 150 -1 -6 6
fuv = fuv= fuv =0
4*(-1) *(-1)=4 0
155-165 1 2 2 5 160 0 0 0
fuv = 0 0 0
165-175 0 4 5 9 170 1 9 9
0 0 fuv =5
fv 5 8 7 20 u fu = u2 fu
3 =15
y(class 45 55 65 fuv= 9 n = 20
mark)
v -1 0 1
v fv -5 0 7 v fv= 2
v2 fv 5 0 7 v2 fv=12
= 3/20 = 0.15 , = 2/20 = 0.1
- = – (3/20 ) (2/20) = 0.435
s.d (x) = = = 0.8529
s.d (x) = = == 0.7681
= = = 0.6640 =
Comment is there is weak positive correlation between two variables
x and y.
For the data which are ranks measure of correlation is found as below.
Definitions:-
1.Correlation :- It is statistical measure of relationship between two variables that indicates
the extent with which two or more variable related. We are going to study Simple Linear
Correlation.
Correlation coefficient signifies strength and direction of the relationship between two
variables. This strength is expression as a number. Direction is signified by sign of the
correlation coefficient.
The relationship between more than two variables is called multiple correlation.
Eg:- consider yield of a crop which depends on various factors liker sunlight, rain, type of
soil etc, then we can find effect of various factors on this yield of a crop.
But we consider only bivariate data so only two factors can be studied simultaneously.
Which is called as simple correlation otherwise it will represent multiple correlation studying
many factors simultaneously. Which is called as simple correlation otherwise it will represent
multiple correlation studying many factors simultaneously.
If yield of crop is represented by variable Y then rains or type of soil or fertilization will be
called as X.
But if we study two variables simultaneously then relationship between them is called as
simple correlation. For two related variables(X,Y) under study we do not consider
different powers of X as X2,X3 etc or Y2 ,Y3 etc but only consider single power of X and Y
.Such relationship between X and Y the it is called a linear correlation.
For two variables(X,Y) under study if we consider different powers of X as X2,X3 etc or Y2
,Y3 etc then we can find relationship between X 2and Y or X3 and Y or Y2 and X Y3 and X then
this type of relationship between the variables is called as non linear correlation.ie.
consider different powers of X and Y and find relationship between them.
Otherwise relationship between only two variables is linear ie. different powers are not
considered ,X and Y can be with or without frequency . in these situations we may be
interested in examining the relationship between the two variables the extent of linear
relationship between the two variables is called correlation. Our study is confined to Simple
Linear correlation.
So study is confined to simultaneous study of two variables where all the variables will not
have power more than one. Hence it is called as Bivariate Linear correlation.
If two variables are such that ,If the change in the value of one variable affects the change
in the value of the another variable then we say variables are correlated to each other .
Types of correlation:-
1 Positive Correlation :- It is relationship between two variables is such that both the
variables move in the same direction . Both of them can either show increase or decrease
in their values.
Eg :-income(x) increases(decreases) expenditure (Y) also increases(decreases) hence
change in both the variables is in the same direction. Further it can be classified as strong
positive ,weak positive or perfect positive correlation. Such variables as income and
expenditure are called positively correlated variables.
2 Negative correlation:- If two variables are such that one variable increases and other
variable decreases then such relationship between two variables is called as Negative
correlation. Such variables as price and demand are called negatively correlated
variables.
Eg:- If price (X) increases and demand (Y) decreases then X and Y are said to be negatively
correlated.
Further it can be classified as strong positive, weak positive or perfect negative correlation.
********