0 ratings0% found this document useful (0 votes) 113 views21 pagesR Programming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Statistical Computing and R Programming V Semester BCA,
Correlation and Regression
Correlation
Meaning:
© Correlation is a statistical technique to ascertain the association or
relationship between two or more variables.
* Correlation analysis is a statistical technique to study the degree and
direction of relationship between two or more variables.
* A correlation coefficient is a statistical measure of the degree to
which changes to the value of one variable predict change to the
value of another. When the fluctuation of one variable reliably
predicts a similar fluctuation in another variable, there’s often a
tendency to think that means that the change in one causes the
change in the other.
Uses of correlations:
1. Correlation analysis helps inn deriving precisely the degree and the direction
of such relationship.
2. The effect of correlation is to reduce the range of uncertainty of our
prediction. The prediction based _ on correlation analysis will be more reliable
and near to reality.
3. The measure of coefficient of correlation is a relative measure of change.
‘Types of Correlation:
Correlation is described or classified in several different ways. Three of the
most important are
I. Positive and Negative
IL. Simple, Partial and Multiple
JIL Linear and non-linear
43Statistical Computing and R Programming V Semester BCA,
I. Positive, Negative and Zero Correlatios
Whether correlation is positive (direct) or negative (in-versa) would depend
upon the direction of change of the variable,
Positive Correlation: If both the variables vary in the same direction,
correlation is said to be positive. It means if one variable is increasing, the other
on an average is also increasing or if one variable is decreasing, the other on an
average is also deceasing, then the correlation is said to be positive correlation.
For example, the correlation between heights and weights of a group of persons
is a positive correlation,
Height (m):X | 156 | 160 | 163 | 166 | 168 | 171 | 174 | 176
Weight (kg):¥ [60 | 62 | 64 | 65 | 67 | 69 [ 71 | 72
Negative Correlation: If both the variables vary in opposite direction, the
correlation is said to be negative. If means if one variable increases, but the
other variable decreases or if one variable decreases, but the other variable
increases, then the correlation is said to be negative correlation. For example,
the correlation between the price of a product and its demand is a negative
correlation,
Price of Product (Rs PerUnit):X| 6 | 5 | 4 | 3 | 2 | 4
Demand (In Units) :¥ 75_| i120 [175 | 250 [215 [ 400
Zero Correlation: Actually it is not a type of correlation but still it is called as
zero or no correlation, When we don’t find any relationship between the
variables then, it is said to be zero correlation, It means a change in value of one
variable doesn’t influence or change the value of other variable. For example,
the correlation between weight of person and intelligence is a zero or no
correlation.Statistical Computing and R Programming V Semester BCA,
II. Simple, Partial and Multiple Correlations:
The distinction between simple, partial and multiple correlation is based upon
the number of variables studied.
Simple Correlation: When only two variables are studied, it is a case of simple
correlation. For example, when one studies relationship between the marks
secured by student and the attendance of student in class, it is a problem of
simple correlation.
Partial Correlation: In case of partial correlation one studies three or more
variables but considers only two variables to be influencing each other and the
effect of other influencing variables being held constant, For example, in above
example of relationship between student marks and attendance, the other
variable influencing such as effective teaching of teacher, use of teaching aid
like computer, smart board etc are assumed to be constant.
Multiple Correlations: When three or more variables are studied, it is a case of
multiple correlation. For example, in above example if study covers the
relationship between student marks, attendance of students, effectiveness of
teacher, use of teaching aids ete, it is a case of multiple correlation.
III. Linear and Non-linear Correlation:
Depending upon the cons!
incy of the ratio of change between the variables, the
correlation may be Linear or Non-linear Correlation.
ear Correlation: If the amount of change in one variable bears a constant
ratio to the amount of change in the other variable, then correlation is said to be
linear. If such variables are plotted on a graph paper all the plotted points would
fall on a straight line. For example: If it is assumed that, to produce one unit of
finished product we need 10 units of raw materials, then subsequently to
produce 2 units of finished product we need double of the one unit.
45Statistical Computing and R Programming V Semester BCA,
Raw material : X 10 20 30 40 50 60
Finished Product : ¥ 2 4 6 a 10 12
Non-linear Correlation: If the amount of change in one variable does not bear
a constant ratio to the amount of change to the other variable, then correlation is
said to be non-linear. If such variables are plotted on a graph, the points would
fall on a curve and not on a straight line. For example, if we double the amount
of advertisement expenditure, then sales volume would not necessarily be
doubled.
ADVERTISEMENT 10 20 «| 30 40 | 50
EXPENSES: X
SALES VOLUME: Y 2 5 8 4 12
Mlustration 01:
State in each case whether there is (a) Positive Correlation (b) Negative
Correlation (¢) No Correlation
SINo Particulars Solution
1_| Price of commodity and its demand Negative
2__| Yield of crop and amount of rainfall Positive
3 __| No of fruits eaten and hungry of a person Negative
4 | No of units produced and fixed cost per unit Negative
5 _| No of girls in the class and marks of bos No Correlation
6 __| Ages of Husbands and wife Positive
7_| Temperature and sale of woollen garments Negative
‘8 _| Number of cows and mille produced Positive
‘9 | Weight of person and intelligence No Correlation
10 _| Advertisement expenditure and sales volume Positive
Methods of measurement of correlation:
Quantification of the relationship between variables is very essential to take the
benefit of study of correlation. For this, we find there are various methods of
measurement of correlation, which can be represented as given below:
46Statistical Computing and R Programming V Semester BCA,
Methods of Measurement of Correlation
Graphic Method Algebric Method
1. Karl Pearson's Coefficient of
Correlation
2. Spearman's Rank Coefficient of
Correlation
3. Concurrent Deviation Method
4, Method of Least Square
1 Scatter Diagram
2. Graph Method
Among these methods we will discuss only the following methods:
1, Scatter Diagram
Scatter Diagram:
This is graphic method of measurement of correlation. It is a diagrammatic
representation of bivariate data to ascertain the relationship between two
variables. Under this method the given data are plotted on a graph paper in the
form of dot. i.e. for each pair of X and Y values we put dots and thus obtain
many points as the number of observations. Usually an independent variable is
shown on the X-axis whereas the dependent variable is shown on the Y-axis.
Once the values are plotted on the graph it reveals the type of the correlation
between variable X and Y. A scatter diagram reveals whether the movements in
one series are associated with those in the other series
+ Perfect Positive Correlation: In this case, the points will form on a straight line
rising from the lower left hand corner to the upper right hand comer.
+ Perfect Negative Correlation: In this case, the points will form on a straight
line declining from the upper left hand comer to the lower right hand corner.
47Statistical Computing and R Programming V Semester BCA,
+ High Degree of Positive Correlation: In this case, the plotted points fall in a
narrow band, wherein points show a rising tendency from the lower left hand
comer to the upper right hand comer.
+ High Degree of Negative Correlation: In this case, the plotted points fall in a
narrow band, wherein points show a declining tendency from upper left hand
comer to the lower right hand corner.
+ Low Degree of Positive Correlation: If the points are widely scattered over the
diagrams, wherein points are rising from the left hand comer to the upper right
hand corner.
+ Low Degree of Negative Correlation: If the points are widely scattered over
the diagrams, wherein points are declining from the upper left hand corner to
the lower right hand corn
+ Zero (No) Correlation: When plotted points are scattered over the graph
haphazardly, then it indicate that there is no correlation or zero correlation
between two variables.
48Statistical Computing and R Programming
Perfect Positive Correlation
V Semester BCA,
Porfect Nogative Corrolation
Diagram - 1
High Positive Correlation
Diagram - 11
High Negative Correlation
Diagram - 111
49
Diagram -1VStatistical Computing and R Programming V Semester BCA
Loy Postive Correlation Low Negative Correlation
Diagram-V
No Correlation
lustration 02:
Given the following pairs of values:
Capital Employed (Rs.InCrore) [1 [2[3[4]5 [7 [8] [i[i2
Profit (Rs. In Lakhs) 3{s[4{[71[91e [10[ai [a2 [14
(a)Draw a scatter diagram (b)Do you think that there is any correlation between
profits and capital employed? Is it positive or negative? Is it high or low?
Solution:
From the observation of scatter diagram we can say that the variables are
positively correlated. In the diagram the points trend toward upward rising from
the lower left hand comer to the upper right hand comer, hence it is positive
correlation. Plotted points are in narrow band which indicates that it is a case of
high degree of positive correlation.Statistical Computing and R Programming
an _ mt ©
ae oo
2]. M<
Capital Employed (Rs. fn Crore)
Correlation coefficient
DIRECT METHOD
Sev fo
> Interpretation of coefficient of correlation
A Negative value of 7 indicat
Wawa
y=0 then the variables are non-correlated.
eon
DIRECT METHOD
‘A positive value of y indicates positive correlation.
s negative correlation.
If y =+lthen the correlation is perfect positive
If y =-Ithen the correlation is perfect negative.
If y >=0.5 then correlation will be high degree of positive.
If'y >=-0.5 then correlation will be high degree of negative
If y <0.5 then correlation will be Low degree of positive,
If y<-0.5 then correlation will be low degree of negative
V Semester BCA,
From following information find the correlation coefficient between advertisement expenses
and sales volume using Karl Pearson’s coefficient of correlation method.
Firm 1[2]3]4|5 9 [10
Advertisement Exp. (Rs. In Lakhs) | 11 | 13 | 14 | 16 | 16 13 | 13
Sales Volume (Rs. In Lakhs) 50 [50 [55 | 60 | 65 60 | 50
s1Statistical Computing and R Programming V Semester BCA
Caleulation of Karl Pearson's coefficient of correlation
Yblewew sea ooee
0 Bei + asa "ae
[xamples on Karl Pearson’s coefficient of correlation :
PRODUCT MOMENT COEFFICIENT OF CORRELATION
PEARSON PRODUCT- (Sx
MOMENT CORRELATION
Sy
»_
| —
QOEFFICIENT (12.1)
(3x7
1"
Product moment coefficient of correlation is,
nbxy ~ (2x)(2y)
[ndx? - (2x)*][ndy? - (2y)?]
Yy =Statistical Computing and R Programming V Semester BCA
Calculate the Karl Pearson’s product moment of coefficient of correlation.
Student 1 2 3 4 3 6
Statisties(x) [7 4 6 9 3 8
Mathematies(y) [8 3 4 8 3 6
olution: ~
* > y [=| |
7 | 8 | 49 | oF | 56
4 5 16 25. 20
6 4 36 16 24
9 [8 | a1 | o | 72
$s ePerts
8 | 6 | 64 | 36 | 48
37 | 34 | 255 | 214 | 229
Product moment coefficient of correlation is,
ndxy — (2x) @y)
s Bom = Gy
ai Bnd
~ [@eni2s;
Ye = 0.8081
‘There exists a positive correlation of higher degree between xand y.
53Statistical Computing and R Programming V Semester BCA,
equa Fats
Computation ofr forthe Interest Index
Economics Example —_Day x , ae fr y
1 1 al ARH 16803
2 ne mn oe 135
3 400 2% sins sion
4 1 as sas 17823
3 70 my S76 m4)
‘ 18 a 99 170.49
7 68 2B 9n9 17164
3 1 a sins 1a
9 138 26 51076 ATS
ty) aor a 3525 18%4
u ft 3 5489 13099
2 san ML esata san 1st800
Sxe9053 Syem9a7 Soy=21,0507
Correlation mat
A correlation matrix is a table showing correlation coefficients between
variables. Each cell in the table shows the correlation between two
variables. A correlation matrix is used to summarize data, as an input into
a more advanced analysis, and as a diagnostic for advanced analyses
‘An example of a correlation matrix
Typically, a correlation matrix is “square”, with the same variables
shown in the rows and columns. I've shown an example below. This
shows correlations between the stated importance of various things to
people. The line of 1.00s going from the top left to the bottom right is
the main diagonal, which shows that each variable always perfectly
54Statistical Computing and R Programming V Semester BCA,
correlates with itself. This matrix is symmetrical, with the same
correlation is shown above the main diagonal being a mirror image of
those below the main diagonal.
Y Xt X2
The above table is a correlation matrix between different
Bonds issued by the Government with different residual maturity stated in the
form of years in both horizontal and vertical buckets. It enables us to interpret
that a bond with 0.25 years to maturity and a bond with 0.5 years to maturity
has a correlation coefficient of 0.97 in their price movements and similarly for
other maturity bonds.
Maturity Buckets {In Years)
oas[oso| 1 | 2[ 3 [5 | 10] 45 | 20] 30
oso] o97| 4
1 [ost|oa7| +
2 | oai [ost|os7| a
2 [072[086/094|000) a
5 | 057 |076) 0.89 096/028] 4
10 | 040 0.57|076|0.89| 0.93 |os7| 1
15 [040 [0.42| 0.55 |0.82| 0.89 [084/099] 1
20 [0.40 0.40/057 0.76] 0.24[o.01[0.97/099| 4
30 | 0.40 | 0.40] 0.42 [0.66] 0.76 [0.26] 0.94 [0.97 |o99| 1
55Statistical Computing and R Programming V Semester BCA,
Applications of a correlation matrix
There are three broad reasons for computing a correlation matrix:
. To summarize a large amount of data where the goal is to see patterns. In
our example above, the observable pattern is that all the variables highly
correlate with each other.
To input into other analyses. For example, people commonly use correlation
matrixes as inputs for exploratory factor analysis, confirmatory factor analysis,
structural equation models, and linear regression when excluding missing values
pairwise.
As a diagnostic when checking other analyses. For example, with linear
regression, a high amount of correlations suggests that the linear regression
estimates will be unreliable.
REGRESSION
Meaning:
Regression analysis is a statistical tool to study the nature and extent
of functional relationship between two or more variables and to estimate
(or predict) the unknown values of dependent variable from the known
values of independent variable.
The variable that forms the basis for predicting another variable is known
as the Independent Variable and the variable that is predicted is known as
dependent variable. For example, if we know that two variables price (X) and
demand (Y) are closely related we can find out the most probable value of X for
a given value of Y or the most probable value of Y for a given value of X.
Similarly, if we know that the amount of tax and the rise in the price of a
commodity are closely related, we can find out the expected price for a certain
amount of tax levy.
Uses of Regression Analysis:
1. It provides estimates of values of the dependent variables from values of
independent variables.
2. Itis used to obtain a measure of the error involved in using the regression line
as a basis for estimation.
3. With the help of regression analysis, we can obtain a measure of degree of
association or correlation that exists between the two variables.
4, It is highly valuable tool in economies and business research, since most of
the problems of the economic analysis are based on cause and effect
relationship.
56Statistical Computing and R Programming V Semester BCA,
ear Regression
Regression lines and regression equations are used synonymously. Regression
equations are algebraic expression of the regression lines. Let us consider two
variables: X & Y. If y depends on x, then the result comes in the form of simple
regression. If we take the case of two variable X and Y, we shall have two
regression lines as the regression line of X on Y and regression line of Y on X.
The regression line of Y on X gives the most probable value of Y for given
value of X and the regression line of X on Y given the most probable value of X
for given value of Y. Thus, we have two regression lines. However, when there
is either perfect positive or perfect negative correlation between the two
variables, the two regression line will coincide, i.e. we will have one line. If the
variables are independent, r is zero and the lines of regression are at right angles
i.e. parallel to X axis and Y axis.
Therefore, with the help of simple linear regression model we have the following two
regression lines
1. Regression line of Y on X: This line gives the probable value of Y (Dependent
variable) for any given value of X (Independent variable).
Regression line of Y on X 2 Y-Y= bye (X-X)
OR : Y=a+bX
2 Regression line of X on Y: This line gives the probable value of X (Dependent
variable) for any given value of Y (Independent variable).
Regression line of Xon Y HRs by
OR X=a+bY
In the above two regression lines or regression equations, there are two
regression parameters, which are “a” and “b”. Here “a” is unknown constant
and “b” which is also denoted as “by,” or “by”, is also another unknown
constant popularly called as regression coefficient. Hence, these “a” and “b” are
two unknown constants (fixed numerical values) which determine the position
of the line completely. If the value of either or both of them is changed, another
line is determined. The parameter “a” determines the level of the fitted line (i.e.
87Statistical Computing and R Programming V Semester BCA,
the distance of the line directly above or below the origin). The parameter “b”
determines the slope of the line (i.e. the change in Y for unit change in X).
If the values of constants “a” and “b” are obtained, the line is completely
determined. But the question is how to obtain these values. The answer is
provided by the method of least squares. With the little algebra and differential
calculus, it can be shown that the following two normal equations, if solved
simultaneously, will yield the values of the parameters “a” and “b”.
Two normal equations:
Xon¥ Yonx
DX = Na+byy DY = Na+bpx
SExy = abY+byye rxy aX + DEX?
This above method is popularly known as direct method, which becomes quite
cumbersome when the values of X and Y are large. This work can be simplified
if instead of dealing with actual values of X and Y, we take the deviations of X
and Y series from their respective means. In that case:
Regression equation Y on X:
a+bX will change to (¥-¥) = by (X-X)
Regression equation X on ¥:
=a+bY will change to (X-X)=be (Y-)
In this new form of regression equation, we need to compute only one parameter i.e. “b”.
This “b” which is also denoted either “by.” or “bs,” which is called as regression coefficient.
Iustration 0
Find the two regression equation of X on Y and Y on X from the following data:
m to 12 16 11 15 14 20 22
15 18 23 14 20 17 25 28
58Statistical Computing and R Programming V Semester BCA,
Calculation of Regression Equation
= ¥ ma ye xy,
10 15 100 225 150
12 18 144 324 216
16 23 256 529 368
4 14 a2a 196 154
45 20 225 400 300
14 7 196 289 238
20 25 400 625 500
22 28 484 784 616
120 160 1.926 3,372 2,542
EX xY 5x" ry ExY.
Here N = Number of elements in either series X or series Y = 8
Now we will proceed to compute regression equations using normal equations.
Regression equation ofXonY: X=a+bY
‘The two normal equations are:
DX = Na+byy
DXY = 9 ayy+byyve
‘Substituting the values in above normal equations, we get
w20 = 8a + 160b
2542 = 160a + ~—-3372b
Let us solve these equations (i) and (ii) by simultaneous equation method
Multiply equation (i) by 20 we get_ 2400 = 160a + 3200b
Now rewriting these equations:
2400 = i + 3200b
2542 = 0a + = -3372b
“142 = -172b
‘Therefore now we have -142 = -172b, this can rewritten as 172b = 142
Now, b =1# = 0.8256 (rounded off)
Substituting the value of b in equation (i), we get
120 = 8a + (160 * 0.8256)
120 = 8a + 132 (rounded off)
Ba = 120 - 132
Ba = “12
a = -12/8
a s “15
‘Thus we got the values of a = -1.5 and b = 0.8256
Hence the required regression equation of X on ¥:
Xsa+bY => .5 + 0.8256Y
59Statistical Computing and R Programming V Semester BCA
Regression equation of ¥ on X: Ysa+bX
‘The two normal equations are:
DY = Na+byx
DXY = alX+byxX?
‘Substituting the values in above normal equations, we get
160 8a + — 120b o~- (iii)
2542 120a + = —-1926b Gv)
Let us solve these equations (iii) and (iv) by simultaneous equation method
Multiply equation (iii) by 15 we get 2400 = 120a + 1800b
Now rewriting these equations
2400 12a + 1800b
2542 = 0a + 1926
142 -126b
‘Therefore now we have -142 = -126b, this can rewritten as 126b = 142
Now, b A = 1.127 (rounded off)
Substituting the value of b in equation (iii), we get
160 = 8a + (120*1127)
160 = Ba + 135.24
Ba = 160 - 135.24
Ba 24.76
a 7 24.76/8
a 3.095
Thus we got the values of a = 3.095 and b = 1.127
Hence the required regression equation of Y on X:
Ysa+bX => Y=3.095+41.127X
Mustrs 02:
Compute the regression equation of y on x from the following data.
x 2 4 5 6 8 i
Y 5
60Statistical Computing and R Programming V Semester BCA
Solution:
x y x xy
2 18 4 36
4 12 16 48
5 10 25 50
6 8 36 48
8 7 o4 56
a 5 121 55
36 60 «| 266 | 293
sg ee
iggy vee”
pany. @
ya = = 10
_ndxy- ExEy_ 6 x293-36 x60_ —402
Be = x)?” 6 * 266 — (36)? 300, es
Regression equation of y on x is,
Y- P= bye - %)
i.e. (y— 10) = -1.3333(x- 6)
i.e., y= -1.3333x + 7.9998 + 10
iLe., y = -1.3333x +17.9998
ind the regression equation of x on y and predict the value of x when y is 9.
x 3 6 5 4 4 6 7 5
¥ 3 2 3 5 3 6 6 4
Solution:
x | 3 6 s[4]4¢[6][7]5 [Ex= 40]
y | 3 al s[ sas] s.[ 6.6) 4 [ty 32
2 (eo o | 25| 9 | 36 | 36| 16 |=y7— 144]
xy | 9 [12] 15 | 20 | 12 | 36 | 42 | 20 [Say — 166
61Statistical Computing and R Programming
x_ 40
eo oe 7*
ga ey 2
z a ae
_ nny ExEy _ 8x 166-40 x 32
‘v noy?— @y! . 8x 144— G2)?
Regression equation of x ony is,
&=- 9 = by- 7)
Le, (x~ 5) = 0.375 (y-4)
x=0375y-1545
ie, x=0.375y + 3.5
Wheny=9, x
=0.375*x9+3.5=6875=7
Ilustrat 04:
Find the two regression lines from the following data.
V Semester BCA,
61
62
a
x 55 57 58 59 59 60,
Y 74 [77 [78 [75 [78 “(82
82
79.
81
Solution:
x x xt # xy
ao 74 | 3025 | 5476 | 4070
37 77 | 3249 | 5929 | 4389
58 78 | 3364 | 6084 | 4524
59 75 | 3481 | 5625 | 4425
59 78 | 3481 | 6084 | 4602
60 82 | 3600 | 6724 | 4920
61 82 | 3721 | 6724 | 5002
62 79 | 3844 | 6241 | 4898
64. 81_| 4006 | 6561 | 5184
335 | 706 [31861 | 55448 | 42014
62Statistical Computing and R Programming V Semester BCA,
9% 42014 ~535 x 706
"9x 55448 — (706)?
Hoe nDxy- DxEy _ 9x 42014—535 x 706
eee Ca 9x 31861 — (535)?
Regression equation of x on y is,
@-D= by -—
Le,, (x ~ 59.4444) = 0.698 (y - 78.4444)
ie., x = 0.698y ~ 54.7542 + 59.4444
0.698
= 0.7939
ie,, x = 0,698y + 4.6902
Regression equation of y on xis,
G- Y= by &- XY)
ie, (y ~ 79.4444) = 0.7939(x - 59.4444)
ie, y = 0.7939x - 47.1929 + 78.4444
-ie.,y = 0.7939x + 31.2515
63