0% found this document useful (0 votes)
20 views21 pages

DSJ BMS Unit3 16102020

This document discusses various methods for analyzing the relationship between two variables, including scatter diagrams, correlation coefficients, and linear regression. It defines correlation as how strongly two variables are related. Simple correlation coefficient measures the intensity of a linear relationship between -1 and 1, with 0 indicating no correlation. Spearman's rank correlation coefficient is a non-parametric measure of statistical dependence between two variables.

Uploaded by

Parmeet Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

DSJ BMS Unit3 16102020

This document discusses various methods for analyzing the relationship between two variables, including scatter diagrams, correlation coefficients, and linear regression. It defines correlation as how strongly two variables are related. Simple correlation coefficient measures the intensity of a linear relationship between -1 and 1, with 0 indicating no correlation. Spearman's rank correlation coefficient is a non-parametric measure of statistical dependence between two variables.

Uploaded by

Parmeet Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 3 – Descriptive Analysis of Bivariate Data

1. Methods and measures of studying relationship between two


variables: Scatter Diagrams, Simple correlation coefficient, Rank
correlation coefficient, Linear Regression, Coefficient of
determination.
2. Estimation of simple and exponential trends for Time Series.

Correlation:

It shows that how strongly a pair of variables are related. E.g. height and weight,
rainfall and rice farming, etc.

Scatter Diagrams:

Simplest way to represent bivariate data diagrammatically. Scatter diagram is


obtained by plotting the variables X and Y along the x-axis and y-axis.

a.

Units Price
1 10
2.5 15
3 25
4.5 30
5 35
6 38
6.5 42
7 46
b.

Units Price
1 46
2.5 42
3 38
4.5 35
5 30
6 25
6.5 15
7 10
c.

Units Price
1 35
2.5 15
3 46
4.5 10
5 25
6 42
6.5 30
7 38

a. Positive Correlation
b. Negative Correlation
c. No Correlation

e.g.
1. Motorcycle speed and number of accidents.
2. Temperature and sales of AC.
Simple correlation coefficient (Karl Pearson’s coefficient of correlation):

It measures intensity or degree of linear relationship between two variables.


Not suitable for non-linear relationships.
Correlation coefficient between two random variables X and Y is denoted by
r ( X , Y ) or rXY .

1 n
 XY  ( xi yi )  x y
n i 1
r( X ,Y )  
 XY  1 n 2 1
n
2
  ( xi )  x   n 
2
( yi )  y 
2

 n i 1   i 1 

Covariance:
1 n 1 n
 XY  Cov( X , Y )   ( xi  x)( yi  y )   ( xi yi )  x y
n i 1 n i 1

Standard Deviation:

1 n 1 n
 
2
X  ( xi  x) 
2
( xi 2 )  x
n i 1 n i 1

1 n 1 n
 
2
X  ( yi  y ) 
2
( yi 2 )  y
n i 1 n i 1

Range: [-1, 1]

a. Positive Correlation 0  r( X ,Y )  1
b. Negative Correlation 1  r ( X , Y )  0
c. No Correlation r( X ,Y )  0
Q1.
Units Price
1 10
2.5 15
3 25
4.5 30
5 35
6 38
6.5 42
7 46 r ( X , Y ) =?
Units Price
S.N. X2 Y2 X*Y
(X) (Y)
1 1 10 1 100 10
2 2.5 15 6.25 225 37.5
3 3 25 9 625 75
4 4.5 30 20.25 900 135
5 5 35 25 1225 175
6 6 38 36 1444 228
7 6.5 42 42.25 1764 273
8 7 46 49 2116 322
Total 35.5 241 188.75 8399 1255.5

1 n 1 1 n 1
x 
n i 1
xi  35.5  4.44
8
y 
n i 1
yi  241  30.13
8
1 n 1 n 1
 
2
X  ( xi  x) 2  ( xi 2 )  x  188.75  4.442  23.59  19.71  3.88  1.97
n i 1 n i 1 8

1 n 1 n 1
 
2
Y  ( yi  y ) 2  ( yi 2 )  y  8399  30.132  1049.88  907.82  142.06  11.92
n i 1 n i 1 8

1 n 1 n 1
 XY  
n i 1
( xi  x )( yi  y )  
n i 1
( xi yi )  x y  1255.5  4.44  30.13  156.94  133.78  23.16
8
1 n
 XY  ( xi yi )  x y
n i 1 23.16 23.16
r( X ,Y )      0.9864
 XY  1 n   1 n  1.97  11.92 23.48
 
2 2
 ( xi 2 )  x   ( yi 2 )  y 
 n i 1   n i 1 

r ( X , Y )  0.9864
Q2.
Units Price
1 46
2.5 42
3 38
4.5 35
5 30
6 25
6.5 15
7 10 r( X ,Y ) = -0.952
Q3.
Units Price
1 35
2.5 15
3 46
4.5 10
5 25
6 42
6.5 30
7 38 r( X ,Y ) = 0.128
Spearman Rank Correlation Coefficient/ Spearman’s Rho (ρ)

Data must be ordinal, interval or ratio.

n
6 d i 2
  1 i 1

n(n  1)
2

Where,
di difference of ranks
n number of observations
Range: [-1, 1]

a. Positive Correlation 0   1
b. Negative Correlation 1    0
c. No Correlation  0

e.g.
Phy (X) 35 23 47 17 10 43 9 6 28
Math (Y) 30 33 45 23 8 49 12 4 31

Math Rank Rank


Phy (X) (Y) (X) (Y) di di2
35 30 3 5 -2 4
23 33 5 3 2 4
47 45 1 2 -1 1
17 23 6 6 0 0
10 8 7 8 -1 1
43 49 2 1 1 1
9 12 8 7 1 1
6 4 9 9 0 0
28 31 4 4 0 0
12
n
6 d i 2
6 12 72 72
  1 i 1
 1  1  1  1  0.1  0.9
n(n  1)
2
9(9  1)
2
9(80) 720
  0.9
Units Price
1 46
2.5 42
3 38
4.5 35
5 30
6 25
6.5 15
7 10 ρ=?
Units Price Rank Rank
(X) (Y) (Y) (X) di di2
1 46 1 8 -7 49
2.5 42 2 7 -5 25
3 38 3 6 -3 9
4.5 35 4 5 -1 1
5 30 5 4 1 1
6 25 6 3 3 9
6.5 15 7 2 5 25
7 10 8 1 7 49
168
n
6 d i 2
6 168 1008 1008
  1 i 1
 1  1  1  1  2  1
n(n  1)
2
8(8  1)
2
8(63) 504
  1
Units Price
1 35
2.5 15
3 46
4.5 10
5 25
6 42
6.5 30
7 38 ρ=?
Units Price
(X) (Y) R (X) Rank (Y) di di2
1 35 8 4 4 16
2.5 15 7 7 0 0
3 46 6 1 5 25
4.5 10 5 8 -3 9
5 25 4 6 -2 4
6 42 3 2 1 1
6.5 30 2 5 -3 9
7 38 1 3 -2 4
68
n
6 d i 2
6  68 408 408
  1 i 1
 1  1  1  1  0.81  0.19
n(n  1)
2
8(8  1)
2
8(63) 504
  0.19
Tied Ranks:

 n 2 
6   di  Tx  Ty 
  1   i 1 2 
n(n  1)
Where,
1 t
Tx   mi (mi 2  1)
12 i 1
1 t
Ty   mi (mi 2  1)
12 i 1

e.g.
X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70

adj.
Rank adj. Rank
X Rank Y Rank (Y) di di2
(X) (Y)
(X)
68 4 4 62 5 5 -1 1
64 5 6 58 7 7 -1 1
75 2 2.5 68 3 3.5 -1 1
50 9 9 45 10 10 -1 1
64 5 6 81 1 1 5 25
80 1 1 60 6 6 -5 25
75 2 2.5 68 4 3.5 -1 1
40 10 10 48 9 9 1 1
55 8 8 50 8 8 0 0
64 5 6 70 2 2 4 16
72

Using old formula:


 n 2
6   di 
6  72
  1   i 21   1 
432
 1   1  0.436  0.564
n(n  1) 10(102  1) 990
  0.564
Using formula for tied ranks:
1 t 1 1 30
Tx   mi (mi 2  1)  1(12  1)  2(22  1)  3(32  1)   0  6  24   2.5
12 i 1 12 12 12
1 t 1 1 6
Ty   mi (mi 2  1)  1(12  1)  2(22  1)    0  6   0.5
12 i 1 12 12 12

 n 
6   di 2  Tx  Ty 
  1   i 1 2   1  6  72  2.5  0.5  1  6  72  2.5  0.5
n(n  1) 10(102  1) 10(102  1)
6  75 450
 1  1  1  0.455  0.545
990 990
  0.564

Question for practice:

X 35 23 47 17 10 47 9 6 28
Y 30 33 49 23 8 49 12 4 31
Linear Regression

Regression analysis is a mathematical measure of the average relationship


between two or more variables in terms of the original units of the data.
Y(popularity)= content + timing + presentation + …

Uses

 Prediction of future values


 Consumer behaviour
 Pricing and promotions on sales of a product
 Assess risk in financial services

Simple Linear Regression:

1 dependent variable, 1 independent variable.

y  a  bx

n n
1. y
i 1
i  na  b xi
i 1
n n n
2.  xi yi  a xi  b xi 2
i 1 i 1 i 1

1 and 2 are known as normal equations. It is obtained by the method of least


squares.
Grades (y) Time on
Social media
(x)
8 1
7 4
7 5
6 8
5 5
n n
1. y
i 1
i  na  b xi
i 1
n n n
2.  xi yi  a xi  b xi 2
i 1 i 1 i 1

1. 33=5a+23b
2. 144= 23a+ 131b

(33=5a+23b)x23 ;759=115a+529b
(144=23a+131b)x5;720=115a+655b
759-720=115a-115a+529b-655b
39=-126b
b=-39/126=-0.28

33=5a+23(-0.28)
33=5a-6.44
5a=39.44
a=39.44/5
a=7.89

a=7.89
b=-0.28

y  7.89  0.28 x
X=5; Y=6.49
X=3; Y=7.05
X=7; y=5.93
Y X1 x2 X3
8 1 10 9
7 4 4 10
7 5 3 6
6 8 3 5
5 5 2 5

Fit a regression line for Y & X2 and for Y & X3.


Y x2 x22 X2.Y
8 10 100 80
7 4 16 28
7 3 9 21
6 3 9 18
5 2 4 10
33 22 138 157

33=5a+22b; 157=22a+138b
a=5.34; b=0.29
Fitted regression is Y=5.34+0.29X

Y X3 x3^2 X3.y
8 9 81 72
7 10 100 70
7 6 36 42
6 5 25 30
5 5 25 25
33 35 267 239

33=5a+35b; 239=35a+267b
a=4.05; b=0.36
Fitted regression is Y=4.05+0.36X
Coefficient of determination ( r 2 )
r 2 is used to analyse the extent to which differences in one variable explains the
differences in another variable.

2
 
1 n
  XY  
2   ( xi yi )  x y 

 r ( X , Y )    
n i 1
2

 X  Y    1 n 2 1
n
2
( yi )  y  
  n  
( xi )  x  
2 2

 i 1   n i 1  

Range 0  r 2  1

Simple Linear Regression:

y  a  bx
Non-Linear Regression (Not a straight line)
Fitting of Exponential curve.

y  ab x

Transformation using log: Normal Equations:


y  ab x
log y  log(ab x ) n n
1.  Yi  nA  B xi
log y  log a  log b x i 1 i 1
n n n
log y  log a  x log b 2.  xiYi  A xi  B xi 2
i 1 i 1 i 1
Y  A  xB
Y  A  Bx
A  log a; a  Anti log( A)
B  log b; b  Anti log( B)
Note:
log(ab)  log a  log b; log(a / b)  log a  log b; log(a b )  b log a
Log10=1, log100=2, log212=2.33
A=1=log a; antilog (log a)= antilog (1); log a=1; a= 1/log=Antilog (1)=101
e.g.
Fit an exponential curve to the given data:
y x x2 Y=logy x.Y
8 10 100 0.90 9.03
7 4 16 0.85 3.38
7 3 9 0.85 2.54
6 3 9 0.78 2.33
5 2 4 0.70 1.40
33 22 138 4.07 18.68

Normal equations:
n n
1.  Yi  nA  B xi
i 1 i 1 1. 4.07  5 A  22 B
n n n
2. 18.68  22 A  138B
2.  x Y  A x  B x
i 1
i i
i 1
i
i 1
i
2

A = 0.732
B = 0.019
a  Anti log( A); a  Anti log(0.732)  5.395
b  Anti log( B); b  Anti log(0.019)  1.045
The fitted exponential curve is:
y  ab x
yˆ  (5.395)(1.045) x

Question for practice:


Y X x^2 Y=logy X.Y
8 9 81 0.90 8.13
7 10 100 0.85 8.45
7 6 36 0.85 5.07
6 5 25 0.78 3.89
5 5 25 0.70 3.49
33 35 267 4.07 29.03

4.07=5A+35B A 0.642 a 4.39


29.03=35A+267B B 0.023 b 1.05
The fitted exponential curve is:
y  ab x yˆ  (4.39)(1.05) x
Solution for the practice Question: Fit Regression line:
y y2 x1 x12 x1.y x2 x22 x2.y x3 x32 x3.Y
3 9 12 144 36 13 169 39 5 25 15
6 36 8 64 48 15 225 90 7 49 42
9 81 16 256 144 16 256 144 11 121 99
8 64 14 196 112 15 225 120 12 144 96
7 49 8 64 56 16 256 112 9 81 63
10 100 9 81 90 19 361 190 8 64 80
43 339 67 805 486 94 1492 695 52 484 395

43=6a+67b 43=6a+94b 43=6a+52b


486=67a+805b 695=94a+1492b 395=52a+484b
a= 6.02 a= -10.121 a= 1.36
b= 0.103 b= 1.104 b= 0.67
Fitted regression line is:
y  6.02  0.103x y  10.121  1.104 x y  1.36  0.67 x
For Coefficient of Determination:
1 n 1 1 n 1 1 1
y  yi  6 43  7.167
n i 1
x1   x1i  6 67  11.167
n i 1
x 2  94  15.667
6
x2  52  8.667
6
1 n 1 n 1
  ( xi 2 )  x  6 805  11.1672  134.167  124.702  9.465  3.076
2
 x1  ( xi  x) 2 
n i 1 n i 1
1
 x2  1492  15.667 2  248.667  245.455  3.212  1.792
6
1
 x3  484  8.667 2  80.667  75.117  5.55  2.356
6
1
y  339  7.167 2  56.5  51.366  5.134  2.266
6
1 n 1 n 1
 x1 y   ( x1i  x1)( yi  y )   ( x1i yi )  x1y  486  11.167  7.167  81  80.034  0.966
n i 1 n i 1 6
1
 x 2 y  695  15.667  7.167  115.833  112.285  3.548
6
1
 x 3 y  395  8.667  7.167  65.833  62.116  3.717
6
1 n
  ( x1i yi )  x1y
n i 1 0.966 0.966
r ( x1, y )  x1 y     0.139
 x1 y  1 n   1 n  3.076  2.266 6.97
 ( x1i )  x1   n 
2 2

2
( yi )  y 
2

 n i 1   i 1 
 3.548 3.548
r ( x 2, y )  x 2 y    0.874
 x 2 y 1.792  2.266 4.061
 x3 y 3.717 3.717
r ( x3, y )     0.696
 x 3 y 2.356  2.266 5.339

r 2 ( x1, y )  0.1392  0.019|| r 2 ( x2, y)  0.8742  0.764 || r 2 ( x3, y)  0.6962  0.484


It can be concluded that 76.4% of variation in y is explained by x2.
Solution for the practice Question: Fit Exponential Curve:
y x1 x12 Y=logy x1.Y x2 x22 Y=logy x2.Y x3 x32 Y=logy x3.Y
3 12 144 0.48 5.73 13 169 0.48 6.20 5 25 0.48 2.39
6 8 64 0.78 6.23 15 225 0.78 11.67 7 49 0.78 5.45
9 16 256 0.95 15.27 16 256 0.95 15.27 11 121 0.95 10.50
8 14 196 0.90 12.64 15 225 0.90 13.55 12 144 0.90 10.84
7 8 64 0.85 6.76 16 256 0.85 13.52 9 81 0.85 7.61
10 9 81 1.00 9.00 19 361 1.00 19.00 8 64 1.00 8.00
43 67 805 4.96 55.62 94 1492 4.96 79.21 52 484 4.96 44.77

4.96=6A+67B 4.96=6A+94B 4.96=6A+52B


55.62=67A+805B 79.21=94B+1492B 44.77=52A+484B
A= 0.781 a= 6.04 A= -0.392 a= 0.41 A= 0.363 a= 2.31
B= 0.004 b= 1.01 B= 0.078 b= 1.2 B= 0.054 b= 1.13

The fitted exponential curve is:


yˆ  (6.04)(1.01) x1 yˆ  (0.41)(1.2) x 2 yˆ  (2.31)(1.13) x3
Estimation of simple and exponential trends for Time Series.

Time Series:
A time series may be defined as a collection of observations belonging to different
time periods, of some economic variables or any other phenomenon. The period
can be seconds, minutes, hours, days, weeks, months, years, decades, etc.
Mathematically, a time series is defined as:

yt  f (t )

where, yt is the value of the phenomenon (or variable) under study at time t.
e.g.
1. Population ( yt ) of a country in different years (t)
2. Number of coronavirus cases ( yt ) on different days (t)
3. Rainfall ( yt ) on different months (t)
4. Temperature ( yt ) of a place on different years (t)

Components of Time Series:

1. Secular Trend or Long-term movement


2. Periodic Changes or Short-term fluctuations
a. Seasonal variations
b. Cyclic variations
3. Random or Irregular Movements.

Trend: General tendency of the data to increase or decrease during long period
of time. e.g. upward or downward movement. Population, number of deaths,
literacy rate, standard of living, etc.
Estimation of trend for Time Series:

Graphical method: Year Profit


(t) (yt)
1996 12.6
Profit (yt) 1997 14.8
1998 18.6
25
1999 14.8
20
15
2000 16.6
10 2001 21.2
5 2002 18
0 2003 17.4
1996 1997 1998 1999 2000 2001 2002 2003 2004 2004 15.8

Method of Semi- average:


1996  1997  1998  1999  2000 12.6  14.8  18.6  14.8  16.6
t1   1998 Yt1   15.48
5 5
2001  2002  2003  2004 21.2  18  17.4  15.8
t2   2002.5 Yt2   18.1
4 4

Semi average points (1998, 15.48) and (2002.5, 18.1).


Method of curve fitting:

1. Fitting of simple trend line (straight line):

yt  a  bt

n n
1. yi 1
ti  na  b ti
i 1
n n n
2.  ti yti  a ti  b ti 2
i 1 i 1 i 1

yˆt  a  bt

2. Fitting of Exponential trend line (curve):

yt  abt
Yt  A  Bt
n n
1.  Yti  nA  B ti
i 1 i 1
n n n
2. t Y
i 1
i ti  A ti  B  ti 2
i 1 i 1

yˆt  abt

Fit simple and exponential trend lines for the time series data and estimate the
trends for the year 2000 and 2005.
Year Profit
x=(t-2000) x2 xyt Y=logyt xY
(t) (yt)
1996 12.6 -4 16 -50.4 1.10 -4.40
1997 14.8 -3 9 -44.4 1.17 -3.51
1998 18.6 -2 4 -37.2 1.27 -2.54
1999 14.8 -1 1 -14.8 1.17 -1.17
2000 16.6 0 0 0 1.22 0.00
2001 21.2 1 1 21.2 1.33 1.33
2002 18 2 4 36 1.26 2.51
2003 17.4 3 9 52.2 1.24 3.72
2004 15.8 4 16 63.2 1.20 4.79
149.8 0 60 25.8 10.95 0.73
149.8=9a+0b a= 16.64
25.8=0a+60b b= 0.43
x=t-2000
yˆt  16.64  0.43x
Trend for year 2000: yˆ 2000  16.64  0.43*  2000  2000   16.64

Trend for year 2005: yˆ 2005  16.64  0.43*  2005  2000   16.64  0.43*5  18.79

Exponential trend lines


10.95=9A+0B
0.73=0A+60B
A= 1.217 a= 16.482
B= 0.012 b= 1.028
x=t-2000
yˆt  (16.482)(1.028) x
Trend for year 2000: yˆ2000  (16.482)(1.028)(20002000)  16.482
Trend for year 2005: yˆ2005  (16.482)(1.028)(20052000)  (16.482)(1.028)5  18.922

Question for practice:


Fit simple and exponential trend lines for the time series data and estimate the
trends for the year 2000 and 2005.
t y
1990 17
1991 20
1992 19
1993 26
1994 24
1995 40
1996 35
1997 55
1998 51
1999 74
2000 79

You might also like