Unit 3 – Descriptive Analysis of Bivariate Data
1. Methods and measures of studying relationship between two
variables: Scatter Diagrams, Simple correlation coefficient, Rank
correlation coefficient, Linear Regression, Coefficient of
determination.
2. Estimation of simple and exponential trends for Time Series.
Correlation:
It shows that how strongly a pair of variables are related. E.g. height and weight,
rainfall and rice farming, etc.
Scatter Diagrams:
Simplest way to represent bivariate data diagrammatically. Scatter diagram is
obtained by plotting the variables X and Y along the x-axis and y-axis.
a.
Units Price
1 10
2.5 15
3 25
4.5 30
5 35
6 38
6.5 42
7 46
b.
Units Price
1 46
2.5 42
3 38
4.5 35
5 30
6 25
6.5 15
7 10
c.
Units Price
1 35
2.5 15
3 46
4.5 10
5 25
6 42
6.5 30
7 38
a. Positive Correlation
b. Negative Correlation
c. No Correlation
e.g.
1. Motorcycle speed and number of accidents.
2. Temperature and sales of AC.
Simple correlation coefficient (Karl Pearson’s coefficient of correlation):
It measures intensity or degree of linear relationship between two variables.
Not suitable for non-linear relationships.
Correlation coefficient between two random variables X and Y is denoted by
r ( X , Y ) or rXY .
1 n
XY ( xi yi ) x y
n i 1
r( X ,Y )
XY 1 n 2 1
n
2
( xi ) x n
2
( yi ) y
2
n i 1 i 1
Covariance:
1 n 1 n
XY Cov( X , Y ) ( xi x)( yi y ) ( xi yi ) x y
n i 1 n i 1
Standard Deviation:
1 n 1 n
2
X ( xi x)
2
( xi 2 ) x
n i 1 n i 1
1 n 1 n
2
X ( yi y )
2
( yi 2 ) y
n i 1 n i 1
Range: [-1, 1]
a. Positive Correlation 0 r( X ,Y ) 1
b. Negative Correlation 1 r ( X , Y ) 0
c. No Correlation r( X ,Y ) 0
Q1.
Units Price
1 10
2.5 15
3 25
4.5 30
5 35
6 38
6.5 42
7 46 r ( X , Y ) =?
Units Price
S.N. X2 Y2 X*Y
(X) (Y)
1 1 10 1 100 10
2 2.5 15 6.25 225 37.5
3 3 25 9 625 75
4 4.5 30 20.25 900 135
5 5 35 25 1225 175
6 6 38 36 1444 228
7 6.5 42 42.25 1764 273
8 7 46 49 2116 322
Total 35.5 241 188.75 8399 1255.5
1 n 1 1 n 1
x
n i 1
xi 35.5 4.44
8
y
n i 1
yi 241 30.13
8
1 n 1 n 1
2
X ( xi x) 2 ( xi 2 ) x 188.75 4.442 23.59 19.71 3.88 1.97
n i 1 n i 1 8
1 n 1 n 1
2
Y ( yi y ) 2 ( yi 2 ) y 8399 30.132 1049.88 907.82 142.06 11.92
n i 1 n i 1 8
1 n 1 n 1
XY
n i 1
( xi x )( yi y )
n i 1
( xi yi ) x y 1255.5 4.44 30.13 156.94 133.78 23.16
8
1 n
XY ( xi yi ) x y
n i 1 23.16 23.16
r( X ,Y ) 0.9864
XY 1 n 1 n 1.97 11.92 23.48
2 2
( xi 2 ) x ( yi 2 ) y
n i 1 n i 1
r ( X , Y ) 0.9864
Q2.
Units Price
1 46
2.5 42
3 38
4.5 35
5 30
6 25
6.5 15
7 10 r( X ,Y ) = -0.952
Q3.
Units Price
1 35
2.5 15
3 46
4.5 10
5 25
6 42
6.5 30
7 38 r( X ,Y ) = 0.128
Spearman Rank Correlation Coefficient/ Spearman’s Rho (ρ)
Data must be ordinal, interval or ratio.
n
6 d i 2
1 i 1
n(n 1)
2
Where,
di difference of ranks
n number of observations
Range: [-1, 1]
a. Positive Correlation 0 1
b. Negative Correlation 1 0
c. No Correlation 0
e.g.
Phy (X) 35 23 47 17 10 43 9 6 28
Math (Y) 30 33 45 23 8 49 12 4 31
Math Rank Rank
Phy (X) (Y) (X) (Y) di di2
35 30 3 5 -2 4
23 33 5 3 2 4
47 45 1 2 -1 1
17 23 6 6 0 0
10 8 7 8 -1 1
43 49 2 1 1 1
9 12 8 7 1 1
6 4 9 9 0 0
28 31 4 4 0 0
12
n
6 d i 2
6 12 72 72
1 i 1
1 1 1 1 0.1 0.9
n(n 1)
2
9(9 1)
2
9(80) 720
0.9
Units Price
1 46
2.5 42
3 38
4.5 35
5 30
6 25
6.5 15
7 10 ρ=?
Units Price Rank Rank
(X) (Y) (Y) (X) di di2
1 46 1 8 -7 49
2.5 42 2 7 -5 25
3 38 3 6 -3 9
4.5 35 4 5 -1 1
5 30 5 4 1 1
6 25 6 3 3 9
6.5 15 7 2 5 25
7 10 8 1 7 49
168
n
6 d i 2
6 168 1008 1008
1 i 1
1 1 1 1 2 1
n(n 1)
2
8(8 1)
2
8(63) 504
1
Units Price
1 35
2.5 15
3 46
4.5 10
5 25
6 42
6.5 30
7 38 ρ=?
Units Price
(X) (Y) R (X) Rank (Y) di di2
1 35 8 4 4 16
2.5 15 7 7 0 0
3 46 6 1 5 25
4.5 10 5 8 -3 9
5 25 4 6 -2 4
6 42 3 2 1 1
6.5 30 2 5 -3 9
7 38 1 3 -2 4
68
n
6 d i 2
6 68 408 408
1 i 1
1 1 1 1 0.81 0.19
n(n 1)
2
8(8 1)
2
8(63) 504
0.19
Tied Ranks:
n 2
6 di Tx Ty
1 i 1 2
n(n 1)
Where,
1 t
Tx mi (mi 2 1)
12 i 1
1 t
Ty mi (mi 2 1)
12 i 1
e.g.
X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70
adj.
Rank adj. Rank
X Rank Y Rank (Y) di di2
(X) (Y)
(X)
68 4 4 62 5 5 -1 1
64 5 6 58 7 7 -1 1
75 2 2.5 68 3 3.5 -1 1
50 9 9 45 10 10 -1 1
64 5 6 81 1 1 5 25
80 1 1 60 6 6 -5 25
75 2 2.5 68 4 3.5 -1 1
40 10 10 48 9 9 1 1
55 8 8 50 8 8 0 0
64 5 6 70 2 2 4 16
72
Using old formula:
n 2
6 di
6 72
1 i 21 1
432
1 1 0.436 0.564
n(n 1) 10(102 1) 990
0.564
Using formula for tied ranks:
1 t 1 1 30
Tx mi (mi 2 1) 1(12 1) 2(22 1) 3(32 1) 0 6 24 2.5
12 i 1 12 12 12
1 t 1 1 6
Ty mi (mi 2 1) 1(12 1) 2(22 1) 0 6 0.5
12 i 1 12 12 12
n
6 di 2 Tx Ty
1 i 1 2 1 6 72 2.5 0.5 1 6 72 2.5 0.5
n(n 1) 10(102 1) 10(102 1)
6 75 450
1 1 1 0.455 0.545
990 990
0.564
Question for practice:
X 35 23 47 17 10 47 9 6 28
Y 30 33 49 23 8 49 12 4 31
Linear Regression
Regression analysis is a mathematical measure of the average relationship
between two or more variables in terms of the original units of the data.
Y(popularity)= content + timing + presentation + …
Uses
Prediction of future values
Consumer behaviour
Pricing and promotions on sales of a product
Assess risk in financial services
Simple Linear Regression:
1 dependent variable, 1 independent variable.
y a bx
n n
1. y
i 1
i na b xi
i 1
n n n
2. xi yi a xi b xi 2
i 1 i 1 i 1
1 and 2 are known as normal equations. It is obtained by the method of least
squares.
Grades (y) Time on
Social media
(x)
8 1
7 4
7 5
6 8
5 5
n n
1. y
i 1
i na b xi
i 1
n n n
2. xi yi a xi b xi 2
i 1 i 1 i 1
1. 33=5a+23b
2. 144= 23a+ 131b
(33=5a+23b)x23 ;759=115a+529b
(144=23a+131b)x5;720=115a+655b
759-720=115a-115a+529b-655b
39=-126b
b=-39/126=-0.28
33=5a+23(-0.28)
33=5a-6.44
5a=39.44
a=39.44/5
a=7.89
a=7.89
b=-0.28
y 7.89 0.28 x
X=5; Y=6.49
X=3; Y=7.05
X=7; y=5.93
Y X1 x2 X3
8 1 10 9
7 4 4 10
7 5 3 6
6 8 3 5
5 5 2 5
Fit a regression line for Y & X2 and for Y & X3.
Y x2 x22 X2.Y
8 10 100 80
7 4 16 28
7 3 9 21
6 3 9 18
5 2 4 10
33 22 138 157
33=5a+22b; 157=22a+138b
a=5.34; b=0.29
Fitted regression is Y=5.34+0.29X
Y X3 x3^2 X3.y
8 9 81 72
7 10 100 70
7 6 36 42
6 5 25 30
5 5 25 25
33 35 267 239
33=5a+35b; 239=35a+267b
a=4.05; b=0.36
Fitted regression is Y=4.05+0.36X
Coefficient of determination ( r 2 )
r 2 is used to analyse the extent to which differences in one variable explains the
differences in another variable.
2
1 n
XY
2 ( xi yi ) x y
r ( X , Y )
n i 1
2
X Y 1 n 2 1
n
2
( yi ) y
n
( xi ) x
2 2
i 1 n i 1
Range 0 r 2 1
Simple Linear Regression:
y a bx
Non-Linear Regression (Not a straight line)
Fitting of Exponential curve.
y ab x
Transformation using log: Normal Equations:
y ab x
log y log(ab x ) n n
1. Yi nA B xi
log y log a log b x i 1 i 1
n n n
log y log a x log b 2. xiYi A xi B xi 2
i 1 i 1 i 1
Y A xB
Y A Bx
A log a; a Anti log( A)
B log b; b Anti log( B)
Note:
log(ab) log a log b; log(a / b) log a log b; log(a b ) b log a
Log10=1, log100=2, log212=2.33
A=1=log a; antilog (log a)= antilog (1); log a=1; a= 1/log=Antilog (1)=101
e.g.
Fit an exponential curve to the given data:
y x x2 Y=logy x.Y
8 10 100 0.90 9.03
7 4 16 0.85 3.38
7 3 9 0.85 2.54
6 3 9 0.78 2.33
5 2 4 0.70 1.40
33 22 138 4.07 18.68
Normal equations:
n n
1. Yi nA B xi
i 1 i 1 1. 4.07 5 A 22 B
n n n
2. 18.68 22 A 138B
2. x Y A x B x
i 1
i i
i 1
i
i 1
i
2
A = 0.732
B = 0.019
a Anti log( A); a Anti log(0.732) 5.395
b Anti log( B); b Anti log(0.019) 1.045
The fitted exponential curve is:
y ab x
yˆ (5.395)(1.045) x
Question for practice:
Y X x^2 Y=logy X.Y
8 9 81 0.90 8.13
7 10 100 0.85 8.45
7 6 36 0.85 5.07
6 5 25 0.78 3.89
5 5 25 0.70 3.49
33 35 267 4.07 29.03
4.07=5A+35B A 0.642 a 4.39
29.03=35A+267B B 0.023 b 1.05
The fitted exponential curve is:
y ab x yˆ (4.39)(1.05) x
Solution for the practice Question: Fit Regression line:
y y2 x1 x12 x1.y x2 x22 x2.y x3 x32 x3.Y
3 9 12 144 36 13 169 39 5 25 15
6 36 8 64 48 15 225 90 7 49 42
9 81 16 256 144 16 256 144 11 121 99
8 64 14 196 112 15 225 120 12 144 96
7 49 8 64 56 16 256 112 9 81 63
10 100 9 81 90 19 361 190 8 64 80
43 339 67 805 486 94 1492 695 52 484 395
43=6a+67b 43=6a+94b 43=6a+52b
486=67a+805b 695=94a+1492b 395=52a+484b
a= 6.02 a= -10.121 a= 1.36
b= 0.103 b= 1.104 b= 0.67
Fitted regression line is:
y 6.02 0.103x y 10.121 1.104 x y 1.36 0.67 x
For Coefficient of Determination:
1 n 1 1 n 1 1 1
y yi 6 43 7.167
n i 1
x1 x1i 6 67 11.167
n i 1
x 2 94 15.667
6
x2 52 8.667
6
1 n 1 n 1
( xi 2 ) x 6 805 11.1672 134.167 124.702 9.465 3.076
2
x1 ( xi x) 2
n i 1 n i 1
1
x2 1492 15.667 2 248.667 245.455 3.212 1.792
6
1
x3 484 8.667 2 80.667 75.117 5.55 2.356
6
1
y 339 7.167 2 56.5 51.366 5.134 2.266
6
1 n 1 n 1
x1 y ( x1i x1)( yi y ) ( x1i yi ) x1y 486 11.167 7.167 81 80.034 0.966
n i 1 n i 1 6
1
x 2 y 695 15.667 7.167 115.833 112.285 3.548
6
1
x 3 y 395 8.667 7.167 65.833 62.116 3.717
6
1 n
( x1i yi ) x1y
n i 1 0.966 0.966
r ( x1, y ) x1 y 0.139
x1 y 1 n 1 n 3.076 2.266 6.97
( x1i ) x1 n
2 2
2
( yi ) y
2
n i 1 i 1
3.548 3.548
r ( x 2, y ) x 2 y 0.874
x 2 y 1.792 2.266 4.061
x3 y 3.717 3.717
r ( x3, y ) 0.696
x 3 y 2.356 2.266 5.339
r 2 ( x1, y ) 0.1392 0.019|| r 2 ( x2, y) 0.8742 0.764 || r 2 ( x3, y) 0.6962 0.484
It can be concluded that 76.4% of variation in y is explained by x2.
Solution for the practice Question: Fit Exponential Curve:
y x1 x12 Y=logy x1.Y x2 x22 Y=logy x2.Y x3 x32 Y=logy x3.Y
3 12 144 0.48 5.73 13 169 0.48 6.20 5 25 0.48 2.39
6 8 64 0.78 6.23 15 225 0.78 11.67 7 49 0.78 5.45
9 16 256 0.95 15.27 16 256 0.95 15.27 11 121 0.95 10.50
8 14 196 0.90 12.64 15 225 0.90 13.55 12 144 0.90 10.84
7 8 64 0.85 6.76 16 256 0.85 13.52 9 81 0.85 7.61
10 9 81 1.00 9.00 19 361 1.00 19.00 8 64 1.00 8.00
43 67 805 4.96 55.62 94 1492 4.96 79.21 52 484 4.96 44.77
4.96=6A+67B 4.96=6A+94B 4.96=6A+52B
55.62=67A+805B 79.21=94B+1492B 44.77=52A+484B
A= 0.781 a= 6.04 A= -0.392 a= 0.41 A= 0.363 a= 2.31
B= 0.004 b= 1.01 B= 0.078 b= 1.2 B= 0.054 b= 1.13
The fitted exponential curve is:
yˆ (6.04)(1.01) x1 yˆ (0.41)(1.2) x 2 yˆ (2.31)(1.13) x3
Estimation of simple and exponential trends for Time Series.
Time Series:
A time series may be defined as a collection of observations belonging to different
time periods, of some economic variables or any other phenomenon. The period
can be seconds, minutes, hours, days, weeks, months, years, decades, etc.
Mathematically, a time series is defined as:
yt f (t )
where, yt is the value of the phenomenon (or variable) under study at time t.
e.g.
1. Population ( yt ) of a country in different years (t)
2. Number of coronavirus cases ( yt ) on different days (t)
3. Rainfall ( yt ) on different months (t)
4. Temperature ( yt ) of a place on different years (t)
Components of Time Series:
1. Secular Trend or Long-term movement
2. Periodic Changes or Short-term fluctuations
a. Seasonal variations
b. Cyclic variations
3. Random or Irregular Movements.
Trend: General tendency of the data to increase or decrease during long period
of time. e.g. upward or downward movement. Population, number of deaths,
literacy rate, standard of living, etc.
Estimation of trend for Time Series:
Graphical method: Year Profit
(t) (yt)
1996 12.6
Profit (yt) 1997 14.8
1998 18.6
25
1999 14.8
20
15
2000 16.6
10 2001 21.2
5 2002 18
0 2003 17.4
1996 1997 1998 1999 2000 2001 2002 2003 2004 2004 15.8
Method of Semi- average:
1996 1997 1998 1999 2000 12.6 14.8 18.6 14.8 16.6
t1 1998 Yt1 15.48
5 5
2001 2002 2003 2004 21.2 18 17.4 15.8
t2 2002.5 Yt2 18.1
4 4
Semi average points (1998, 15.48) and (2002.5, 18.1).
Method of curve fitting:
1. Fitting of simple trend line (straight line):
yt a bt
n n
1. yi 1
ti na b ti
i 1
n n n
2. ti yti a ti b ti 2
i 1 i 1 i 1
yˆt a bt
2. Fitting of Exponential trend line (curve):
yt abt
Yt A Bt
n n
1. Yti nA B ti
i 1 i 1
n n n
2. t Y
i 1
i ti A ti B ti 2
i 1 i 1
yˆt abt
Fit simple and exponential trend lines for the time series data and estimate the
trends for the year 2000 and 2005.
Year Profit
x=(t-2000) x2 xyt Y=logyt xY
(t) (yt)
1996 12.6 -4 16 -50.4 1.10 -4.40
1997 14.8 -3 9 -44.4 1.17 -3.51
1998 18.6 -2 4 -37.2 1.27 -2.54
1999 14.8 -1 1 -14.8 1.17 -1.17
2000 16.6 0 0 0 1.22 0.00
2001 21.2 1 1 21.2 1.33 1.33
2002 18 2 4 36 1.26 2.51
2003 17.4 3 9 52.2 1.24 3.72
2004 15.8 4 16 63.2 1.20 4.79
149.8 0 60 25.8 10.95 0.73
149.8=9a+0b a= 16.64
25.8=0a+60b b= 0.43
x=t-2000
yˆt 16.64 0.43x
Trend for year 2000: yˆ 2000 16.64 0.43* 2000 2000 16.64
Trend for year 2005: yˆ 2005 16.64 0.43* 2005 2000 16.64 0.43*5 18.79
Exponential trend lines
10.95=9A+0B
0.73=0A+60B
A= 1.217 a= 16.482
B= 0.012 b= 1.028
x=t-2000
yˆt (16.482)(1.028) x
Trend for year 2000: yˆ2000 (16.482)(1.028)(20002000) 16.482
Trend for year 2005: yˆ2005 (16.482)(1.028)(20052000) (16.482)(1.028)5 18.922
Question for practice:
Fit simple and exponential trend lines for the time series data and estimate the
trends for the year 2000 and 2005.
t y
1990 17
1991 20
1992 19
1993 26
1994 24
1995 40
1996 35
1997 55
1998 51
1999 74
2000 79