0% found this document useful (0 votes)

173 views138 pages

Multiple Regression

Multiple regression allows modeling of a dependent variable (y) as a function of several independent or explanatory variables (x1, x2, etc). It involves finding the coefficients (C1, C2, etc) that minimize the sum of squared errors between the actual dependent values and those predicted by the regression equation y = C1 + C2x1 + ... + CkxK. A key risk with multiple regression is multicollinearity, where explanatory variables are highly correlated with each other, making their individual effects difficult to distinguish.

Uploaded by

Yeldho Peter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views138 pages

Multiple Regression

Uploaded by

Yeldho Peter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 138

Introducing Multiple Regression

Simple Regression

Cause Effect
Independent variable Dependent variable
Simple Regression
Oil Prices

Government
Bond Yields

One cause, one effect

Multiple Regression

Causes Effect
Independent variables Dependent variable
Multiple Regression
Oil Prices

Government
Bond Yields
S&P 500
Share Index

Many causes, one effect

Simple Regression
Y
(x1, y1)

(x2, y2)

(x3, y3)
Regression Line:
y = A + Bx
(xn, yn)

Represent all n points as

(xi,yi), where i = 1 to n
Multiple Regression
Y
(x1, y1, z1)

(x2, y2, z2)

(x3, y3, z3)

Regression Plane:
(xn, yn, zn) y = A + Bx + Cz

Represent all n points as

Z (xi,yi,zi), where i = 1 to n
Multiple Regression

Causes Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1

[] [] [] [][]
O1 e1
E2 1 D2 O2 e2
E3 = A +B D3 +C O3 + e3
1
… … … …
…
En Dn On en
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Multiple Regression
Regression Equation:
y = A + Bx + Cz

y1 1 x1

[ ] [] [][][]
z1 e1
y2 1 x2 z2 e2
y3 =A +B x3 +C z3 + e3
1
… … … …
…
yn xn zn en
1
Multiple Regression
Regression Equation:
y = A + Bx + Cz

y1 e1

[ ] [ ] []
1 x1 z1
y2 e2
[ ]
A
1 x2 z2
y3
…
yn
=
1
…
x3
…
xn
z3
…
zn
* B
C
+ e3
…
en
1
n Rows, n Rows, 3 Rows, n Rows,
1 Column 3 Columns 1 Column 1 Column
Multiple Regression

2 Causes 1 Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression

k Causes 1 Effect
Dow Jones index, Exxon stock
price of oil, bond yields…
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

y1 1 x11 x1k e1

[ ] [] [] [][]
y2
y3
…
yn
= C1
1
1
…
+ C2
x21
x31
…
xn1
+ … Ck+1
x2k
x3k
…
+
e2
e3
…
xnk en
1
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

y1 e1

[ ] [ ] []
1 x11 x1k
y2 e2
[ ]
C1
1 x21 x2k
y3
…
yn
=
1
…
x31
…
xn1
… x3k
…
xnk
* C2
…
Ck+1
+ e3
…
en
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

Multiple regression involves finding

k+1 coefficients, k for the explanatory
variables, and 1 for the intercept
Estimation Methods in Multiple Regression

Maximum
Method of Method of least
likelihood
moments squares
estimation

The method of least squares works for

multiple regression too
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

The “best fit” line is the one where

the sum of the squares of the
lengths of the errors is minimised
Risks in Multiple Regression
Simple and Multiple Regression

Simple Regression Multiple Regression

Data in 2 dimensions Data in > 2 dimensions
Simple and Multiple Regression

Simple Regression Multiple Regression

Risks exist, but can usually be Risks are more complicated,
mitigated analysing R2 and require interpreting regression
residuals statistics
Risks in Simple Regression

No cause-effect Mis-specified Incomplete

relationship relationship relationship
Regression on completely Non-linear (exponential Multiple causes exist, we
unrelated data series or polynomial) fit have captured just one
Diagnosing Risks in Simple Regression

No cause-effect Mis-specified Incomplete

relationship relationship relationship
high R2, residuals are not
low R2,plot of X ~ Y has low R2, residuals are not
independent of each
no pattern independent of x
other
Mitigating Risks in Simple Regression

No cause-effect Mis-specified Incomplete

relationship relationship relationship
Wrong choice of X and Y Transform X and Y - Add X variables (move to
- back to drawing board convert to logs or returns multiple regression)
The big new risk with multiple
regression is multicollinearity: X
variables containing the same
information
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1

y1 1

[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=

nx1
1
…
x31
…
xn1
… x3k-1
…
xnk-1
*
nxk
C2
…
Ck
kx1

1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1

y1 1

[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
1
…
x31
…
xn1
… x3k-1
…
xnk-1
* C2
…
Ck
1
X1 Xk
Bad News: Multicollinearity Detected
X1

High R2

Highly correlated explanatory variables

Good News: No Multicollinearity Detected
X1

Low R2

Uncorrelated explanatory variables

Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt

E1 1 D1

[ ] [] [] []
O1
E2 1 D2 O2
E3 = A +B D3 +C O3
1
… … …
…
En Dn On
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Good News: No Multicollinearity Detected
DOW
Returns

Low R2

OIL

Uncorrelated explanatory variables

Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C NASDAQt

E1 1 D1

[ ] [] [] []
N1
E2 1 D2 N2
E3 = A +B D3 +C N3
1
… … …
…
En Dn Nn
1
Ei = % return Di = % return of Ni = % return of
on Exxon stock Dow Jones NASDAQ index
on day i index on day i on day i
Bad News: Multicollinearity Detected
DOW

High R2

NASDAQ

Highly correlated explanatory variables

Multicollinearity Kills Regression’s Usefulness

Explaining Variance Making Predictions

The R2 as well as the regression The regression model will
coefficients are not very reliable perform poorly with out-of-
sample data
Multicollinearity: Prevention and Cure

Common Sense Nuts and Bolts Heavy Lifting

Big-picture Setting up data right  Factor analysis,
understanding of the Principal components
data  analysis (PCA) 
Think deeply about each x variable
Eliminate closely related ones
Dig down to underlying causes
Common
Sense
Multiple Regression

Proposed Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt
% return on % return of % return of % change in
Exxon stock on Dow Jones NASDAQ index price of oil on
day i index on day i on day i day i
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks

Do we really need both Dow and NASDAQ returns as

explanatory variables?
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks

If yes - consider keeping one, and constructing a

new explanatory variable of their difference
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C OILt

Dow Jones
Oil Prices
Industrial Average
Price of barrel of oil
30 Large-cap US stocks

What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense

GDP growth Interest rates

US dollar strength Seasonality

What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense

GDP growth Interest rates

US dollar strength Seasonality

What underlying factors drive both US large-cap

stocks and the price of oil?
Multiple Regression

Original Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt

Revised Regression Equation:

EXXONt = A + B DOWt + C INTERESTt + D GDPt
‘Standardise’ the variables
Rely on adjusted-R2, not plain R2
Set up dummy variables right
Distribute lags
Nuts and Bolts
Find underlying factors that drive
the correlated x variables
Principal Component Analysis
(PCA) is a great tool
Heavy Lifting
Multiple Regression

Proposed Regression Equation:

HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
% change in yield on 5- yield on 10-
home prices in year bond in year bond in …
month i month i month i
Bad News: Multicollinearity Detected
Xi

Time

Highly correlated explanatory variables

Factor Analysis

3-month 1-year government 5-year government

government bonds bonds bonds

1-day (overnight) 30-year government 5-year swap rate

money market bonds (inter-bank)

Interest rates on a wide variety of fixed-income

instruments
Factor Analysis on Interest Rates

Level Slope Twist

How high are interest How steep is the yield How convex is the yield
rates? curve? curve?

Three uncorrelated factors explain most variation in

all interest rates
The factors identified are
guaranteed to be uncorrelated

However, they may not have an

intuitive interpretation

Factor Principal Component Analysis is one

Analysis procedure for factor analysis
Dimensionality Reduction via Factor Analysis

20-dimensional 3-dimensional
data data

Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Dimensionality Reduction via Factor Analysis

[ ]
x11 x1k-1 f11 f12 f13
x21

…
xn1
x2k-1
x31 … x3k-1
…
xnk-1
[ ]
f21
f31
…
fn1
f22
f32
…
fn2
f23
f33
…
fn3
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Factor Analysis is a dimensionality-
reduction technique to identify a
few underlying causes in data
Multiple Regression
Proposed Regression Equation:
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …

Principal
Component
Analysis

Revised Regression Equation:

HOMEt = A + B LEVELt + C SLOPEt + D TWISTt
Benefits of Multiple Regression
Simple Regression Is a Great Tool

Powerful Versatile Deep

Perfectly suited to two Easily extended to non- The first “crossover hit”
common use-cases linear relationships from Machine Learning
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Two Common Applications of Regression

Explaining Variance Making Predictions

How much variation in one data How much does a move in one
series is caused by another? series impact another?
Controlling For Different Causes

Proposed Regression Equation:

EXXONt = A + B DOWt + C OILt
% return on % return of % change in
Exxon stock on Dow Jones price of oil on
day i index on day i day i