Introducing Multiple Regression
Simple Regression
Cause Effect
Independent variable Dependent variable
Simple Regression
Oil Prices
Government
Bond Yields
One cause, one effect
Multiple Regression
Causes Effect
Independent variables Dependent variable
Multiple Regression
Oil Prices
Government
Bond Yields
S&P 500
Share Index
Many causes, one effect
Simple Regression
Y
(x1, y1)
(x2, y2)
(x3, y3)
Regression Line:
y = A + Bx
(xn, yn)
Represent all n points as
(xi,yi), where i = 1 to n
Multiple Regression
Y
(x1, y1, z1)
(x2, y2, z2)
(x3, y3, z3)
Regression Plane:
(xn, yn, zn) y = A + Bx + Cz
Represent all n points as
Z (xi,yi,zi), where i = 1 to n
Multiple Regression
Causes Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1
[] [] [] [][]
O1 e1
E2 1 D2 O2 e2
E3 = A +B D3 +C O3 + e3
1
… … … …
…
En Dn On en
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Multiple Regression
Regression Equation:
y = A + Bx + Cz
y1 1 x1
[ ] [] [][][]
z1 e1
y2 1 x2 z2 e2
y3 =A +B x3 +C z3 + e3
1
… … … …
…
yn xn zn en
1
Multiple Regression
Regression Equation:
y = A + Bx + Cz
y1 e1
[ ] [ ] []
1 x1 z1
y2 e2
[ ]
A
1 x2 z2
y3
…
yn
=
1
…
x3
…
xn
z3
…
zn
* B
C
+ e3
…
en
1
n Rows, n Rows, 3 Rows, n Rows,
1 Column 3 Columns 1 Column 1 Column
Multiple Regression
2 Causes 1 Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
k Causes 1 Effect
Dow Jones index, Exxon stock
price of oil, bond yields…
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
y1 1 x11 x1k e1
[ ] [] [] [][]
y2
y3
…
yn
= C1
1
1
…
+ C2
x21
x31
…
xn1
+ … Ck+1
x2k
x3k
…
+
e2
e3
…
xnk en
1
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
y1 e1
[ ] [ ] []
1 x11 x1k
y2 e2
[ ]
C1
1 x21 x2k
y3
…
yn
=
1
…
x31
…
xn1
… x3k
…
xnk
* C2
…
Ck+1
+ e3
…
en
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
Multiple regression involves finding
k+1 coefficients, k for the explanatory
variables, and 1 for the intercept
Estimation Methods in Multiple Regression
Maximum
Method of Method of least
likelihood
moments squares
estimation
The method of least squares works for
multiple regression too
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
The “best fit” line is the one where
the sum of the squares of the
lengths of the errors is minimised
Risks in Multiple Regression
Simple and Multiple Regression
Simple Regression Multiple Regression
Data in 2 dimensions Data in > 2 dimensions
Simple and Multiple Regression
Simple Regression Multiple Regression
Risks exist, but can usually be Risks are more complicated,
mitigated analysing R2 and require interpreting regression
residuals statistics
Risks in Simple Regression
No cause-effect Mis-specified Incomplete
relationship relationship relationship
Regression on completely Non-linear (exponential Multiple causes exist, we
unrelated data series or polynomial) fit have captured just one
Diagnosing Risks in Simple Regression
No cause-effect Mis-specified Incomplete
relationship relationship relationship
high R2, residuals are not
low R2,plot of X ~ Y has low R2, residuals are not
independent of each
no pattern independent of x
other
Mitigating Risks in Simple Regression
No cause-effect Mis-specified Incomplete
relationship relationship relationship
Wrong choice of X and Y Transform X and Y - Add X variables (move to
- back to drawing board convert to logs or returns multiple regression)
The big new risk with multiple
regression is multicollinearity: X
variables containing the same
information
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1
y1 1
[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
nx1
1
…
x31
…
xn1
… x3k-1
…
xnk-1
*
nxk
C2
…
Ck
kx1
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1
y1 1
[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
1
…
x31
…
xn1
… x3k-1
…
xnk-1
* C2
…
Ck
1
X1 Xk
Bad News: Multicollinearity Detected
X1
High R2
Xk
Highly correlated explanatory variables
Good News: No Multicollinearity Detected
X1
Low R2
Xk
Uncorrelated explanatory variables
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1
[ ] [] [] []
O1
E2 1 D2 O2
E3 = A +B D3 +C O3
1
… … …
…
En Dn On
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Good News: No Multicollinearity Detected
DOW
Returns
Low R2
OIL
Uncorrelated explanatory variables
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C NASDAQt
E1 1 D1
[ ] [] [] []
N1
E2 1 D2 N2
E3 = A +B D3 +C N3
1
… … …
…
En Dn Nn
1
Ei = % return Di = % return of Ni = % return of
on Exxon stock Dow Jones NASDAQ index
on day i index on day i on day i
Bad News: Multicollinearity Detected
DOW
High R2
NASDAQ
Highly correlated explanatory variables
Multicollinearity Kills Regression’s Usefulness
Explaining Variance Making Predictions
The R2 as well as the regression The regression model will
coefficients are not very reliable perform poorly with out-of-
sample data
Multicollinearity: Prevention and Cure
Common Sense Nuts and Bolts Heavy Lifting
Big-picture Setting up data right
Factor analysis,
understanding of the Principal components
data
analysis (PCA)
Think deeply about each x variable
Eliminate closely related ones
Dig down to underlying causes
Common
Sense
Multiple Regression
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt
% return on % return of % return of % change in
Exxon stock on Dow Jones NASDAQ index price of oil on
day i index on day i on day i day i
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Do we really need both Dow and NASDAQ returns as
explanatory variables?
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
If yes - consider keeping one, and constructing a
new explanatory variable of their difference
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C OILt
Dow Jones
Oil Prices
Industrial Average
Price of barrel of oil
30 Large-cap US stocks
What underlying factors drive both US large-cap
stocks and the price of oil?
Common Sense
GDP growth Interest rates
US dollar strength Seasonality
What underlying factors drive both US large-cap
stocks and the price of oil?
Common Sense
GDP growth Interest rates
US dollar strength Seasonality
What underlying factors drive both US large-cap
stocks and the price of oil?
Multiple Regression
Original Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt
Revised Regression Equation:
EXXONt = A + B DOWt + C INTERESTt + D GDPt
‘Standardise’ the variables
Rely on adjusted-R2, not plain R2
Set up dummy variables right
Distribute lags
Nuts and Bolts
Find underlying factors that drive
the correlated x variables
Principal Component Analysis
(PCA) is a great tool
Heavy Lifting
Multiple Regression
Proposed Regression Equation:
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
% change in yield on 5- yield on 10-
home prices in year bond in year bond in …
month i month i month i
Bad News: Multicollinearity Detected
Xi
Time
Highly correlated explanatory variables
Factor Analysis
3-month 1-year government 5-year government
government bonds bonds bonds
1-day (overnight) 30-year government 5-year swap rate
money market bonds (inter-bank)
Interest rates on a wide variety of fixed-income
instruments
Factor Analysis on Interest Rates
Level Slope Twist
How high are interest How steep is the yield How convex is the yield
rates? curve? curve?
Three uncorrelated factors explain most variation in
all interest rates
The factors identified are
guaranteed to be uncorrelated
However, they may not have an
intuitive interpretation
Factor Principal Component Analysis is one
Analysis procedure for factor analysis
Dimensionality Reduction via Factor Analysis
20-dimensional 3-dimensional
data data
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Dimensionality Reduction via Factor Analysis
[ ]
x11 x1k-1 f11 f12 f13
x21
…
xn1
x2k-1
x31 … x3k-1
…
xnk-1
[ ]
f21
f31
…
fn1
f22
f32
…
fn2
f23
f33
…
fn3
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Factor Analysis is a dimensionality-
reduction technique to identify a
few underlying causes in data
Multiple Regression
Proposed Regression Equation:
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
Principal
Component
Analysis
Revised Regression Equation:
HOMEt = A + B LEVELt + C SLOPEt + D TWISTt
Benefits of Multiple Regression
Simple Regression Is a Great Tool
Powerful Versatile Deep
Perfectly suited to two Easily extended to non- The first “crossover hit”
common use-cases linear relationships from Machine Learning
Multiple Regression Is Even Better
Powerful Versatile Deep
Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better
Powerful Versatile Deep
Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Two Common Applications of Regression
Explaining Variance Making Predictions
How much variation in one data How much does a move in one
series is caused by another? series impact another?
Controlling For Different Causes
Proposed Regression Equation:
EXXONt = A + B DOWt + C OILt
% return on % return of % change in
Exxon stock on Dow Jones price of oil on
day i index on day i day i
All else being equal, how much will Exxon stock
move by if oil prices increase by 1%?
Controlling For Different Causes
Proposed Regression Equation:
EXXONt = A + B DOWt + C OILt
All else being equal, how much will Exxon stock
move by if oil prices increase by 1%?
Controlling For Different Causes
EXXONt = A + B DOWt + C OILt
EXXON? = A + B DOWt + C (OILt + 1%)
All else being equal, how much will Exxon stock
move by if oil prices increase by 1%?
Controlling For Different Causes
EXXONt = A + B DOWt + C OILt
EXXON? = A + B DOWt + C (OILt + 1%)
Change in EXXON = EXXONt - EXXON?
All else being equal, how much will Exxon stock
move by if oil prices increase by 1%?
Controlling For Different Causes
EXXONt = A + B DOWt + C OILt
- EXXON? = A + B DOWt + C (OILt + 1%)
All else being equal, how much will Exxon stock
move by if oil prices increase by 1%?
Controlling For Different Causes
EXXONt = A + B DOWt + C OILt
- EXXON? = A + B DOWt + C (OILt + 1%)
Change in EXXON = C
All else being equal, how much will Exxon stock
move by if oil prices increase by 1%?
Regression coefficients tell how
much y changes for a unit change
in each predictor, all others being
held constant
Multiple Regression Is Even Better
Powerful Versatile Deep
Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better
Powerful Versatile Deep
Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better
Powerful Versatile Deep
Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Interpreting the Results of a
Regression Analysis
Interpreting Results of a Simple Regression
R2 Residuals
Measures overall quality of fit - the Check if regression assumptions are
higher the better (up to a point) violated
Standard errors of individual coefficients are usually
of little significance
Interpreting Results of a Multiple Regression
Adjusted R2 Residuals F-statistic
Standard Errors
R2
of coefficients
e = y - y’
=> y = y’ + e
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)
A Not-Very-Important Intermediate Step
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
e = y’ - y
=> y = y’ + e
Always = 0
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)
A Leap of Faith
This is important - more on why in a bit
Variance(y) = Variance(y’) + Variance(e)
Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
Variance(y) = Variance(y’) + Variance(e)
Total Variance (TSS)
A measure of how volatile the dependent variable is, and of much it moves around
TSS = Variance(y’) + Variance(e)
Explained Variance (ESS)
A measure of how volatile the fitted values are - these come from the regression line
TSS = Variance(y)
TSS = ESS + Variance(e)
Residual Variance (RSS)
This the variance in the dependent variable that can not be explained by the
regression
TSS = Variance(y) ESS = Variance(y’)
TSS = ESS + RSS
Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
TSS = Variance(y) ESS = Variance(y’) RSS = Variance(e)
R2 = ESS / TSS
R 2
The percentage of total variance explained by the regression. Usually, the higher the
R2, the better the quality of the regression (upper bound is 100%)
R2 = ESS / TSS
R 2
In multiple regression, adding explanatory variables always increases R2, even if those
variables are irrelevant and increase danger of multicollinearity
Adjusted-R2 = R2 x (Penalty for adding irrelevant variables)
Adjusted-R2
Increases if irrelevant* variables are deleted
(*irrelevant variables = any group whose F-ratio < 1)
Extending Multiple Regression to
Categorical Variables
A Simple Regression
Proposed Regression Equation:
y = A + Bx
Height of Average height
individual of parents
A Simple Regression
y
Male
Female
Regression Line:
y = A + Bx
Not a great fit - regression line is far from all points!
A Simple Regression
y
Male
Female
Regression Line For Males:
y = A1 + Bx
A1
We can easily plot a great fit for males…
A Simple Regression
y
Male
Female
Regression Line For Females:
y = A2 + Bx
A2
…and another great fit for females
A Simple Regression
y Regression Line For Males:
Male
y = A1 + Bx
Female Regression Line For Females:
y = A2 + Bx
A1
A2
x
Two lines - same slope, different intercepts
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:
y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:
y = A1 + (A2 - A1)D + Bx
D=0 for males
y = A1 + (A2 - A1)D + Bx
= A1 + Bx
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:
y = A1 + (A2 - A1)D + Bx
D=1 for females
y = A1 + (A2 - A1) + Bx
= A2 + Bx
Adding A Dummy Variable
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents
Combined Regression Line:
y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
Adding A Dummy Variable
Combined Regression Line:
y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
The data contained 2
groups, so we added 1
dummy variable
Given data with k groups, set up k-1
dummy variables, else
multicollinearity occurs
Dummy and Other Categorical Variables
Dummy Variables Categorical Variables
Finite set of values - e.g. days of
Binary - 0 or 1
week, months of year…
To include non-binary categorical
variables, simply add more dummies
Testing for Seasonality
Proposed Regression Equation:
y = A + BQ1 + CQ2 + DQ3
Average stock Quarter of the
returns year
The data contains 4 groups, so we
added 3 dummy variables
Testing for Seasonality
y = A + BQ1 + CQ2 + DQ3
The data contains 4 groups, so we
added 3 dummy variables
Q1 = 1 for Jan, Feb, Mar
=0 for other quarters
Q2 = 1 for Apr, May, Jun
=0 for other quarters
Q3 = 1 for July, Aug, Sep
=0 for other quarters
Different Groups, Different Slopes
y
Male
y = A1 + B1 x
y = A2 + B2 x
Female
x
Dummy variables can also be extended for use
where groups have different slopes
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x
Combined Regression Line:
y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D1 = 0 for males D2 = 0 for males
=1 for females =x for females
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x
For males: D1 = 0
y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D2 = 0
= A1 + B1 x
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x
For females: D1 = 1
y = A1 + (A2 - A1)(1) +
B1x + (B2 - B1)x D2 = x
= A1 + (A2 - A1) +
B1x + (B2 - B1)x
= A2 + B2 x
Dummy Variables
X Y
Linear regression Logistic regression
Normal Distribution
μ
N(μ,σ)
Average (mean) is μ
Standard deviation is σ
Standard Errors
E(α) = A E(β) = B
A - SE(α) A + SE(α) B - SE(β) B + SE(β)
Sampling Distribution of α Sampling Distribution of β
α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter
Standard error of a regression parameter is the
standard deviation of the sampling distribution
Strong Cause-effect Relationship
Y
High R2
Residuals are small, standard errors are small
Weak Cause-effect Relationship
Y
Low R2
Residuals are large, standard errors are large
Standard Errors and Residuals
Low Standard Error High Standard Error
High confidence that parameter Low confidence that parameter
coefficient is well estimated coefficient is well estimated
The smaller the residuals, the smaller the standard
errors and the better the quality of the regression
Sample Regression Line
Regression Equation:
y = A + Bx
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
Sample Regression Line
Regression Equation:
y = A + Bx
Residuals
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
RSS = Variance(e)
Residual Variance (RSS)
Easily calculated from regression residuals
SE(α), SE(β) can be found from RSS
Estimate Standard Errors from RSS
Exact formulae are not important - reported by Excel, R…
The smaller the residuals, the smaller
the standard errors and the better
the quality of the regression
Standard Errors
E(α) = A E(β) = B
A - SE(α) A + SE(α) B - SE(β) B + SE(β)
Standard Error of α Standard Error of β
α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter
Standard error of a regression parameter is the
standard deviation of the sampling distribution
Null Hypotheses
What if the population parameter α
were actually zero?
Call this the null hypotheses H0
Null Hypotheses: α = 0
E(α) = 0
t x SE(α)
α=A
If this were actually true, how likely is it that our
sample regression would yield the estimate α = A?
Why Zero?
Y Y
X X
Sample Regression Line Population Regression Line
y = A + Bx y = α + βx
If α = 0, it is adding no value in the regression line
and should just be excluded
Null Hypotheses: α = 0
E(α) = 0
t x SE(α)
α=A
The farther from the mean, the more unlikely that
α=0
t-Statistics
E(α) = 0 E(β) = 0
0.85xSE(α)
A
9.01xSE(β)
B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)
We are now testing a hypothesis, that the population
parameter is actually zero
t-Statistics
E(α) = 0 E(β) = 0
0.85xSE(α)
A
9.01xSE(β)
B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)
Is an individual estimate of A or B ‘adding value’ at all?
High t-statistic => Yes
The higher the t-statistic of a
coefficient, the higher our confidence
in our estimate of that coefficient
p-Values
E(α) = 0 E(β) = 0
0.85xSE(α)
A 0.39 2 x 10-15
9.01xSE(β)
p-value(α) = 0.39 p-value(β) = 2 x 10-15 ~ 0
Low t-stat, high p-value High t-stat, low p-value
Is an individual estimate of α or β ‘adding value’ at all?
low p-value => Yes
The lower the p-value of a coefficient,
the higher our confidence in our
estimate of that coefficient
RSS
SER =
n-2
Standard Error of Regression (SER)
n is the number of points in the regression.
SER provides an unbiased estimator of error variance σ2
RSS
σ2
~ χ2
χ2 Distribution
Never mind the fine print about degrees of freedom for now
Null Hypotheses
What if all population parameters
were zero? i.e. β = α = 0
Call this the null hypotheses H0
Null Hypotheses: β = α = 0
β = 0, α = 0
F-statistic
β = B, α = A
If this were actually true, how likely is it that our
sample regression would yield the estimate
β = B, α = A?
Why Zero?
Y Y
X X
Sample Regression Line Population Regression Line
y = A + Bx y = α + βx
If α = β = 0, our regression line is not adding any
value at all
Null Hypotheses: α = 0
β = 0, α = 0
F-statistic
β = B, α = A
The farther from the peak, the more unlikely that
α=β=0
F-Statistic
β = 0, α = 0
F-statistic
β = B, α = A
Does our regression as a whole ‘add value’ at all?
High F-statistic => Yes
p-values and t-statistics tell us
whether individual parameter
coefficients are ‘good’
The F-statistic tells us whether a
entire regression line is ‘good’
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:
y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 1 for females
=0 for females =0 for males
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:
y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 0 for males
y = A1x1 + A20 + Bx
= A1 + Bx
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
Combined Regression Line:
y = A1D1 + A2D2 + Bx
D1 = 0 for females D2 = 1 for females
y = A1x0 + A2x1 + Bx
= A2 + Bx
Adding A Dummy Variable
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents
Combined Regression Line:
y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 1 for females
=0 for females =0 for males
Given data with k groups, set up k-1
dummy variables and an intercept, or
k dummy variables with no intercept
Regression Without Intercept
Regression R2 can go Excel, Python and R all adjust
negative R2 formula in this case
Python statsmodel R2
Excel and R usually agree
sometimes differs