LINEAR
REGRESSIO
N
Nguyen Quang qua
[email protected]
THE IDEA BEHIND REGRESSION
• We want to relate two different variables – how
does one affect the other?
• Particularly, we want to know how much Y changes
when X increases/decreases by 1 unit.
• In doing so, we need a function in the form
Y = 𝛽𝑋
which lets us know that when X increases by 1 unit, Y
changes by 𝛽.
• Example:
• What is your monthly income?
REGRESSIO • How much do you spend on bubble milk tea?
N ANALYSIS • Below is the data from a sample of 100 students
• How much more does a student spend on bubble tea each
monthly if his/her income increases by 1 mil. VND?
• How do we find the value of 𝛽 in this case?
REGRESSIO • By fitting a line to the data.
N ANALYSIS • In particular, we try to find the line of best fit.
• What does best fit mean?
• How do we find the value of 𝛽 in this case?
REGRESSIO • By fitting a line to the data.
N ANALYSIS • In particular, we try to find the line of best fit.
• What does best fit mean?
REGRESSIO
N
FUNCTION
• Most basic regression does
exactly this
• The method of Ordinary
least squares (OLS) minimizes
the sum of the squared
“distances”
THE LINEAR REGRESSION MODEL (LRM)
• The general form of the LRM model is:
𝑌!= 𝛽" + 𝛽#𝑋#! + 𝛽$𝑋$! + ⋯ + 𝛽%𝑋%! + 𝑒!
• Or, as written in short form:
= 𝛽𝑋+ 𝑒! 𝑌!
• 𝑌is the regressand, or
dependent/explained variable
• 𝑋is a vector of regressors, or
independent/explanatory
variables
• 𝑒is an error term/residual.
REGRESSION COEFFICIENTS
𝑌!= 𝛽" + 𝛽#𝑋#! + 𝛽$𝑋$! + ⋯ + 𝛽% 𝑋%! + 𝑒!
• 𝛽"is the intercept/constant
• 𝛽# to 𝛽% are the slope coefficients
• In general, 𝛽 are the regression coefficients or regression parameters. THEY ARE
WHAT WE NEED TO ESTIMATE!
• Each slope coefficient measures the (partial) rate of change in the mean value of 𝑌for a unit
change in the value of a regressor, ceteris paribus
• Roughly speaking: 𝛽# lets us know when 𝑋# increases by one unit, 𝑌changes by 𝛽#, other
things (all other Xs) unchanged.
METHOD OF • Method of Ordinary Least Squares (OLS) search
for coefficients that minimizes residual sum of
ORDINARY squares (RSS):
LEAST 𝑅𝑆𝑆 = % 𝑢 "!
SQUARES • We need a data set of Y and X to find 𝛽.
• Finding 𝛽 is an optimization problem.
GOODNESS OF FIT: R2
• 𝑅$, the coefficient of determination, is an overall measure of the goodness of fit of the
estimated regression line.
• 𝑅$gives the percentage of the total variation in the dependent variable explained by the
regressors:
"
• Explained Sum of Squares 𝐸𝑆𝑆 = ∑ 𝑌)−
𝑌,
• Residual Sum of Squares 𝑅𝑆𝑆 = ∑ 𝑒 "
#$$ &$$
• Then: 𝑅 " = %$$= 1 − %$
• Total Sum of$Squares 𝑇𝑆𝑆 = ∑ 𝑌 − 𝑌, "
• It is a value between 0 (no fit) and 1 (perfect fit), higher 𝑅$indicates better fit.
• When 𝑅$= 1,𝑆𝑅= 0 and ∑ 𝑒$= 0.
• 𝑛is total number of observations
DEGREE OF • 𝑘is total number of estimated coefficients
FREEDOM • 𝑓𝑑for 𝑆𝑅= 𝑛− 𝑘
𝑑𝑓
GOODNESS OF FIT:
R SQUARED ADJUSTED
• 𝑅$is higher when more regressors are added.
• Sometimes researchers play the game of “maximizing” 𝑅$(Somebody think the
higher the 𝑅$, the better the model. BUT THIS IS NOT NECESSARILY
TRUE!)
• To avoid this temptation: 𝑅$should take into account the number of regressors
• Such an 𝑅$is called an adjusted 𝑅$, denoted as .𝑅$ (R-bar squared), and is
computed from the original (unadjusted) 𝑅$as follows:
𝑛− 1
.𝑅$ = 1 −1 − 𝑛− 𝑘
𝑅$
ILLUSTRATIO
N
DAT
A
A survey of 20,306 individuals in the U.S.
• male 1 = male; 2 = female
• age age (year)
• wage wage (US$/hour)
• tenure # years working for current
• union employer 1 = union member, 0
• edu otherwise
• married years of schooling (years)
Data file: 1 = married or living together with a
lrm.xlsx. partner, 0 otherwise
IMPORTIN
G DATA
IMPORTIN
G DATA
PREPARING AND DESCRIBING DATA
DESCRIBING
DUMMY VARIABLES
• For dummy variables, the mean and
sd do not make a lot of sense.
• We present the frequency of
each outcome instead.
hist(Z$wage, main = "Histogram of wage", xlab = "Wage", col = "yellow", breaks = 100, freq = TRUE)
MORE DETAILED DESCRIPTION: HISTOGRAM
Limit the range of the x axis to (0,100):
hist(Z$wage, main = "Histogram of wage", xlab = "Wage", col = "yellow", breaks = 1000, xlim = c(0,100))
MORE DETAILED DESCRIPTION: HISTOGRAM
SCATTER PLOT
plot(Z$edu,Z$wage, ylab = "Wage (US$/hour)", xlab = "Schooling years")
SCATTER PLOT
plot(Z$age,Z$wage, ylab = "Wage (US$/hour)", xlab = "Age (years)")
SCATTER PLOT
plot(Z$age,Z$wage, ylab = "Wage (US$/hour)", xlab = "Age (years)", ylim = c(0,100))
COMPARING WAGE BETWEEN GROUPS
REGRESSION RESULTS
One more schooling year results in a
wage increase of about US$2/hour.
REGRESSIO
N WITHOUT
OUTLIERS
1. Linear in parameters
2. Full rank
ASSUMPTION 3. Regressors X are fixed (non-
S OF THE stochastic)
CLASSICAL 4. Exogeneity of X
LRM 5. Normal distribution of the error
term
6. Homoskedasticity of the error
term
7. No autocorrelation
8. No specification error
• A1: Model is linear in the parameters
• A2: The number of observations must be
greater than the number of parameters, and
ASSUMPTION no perfect multicollinearity, or no perfect
S OF linear relationships among the 𝑋 variables.
CLASSICAL • A3: Regressors 𝑋𝑠 are fixed or nonstochastic
LRM
• A4: No correlation between 𝑋 and 𝑒, or
E e𝑋 = 0
• A5: Given 𝑋, the expected value of the
error term is zero, or 𝐸 𝑒𝑖 𝑋 =0
and follow
ASSUMPTION 𝑁(0, 𝜎0).
S OF • A6: Homoskedastic, or constant, variance
CLASSICAL of
LRM 𝑢1. Or 𝑣𝑎𝑟 𝑢1 𝑋 = 𝜎0 is a constant.
• A7: No autocorrelation 𝑐𝑜𝑣(𝑢1 , 𝑢2 |𝑋 )
= 0, 𝑖≠ 𝑗.
• A8: No specification bias.
• On the basis of assumptions A1 to A8, the
OLS method gives best linear unbiased estimators
(BLUE):
GAUSS – • (1) Estimators are linear functions of the
dependent variable Y.
MARKOV • (2) The estimators are unbiased; in repeated
THEORE applications of the method, the estimators
approach their true values.
M • (3) In the class of linear estimators, OLS
estimators have minimum variance; i.e., they are
efficient, or the “best” estimators.
HYPOTHESIS
TESTING
Testing individual coefficient: t test
Testing multiple coefficients: F
test
• To test the following hypothesis:
• 𝐻": 𝐵% = 0
• 𝐻#: 𝐵% ≠ 0
TESTING • Calculate the following and use the 𝑡table to
obtain the critical value with 𝑛 − 𝑘 degrees of
INDIVIDUAL freedom for a given level of significance
COEFFICIENT (or 𝛼, conventionally chosen at 10%, 5%, or
: T-TEST 1%): 𝑏3
𝑡=
𝑠𝑒 𝑏3
• Ifthis value is greater than the critical 𝑡value,
we can reject 𝐻0.
• Step 1: Form hypotheses
• 𝐻": 𝛽& = 0
• 𝐻#: 𝛽& ≠ 0
• Step 2: Determine confidence interval, critical
TESTING values, region of rejection, region of
INDIVIDUAL acceptance. ∗
𝑡
COEFFICIENT 4
0,673
: T-TEST • Step 3: Calculate test
statistic 𝑡 𝛽:
9 =𝑠
9 ;!
• Step 4: Decide
TESTING INDIVIDUAL COEFFICIENT: T TEST
• If >𝑡 𝑡!,()% Reject 𝐻" at level of significance of 𝛼
&& "
• If 𝑃*+,-. < 𝛼 Reject 𝐻" at level of significance
of 𝛼
TESTING INDIVIDUAL COEFFICIENT: T TEST
The hypothesis that schooling years has no impact on wage is rejected at 10% (even at
TESTING MULTIPLE COEFFICIENTS: F-TEST
• Step 1: Form hypotheses
• 𝐻": 𝛽/0# = 𝛽 /0$ = ⋯ = 𝛽% = 0
• 𝐻1 : At least one β different
from 0
• Step 2: Calculate test statistic ( 𝐹 )
(&$$ ! (&$$ " )/(+, ! (+, " )
𝐹=
&$$ " /+, "
𝑑𝑓- = 𝑛 − 𝑘
𝑑𝑓& = 𝑛 − 𝑚
TESTING MULTIPLE COEFFICIENTS: F-TEST
• Step 3: Determine the critical
value
𝐹∗%' & ,) ' (𝛼
%
• (𝑘 − 𝑚 ) degree of freedom for nominator )
• (𝑛 − 𝑘) degree of freedom for denominator
• Step 4: Decide
• 𝐹 .. > 𝐹∗, 𝑜𝑟
• 𝑃01234 = 𝑃 𝐹 > 𝐹 .. <𝛼
=> Reject 𝐻5 at the significance level of 𝛼
TESTING MULTIPLE COEFFICIENTS: F-TEST
The hypothesis that male, married and age are equal to zero simultaneously is rejected at 1%.
… is the F-test for the null hypothesis
F-TEST FOR that all coefficients are equal to zero
OVERALL simultaneously.
SIGNIFICANC
E
F TEST FOR OVERALL SIGNIFICANCE
The hypothesis that all coefficients are equal to zero simultaneously is rejected at 1%.