LINEAR
REGRESSION
Nguyen Quang
[email protected]
THE IDEA BEHIND REGRESSION
• We want to relate two different variables – how
does one affect the other?
• Particularly, we want to know how much Y changes
when X increases/decreases by 1 unit.
• In doing so, we need a function in the form
Y = 𝛽𝑋
which lets us know that when X increases by 1 unit, Y
changes by 𝛽.
• Example:
• What is your monthly income?
REGRESSION • How much do you spend on bubble milk tea?
ANALYSIS • Below is the data from a sample of 100 students
• How much more does a student spend on bubble tea each
monthly if his/her income increases by 1 mil. VND?
• How do we find the value of 𝛽 in this case?
REGRESSION • By fitting a line to the data.
ANALYSIS • In particular, we try to find the line of best fit.
• What does best fit mean?
• How do we find the value of 𝛽 in this case?
REGRESSION • By fitting a line to the data.
ANALYSIS • In particular, we try to find the line of best fit.
• What does best fit mean?
REGRESSION
FUNCTION
• Most basic regression does
exactly this
• The method of Ordinary
least squares (OLS) minimizes
the sum of the squared
“distances”
THE LINEAR REGRESSION MODEL (LRM)
• The general form of the LRM model is:
𝑌! = 𝛽" + 𝛽#𝑋#! + 𝛽$𝑋$! + ⋯ + 𝛽% 𝑋%! + 𝑒!
• Or, as written in short form:
𝑌! = 𝛽𝑋 + 𝑒!
• 𝑌 is the regressand, or
dependent/explained variable
• 𝑋 is a vector of regressors, or
independent/explanatory variables
• 𝑒 is an error term/residual.
REGRESSION COEFFICIENTS
𝑌! = 𝛽" + 𝛽# 𝑋#! + 𝛽$ 𝑋$! + ⋯ + 𝛽% 𝑋%! + 𝑒!
• 𝛽" is the intercept/constant
• 𝛽# to 𝛽% are the slope coefficients
• In general, 𝛽 are the regression coefficients or regression parameters. THEY ARE
WHAT WE NEED TO ESTIMATE!
• Each slope coefficient measures the (partial) rate of change in the mean value of 𝑌 for a unit
change in the value of a regressor, ceteris paribus
• Roughly speaking: 𝛽# lets us know when 𝑋# increases by one unit, 𝑌 changes by 𝛽# , other
things (all other Xs) unchanged.
METHOD OF • Method of Ordinary Least Squares (OLS) search
for coefficients that minimizes residual sum of
ORDINARY squares (RSS):
LEAST 𝑅𝑆𝑆 = % 𝑢!"
SQUARES • We need a data set of Y and X to find 𝛽.
• Finding 𝛽 is an optimization problem.
GOODNESS OF FIT: R 2
• 𝑅$, the coefficient of determination, is an overall measure of the goodness of fit of the
estimated regression line.
• 𝑅$ gives the percentage of the total variation in the dependent variable explained by the
regressors:
"
• Explained Sum of Squares 𝐸𝑆𝑆 = ∑ 𝑌) − 𝑌,
• Residual Sum of Squares 𝑅𝑆𝑆 = ∑ 𝑒 "
• Total Sum of Squares 𝑇𝑆𝑆 = ∑ 𝑌 − 𝑌, "
#$$ &$$
• Then: 𝑅" = =1−
%$$ %$$
• It is a value between 0 (no fit) and 1 (perfect fit), higher 𝑅$ indicates better fit.
• When 𝑅$ = 1, 𝑅𝑆𝑆 = 0 and ∑ 𝑒 $ = 0.
• 𝑛 is total number of observations
DEGREE OF • 𝑘 is total number of estimated coefficients
FREEDOM • 𝑑𝑓 for 𝑅𝑆𝑆 = 𝑛 − 𝑘
𝑑𝑓
GOODNESS OF FIT:
R SQUARED ADJUSTED
• 𝑅$ is higher when more regressors are added.
• Sometimes researchers play the game of “maximizing” 𝑅$ (Somebody think the
higher the 𝑅$, the better the model. BUT THIS IS NOT NECESSARILY
TRUE!)
• To avoid this temptation: 𝑅$ should take into account the number of regressors
• Such an 𝑅$ is called an adjusted 𝑅$, denoted as (𝑅$ (R-bar squared), and is
computed from the original (unadjusted) 𝑅$ as follows:
𝑛−1
(𝑅$ = 1 − 1 − 𝑅$
𝑛−𝑘
ILLUSTRATION
DATA
A survey of 20,306 individuals in the US (data file: lrm.xlsx)
• male 1 = male; 2 = female
• age age (year)
• wage wage (US$/hour)
• tenure # years working for current employer
• union 1 = union member, 0 otherwise
• edu years of schooling (years)
• married 1 = married or living together with a partner, 0 otherwise
• race 1 = white; 2 = black; 3 = others
IMPORTING DATA
IMPORTING DATA
PREPARING AND DESCRIBING DATA
DESCRIBING DISCRETE
VARIABLES
• For discrete variables, the mean and standard
deviation do not make a lot of sense.
• We present the frequency of each outcome
instead.
hist(Z$wage, main = "Histogram of wage", xlab = "Wage ($/hour)", col = "yellow", breaks = 100, freq = T)
MORE DETAILED DESCRIPTION: HISTOGRAM
Limit the range of the x axis to (0,100):
hist(Z$wage, main = "Histogram of wage", xlab = "Wage", col = "yellow", breaks = 1000, xlim = c(0,100))
MORE DETAILED DESCRIPTION: HISTOGRAM
SCATTER PLOT
plot(Z$edu,Z$wage, ylab = "Wage (US$/hour)", xlab = "Schooling years")
SCATTER PLOT
plot(Z$age,Z$wage, ylab = "Wage (US$/hour)", xlab = "Age (years)")
SCATTER PLOT
plot(Z$age,Z$wage, ylab = "Wage (US$/hour)", xlab = "Age (years)", ylim = c(0,100))
COMPARING WAGE BETWEEN GROUPS
One more schooling year
REGRESSION results in a wage increase
of about US$2/hour.
RESULTS
REGRESSION
WITHOUT
OUTLIERS
DUMMY
VARIABLE AS
A REGRESSOR
• Coefficient of a dummy
regressor should be
interpreted as the
difference between the two
groups of the dummy
regressor.
QUALITATIVE
REGRESSORS
Dummy variable as a regressor
Transforming categorical variables into dummies
INTRODUCING
CATEGORICAL
VARIABLE
• Recall the categorical variable race:
• race = 1 if white;
• race = 2 if black;
• race = 3 if others.
• How to include this variable in the
wage function?
• We can’t introduce it directly to
the regression function
• Instead, we create a set of
corresponding dummy variables
• White: race = 1
Race • Black: race = 2
Categorization: • Others: race = 3 (all other races)
TRANSFORMING
A CATEGORICAL Dummy • white: 1 if race = 1, 0 if otherwise
VARIABLE TO Variables: • black: 1 if race = 2, 0 if otherwise
DUMMIES
Regression • Include white and black as regressors
Inclusion: • "Others" serves as the base category
THE WAGE
FUNCTION WITH
CATEGORICAL
VARIABLES
• β of white/black indicates
the difference in wage
between white/black and the
base category (“others”).
HYPOTHESIS
TESTING
Testing individual coefficient: t test
Testing multiple coefficients: F test
• To test the following hypothesis:
• 𝐻": 𝛽% = 0
• 𝐻#: 𝛽% ≠ 0
TESTING • Calculate the following and use the 𝑡 table to
obtain the critical value with 𝑛 − 𝑘 degrees of
INDIVIDUAL freedom for a given level of significance (𝛼):
COEFFICIENT: 𝛽'0
T-TEST 𝑡=
𝑠𝑒 𝛽'0
• If this value is greater than the critical 𝑡 value,
we can reject 𝐻0.
TESTING INDIVIDUAL COEFFICIENT: T TEST
• If 𝑡&& > 𝑡!,()% Reject 𝐻" at level of significance of 𝛼
"
• If 𝑃*+,-. < 𝛼 Reject 𝐻" at level of significance of 𝛼
TESTING
INDIVIDUAL
COEFFICIENT:
T TEST
The hypothesis that
schooling years has no
impact on wage is rejected
at 10%.
TESTING MULTIPLE COEFFICIENTS: F-TEST
• Step 1: Form hypotheses
• H0: 𝛽%# = 𝛽%$ = ⋯ = 𝛽%/ = 0
• Ha: At least one of the tested βs is different from 0
• Step 2: Calculate test statistic 𝐹
(𝑅𝑆𝑆& − 𝑅𝑆𝑆' )/(𝑑𝑓& − 𝑑𝑓' )
𝐹=
𝑅𝑆𝑆' /𝑑𝑓'
𝑑𝑓' = 𝑛 − 𝑘 𝑑𝑓& = 𝑛 − 𝑚 𝑑𝑓& − 𝑑𝑓' = 𝑘 − 𝑚
TESTING MULTIPLE COEFFICIENTS: F-TEST
• Step 3: Determine the critical value
∗
𝐹%&',)&% (𝛼)
• (𝑘 − 𝑚) degrees of freedom for nominator
• (𝑛 − 𝑘) degrees of freedom for denominator
• Step 4: Decide, reject 𝐻" (at the significance level of 𝛼) if
• 𝐹(( > 𝐹 ∗, 𝑜𝑟
• 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃 𝐹 > 𝐹(( < 𝛼
TESTING MULTIPLE
COEFFICIENTS:
F-TEST
• The hypothesis that the
coefficients of male, married
and age are simultaneously
equal to zero is rejected at 10%.
F TEST FOR
OVERALL
SIGNIFICANCE
• … is the F-test for the
null hypothesis that all
slopes are equal to zero
simultaneously.
• The hypothesis that all
coefficients are equal to
zero simultaneously is
rejected at 10%.
1. Linear in parameters
2. Full rank
ASSUMPTIONS 3. Regressors X are fixed (non-stochastic)
OF THE 4. Exogeneity of X
CLASSICAL 5. Normal distribution of the error term
LRM 6. Homoskedasticity of the error term
7. No autocorrelation
8. No specification error
• A1: Model is linear in the parameters
• A2: The number of observations must be
greater than the number of parameters, and no
ASSUMPTIONS perfect multicollinearity, or no perfect linear
OF CLASSICAL relationships among the 𝑋 variables.
LRM • A3: Regressors 𝑋𝑠 are fixed or nonstochastic
• A4: No correlation between 𝑋 and 𝑒, or
E e𝑋 = 0
• A5: Given 𝑋, the expected value of the error
term is zero, or 𝐸 𝑒𝑖 𝑋 = 0 and follow
𝑁(0, 𝜎 F ).
ASSUMPTIONS • A6: Homoskedastic, or constant, variance of
OF CLASSICAL 𝑢G . Or 𝑣𝑎𝑟 𝑢G 𝑋 = 𝜎 F is a constant.
LRM • A7: No autocorrelation 𝑐𝑜𝑣(𝑢G , 𝑢H |𝑋) =
0, 𝑖 ≠ 𝑗.
• A8: No specification bias.
• On the basis of assumptions A1 to A8, the
OLS method gives best linear unbiased estimators
(BLUE):
GAUSS – • (1) Estimators are linear functions of the
dependent variable Y.
MARKOV • (2) The estimators are unbiased; in repeated
THEOREM applications of the method, the estimators
approach their true values.
• (3) In the class of linear estimators, OLS
estimators have minimum variance; i.e., they are
efficient, or the “best” estimators.