Multiple Linear Regression (MLR): Simplified &
Structured Notes
1. Introduction to Multiple Linear Regression (MLR)
Definition and Purpose
• MLR is used to study the impact of multiple explanatory variables on a single dependent
variable.
• It extends Simple Linear Regression (SLR), which involves only one explanatory variable.
• MLR helps analyze the individual and combined effects of multiple variables on the outcome,
including interactions between correlated variables.
Comparison: MLR vs. SLR
• SLR Equation: Y = β0 + β1X1 + ε
• One explanatory variable.
• Other variables' effects are absorbed into the error term (ε).
• MLR Equation: Y = β0 + β1X1 + β2X2 + ... + βkXk + ε
• Multiple explanatory variables.
• Each variable's effect is estimated explicitly.
MLR Model Components
• Y: Response variable
• X1, X2, ..., Xk: Explanatory variables
• β0, β1, ..., βk: Coefficients
• ε: Error term
Error Term Assumptions
• Errors are independent
• Errors have equal variance (homoscedasticity)
• Errors are normally distributed
• E[ε] = 0
Expected Value of Y
• E[Y | X1, ..., Xk] = β0 + β1X1 + β2X2 + ... + βkXk
1
2. Key Concepts in MLR
Adjusted R-squared (R̄²)
• Definition: Adjusted R² accounts for the number of predictors (k) and sample size (n).
• Purpose: Prevents misleading increases in R² by penalizing unnecessary variables.
• Key Points:
• Adjusted R² is generally less than R².
• Higher Adjusted R² = Better model.
Standard Error (Se)
• Definition: Estimate of population standard deviation of error terms (σε).
• Smaller Se = Better model.
• In MLR, Se usually decreases as more variables are added.
• R̄² and Se² move in opposite directions.
Coefficient of Correlation (R)
• In SLR: R = correlation between X and Y.
• In MLR: R = correlation between observed Y and predicted Y (Ŷ).
Marginal vs. Partial Slopes
• Marginal Slope (SLR): Total effect of a variable on Y, ignoring other variables.
• Partial Slope (MLR): Effect of a variable on Y holding other variables constant.
• Marginal and partial slopes are the same only if explanatory variables are independent (rare).
3. Collinearity / Multicollinearity
• Definition: High correlation among explanatory variables.
• Effect: Makes MLR results hard to interpret.
• Tools: Path diagram, Variance Inflation Factor (VIF).
4. Path Diagram: Direct and Indirect Effects
• Visualizes relationships among explanatory variables and with Y.
• Direct Effect: From X1 to Y (partial slope).
• Indirect Effect: From X1 → X2 → Y.
• Total Effect = Direct Effect + Indirect Effect
• Total Effect ≈ Marginal Slope from SLR.
2
5. Example: CGPA Prediction (Business School Admissions)
Dataset
• 15 Students
• Y: CGPA
• X1: Entrance Exam Score (0-10)
• X2: Interview Score (0-10)
Correlations
• CGPA and Entrance: 0.74
• CGPA and Interview: 0.76
• Entrance and Interview: 0.54 → Sign of multicollinearity
SLR: X1 on Y
• R: 0.74, R²: 0.55, Se: 0.785
• Coefficient: 0.72 (Marginal slope)
• p-value: 0.001 → Significant
SLR: X2 on Y
• R: 0.763, Se: 0.741
• Coefficient: 0.934 (Marginal slope)
• p-value: 0.0001 → Significant
MLR: X1 and X2 on Y
• Multiple R: 0.86, R²: 0.74, Adjusted R²: 0.69, Se: 0.628
• p-value: 0.0003 → Model is significant
Coefficients (Partial Slopes):
• Intercept: -0.7
• X1: 0.455 (p = 0.019, CI: 0.10 to 0.81)
• X2: 0.622 (p = 0.010, CI: 0.15 to 1.08)
Regression Equation: CGPA = -0.7 + 0.455(Entrance) + 0.622(Interview)
Path Diagram Quantification
• X1 → X2: Coefficient = 0.42
• Indirect Effect (X1): 0.42 * 0.622 = 0.26
• Total Effect (X1): 0.455 + 0.26 = 0.715 ≈ 0.72 (Marginal)
• X2 → X1: Coefficient = 0.68
• Indirect Effect (X2): 0.68 * 0.455 = 0.31
• Total Effect (X2): 0.622 + 0.31 = 0.932 ≈ 0.934 (Marginal)
3
6. Variance Inflation Factor (VIF)
Definition
• Measures how much variance in X_i is explained by other Xs.
• Formula: VIF(Xi) = 1 / (1 - Ri²)
• Ri²: R-squared from regression of Xi on all other explanatory variables.
Interpretation
• VIF = 1: No collinearity
• VIF > 1: Indicates multicollinearity
Impact on Standard Error
• SE(bi)_with_VIF = SE(bi)_without_VIF * √(VIF)
• Higher VIF → Larger SE → Smaller t-statistic → Larger p-value
CGPA Example
• Correlation X1 & X2: 0.54
• R² from auxiliary regression: 0.29 → VIF = 1.41
• √(1.41) = 1.18 → SE increases by 18%
7. Case Study: Apartment Price Prediction
Variables
• Y: Price
• X1: Area (sq. ft.)
• X2: Bedrooms
• X3: Parking Lots
• Data: 20 apartments
SLR Results
• All variables: Significant individually (p < 0.05)
• Area marginal slope = 0.32
MLR Results
• Multiple R = 0.7, R² = 0.49, Model p-value = 0.01
But:
• Area (X1):
4
• Partial slope = ~0.05
• p = 0.7 → Not significant
• CI includes 0
• Parking Lots (X3):
• p = 0.11 → Not significant
• CI includes 0
Conclusion: Despite individual significance in SLR, variables become insignificant in MLR due to
multicollinearity.
VIF Values
• Area: 1.53
• Bedrooms: 1.34
• Parking Lots: 1.23
8. Signs of Multicollinearity
• R² increases only slightly with more variables
• Marginal vs. Partial slopes differ drastically
• Strong overall F-statistic, but weak individual t-tests
• Partial slope SE > Marginal slope SE
9. Remedies for Multicollinearity
1. Remove Redundant Variables
2. Drop variables that add little unique value.
3. Re-express Variables
4. Combine correlated variables into one (e.g., economic status).
5. Do Nothing (if variables are still significant)
6. If p-values are low and estimates are stable, collinearity might be acceptable.
7. Example: In CGPA model, both variables had significant partial slopes despite correlation.