🔍 What is Multicollinearity?
Multicollinearity refers to the situation in a multiple regression model where two or more explanatory
(independent) variables are highly linearly related.
If the correlation is perfect (i.e., exact linear relationship), it's called perfect multicollinearity.
If the correlation is very high but not perfect, it's called imperfect (or high) multicollinearity.
❗ Why is the Assumption of No Perfect Multicollinearity Important?
The assumption of no perfect multicollinearity is crucial because:
1. OLS Estimators Cannot Be Computed:
If perfect multicollinearity exists, the matrix of explanatory variables (X'X) becomes non-invertible, and OLS
(Ordinary Least Squares) cannot be computed.
2. Loss of Unique Solutions:
Perfect multicollinearity implies that one variable is an exact linear function of others. Hence, the model
cannot separate out the individual effect of each variable on the dependent variable.
⚠️Consequences of Multicollinearity
Even high (but not perfect) multicollinearity can cause:
1. Inflated Standard Errors: Coefficients become less precise.
2. Unstable Estimates: Small changes in data can lead to large changes in estimates.
3. Insignificant t-values: Even important variables may appear statistically insignificant.
4. Difficulty in Interpretation: It becomes hard to distinguish the effect of one variable from another.
🔎 Detection of Multicollinearity
Common methods include:
1. Correlation Matrix: Check pairwise correlations among independent variables.
2. Variance Inflation Factor (VIF):
o VIF > 10 indicates high multicollinearity.
3. Tolerance:
o Tolerance = 1/VIF; a low value (close to 0) suggests multicollinearity.
4. Condition Index: High values indicate multicollinearity.
5. Eigenvalues of X'X Matrix: Near-zero eigenvalues indicate multicollinearity.
Remedial Measures
1. Drop One of the Correlated Variables: If two variables are redundant, remove one.
2. Combine Variables: Use indices or principal component analysis.
3. Centering Variables: Especially for interaction terms (subtract the mean).
4. Collect More Data: Sometimes additional data helps reduce the multicollinearity.
5. Use Ridge Regression: Regularization techniques can handle multicollinearity better.
📘 Types of Multicollinearity
Multicollinearity refers to the degree of linear relationship among the explanatory variables in a regression
model. It is broadly divided into:
1️⃣ Perfect Multicollinearity
Definition: Occurs when one explanatory variable is an exact linear function of one or more other
explanatory variables.
⚠️Consequences
Cannot compute unique OLS estimates for all original parameters.
Estimation and hypothesis testing on individual coefficients not possible.
The X'X matrix is not invertible, violating the assumptions of OLS.
2️⃣ Imperfect (or Near) Multicollinearity
Definition: Occurs when two or more explanatory variables are highly, but not perfectly, linearly related.
Common in real-world data; perfect multicollinearity is rare.
Also called high collinearity.
✅ In this case:
OLS can still be applied.
Unique estimates of parameters are possible, but:
o Standard errors may be inflated.
o Some coefficients may appear statistically insignificant, even if they are actually important.
o Estimates can become unstable or highly sensitive to small data changes.
📝 Key Distinction Table
Feature Perfect Multicollinearity Imperfect Multicollinearity
Exact (correlation coefficient =
Degree of correlation Very high, but not exact
±1)
Can OLS be applied? ❌ No ✅ Yes
Can you estimate all
❌ No unique estimators ✅ Yes, but with caution
parameters?
Common in real data? ❌ Rare ✅ Very common
Matrix invertibility ❌ X'X not invertible ✅ X'X invertible
✅ Yes, but results may be misleading
Statistical inference possible? ❌ No
📉 Consequences of Multicollinearity in Regression Analysis
Even though OLS estimators remain BLUE (Best Linear Unbiased Estimators) under imperfect multicollinearity,
the presence of multicollinearity—especially high or near multicollinearity—causes several problems in
estimation and inference:
🔸 (a) Multicollinearity is a Sample Problem
The explanatory variables might not be correlated in the population, but may appear highly correlated in
the sample.
Thus, multicollinearity can arise due to sampling variations, not population-level relationships.
🔸 (b) Increased Variance and Standard Errors
High multicollinearity leads to large variances and higher standard errors of OLS estimates.
This reduces precision in estimating the true values of coefficients.
🔸 (c) Wider Confidence Intervals
Standard errors are inflated, leading to wider confidence intervals for slope coefficients.
This lowers confidence in the accuracy of estimates.
🔸 (d) Insignificant t-ratios
t-statistic =
With high standard error, t-ratios become small, often failing to reject the null hypothesis H0:β2=0H_0: \
beta_2 = 0, even if the variable is actually important.
🔸 (e) High R² but Few Significant t-values
Example: In Equation (10.6),
R2=0.97778R^2 = 0.97778, i.e., model explains ~98% of variation.
However, most t-values are not significant, except for the price variable.
This leads to a contradiction:
o F-test (overall significance) may reject H0H_0
o t-tests (individual significance) fail to reject H0H_0
Indicates potential multicollinearity issue.
🔸 (f) Sensitivity to Small Changes in Data
OLS estimates and their standard errors become unstable.
Even small changes in the sample can substantially alter the regression results.
🔸 (g) Wrong Signs of Coefficients
One major effect of multicollinearity is that estimated coefficients may have unexpected or contradictory
signs.
Example: If the income coefficient is negative (as in Equation 10.6), it violates economic logic unless the good
is inferior.
This is due to confounding influence of other correlated variables.
Consequence Description
Sample Problem Multicollinearity may exist only in the sample, not in the population.
Large Standard Errors Reduces precision of estimates.
Wider Confidence Intervals Leads to less precise inference.
Insignificant t-values Variables may wrongly appear unimportant.
High R² but low t-stats Contradiction between overall and individual significance.
Sensitivity to Data Changes Small data changes ⇒ big result shifts.
Wrong Signs Estimated coefficients may defy theoretical expectations.
🔍 Detection of Multicollinearity
Multicollinearity is not always obvious in a regression model, so it requires specific tests and indicators to be
detected. The most common methods are:
1️⃣ High R² and Few Significant t-ratios
Classic symptom of multicollinearity:
o Overall regression appears strong (high R², e.g., > 0.8).
o But individual variables are not statistically significant (low t-values).
Contradiction:
o F-test may reject the null (suggesting model is significant),
o while t-tests fail to reject null hypotheses on individual coefficients.
📌 Interpretation: Suggests that explanatory variables are collinear.
2️⃣ High Pair-wise Correlations Among Explanatory Variables
High correlation coefficients (e.g., > 0.8 or 0.9) between independent variables suggest multicollinearity.
⚠️Caution: High pairwise correlation is not always a sufficient condition.
o Even if pairwise correlation is low, perfect multicollinearity can still exist.
Use partial correlation coefficients (e.g., r23.4r_{23.4}) for more accuracy:
o Measures correlation between X2X_2 and X3X_3, holding X4X_4 constant.
o Example:
r23=0.90r_{23} = 0.90, but
r23.4=0.43r_{23.4} = 0.43 → indicates weak partial correlation.
🔎 Conclusion: High pairwise correlation may suggest multicollinearity, but partial correlation gives a more reliable
picture.
3️⃣ Auxiliary (Subsidiary) Regressions
Regress each explanatory variable on all other explanatory variables:
o For example: X1=α0+α2X2+α3X3+⋯+uX_1
Compute Ri2R_i^2 for each auxiliary regression.
Rule of thumb:
If Ri2R_i^2 (from auxiliary regression) is greater than the R2R^2 of the main model, multicollinearity may be
present.
📉 Limitation: Time-consuming if there are many variables.
4️⃣ Variance Inflation Factor (VIF)
One of the most commonly used and reliable detection tools.
Defined as:
where R_j^2 is the R² from the auxiliary regression of variable X_j on other independent variables.
Interpretation:
o : No multicollinearity.
o : Moderate multicollinearity.
o : Serious multicollinearity.
Effect on variance:
As Rj2→1R_j^2 \to 1, VIF → ∞, and so does the variance of b_j.
📌 Note: High VIF means inflated standard errors, leading to:
Lower t-values,
Wider confidence intervals.
⚠️Important caveat: Even if Rj2R_j^2 is high, if error variance σ2\sigma^2 is small or the denominator ∑xj2\sum
x_j^2 is large, the overall variance may still remain low, and t-values may be high.
📝 Summary Table
Method Description Limitation
Overall model is strong, but variables look
High R² + low t-values Only an indirect indicator
insignificant individually
Doesn’t capture multivariable
Pairwise correlation High correlation (>0.8) between variables
relationships
Partial correlation Accounts for influence of other variables More accurate, but complex
Auxiliary regressions Regress each variable on others, check Ri2R_i^2 Time-consuming
Variance Inflation Factor Measures how much variance is inflated due to
Best general indicator
(VIF) multicollinearity
🔧 Remedial Measures of Multicollinearity
Multicollinearity is not always a problem—if the objective is prediction, and collinearity is stable across samples, it
may not affect the forecast. However, if the goal is to estimate individual regression coefficients reliably, then
multicollinearity can severely distort results due to inflated standard errors and wider confidence intervals,
potentially leading to statistical insignificance of coefficients.
✅ Key Remedial Measures:
1. Dropping a Variable from the Model
Description: Remove one of the collinear variables.
Caution: May introduce model specification error and bias if the variable is theoretically important.
Guideline: Do not drop a variable if its t-statistic > 1, as it contributes to explanatory power (reflected in
adjusted R²).
2. Acquiring Additional Data
Description: Increase sample size.
Effect: Boosts ∑x² → lowers variance and standard error of estimators.
Formula Insight:
3. Re-specifying the Model
Description: Modify model structure—maybe due to omitted variables or wrong functional form.
Example Fix: Use log-linear or semi-log models to reduce multicollinearity.
4. Using Prior Information
Description: Employ estimated parameter values from previous studies as a guide.
Usefulness: Helps constrain and validate parameter values.
5. Variable Transformation
Description: Apply transformations (e.g., logs, differences) to variables.
Objective: Reduce linear relationships between regressors.
6. Ridge Regression
When to Use: Severe multicollinearity; especially with many explanatory variables.
How It Works:
o Standardize variables (mean = 0, SD = 1).
o Add a small constant k to diagonal of the correlation matrix.
o Reduces variance of estimators at the cost of introducing slight bias.
Purpose: Stabilizes estimates when R² is very high.
7. Other Techniques
Combine Time Series & Cross-sectional Data: Enlarges sample & variation.
Principal Component or Factor Analysis: Combines correlated variables into fewer uncorrelated
components.
📌 Polynomial Regression & Multicollinearity
Example: Cubic cost function
Key Point: Model is linear in parameters, so OLS is still valid.
Risk: X, X², X³ can be highly correlated → multicollinearity risk.
Economic Theory Suggestion (for U-shaped cost curves):