Decoding Multicollinearity in OLS Regression
Dr. Abhijit Biswas
1. What is Multicollinearity?
Multicollinearity occurs in Ordinary Least Squares (OLS) regression when two or more
independent variables (the predictors) are highly correlated with each other. This means
that one predictor can be almost completely explained using the other predictor(s).
OLS Regression: A statistical method used to study how one dependent variable (what
you are trying to predict) gets impacted by or is related to one or more independent
variables (what you use to make the prediction).
Example to Understand Multicollinearity:
Imagine you are trying to predict house prices using:
• Square Footage (X1): The total size of the house.
• Number of Bedrooms (X2): A related feature of the house.
Since larger houses generally have more bedrooms, these two variables will be highly
correlated. This correlation causes a problem for the regression model when trying to
separate the individual effects of square footage and bedrooms on house price.
2. Why is Multicollinearity a Problem?
Multicollinearity makes it harder to estimate the effects of the independent variables
accurately. Here’s how:
a. Increased Standard Error of Coefficients
• Standard Error: A measure of how precise the estimate of a regression coefficient
is. Smaller standard errors mean more confidence in the estimate, while larger
ones mean less confidence.
• With multicollinearity, the model struggles to decide how much each predictor
contributes to the dependent variable, which leads to larger standard errors. This
means the coefficients become unreliable and fluctuate more depending on the
sample data.
b. Difficulty in Interpreting Coefficients
When variables are highly correlated, it’s hard to determine how much each variable
uniquely contributes to the outcome. For example, is house price more influenced by
square footage or number of bedrooms? Multicollinearity makes it unclear.
c. Insignificant Variables
Variables that should be important may appear statistically insignificant (their
coefficients have p-values higher than a significance threshold, like 0.05). This happens
because of the inflated standard errors caused by multicollinearity.
d. Unstable Coefficients
The regression coefficients become unstable, meaning small changes in the data can
lead to big swings in their values. This instability makes the model unreliable for
predictions.
3. How to Detect Multicollinearity
a. Correlation Matrix
A correlation matrix shows the relationships between all pairs of independent variables.
Correlations close to ±1 indicate potential multicollinearity.
b. Variance Inflation Factor (VIF)
• VIF measures how much the variance of a regression coefficient increases due to
multicollinearity.
• A VIF value above 5 generally suggests a problematic level of multicollinearity.
4. Remedies for Multicollinearity
When you detect multicollinearity, here’s how you can address it:
a. Remove One of the Correlated Variables
If two variables are highly correlated, consider removing one of them. For instance, if
square footage and the number of bedrooms are highly correlated, you might choose to
keep only square footage.
b. Combine Variables
You can create a new variable that combines the information from the correlated
predictors. For example, you could create a “size index” by combining square footage and
the number of bedrooms into one variable.
c. Use Principal Component Analysis (PCA)
PCA transforms the variables into a new set of uncorrelated components. These
components can then be used in the regression model.
d. Collect More Data
With more data, the relationships between variables may become clearer, and
multicollinearity can be reduced.
e. Standardize Variables
If multicollinearity arises due to different scales of measurement (e.g., dollars and
percentages), standardizing variables by converting them to a common scale can help.
5. Real-World Example
Imagine you’re analyzing sales data to predict revenue (YYY) using:
1. Advertising Budget (X1): Total amount spent on ads.
2. Online Ad Spend (X2): A subset of the advertising budget focused on digital ads.
• The Problem: Online ad spend is part of the overall advertising budget, so these
two variables are highly correlated.
• Impact: The regression model can’t distinguish how much revenue is driven by
overall advertising vs. online ads. Coefficients for both variables become unstable
and have large standard errors.
• Solution: Remove one variable (e.g., keep only total advertising budget) or
combine them into a single variable representing "total spend" instead.
Key Takeaways
1. Multicollinearity occurs when predictors are highly correlated, making it difficult
for the regression to estimate their unique effects.
2. Why it’s a Problem: It inflates standard errors, makes coefficients unstable, and
reduces the interpretability and reliability of the regression model.
3. Solutions: Detect multicollinearity using correlation matrices, VIF, or condition
numbers, and address it by removing variables, combining variables, or using
advanced techniques like PCA or regularization.
By understanding and addressing multicollinearity, you ensure that your regression
models are both accurate and interpretable.