Visualizing Multi-
collinearity in Python
Multi-collinearity Business Situations
• In order to analyze relationship of company sizes and revenues
to stock prices in regression model market capitalizations and
revenues are independent variables
• A company’s market capitalization and its total revenues are
strongly correlated; as a company earns increasing revenues
and it also grows in size, it leads to multi-collinearity problem
What is Multi-collinearity?
• Multi-collinearity is present when
- two or more features are correlated with each other
• Correlation between independent and dependent features is desired
• Multi-collinearity of independent features
- is less desired in some settings
• They can be omitted as they are not necessarily more informative
- than feature they are correlated with
• Identifying these features is a form of feature selection
What is Multi-collinearity?
• In a dataset prior to training predictive models
- it is key to identify and understand multi-collinearity
• We need to limit highly collinear features
- as it can lead to misleading outcomes when explaining models
Why visualize Multi-collinearity?
• Checking correlation between independent and dependent features
- is typically done during EDA
• It provides insight towards
- feature understanding of informative features for prediction
• For feature selection
- it is not always necessarily to visually inspect features correlation
• VIF (Variance Inflation Factor) to detect multi-collinearity
• With multi-collinearity
- regression coefficients are still consistent
- but not reliable since standard errors are inflated
• It means that model’s predictive power is not reduced
- but coefficients are not be statistically significant [Type II error (FN)]
• Multi-collinearity exists with
- high coefficient of determination (R2)
• Correlation between features is visualized using
- correlation matrix and corresponding heatmap
• If dataset has large amount of features then
- it becomes complex in extracting any information
• With 50 features
- we have matrix with shape of 50 x 50
• There must be a better way
- clustermap
Variance Inflation Factor
• Ri2 represents unadjusted coefficient of determination for
regressing ith independent variable on remaining ones
• The reciprocal of VIF is known as tolerance
• Calculation of VIF [Refer attached slides alongwith]
• If Ri2 = 0, variance of remaining independent variables cannot be
predicted from ith independent variable
• When VIF or tolerance = 1, ith independent variable is not correlated
to remaining ones which means multi-collinearity does not exist [Here
variance of ith regression coefficient is not inflated]
• VIF > 4 or tolerance < 0.25 indicates that multi-collinearity might exist
and further investigation is required
• When VIF > 10 or tolerance < 0.1 there is significant multi-collinearity
which needs to be addressed
There are situations where high VIFs can be safely ignored without
suffering from multi-collinearity. The following are three situations:
• High VIFs only exist in control variables but not in variables of interest.
Here variables of interest are not collinear to each other or control
variables [The regression coefficients are not impacted]
• When high VIFs are caused as a result of inclusion of products or
powers of other variables, multi-collinearity does not cause negative
impacts [A regression model includes both x and x2 as independent
variables]
• When a dummy variable which represent more than two categories
has a high VIF, multi-collinearity does not necessarily exist [The
variables will always have high VIFs if there is a small portion of cases
in category, regardless of whether categorical variables are correlated
to other variables]
Correction of Multi-collinearity
• Remove one (or more) of highly correlated variables
• Use principal components analysis (PCA)
• Both minimize
- information loss
- improves model predictability
Visualizing strongly correlated S&P500
stocks
• S&P500 stock data (01/01/2020 -
31/12/2021)
- to visualize collinear stocks
- yahoofinance yfinance package in python
Daily price data of S&P500 stocks
Heatmap
Clustermap