There are different questions related to Multicollinearity as below:
- What is Multicollinearity?
- How Multicollinearity is related to correlation?
- Problems with Multicollinearity.
- Best way to detect multicollinearity in the model.
- How to handle/remove Multicollinearity from the model?
We will try to understand each of the questions in this post one by one.
Multicollinearity occurs in a multilinear model where we have more than one predictor variables. So Multicollinearity exists when we can linearly predict one predictor variable (note not the target variable) from other predictor variables with a significant degree of accuracy. It means two or more predictor variables are highly correlated. But not the vice versa means if there is a low correlation among predictors then also multicollinearity may exist.
In more generic terms:
When the amount of information among predictor variables is not all independent then we say that multicollinearity exists.
Hence by removing multicollinearity, we can get a reduced set of predictors which contained most of the information. “stepAIC” function do this all for us, it removes multicollinearity as well as produces the final optimal set of predictors which contained most of the information and also which build the significant fit model.
Multicollinearity vs Correlation
Correlation coefficient tells us that by which factor two variables vary whether in the same direction or in a different direction. in other words, correlation coefficient tells us that whether there exists a linear relationship between two variables or not and the absolute value of correlation tells how strong the linear relationship is. correlation coefficient zero means there does not exist any linear relationship however these variables may be related non linearly.
High correlation means there exist multicollinearity however if correlation value is low then also multicollinearity may exist hence correlation is not the conclusive evidence for the existence of the multicollinearity. The reason behind this is that correlation is always determined between two variables, however, multicollinearity exists when one predictor variable can be predicted with the help of other sets of predictor variables. This is what is meant by “when not all information contained by a variable is independent means some of that information can be determined by other sets of predictor variables.”
Let's understand this with the help of an example. we can draw a correlation plot using cor(data frame) function, where all variable should be numeric.
In the above correlation plot, the darker and bigger circle represents a high correlation between variables.
In the above matrix, the correlation between “disp” and “cyl” is 90.28% which is very high. It means when we increase cyl value, hp value that is horsepower also increases which is true also. We are using 10 variables to predict the mpg value but it looks like whatever the information contained by “disp” variable also contained by cyl variable so there is no reason to include both the variables in our prediction model, however, we can choose only one.
Problems with Multicollinearity
When multicollinearity exists in the model, it could not calculate the regression coefficient confidently. Means there could be multiple options for regression coefficient which will not have statistically any meaning.
Let’s take a very simple example just to understand this point. So what could be the linear equations to predict the value from below table?
Possible options would be:
- y = x1 + x2 ?
- y= 2×1 ?
- y=2.5×1 — .5×2?
So the same case occurs when we have multicollinearity. Now let's understand it with the real problem set.
Best way to detect multicollinearity
Stepwise Regression prevents multicollinearity problem to a great extent, however, the best way to know if multicollinearity exists is by calculating variance inflation factor (VIF).
We calculate VIF for each of the predictor (independent variable). In this approach, we completely neglect target variable temporary. And try to predict each predictor variable itself with the help of the remaining set of predictors.
Suppose there are three predictors x1, x2, x3. then regression equations will be like below:
- x1 = b1 + b2x2 + b3x3
- x2 = b1 + b2x1 + b3x3
- x3 = b1 + b2x1 + b3x2
Each of the above equation will give back the R-squared value. hence VIF will be
So the general rule is if VIF > 10 OR R-squared is > 0.90 then severe Multicollinearity exist. Again in the mtcars data set if we run vif(linear model) then we get the following result:
So we can say variables “cyl”, “disp”, “wt”, “gear” and “carb” causes multicollinearity.
So what next shall we immediately remove those variables and start with the model building? The answer is no, this is not the whole story. Although Multicollinearity exists but still removing those variable blindly will have an impact on the accuracy. So here comes stepAIC which select the final set of variables by considering both model performance as well as multicollinearity issue. You may refer my article on stepAIC here to know more on it with an example.
So that’s all, about the Multicollinearity. Hope it gives you a good understanding of what is Multicollinearity and how to handle it. In case you have any doubts/ideas feel free to share using the comment section below.
This article first appeared on the “Tech Tunnel” blog at https://ashutoshtripathi.com/2019/06/07/feature-selection-techniques-in-regression-model/
Thank you for reading.
Related Articles on Machine Learning: