Multiple Linear Regression Assumptions
First, multiple linear regression requires the relationship between the independent and dependent variables to be linear. One can test the linearity assumption best with scatterplots. The following two examples depict a curvilinear relationship (top image) and a linear relationship (bottom image).
Second, the multiple linear regression analysis requires that the errors between observed and predicted values (i.e., the residuals of the regression) follow a normal distribution. One can check this assumption by looking at a histogram or a Q-Q plot. One can check normality with a goodness of fit test (e.g., Kolmogorov-Smirnov) on the residuals.
Third, multiple linear regression assumes that there is no multicollinearity in the data. Multicollinearity occurs when the independent variables become too highly correlated with each other.
You can check for multicollinearity in multiple ways:
1) Correlation matrix – When computing a matrix of Pearson’s bivariate correlations among all independent variables, the magnitude of the correlation coefficients should be less than .80.
2) Variance Inflation Factor (VIF) – The VIFs of the linear regression indicate the degree that the variances in the regression estimates are increased due to multicollinearity. VIF values higher than 10 indicate that multicollinearity is a problem.
If multicollinearity is found in the data, one possible solution is to center the data. To center the data, subtract the mean score from each observation for each independent variable. However, the simplest solution is to identify the variables causing multicollinearity issues (i.e., through correlations or VIF values) and removing those variables from the regression.
The last assumption of multiple linear regression is homoscedasticity. A scatterplot of residuals versus predicted values is good way to check for homoscedasticity. There should be no clear pattern in the distribution; if there is a cone-shaped pattern (as shown below), the data is heteroscedastic.
If the data are heteroscedastic, a non-linear data transformation or addition of a quadratic term might fix the problem.
