Stepwise Regression: What is it and should you use it?

Quantitative Results
Statistical Analysis

When conducting a multiple linear regression, there are a number of different approaches to entering predictors (i.e., independent variables) into your model. The simplest approach is to enter all of the predictors you have into your model in one step. It is commonly referred to as the “standard” method of regression. Another approach is to enter your predictors in multiple, predetermined steps. This is generally known as “hierarchical regression” and is appropriate when you have meaningful groups of predictors. For instance, your predictors might include a few demographic variables (such as gender and age), and personality characteristics (such as extraversion and neuroticism). In this case, it might make sense to enter your demographic variables in one step, and then enter your personality variables in another step. This would allow you see how much variance in your outcome (dependent) variable that the personality characteristics explain above and beyond the demographic variables.

Stepwise Regression

Stepwise regression is a special case of hierarchical regression in which statistical algorithms determine what predictors end up in your model. This approach has three basic variations: forward selection, backward elimination, and stepwise. In forward selection, the model starts with no predictors and successively enters significant predictors until reaching a statistical stopping criteria. In backward elimination, the model starts with all possible predictors and successively removes non-significant predictors until reaching the stopping criteria. The stepwise regression method combines these two approaches, adding and removing predictors as it builds the model.

Disadvantages

Stepwise regression may seem like a convenient option. However, researchers and statisticians have identified numerous statistical problems. Including overfitting the data, biased estimates, and inflated Type I error (see Harrell, 2015 for a detailed discussion). Statistical pitfalls aside, there are other important limitations to stepwise regression. Most notably, stepwise regression relies on a computer program to pick the variables for you, without any consideration for what they measure or how they fit into the theoretical framework that guides your study. It is usually more appropriate to use theory and previous research to decide what variables are important to include in your model. Ronan Conroy, a biostatistician, once said, “Personally, I would no more let an automatic routine select my model than I would let some best fit procedure pack my suitcase.” In other words, the computer program will just pick the things that fit into the suitcase the best, regardless of what they are or if you need them for your trip.

When should you use it?

However, there are situations in which stepwise regression may be appropriate to use. For example, if you have a very large number of potential predictors to include in your model. Predictors may be reduced by using stepwise regression. It is usually better to narrow down the variables in your study based on the specific problem you are investigating and the background literature and theories surrounding the topic. If your research is purely exploratory, and there is no existing theoretical foundation to guide the selection of variables. Stepwise regression may be applied as an exploratory analysis.

Final Remarks

To conclude, it is generally not advisable to use stepwise regression, especially if your research questions are theoretical. However, if you have a very large number of potential variables to use in your model. We recommend revisiting the literature to narrow your options down. This may ultimately lead you to a more focused study that does not rely on automatic variable selection.

References
Harrell, F. (2015). Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis (2nd ed.). New York, NY: Springer.