Nominal variables are common in research but not always usable in categorical form. Therefore, dummy coding provides a common workaround for incorporating these variables into regression analysis. However, it often causes confusion, even among dissertation committees. As such, researchers must understand dummy coding to defend its correct and necessary use.
Dummy coding helps integrate nominal variables into regression models, and its purpose becomes clear once you understand the model. Typically, regressions are used to predict an outcome (such as GPA) using continuous variables (like hours spent studying). For instance, we might find that increased study time correlates with higher GPAs.
Now, what if we wanted to also know if favorite class (e.g., science, math, and language) corresponded with an increased GPA. Let’s say we coded this so that science = 1, math = 2, and language = 3. Looking at the nominal favorite class variable, we can see that there is no such thing as an increase in favorite class – math is not higher than science, and is not lower than language either. This is often referred to as directionality, where understanding whether a high or low score has meaning is key in regression analysis. Luckily, there’s a solution: dummy coding.
Dummy coding lets us convert categories into binary values, where 1 represents a higher category and 0 represents a lower one. This gives directionality to the variable, allowing the regression to compare two categories instead of expecting a continuous increase. For example, consider the favorite class variable, where we originally coded science as 1, math as 2, and language as 3.
TTo make the regression work, we create a separate column (or variable) for each category. These columns show whether each category was a student’s favorite. If a student has a (1) in the science column, it means science is their favorite; if they have a (0), it means science is not their favorite. The same logic applies to the other dummy variables. Below is an example of how this works:
| Dummy variables | |||
Student | Favorite class | Science | Math | Language |
1 |
Science |
1 |
0 |
0 |
2 |
Science |
1 |
0 |
0 |
3 |
Language |
0 |
0 |
1 |
4 |
Math |
0 |
1 |
0 |
5 |
Language |
0 |
0 |
1 |
6 |
Math |
0 |
1 |
0 |
Now, looking at this you can see that knowing the values for two of the variables tell us what value the final variable has to be. Let’s look at student 1; we know they can only have one favorite class. If we know science = 1 and math = 0, we know that language has to be 0 as well. The same goes for student 5; we know that science is not their favorite, nor is math, so language has to have a yes (or 1).
For this reason, we do not use all three categories in a regression. Doing so would give the regression redundant information, result in multicollinearity, and break the model. This means we have to leave one category out, and we call this missing category the reference category. Using the reference category makes all interpretation in reference to that category. For example, if you included the dummy variable of science and used language as the reference, results for that variable tell you those students’ results in comparison to students with language as their favorite class. The reference category is usually chosen based on how you want to interpret the results, so if you would rather talk about students in comparison to those with math as their favorite class, simply include the other two instead.
Now that we have covered the basics of one of the most common data transformations done for regression, next time we will cover a little more of a general interpretation of the linear regression. You can also learn more about interpreting binary logistic regression here!
If you’re like others, you’ve invested a lot of time and money developing your dissertation or project research. Finish strong by learning how our dissertation specialists support your efforts to cross the finish line.