Easy Implementation of Dummy Coding/One-Hot Coding in R

Martinqiu
CodeX
Published in
5 min readOct 5, 2021

--

This article summarizes some easy approaches to quickly convert categorical variables into 0–1 values together, with other variables kept in the dataframe for next-step analysis.

We don’t use the tedious ifelse or case_when function.

I will construct a general task: in a dataframe we have an ID variable, a numerical variable, N1, two categorical variables, C1, and C2, for which we want to code as dummies, and one categorical variable, C3, which we leave as is.

We create some data first. The function set.seed is used to guarantee the results can be replicated each time we run the code.

set.seed(451)
df=data.frame(
id=1:8,
N1=round(runif(8,14,20),2),
C1=sample(c("E","F","G","H"),8,replace=T),
C2=sample(c("Ideal","Good","Average"),8,replace=T),
C3=sample(c("Group 1","Group 2"),8,replace=T),stringsAsFactors = T)

I create a dataframe with 5 variables and 8 rows. See below. If you use 451 for set.seed (), you shall create the same dataframe as I did.

As stated in the beginning, we need to dummy code C1 and C2 only. There are 4 different levels for C1 (E,F,G, and H), and 3 different levels for C2 (Good, Ideal, and Average): 4+3=7. If we disregard full rank, we shall create 7 dummies, one level each.

But if we want to maintain a full rank and avoid something called “dummy trap,” we shall have one intercept (or the constant, namely, a column of 1s) and 5 dummies, or 6 dummies without an intercept. I leave it to readers to figure out what dummy trap is.

In below I provide three approaches. Each is just one line of code using pipe. You can copy my code and make small changes to do your work.

First, I employee pivot_wider function to complete dummy coding.

library(tidyverse)
df1=df %>% mutate(value1=1,value2=1) %>% pivot_wider(names_from =C1,values_from=value1,values_fill = 0) %>% pivot_wider(names_from = C2,values_from=value2,values_fill = 0);df1

The trick is to create two constant columns of 1s (called value1 and value2), and convert the data into a wide shape via pivot_wider, twice. Each time we fill the NAs with zero using values_fill = 0. We now complete a thorough dummy transformation for only C1 and C2. In total 7 dummy variables are created.

To obtain a full rank transformation, we can either remove one of the 7 dummies, or add an intercept and remove one dummy from each of the two factors. The factor levels represented by these two dummy variables are chosen as reference. In below, I choose dummy E and Ideal as reference and remove them accordingly when adding an intercept.

df2=df1 %>% mutate(intercept=1) %>% select(-c(E,Ideal));df2

Sometimes people want to track those new dummies (which column they originally come from, C1 or C2?). They want to add a prefix or appendix to those dummy names such as “C1.E” or “Good_C2.” My approach is to change the values of C1 and C2 before pivoting them.

df1=df %>% mutate(value1=1,value2=1,C1=str_c("C1_",C1),C2=paste0(C2,".C2")) %>% pivot_wider(names_from =C1,values_from=value1,values_fill = 0) %>%  pivot_wider(names_from = C2,values_from=value2,values_fill = 0);df1

Either paste0 or strc_c can change the values of C1 and C2 systemically(shown both above). If you find a better way to do it, please let me know.

pivot_wider’s opposite, pivot_longer, can easily convert back a set of dummies into a categorical variable. You can find the answer from my previous articles.

Another approach is to use dummyVar in caret package. The syntax dummyVar is a bit strange as it takes two steps to get the dataframe we need.

library(caret)
df1=dummyVars( ~.-C3, data = df,sep = "_")
data.frame(predict(df1, newdata = df))
df1$C3=df$C3;df1

Using pipe,

df1=df %>% dummyVars(~.-C3,data=., sep="_") %>% predict(,newdata=df) %>% as.data.frame()%>%mutate(C3=df$C3)

dummyVars gives you 7 dummy variables as does pivot_wider. The nice thing about dummyVars is it can automatically keep the original column names (C1and C2) as prefix and accordingly add them to the names of those dummy variables.

The problem with dummyVars is that it converts every categorical variable in the dataframe to dummies unless we exclude C3 from conversion (by adding -C3 in the formula). Thus, C3 is removed from the new dataframe, and we have to add it back manually via mutate function.

The formula is also tricky: “~.-C3”, “~C1+.-C3”, “~C2+.-C3”,“~C1+C2+.-C3”, and “~C1+C2+N1+id-C3” all give the same results but with different orders of columns. The dot in the formula means to include all variables in the dataframe.

How to obtain a full-rank transformation when using dummyVars? dummyVars has an option fullRank=T to achieve this objective.

df1=df %>% dummyVars(~.-C3,data=., sep="_",fullRank=T) %>% predict(,newdata=df) %>% as.data.frame()%>%mutate(intercept=1,C3=df$C3)

In total, 5 dummies are created. We then need to add an intercept to the dataframe via mutate.

If instead, we don’t want an intercept but still prefer a full-rank transformation, we can modify the formula argument of the above code in dummyVars function.

df1=df %>% dummyVars(~.-C3-1,data=., sep="_",fullRank=T) %>% predict(,newdata=df) %>% as.data.frame()%>%mutate(C3=df$C3)

The “-1” in the formula tells R not to account for an intercept during the transformation. As a result, 6 dummies are created.

The last approach is to use model.matrix.

df1=as.data.frame(model.matrix(~C1+.-C3,df));df1

If we don’t want the intercept, you can have

as.data.frame(model.matrix(~C1+.-1-C3,df))

The nice thing about model.matrix is that the conversion is always full rank (either 5 dummies created with one intercept, or six dummies without an intercept). The downsides are that you cannot customize the names of those newly created dummy variables, and you still lose categorical variables that you don’t want to convert to dummies in the output. In addition, the output of model.matrix still needs to be converted back to a dataframe.

Using pipe, we have

df1=df %>% model.matrix(~.-C3,.) %>% as.data.frame()%>%mutate(C3=df$C3)

The second dot in model.matrix(~.-C3,.) to indicate the data argument in pipe.

To sum up, all three approaches have their pros and cons. But they are far more convenient than ifelse of case_when. If we prepare for linear regressions or glm, we shall use model.matrix without worrying about full rank or dummy trap. In other situations, dummyVars can quickly handle a data set with many categorical variables at once (such as product features), and pivot_wider allows for customization if you intend to dummy code only a selected number of categorical variables.

--

--

Martinqiu
CodeX

I am a marketing professor and I teach BDMA (big data and marketing analytics) at Lazaridis School of Business, Wilfrid Laurier University, in Waterloo, Canada.