Dealing With Missing Values in R: From Deletion to Replacing to Machine Learning

Martinqiu
Geek Culture
Published in
7 min readMay 21, 2021

--

We frequently encounter datasets with missing values (represented as NAs in the data frame). Missing values render useless some part of the data. Why those values are missing is a different story and is beyond the scope of this article. Here we only talk about treatment.

The primary treatment is either to delete the rows with missing values (reducing the No. of observations) or to remove the columns with missing values (giving up some information). Some people are not happy with a reduced dataset, and they replace missing values with summary stats such as means or medians of available values. A relatively complicated approach is to use the Bayesian approach to estimate missing values together with the model parameters. This approach is popular in the econometric circle (Bayesian people, of course), but I don’t see wide adoption outside that field (I am not a big fan of Baysian, so I will skip this approach).

More recently, people use machine learning algorithms to “guess” out missing information and fill in the guesses in the data as if they are there in the first place. I am in no position to judge the cons and pros of this approach, and this article intends to provide R learners an idea of implementing the guessing approach. We will also summarize some useful functions in R that deal with NAs.

We employ one more time the diamonds data in the ggplot2 package for easy replicability. We sample 5000 diamonds from it, out of which we randomly select 500 diamonds to replace their carat values with NAs, 200 diamonds to replace their price with NAs, and 100 diamonds to replace their color with NAs.

To maintain tractability, we only keep the four Cs variables: carat, color, clarity, and cut, plus the price. If we run linear regression, the four Cs are RHS variables, and the price is the LHS.

data(diamonds,package="ggplot2")set.seed(450)
df=diamonds %>% sample_n(5000)%>% select(1:4,7)
set.seed(450)
df$carat[sample(1:5000, 500)]=NA
df$price[sample(1:5000, 200)]=NA
df$color[sample(1:5000, 100)]=NA

These five variables are selected using their column IDs via the code above: select(1:4,7) in the pipe. set.seed function allows replicating the results each time we rerun the code. You can use any positive integer inside set.seed().

The following two sapply codes help detect whether columns have missing values and count how many missing values are there, respectively.

sapply(df,function(x)all(any(is.na(x))))
sapply(df,function(x)sum(is.na(x)))

We have the outputs as follows.

## carat     cut   color clarity   price    
## TRUE FALSE TRUE FALSE TRUE
## carat cut color clarity price
## 500 0 100 200 0

Without any treatment of missing values, we can run linear regression and many other analyses. Rows with NAs are simply excluded without notice.

lm.model=lm(price~.,data=df)
summary(lm.model)

That’s one of the problems caused by NAs. You don’t even realize that rows with NAs are removed until you check the degrees of freedom in the summary (red oval below).

Some other analyses or operations may not proceed if NAs are detected. We have to let R remove NAs. For example, to compute the mean of carat, we have to add na.rm = T to the code:

mean(df$carat,na.rm=T)

In those cases, we have to clean up NAs in the data frame first before we feed the data to analyses. Below I summarize three approaches.

Delete NAs

We first list some code that removes rows with missing values.

df1=na.omit(df)
df1=df %>% filter(complete.cases(df))

If there are multiple columns with missing values, we can remove rows by the missing values of selected columns.

df1=df %>% filter(!is.na(carat))
df1=df %>% drop_na(carat)

To remove columns with NA values (so you will have the full dimension of rows to use but with fewer variables), you can use select (-variable names) to remove those columns with missing information.

Replace NAs

For numerical variables, we can replace NAs with given values, for example, the mean, as depicted in the following code.

mean.carat=mean(df$carat,na.rm=T)
df1=df %>% mutate(carat=ifelse(is.na(carat),mean.carat,carat))

Or written in one pipe

df1=df %>% mutate(carat=ifelse(is.na(carat),mean(df$carat,na.rm=T),carat))

Using the mean to replace missing values leads to the same mean for carat after those missing values are replaced. Suppose the distribution of the variable suffering missing values is highly skewed, i.g., too many outliers that drag the mean away from the median. In that case, we’d instead use the median to replace missing values.

For factor variables, we have to designate a new factor level. But we cannot directly assign it to the variable. We need to convert it to a character variable first, replace missing values, and convert it back to factors.

df$color=as.character(df$color)
df1=df %>% mutate(color=ifelse(is.na(color),"Other",color))
flevel=row.names(as.matrix(table(df1$color)))
df1$color=factor(df1$color,levels=flevel)
table(df1$color)

flevel extracts the factor level information from the table argument (by converting to a matrix first) to save us the trouble of typing. The last line verifies “Other” is added as a new factor level.

In case we want to replace values for multiple columns with NAs, we can use replace_na. The syntax is as follows. replace_na also keeps the labels of factor levels, so saves the trouble of extracting them.

df1=df %>% replace_na(list(carat=mean.carat,price=0, color="Not Available")) %>% mutate(color=factor(color))str(df1$color)

In the above code, we replace the missing values of carat with mean carat, the missing values of price with zero, and the missing values of color with the string “Not Available.” List argument is needed for multiple value replacement.

If we have many numerical columns with missing values, and we want to replace them with the corresponding column means without too much typing, the mutate_if function comes to help.

df1=df %>% mutate_if(is.numeric, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))

In the line above, carat and price have their missing values replaced by the corresponding column mean, respectively.

Guess NAs

Now consider the case where we want to guess the missing carat information via machine learning prediction.

To keep the article readable, we remove all previous results and create a new data frame of diamonds with the missing values only on carat. We sample 10,000 diamonds, set 1,000 diamonds’ carat value to NA. We will predict them via machine learning.

rm(list=ls(pattern="df"))#execute with cautiondf=diamonds %>% sample_n(10000) %>% mutate(id=1:10000)
set.seed(451)
df$carat[sample(1:10000, 1000)]=NA

For the rest 9,000 diamonds with carat values, we split them into training and test datasets by a ratio of 7:3, and name them df.train and df.test, respectively.

set.seed(450)
df.train=df %>% filter(!is.na(carat)) %>% sample_n(6300) %>% arrange(id)
df.test=df %>% filter((!is.na(carat))&(!id%in%df.train$id)) %>% arrange(id)

We then run a series of linear regressions using whatever we have in the train data (excluding price and id) to explain the carat values in the train data (sounds a bit weird but that’s what R-square is here for). When we feel we have found a model with sufficient explanation power, we use it to predict the crate values in the test data, measuring RMSE with the actual carat values, and go back and forth a couple of times to search for the “best” model by comparing the RMSE values until we are confident that we have one.

lm1=lm(carat~cut+color+clarity,df.train)
summary(lm1)
pred=predict(lm1,df.test)
rmse1=sqrt(mean((df.test$carat-pred)^2))
lm2=lm(carat~.-id-price,df.train)
summary(lm2)
pred2=predict(lm2,df.test)
rmse2=sqrt(mean((df.test$carat-pred2)^2))

For illustration purposes, I only trained two linear models. But we may need to train dozens of models before we can find the ideal one. In this case, using functional programming is extremely helpful so we don’t need to repeat most of the code above.

Assuming we decide to use model lm2 to predict missing values of carat. Using the code in the replace NA subsection, we replace NAs with the predicted carat values.

df1=df %>% mutate(carat=ifelse(is.na(carat),predict(lm2,.),carat))

We now have a new data frame df1 with no missing values, ready for any following analysis.

The linear regression approach can be replaced with any other machine learning logarithms that are suitable for predicting numerical values. Below I use KNN.

df2=df %>% mutate_if(is.factor,as.character) %>%  model.matrix(~color+cut+clarity-1,.)%>% cbind(select(df,-c(cut,color,clarity))) %>% select(carat,everything()) %>% mutate_at(names(.)[2:25],scale)%>% as.data.frame()

Through a sequence of pipe operations, I convert all the factor variables in df to dummies (0–1). This is because KNN in R use Euclidean distance and every variable shall be numeric. I will write a new article in the future on model.matrix and other R functions that facilitate the 0–1 conversion. As a result, all variables are numeric. I then standardize all of them except carat and id (also for the sake of computing Euclidean distance). I call this new data frame df2.

After excluding rows with missing carat values from df2, I split the rest rows in df2 into training and testing sets as usual.

set.seed(450)   
df2.train=df2 %>% filter(!is.na(carat)) %>% sample_n(6300) %>% arrange(id)
df2.test=df2 %>% filter((!is.na(carat))&(!id%in%df2.train$id)) %>% arrange(id)

Apply the KNN algorithm, and load the necessary packages.

library(caret)
library(class)
knn.pred = knn(df2.train[,2:25],df2.test[,2:25],df2.train$carat,k=3) %>% as.character() %>% as.numeric()rmse.knn=sqrt(mean((as.numeric(as.character(knn.pred))-df2.test$carat)^2))

I tried different values for the number of neighbors and chose a small k =3 based on RMSE. Be aware that small ks could lead to overfitting. Anyhow this article is for demo purposes so I will get too much into model tuning.

The next is to apply the predicted results of carat via KNN to replace missing values of carat in df2, for which we create a subset data frame, called df2.guess composed of the 1000 diamonds with missing carat information.

df2.guess=df2 %>% filter(is.na(carat))knn.pred.carat = knn(df2.train[,2:25],df2.guess[,2:25],df2.train$carat,k=3) %>% as.character() %>% as.numeric()df2.guess= df2.guess %>% mutate(carat=knn.pred.carat)

The final step is to insert back the predicted carat values stored in df2.guess to df.

df3=df %>%  mutate(carat = replace(carat, match(df2.guess$id, id), df2.guess$carat))

We have prepared another data frame with no missing values, df3, which is ready for future analysis.

--

--

Martinqiu
Geek Culture

I am a marketing professor and I teach BDMA (big data and marketing analytics) at Lazaridis School of Business, Wilfrid Laurier University, in Waterloo, Canada.