Two Weapons to Instantly Buff Your R Coding Power-Part 1

Martinqiu
Geek Culture
Published in
6 min readMay 1, 2021

--

Yes, that’s correct. R beginners can quickly upgrade their R programming capabilities by mastering two things: pipe operations (%>%) and write-your-own functions, for plotting, data wrangling, regression results extraction, and many more.

Suppose you just started to learn how to empower R for your work-like some of my MBA students who heard of MSE for the first from me, and now you are happy for writing code in R-Studio that work, e.g., run a logistic regression-based classification. In that case, it is time to accelerate your R coding skills. I mean to do the same job with easier, shorter, clearer, less erroneous, more interpretable code. Unless you are paid by the lines of code you write, you shall seriously consider following what this article and the next one tell you to do. They will significantly enhance your coding technique.

This article is about using pipe, a series of operations supported jointly by the two packages, dplyr and tidyr. As usual, I will focus on some of the most common scenarios of operations.

What is a pipe operation? Begin with a dataframe as the initial input, pipe operations use the results of the last operation as the input for the next operation and continue till you get the desired outcome (e..g, a modified dataframe, a plot). A pipe sign %>% connects the operations like a pipe, making the data flow smoothly without creating temporary output.

Dataframe manipulation

First, we use a straightforward example to illustrate pipe operations. Use diamonds data, select 4Cs and price, keep diamonds with price > 10,000, and carat bigger than 1, and compute the ratio of price/carat as a new variable, and create a new dataframe, call it big.stone.df to store these five variables. Without using piping you probably write code like this

data("diamonds",package="ggplot2")
df1=diamonds[,c("carat","cut","clarity","color","price")]
big.stone.df=df1[df1$price>10000&df1$carat>1,]
big.stone.df$ratio=big.stone.df$carat/big.stone.df$price

Or you can use names command to get the column IDs and cite the IDs directly.

names(data)
df1=diamonds[,c(1:4,7)]

Using pipe, it’s just one line.

diamonds%>%select(c(1:4,7))%>% filter(price>10000&carat>1) %>% mutate(ratio=carat/price)->big.stone.df

Note that the argument for assigning the manipulation outcome to a new dataframe can be put in the beginning. I prefer = to ->. Same result.

big.stone.df=diamonds%>% select(c(1:4,7)) %>% filter(price>10000&carat>1) %>% mutate(ratio=carat/price)

FYI, pipe operations allow you to easily debug your code to see where the “clog” is. Start from the beginning, and you can select the part to execute the code pipe by pipe, as illustrated below.

FYII, attend to those functions used in the example, select trims the columns, and filter trim the rows. mutate add columns. These three functions perhaps are the most commonly used in pipe operations.

Practice by yourself. If I want to trim the dataframe as follows: add an ID column which is just the row ID in the dataframe, and then to keep just x, y, z (yes, the last three columns in diamonds data are called x,y, and z), and keep the rows where x > 5 or y < 6. How to write the code?

Data wrangling using pipe

The following example is to create pivot-table alike output using pipe. I used a pipe example in my first Medium article on hooking R and Excel, and I got several inquiries. Here is how. Suppose we want to know the average price and the average carat for each color type in the diamond data.

diamonds %>% group_by(color) %>% summarize(mean.carat=mean(carat),mean.price=mean(price))

Use group_by to apply the “grouping” function, and then summarize for computing function. mean can be replaced with other stats functions of your choice (median, sd).

If you want to get the frequency/count in each subgroup,

diamonds %>% group_by(color) %>% count(color)
diamonds %>% group_by(color) %>% summarize(freq=n())

The two lines above are equivalent.

If you want to count for each cut and color combo, how many diamonds are there, just add cut to the group_by argument. Note that any variables not added in group_by are removed in the output.

diamonds %>% group_by(color,cut) %>% summarize(freq=n())

This is a “long shape” data; use spread or pivot_wider function and you can convert it to a “wide shape.” For detailed explanation about and operations on long and wide data types, please refer to this article.

diamonds %>% group_by(color,cut) %>% summarize(freq=n()) %>% pivot_wider(names_from=cut,values_from=freq,values_fill=0)

The example above illustrates the convenience in pipe operations: just adding another pipe for new manipulation. The results can be saved to an dataframe by adding ->dataframe.name at the end.

Plotting and Piping

The third example is to pipe to plot. Assume we want to create a scatter plot of price and carat for randomly sampled 3000 carats, colored by different cut, and add a simulation regression line to fit the data.

set.seed(450)
diamonds %>% sample_n(3000) %>%
ggplot()+
geom_point(aes(x=carat,y=price,color=cut)) + geom_smooth(aes(x=carat,y=price),color="sandybrown",method="lm",se=T)

I use a simple ggplot to illustrate the pipe operations. After manipulation, the data is used as an input in ggplot(). The rest are ggplot drills. You can add other plotting elements and aesthetic styles to the plot as you do for normal ggplot coding.

FYIII, sample_n is a convenient way to create a random sample in pipe operations. set.seed(450) is to guarantee each time you get the same random sample so you can replicate your results.

Piping out regression results

The last example is to pipe out regression results. Suppose I want to study the linear relationship between price and x, y, z of selected diamonds, and I only need coefficient estimates and p-values.

set.seed(450)
diamonds %>% sample_n(3000) %>% filter(carat>1|color=="E") %>% lm(price~x+y+z,data=.)

Above is the code for our initial attempt. We directly employ lm function. Note that data argument in lm function: data=.

This . is to replace the input. Pipe operations best work for functions that accept input as the first argument. e.g., select(data, col.names), filter(data, conditions). For other functions where data argument is not the first, such as lm(formula, data), we use . to replace what’s been piped to. You cannot use data= diamonds since what’s been passed to lm is not the original diamond data.

We can get a more detailed result by adding summary() at the end of the pipe. Some people may be satisfied, but still, it is not quite what we want.

set.seed(450)
diamonds %>% sample_n(3000) %>% filter(carat>1|color=="E") %>% lm(price~x+y+z,data=.)%>% summary()

We can save the summary as an object, lm.sum, and play with it.

set.seed(450)
diamonds %>% sample_n(3000) %>% filter(carat>1|color=="E") %>% lm(price~x+y+z,data=.)%>% summary()->lm.sum
lm.sum#checkunlist(lm.sum$coefficients[,c(1,4)]) %>% as.data.frame() %>% setNames(c("Coefficient","p-value"))

If you are familiar with the summary outcome for linear regression, you know coefficient estimates are stored in column 1, and p-values are stored in column 4 of the list called coefficients. A simple unlist function extracts them and pipe again to convert to data.frame (for export if that’s your aim). SetNames help customize the column names of your choice.

FYIV. Students in my big data marketing course learned pipe operations on Day 1. They can do it, and you too.

If this article makes you feel learning something, please continue to upgrade to the next level: Part II Build your own R functions.

Article series devoted to the tools of the trade.

--

--

Martinqiu
Geek Culture

I am a marketing professor and I teach BDMA (big data and marketing analytics) at Lazaridis School of Business, Wilfrid Laurier University, in Waterloo, Canada.