Functional Programming in R: A Weapon Instantly Buffing Your Coding Power

Martinqiu
6 min readMay 2, 2021

I have introduced pipe operations in Part 1. I will walk through the other powerful weapon in R programming: functions. The idea is that, anything you expect to do multiple times, you shall consider using functions, e.g., to run multiple multiple regressions, or to make a series of similar plots. With pipe operations embedded within functions, you really take full advantage of R programming.

Let's begin with a simple example. Suppose you are given a small arithmetic homework by your kid to find out the sum of squares of all the integers between a given number, say 4, and 100. You don't need a function to get the answer:

x=4
value=seq(from=x,to=100,by=1)
sum(value^2)

But what if your kid’s teacher asks for the computation result for 5 to 20? Or any integer less than 100? You certainly can repeat the above code 16 times or more. But R functions offer an easier approach.

We now formally designate that small job to a function. We will explain what is needed to write a R function from the example.

myfunc1=function(x){
value=seq(x,to=100,by=2)
y=sum(value^2)
return(y)
}

To create a function, you need to specify a function name (myfunc1 here), the argument in the parentheses, a value, called x. You can call it whatever you want, but make sure the name appears in the function be working. In between the two braces ({}) you write the code that you want the function to execute based on your input. In the above example, x is used to generate a sequence of integers. You don’t use punctuation to separate different lines of code; use soft or hard returns. When you have the function ready, load it to R by moving the cursor to the very end of the ending brace (}) and click "run" or press "Ctrl"+"Enter". You function is now ready to be called to use.

Function myfunc1 is now created and loaded. To use myfunc1 to do the homework, you supply it with the arguments: myfunc1(5), myfunc1(6), all the way till myfunc1(20). Again that’s not a good way since: 1) we are still typing something repetitively, and 2) the results are not put together.

The code below is the the way to go. I will explain it below.

v1=5:20
v2=sapply(v1,function(x)myfunc1(x))

sapply allows a self-defined R function, myfunc1, to be applied to every element of a vector, v1, which sequences from 5 to 20. The result is stored in a new vector called v2. You can use cbind(v1,v2) to create a matrix, which can be further converted to a dataframe using as.data.frame:

as.data.frame(cbind(v1,v2)

The above example illustrates the basic syntax of R functions and the usage. Next, we move a bit further into functional programming in R.

  1. Adding warning messages

Now suppose you want myfunc1 to send a warning message: if x is not an integer, or greater than 100, stop the execution of the rest of the code in the function. We add a stop sentence in the function. This warning message is similar to those you came across for erroneous execution of built-in R functions.

myfunc2=function(x){
if(round(x)!=x|x>=100) stop('your entered value is not correct!')
value=seq(from=x,to=100,by=1)
sum(value^2)
}

2. Multiple function outputs.

Sometimes we expect functions to give multiple outputs. For example, you want to generate the mean and the standard deviation in addition to the sum of squares. Instead of having multiple functions, one for sum, one for mean, and one for sd, we can add a return command and create a list of multiple outputs.

myfunc3=function(x){
if(round(x)!=x|x>=100) stop('your entered value is not correct')
value=seq(from=x,to=100,by=1)
output1=mean(value^2)
output2=sd(value^2)
output3=sum(value^2)
return(list(output1,output2,output3))
}

To apply myfunc3 to vector v1, and have the result stored nicely in a dataframe, we use the following code.

v1 %>% sapply(function(x)myfunc3(x)) %>% t() %>% as.data.frame() %>% setNames(c("mean","sd","sum"))

t() is used to transpose the output from sapply so the results are given by columns. setNames is used to change the names in pipe operations. You can execute the code pipe by pipe as I illustrated in Part 1 to see the flow of data. The result looks like below.

3. Functions for running regressions

R functions can use objects rather than a value (as in the above examples) as inputs. Suppose I want to run a regression on different datasets to get the best fit. In reality, that is called data mining and is not recommended. But for illustration purposes, we use diamonds data. We want to run a regression model of diamond price on diamond features for multiple datasets.

reg.func=function(df){
reg=lm(price~carat+cut+clarity+color,data=df)
sum=summary(reg)
result= unlist(sum$coefficients[,c(1,3,4)]) %>% as.data.frame() %>% round(3)%>% setNames(c("Coefficient","t stats","p-value"))
return(result)
}

I have the function reg.func written. It uses dataframe as the input, and returns another dataframe as output. I randomly sample 500 diamonds and apply reg.func. You can check the results.

set.seed(450)
dfx=sample_n(diamonds, 500)
reg.func(dfx)

Other objects, such as formula, variable names can also be used as function input. Suppose we want to run a regression on price and one categorical variable (either cut, color or clarity) to compare the influence of different feature dimensions on price. Again, repetitive work calls for function programming. (The following code is a bit jump in learning.)

reg.func2=function(x,df){
lm.formula=as.formula(paste0("price~",x))
reg=lm(lm.formula,data=df)
sum=summary(reg)
result=unlist(sum$coefficients[,c(1,3,4)]) %>% as.data.frame() %>% round(3)%>% setNames(c("Coeffcient","t stats","p-value"))
return(result)
}
reg.func2("cut", dfx)#"cut" is quoted.

In the code above, you notice that we use two arguments for a function (x, the feature we use in the regression, and df, the dataframe where the regression will be run on). You also notice that I need to specify lm.formula for the linear regression, which is created using one of the function arguments x via paste and as.formula. The rest is the same. The output is as follows.

If someone says she does not want to use a string-type input in a function; she’d rather prefer the argument without quotation marks. We just add a line deparse(substitute(x)) to extract the string value from the input.

reg.func3=function(x,df){
x_name <- deparse(substitute(x))
formula=as.formula(paste0("price~",x_name))
reg=lm(formula,data=df)
sum=summary(reg)
result=unlist(sum$coefficients[,c(1,3,4)]) %>% as.data.frame() %>% round(3)%>% setNames(c("Coefficient","t stats","p-value"))
return(result)
}
reg.func3(color,dfx)#color has no quotation marks

We use color as the input for x in this new function to have some variety for the output.

The string value of a function argument is useful to customize plotting elements such as title, which leads to the last example of this article.

4. Functions for creating plots

Suppose we want to select a categorical feature/variable, group by its levels, and count how many diamonds in each group. Then we use a bar plot to illustrate the counts with a title saying “Diamond Counts by Different Groups of xxx”, where xxx is the feature we select. We want to create three plots for each of the three categorical features, cut, color, and clarity, respectively, and save them to local drives.

library(ggplot2)
plot.func=function(x){
x_name <- deparse(substitute(x))
plot= diamonds %>% group_by({{x}}) %>% summarize(count=n()) %>% ggplot()+ geom_col(aes(x={{x}},y=count,fill={{x}}))+
labs(title=paste("Diamond Counts by Different Groups of" ,toupper(x_name)),x=x_name,y="count")
return(plot)
}

We first nail the plot job using the above function plot.func. Note that x_name is a string stripped from the function argument x, and enters the title and x lab arguments for ggplot2. To cite the function argument in the pipe operations on the dataframe, use {{x}}.

Here I also provide the version where function argument is a quoted string. Together with the saving to file code.

plot.func1=function(x){
plot= diamonds %>% group_by(!!sym(x)) %>% summarize(count=n()) %>% ggplot()+ geom_col(aes(x=reorder(!!sym(x),-count),y=count,fill=!!sym(x)))+
labs(title=paste("Diamond Counts by Different Groups of" ,toupper(x)),x=x,y="count")
ggsave(paste0("F:/count.plot_",x,".jpg"),plot,width = 8, height = 6, units = "in")
}

In this version of the function, x is a string, so it can directly enter into plot title and labels. In addition, it also enters the file name in the ggsave argument. However, to refer to the name of the variable it represents, we need to use !!sym(x). This is the trickiest part in this article. It opens the door to the generation of dynamic variable names in R functions, which is beyond the scope of this article.

The final step is to create a vector of diamond features and sapply it with function we just write.

feature=c("cut","clarity","color")
sapply(feature,plot.func1)

You shall see the three plots quietly lying in your F drive.

Article series devoted to the tools of the trade. I hope you will find this article useful. When you master pipe operations and functional programming, writing R code will turn in something that you enjoy and miss.

--

--

Martinqiu

I am a marketing professor and I teach BDMA (big data and marketing analytics) at Lazaridis School of Business, Wilfrid Laurier University, in Waterloo, Canada.