Juggling Wide and Long Types Data in R: A Nice and Easy Approach

Martinqiu
CodeX
Published in
9 min readMay 18, 2021

--

You may hear people talking about wide and long type data. What is wide type data? Here is an example of the investment portfolio of five individuals. The first column records individual IDs (1 to 5), and the rest record the amount invested in the three types of assets: stock.market, mutual.fund, and bank.deposit by each individual.

Wide type is the common type of data. It is probably the default data type if we want to summarize similar multiple dimensional information across individuals (or, in a more general sense, subjects). A sales manager may use wide data to trace the monthly performance of salespersons, a teacher may use wide data to record student grades in different tests, and so on.

Sometimes data analyses require that the data shall be provided in a long type. A long data records the same amount of information but presents the information differently. By its name it is longer than a wide data but narrower. If we present the investment portfolio in a long data; it looks like this:

The first three rows belong to individual one. Her gender and age information repeats three times (why three times?), and so for every individual. We can see two new columns are created: one column stores the type of assets, and the other stores the investment amount in that asset.

Why bothers the transformation? There are many reasons. Many analyses in R require the data to be long type; some ggplot tasks also require a long data for the sake of co-plotting multiple variables and legend generation. Or you only have wide data, and a programming language accepts only long type, and you have to wrangle the data in R to export.

For example, how to use the above data to plot a bar plot that shows the relative investment portion of each asset? Without wide/long transofrmation the job is difficult.

Hence, at this stage of R learning, it is highly desirable to master a nice and easy trick to convert wide to long, and long to wide. It solves the headache that shouldn’t be a headache.

In R, people used to employ dcast and melt functions in Reshape 2 package, and later spread and gather functions in tidyr package to do the job. Each have their pros and cons. With the introduction of pivot _longer and pivot_wider functions in the new tidyr package, we finally have this nice and easy tool.

The above investment portfolio data is an ideal case for transformation since every individual invests in all three assets. Using pipe operations, it is easy to replicate the following result on your own.

Copy the whole code below to your R Studio to import the wide data, call it df.wide. Some illustrations on read.table function can be found in this article.

library(tidyr)df.wide=read.table(text="
ID Gender Age Stock.market mutual.fund Bank.deposit
1 F 45 55 22 32
2 M 41 15 50 45
3 F 56 27 21 14
4 M 60 12 63 58
5 F 39 37 34 20
",header =T)
df.wide %>% pivot_longer(4:6,names_to=”Asset.type”,values_to=”Amount”)->df.long

In the code above, 4:6 is the column IDs that we want R to make the transformation; furthermore, we designated a new column using names_to argument, which means you want to put column names of 4:6 (stock.market, mutual.fund etc) to this new column, and call it Asset.type. We also designate a new column using values_to argument to store the corresponding amount and name this new column Amount.

Correspondingly, pivot_wider transforms back long data to wide, with the switching arguments names_from and values_from.

df.long %>% pivot_wider(names_from=”Asset.type”,values_from=”Amount”)

In the line above, the quotation marks can be removed. But for the consistency with pivot_longer, we’d better keep them.

I didn’t save the above result to a new data.frame since I already have the wide data. And you can check that the transformation is complete.

Nice and easy, right? Now let’s look at some more complicated scenarios.

Incomplete data. They create headaches when we transform from long to wide. For example, below we remove the last two rows of df.long. Hence, there is no information about individual 5’s investments in mutual fund and bank deposit. We call this new long data df.long2,

df.long2=read.table(text="Customer.ID Gender Age Asset.type Amount
1 F 45 Stock.market 55
1 F 45 mutual.fund 22
1 F 45 Bank.deposit 32
2 M 41 Stock.market 15
2 M 41 mutual.fund 50
2 M 41 Bank.deposit 45
3 F 56 Stock.market 27
3 F 56 mutual.fund 21
3 F 56 Bank.deposit 14
4 M 60 Stock.market 12
4 M 60 mutual.fund 63
4 M 60 Bank.deposit 58
5 F 39 Stock.market 37
",header =T)

which looks like this:

Customer.ID Gender Age   Asset.type amount
1 1 F 45 Stock.market 55
2 1 F 45 mutual.fund 22
3 1 F 45 Bank.deposit 32
4 2 M 41 Stock.market 15
5 2 M 41 mutual.fund 50
6 2 M 41 Bank.deposit 45
7 3 F 56 Stock.market 27
8 3 F 56 mutual.fund 21
9 3 F 56 Bank.deposit 14
10 4 M 60 Stock.market 12
11 4 M 60 mutual.fund 63
12 4 M 60 Bank.deposit 58
13 5 F 39 Stock.market 37

if we transform directly from long to wide using the code above, we end up with this dataframe:

Customer.ID Gender   Age Stock.market mutual.fund Bank.deposit
<int> <chr> <int> <int> <int> <int>
1 1 F 45 55 22 32
2 2 M 41 15 50 45
3 3 F 56 27 21 14
4 4 M 60 12 63 58
5 5 F 39 37 NA NA

The wide data is still created but R uses NA to replace missing values. If we are certain that individual five invests zeros in those two assets, we can replace NA with zero. In the code we add to pivot_wider with the argument of values_fill=0.

df.long2 %>% pivot_wider(names_from="Asset.type",values_from="Amount",values_fill=0)

You can replace zero with any values you want.

Customer.ID Gender   Age Stock.market mutual.fund Bank.deposit
<int> <chr> <int> <int> <int> <int>
1 1 F 45 55 22 32
2 2 M 41 15 50 45
3 3 F 56 27 21 14
4 4 M 60 12 63 58
5 5 F 39 37 0 0

Another complicated scenario is the opposite: an individual has multiple rows mapping to one column. For example, in the above investment portfolio dataframe, individual one has two rows of bank.deposit, one credited with an amount equal to 32, and the other with 70.

df.long3=read.table(text="Customer.ID Gender Age Asset.type Amount
1 F 45 Stock.market 55
1 F 45 mutual.fund 22
1 F 45 Bank.deposit 32
1 F 45 Bank.deposit 70

2 M 41 Stock.market 15
2 M 41 mutual.fund 50
2 M 41 Bank.deposit 45
3 F 56 Stock.market 27
3 F 56 mutual.fund 21
3 F 56 Bank.deposit 14
4 M 60 Stock.market 12
4 M 60 mutual.fund 63
4 M 60 Bank.deposit 58
5 F 39 Stock.market 37
",header =T)

If we use pivot_wider directly, we get error messages. This is because we need to answer a question first: in the transformed wide data, what information do we want to put down for individual one’s bank deposit? It is a sum of 32 and 7, or the average of 32 and 70, or the maximum of 32 and 70? We can proceed only after we tell pivot_wider about our decision.

Suppose we want to get the sum of the bank deposit amounts. To do so we add values_fn=sum to tell R to conduct summation over across all individuals and their asset types.

df.long3 %>% pivot_wider(names_from="Asset.type",values_from="Amount",values_fill=0,values_fn=sum)

Of course only one summation is executed: adding the two bank deposits of individual one and storing the value (32+70=102) to the Bank.deposit column in the newly transformed wide data.

Customer.ID Gender   Age Stock.market mutual.fund Bank.deposit
<int> <chr> <int> <int> <int> <int>
1 1 F 45 55 22 102
2 2 M 41 15 50 45
3 3 F 56 27 21 14
4 4 M 60 12 63 58
5 5 F 39 37 0 0

Next, we consider a more complicated case where you intend to tranform a wide data with two categories. To form the example, we add three columns (Grocery, Gas, and Rent) to the investment porfolio dataframe to represent those individuals’ spending.

We want the transformed data to look like this.

Here is the code. We use pivot_longer twice. And each time we specify the columns to store the item names and values, respectively. Also it is useful to execute the first pivot_longer function first to identify the column IDs for grocery, gas, and rent.

df.wide2=read.table(text="ID Gender Age Stock.market mutual.fund Bank.deposit Grocery Gas Rent
1 F 45 55 22 32 4 1 3
2 M 41 15 50 45 4 2 1
3 F 56 27 21 14 2 1 6
4 M 60 12 63 58 2 1 1
5 F 39 37 34 20 3 1 4
",header =T)
df.wide2 %>% pivot_longer(4:6,names_to="Asset.type",values_to="Amount") %>% pivot_longer(4:6,names_to="Spending.type",values_to="Spending")

You shall have the following wide data frame generated.

ID Gender   Age Asset.type   Amount Spending.type Spending
<int> <chr> <int> <chr> <int> <chr> <int>
1 1 F 45 Stock.market 55 Grocery 4
2 1 F 45 Stock.market 55 Gas 1
3 1 F 45 Stock.market 55 Rent 3
4 1 F 45 mutual.fund 22 Grocery 4
5 1 F 45 mutual.fund 22 Gas 1
6 1 F 45 mutual.fund 22 Rent 3
7 1 F 45 Bank.deposit 32 Grocery 4
8 1 F 45 Bank.deposit 32 Gas 1
9 1 F 45 Bank.deposit 32 Rent 3
10 2 M 41 Stock.market 15 Grocery 4
# ... with 35 more rows

Finally, sometimes we have multiple column values in a hybrid long-wide type that we need to transform, as depicted below.

In the dataframe above, there are 2019 and 2020 values for each asset type, presented in two separate columns (wide type). Yet for asset type each ID occupies two rows (long type).

Let’s import the data.

df.long4=read.table(text="ID Gender Age Asset.type Amount.2020 Amount.2019
1 F 45 Stock.market 55 50
1 F 45 mutual.fund 22 20
2 M 41 Stock.market 15 10
2 M 41 mutual.fund 50 45
",header =T)

We first rename the two columns, Amount.2020 and Amount.2019 to Y2020 and Y2019, respectively to create easy-to-interpret names for new columns. Then we use pivot_wider and specify a vector of values to be transformed (Y2020 andY2019). Furthermore, we specify the format of the new column names as Asset.type_Year using names_glue.

df.wide3=df.long4%>% rename(Y2020=Amount.2020,Y2019=Amount.2019)%>% pivot_wider(names_from=Asset.type,values_from=c(Y2020,Y2019),names_glue = "{Asset.type}_{.value}")

The newly generated wide data looks as below

We can further transform this wide data back to a “complete” long data with each ID occupying four rows. We use “_” to separate names_to argument into multiple columns to make the summary clearer.

df.wide3%>%
pivot_longer(contains("_"),names_to=c("Asset","Year"),values_to="Amount",names_sep="_")

Last of last, regarding the plot appearing at the beginning of the article. Having learned what I have described in this article, you can plot it, nice and easy!

library(ggplot2)df.wide %>% pivot_longer(4:6,names_to=”Asset.type”,values_to=”Amount”)%>% ggplot() + geom_col(aes(x=ID,y=Amount,fill=Asset.type),position=”stack”) +scale_fill_manual(values=c(“sandybrown”,”darkblue”,”turquoise”))

#Hope this article makes long/wide transformation never an issue for you.

--

--

Martinqiu
CodeX

I am a marketing professor and I teach BDMA (big data and marketing analytics) at Lazaridis School of Business, Wilfrid Laurier University, in Waterloo, Canada.