How do I calculate new variables in R long tables?

Question

I have a long table with multiple variables (CPI - Workers, CPI - Consumers, (Seas) Unemployment Level (thous) and many more, but am truncating the data set to 3 variables and 6 time periods for brevity. I want to create a new variable, which is a combination of the first two. Let's call it CPI - Average which of course is simply the average of the first two or (CPI - Workers + CPI - Consumers) / 2. This is a simple calculation in a wide table, however, in order to satisfy ggplot, I have stored my data in long form.

Note, I am storing all of my variables in one long table. When I need to visualize the trend, I filter to the desired variable or variables in my ggplot command.

My question is how do I create the new variable without first converting the data to wide format?

First, here is my dataset:

DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date 
1994  M01    January 143.8 "CPI - Workers" 1994-01-01
1994  M02   February 144.0 "CPI - Workers" 1994-02-01
1994  M03      March 144.3 "CPI - Workers" 1994-03-01
1994  M04      April 144.5 "CPI - Workers" 1994-04-01
1994  M05        May 144.8 "CPI - Workers" 1994-05-01
1994  M06       June 145.3 "CPI - Workers" 1994-06-01
1994  M01    January 146.3 "CPI - Consumers" 1994-01-01
1994  M02   February 146.7 "CPI - Consumers" 1994-02-01
1994  M03      March 147.1 "CPI - Consumers" 1994-03-01
1994  M04      April 147.2 "CPI - Consumers" 1994-04-01
1994  M05        May 147.5 "CPI - Consumers" 1994-05-01
1994  M06       June 147.9 "CPI - Consumers" 1994-06-01
1994  M01    January  8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994  M02   February  8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994  M03      March  8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994  M04      April  8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994  M05        May  7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994  M06       June  7927 "(Seas) Unemployment Level (thous)" 1994-06-01
'))

Second, here is what the output of the calculation should look like:

DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date 
1994  M01    January 143.8 "CPI - Workers" 1994-01-01
1994  M02   February 144.0 "CPI - Workers" 1994-02-01
1994  M03      March 144.3 "CPI - Workers" 1994-03-01
1994  M04      April 144.5 "CPI - Workers" 1994-04-01
1994  M05        May 144.8 "CPI - Workers" 1994-05-01
1994  M06       June 145.3 "CPI - Workers" 1994-06-01
1994  M01    January 146.3 "CPI - Consumers" 1994-01-01
1994  M02   February 146.7 "CPI - Consumers" 1994-02-01
1994  M03      March 147.1 "CPI - Consumers" 1994-03-01
1994  M04      April 147.2 "CPI - Consumers" 1994-04-01
1994  M05        May 147.5 "CPI - Consumers" 1994-05-01
1994  M06       June 147.9 "CPI - Consumers" 1994-06-01
1994  M01    January  8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994  M02   February  8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994  M03      March  8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994  M04      April  8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994  M05        May  7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994  M06       June  7927 "(Seas) Unemployment Level (thous)" 1994-06-01
1994  M01    January 145.05 "CPI - Average" 1994-01-01
1994  M02   February 145.35 "CPI - Average" 1994-02-01
1994  M03      March 145.70 "CPI - Average" 1994-03-01
1994  M04      April 148.85 "CPI - Average" 1994-04-01
1994  M05        May 146.15 "CPI - Average" 1994-05-01
1994  M06       June 146.60 "CPI - Average" 1994-06-01
'))

The fourth variable (CPI - Average) takes the average of the first two for each date. Please ignore the fact that this average makes no sense economically, I just wanted a simple calculation for the example.

Such calculations are quite straight forward in wide format. So let's first transform the data to wide and then make the calculation.

DT_wide <- DT_long %>% pivot_wider(names_from = variable_name, values_from = value)

DT_wide_with_average <- DT_wide %>% mutate(`CPI - Average` = (`CPI - Workers` + `CPI - Consumers`) / 2)

This takes the wide table and adds a new column with the calculated results:

DT_wide_with_average <- as.data.table(read.table(header=TRUE, text='year period periodName date `CPI - Workers` `CPI - Consumers` `(Seas) Unemployment Level (thous)` `CPI - Average`
1994 M01  January  1994-01-01  144.    146.       8630        145.
1994 M02  February 1994-02-01  144     147.       8583        145.
1994 M03  March    1994-03-01  144.    147.       8470        146.
1994 M04  April    1994-04-01  144.    147.       8331        146.
1994 M05  May      1994-05-01  145.    148.       7915        146.
1994 M06  June     1994-06-01  145.    148.       7927        147.
'))

Please ignore the fact that the decimals have been truncated by pivot_wider.

Working in wide mode, creating variables, analyzing them, revising the calculations, reordering the column orders, deleting unneeded columns is the way we mere humans think when analyzing simple data tables.

Unfortunately, ggplot requires the long format, considered "tidy" by the gods of R, but is quite messy in the eyes of us mere mortals. Sorry for the dig, but if I were to stack my couch, table, chairs, lamp and rug in one corner of the room, it would be quite messy, while if I left them as I normally keep my room, they would be quite tidy. In the real world, I might stack the furniture in one corner in order to paint the room or sand the floors. This would be useful for the task at hand, but it would be considered messy and not useful for ordinary living. So, considering long tables as tidy and wide tables as messy is counterintuitive. It took me a long time to figure out this counterintuitive logic when I was first introduced to tidyverse. Sorry for the rant, but hopefully it is useful customer feedback to the gods of R. At a minimum, it would be helpful to R learners if the gods would admit to the counterintuitive nomenclature. If I am warned, before entering the bathroom, that the faucet handle with an "H" is the cold water and the one with a "C" is the hot water, I am less likely to scald my hands!

Data analysis is iterative. I don't want to have to take the following steps for each iteration:

pivot_wider
calculate new variable
pivot_longer
examine trend in ggplot

I would rather:

calculate new variable
examine trend in ggplot

In short, I want to focus on my economic analysis, rather than on unnecessary R programming.

So, how can I select a subset of variables from my long format table, use them in a calculation to create a new variable and ensure that the new variable is rbind-ed to the end of my long table...without having to convert to wide format?

Thanks for your help!

Jon Spring Jon Spring · Accepted Answer · 2021-02-16T21:43:44

How about this?

bind_rows(
  DT_long,
  DT_long %>%
    filter(variable_name %>% str_detect("CPI")) %>%
    group_by(year, period, periodName, date) %>%
    summarize(value = mean(value)) %>%
    mutate(variable_name = "CPI - Average")
)

In this case the math can be accomplished by a mean across the group, but this assumes that both Workers / Consumer CPI are present, and each only once within each group, and you want them evenly weighted. It could get more convoluted, and in many cases you're absolutely right that many calculations that involve relationships between variables are much more straightforward in wide format.

(Especially in a case like this where it's a gray area about whether these different data points are truly different observations or different dimensions of the same "economic snapshot" observation, so arguably your wide version is already "tidy.")

How do I calculate new variables in R long tables?

3 Answers

Wide Table Calculations Across Columns, within each Row, Applied to Long Tables

Wide Table Calculations Across Rows, within each Column, Applied to Long Tables