2
votes

I have a long table with multiple variables (CPI - Workers, CPI - Consumers, (Seas) Unemployment Level (thous) and many more, but am truncating the data set to 3 variables and 6 time periods for brevity. I want to create a new variable, which is a combination of the first two. Let's call it CPI - Average which of course is simply the average of the first two or (CPI - Workers + CPI - Consumers) / 2. This is a simple calculation in a wide table, however, in order to satisfy ggplot, I have stored my data in long form.

Note, I am storing all of my variables in one long table. When I need to visualize the trend, I filter to the desired variable or variables in my ggplot command.

My question is how do I create the new variable without first converting the data to wide format?

First, here is my dataset:

DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date 
1994  M01    January 143.8 "CPI - Workers" 1994-01-01
1994  M02   February 144.0 "CPI - Workers" 1994-02-01
1994  M03      March 144.3 "CPI - Workers" 1994-03-01
1994  M04      April 144.5 "CPI - Workers" 1994-04-01
1994  M05        May 144.8 "CPI - Workers" 1994-05-01
1994  M06       June 145.3 "CPI - Workers" 1994-06-01
1994  M01    January 146.3 "CPI - Consumers" 1994-01-01
1994  M02   February 146.7 "CPI - Consumers" 1994-02-01
1994  M03      March 147.1 "CPI - Consumers" 1994-03-01
1994  M04      April 147.2 "CPI - Consumers" 1994-04-01
1994  M05        May 147.5 "CPI - Consumers" 1994-05-01
1994  M06       June 147.9 "CPI - Consumers" 1994-06-01
1994  M01    January  8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994  M02   February  8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994  M03      March  8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994  M04      April  8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994  M05        May  7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994  M06       June  7927 "(Seas) Unemployment Level (thous)" 1994-06-01
'))

Second, here is what the output of the calculation should look like:

DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date 
1994  M01    January 143.8 "CPI - Workers" 1994-01-01
1994  M02   February 144.0 "CPI - Workers" 1994-02-01
1994  M03      March 144.3 "CPI - Workers" 1994-03-01
1994  M04      April 144.5 "CPI - Workers" 1994-04-01
1994  M05        May 144.8 "CPI - Workers" 1994-05-01
1994  M06       June 145.3 "CPI - Workers" 1994-06-01
1994  M01    January 146.3 "CPI - Consumers" 1994-01-01
1994  M02   February 146.7 "CPI - Consumers" 1994-02-01
1994  M03      March 147.1 "CPI - Consumers" 1994-03-01
1994  M04      April 147.2 "CPI - Consumers" 1994-04-01
1994  M05        May 147.5 "CPI - Consumers" 1994-05-01
1994  M06       June 147.9 "CPI - Consumers" 1994-06-01
1994  M01    January  8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994  M02   February  8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994  M03      March  8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994  M04      April  8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994  M05        May  7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994  M06       June  7927 "(Seas) Unemployment Level (thous)" 1994-06-01
1994  M01    January 145.05 "CPI - Average" 1994-01-01
1994  M02   February 145.35 "CPI - Average" 1994-02-01
1994  M03      March 145.70 "CPI - Average" 1994-03-01
1994  M04      April 148.85 "CPI - Average" 1994-04-01
1994  M05        May 146.15 "CPI - Average" 1994-05-01
1994  M06       June 146.60 "CPI - Average" 1994-06-01
'))

The fourth variable (CPI - Average) takes the average of the first two for each date. Please ignore the fact that this average makes no sense economically, I just wanted a simple calculation for the example.

Such calculations are quite straight forward in wide format. So let's first transform the data to wide and then make the calculation.

DT_wide <- DT_long %>% pivot_wider(names_from = variable_name, values_from = value)

DT_wide_with_average <- DT_wide %>% mutate(`CPI - Average` = (`CPI - Workers` + `CPI - Consumers`) / 2)

This takes the wide table and adds a new column with the calculated results:

DT_wide_with_average <- as.data.table(read.table(header=TRUE, text='year period periodName date `CPI - Workers` `CPI - Consumers` `(Seas) Unemployment Level (thous)` `CPI - Average`
1994 M01  January  1994-01-01  144.    146.       8630        145.
1994 M02  February 1994-02-01  144     147.       8583        145.
1994 M03  March    1994-03-01  144.    147.       8470        146.
1994 M04  April    1994-04-01  144.    147.       8331        146.
1994 M05  May      1994-05-01  145.    148.       7915        146.
1994 M06  June     1994-06-01  145.    148.       7927        147.
'))

Please ignore the fact that the decimals have been truncated by pivot_wider.

Working in wide mode, creating variables, analyzing them, revising the calculations, reordering the column orders, deleting unneeded columns is the way we mere humans think when analyzing simple data tables.

Unfortunately, ggplot requires the long format, considered "tidy" by the gods of R, but is quite messy in the eyes of us mere mortals. Sorry for the dig, but if I were to stack my couch, table, chairs, lamp and rug in one corner of the room, it would be quite messy, while if I left them as I normally keep my room, they would be quite tidy. In the real world, I might stack the furniture in one corner in order to paint the room or sand the floors. This would be useful for the task at hand, but it would be considered messy and not useful for ordinary living. So, considering long tables as tidy and wide tables as messy is counterintuitive. It took me a long time to figure out this counterintuitive logic when I was first introduced to tidyverse. Sorry for the rant, but hopefully it is useful customer feedback to the gods of R. At a minimum, it would be helpful to R learners if the gods would admit to the counterintuitive nomenclature. If I am warned, before entering the bathroom, that the faucet handle with an "H" is the cold water and the one with a "C" is the hot water, I am less likely to scald my hands!

Data analysis is iterative. I don't want to have to take the following steps for each iteration:

  1. pivot_wider
  2. calculate new variable
  3. pivot_longer
  4. examine trend in ggplot

I would rather:

  1. calculate new variable
  2. examine trend in ggplot

In short, I want to focus on my economic analysis, rather than on unnecessary R programming.

So, how can I select a subset of variables from my long format table, use them in a calculation to create a new variable and ensure that the new variable is rbind-ed to the end of my long table...without having to convert to wide format?

Thanks for your help!

3

3 Answers

1
votes

How about this?

bind_rows(
  DT_long,
  DT_long %>%
    filter(variable_name %>% str_detect("CPI")) %>%
    group_by(year, period, periodName, date) %>%
    summarize(value = mean(value)) %>%
    mutate(variable_name = "CPI - Average")
)

In this case the math can be accomplished by a mean across the group, but this assumes that both Workers / Consumer CPI are present, and each only once within each group, and you want them evenly weighted. It could get more convoluted, and in many cases you're absolutely right that many calculations that involve relationships between variables are much more straightforward in wide format.

(Especially in a case like this where it's a gray area about whether these different data points are truly different observations or different dimensions of the same "economic snapshot" observation, so arguably your wide version is already "tidy.")

0
votes

Here is my try to be more generic using a helper function and data.table.

helper <- function(name, value, formula) {
  # get the variable and value field name
  vn_name <- substitute(name)
  vn_value <- substitute(value)
  
  # new name is given by formula's LHS
  if(length(formula)==3) {
    new_name <- as.character(formula[[2]])
    formula <- formula[-2]
  } else
    stop("formula should be of the form new_name ~ ...")
  
  # build named list from variable names and values
  .x <- setNames(as.list(value), name)
  attr(formula,".Environment") <- list2env(.x)

  # build function from one sided formula
  f <- rlang::as_function(formula)
  
  # return result as a named list using provided variable names and new_name 
  setNames(
    list(new_name, f()),
    c(vn_name,vn_value)
  )
}

# test
rbind(
  DT_long,
  DT_long[, by="year,period,periodName,date", 
    helper(variable_name, value, 
      `CPI - Average` ~ (`CPI - Workers` + `CPI - Consumers`) / 2
    )
  ]
)

Or an alternative formulation using a helper function shaped after melt/dcast functions I called long_mutate. It could be easily vectorized over expr to allow multiple computations in a row.

long_mutate <- function(x, id.vars, variable.name="variable", value.name="value", result.name=NULL, expr) {
  # names can be provided as strings or identifiers
  variable.name <- as.character(substitute(variable.name))
  value.name <- as.character(substitute(value.name))
  result.name <- as.character(substitute(result.name))

  # if id.vars not provided, defaults to all variables but variable and value
  if(missing(id.vars)) {
    id.vars <- setdiff(names(x), c(variable.name, value.name))
  }
  
  # expression can be given as 
  #   a one sided formula (result.name must be provided)
  #   a two sided formula (left part becomes result.name)
  #   a function (with no or only ... arguments)
  if(rlang::is_formula(expr)) {
    if(length(expr)==3) {
      result.name <- as.character(expr[[2]])
      expr <- rlang::as_function(expr[-2])
    } else {
      expr <- rlang::as_function(expr)
      if(length(result.name)!=1)
        stop("Need a result.name in case of one sided formula!")
    }
  } else if(is.function(expr)) {
    if(length(result.name)!=1)
      stop("Need a result.name in case of function!")
    args <- formalArgs(expr)
    if(!(is.null(args) || ((length(args)==1)&&(args=="..."))))
      stop("Function must have no or only ... as arguments!")
  }

  # wrapper to inject variables and values in the environment
  # and return result name and value using variable.name and value.name
  f <- function(sd) {
    ev <- list2env(setNames(as.list(sd[[2]]), sd[[1]]))
    environment(expr) <- ev
    setNames(list(result.name, expr()), c(variable.name, value.name))
  }

  # keep input variable order
  x.vars <- intersect(names(x), c(id.vars, variable.name, value.name))
  rbind(
    x[, ..x.vars],
    x[, by=id.vars, f(.SD), .SDcols=c(variable.name, value.name)]
  )
}

# example with two sided formula expression
long_mutate(DT_long,
  variable.name = "variable_name",value.name = "value", 
  expr=`CPI - Average` ~ (`CPI - Workers` + `CPI - Consumers`) / 2
)

# example with function
long_mutate(DT_long,
  variable.name = "variable_name",value.name = "value", 
  result.name = "CPI - Average",
  expr=function() {(`CPI - Workers` + `CPI - Consumers`) / 2}
)
0
votes

Jon Spring's answer works perfectly for the case that I originally described, but what is really needed is a more generalized approach to allow arbitrary calculations. As mere mortals, we conceptualize data in rows and columns, so the trick is to take our wide table concept of a calculation and translate into a long table implementation.

Wide Table Calculations Across Columns, within each Row, Applied to Long Tables

Based on his solution, we can generalize this to the case of arbitrary calculations across columns, but within a row (think of the wide table framework or a spreadsheet formula that only refers to cells in the same row). The calculations are usually mathematical, but could be string manipulation.

First we need to dissect the calculation. Let's take an example of an arbitrary calculation that requires specific reference to each variable, unlike the example I gave previously that Jon responded to. ((VarA * 6) / VarB) / (VarB) / (VarA * 6)), which of course by definition is always 1 unless either VarA or VarB is 0. If we get all 1's with our test data, then we know our solution works since there are no 0's.

Second, we select our variables. In our test data we will use CPI - Consumers and CPI - Workers and not (Seas) Unemployment Level (thous). We do this through Jon's filter command or DT_long[variable_name %in% c("CPI - Workers", "CPI - Consumers") in data.table parlance. Note, I use a list to ensure unique selection of variables.

Third, we need to ensure that the calculation is restricted to the row (thinking in the wide table format). That is the group_by command which restricts the calculation to the date. That would be a unique row in the wide table.

Fourth we need a way to distinguish among the chosen variables. In the original example that was not necessary, but in the generalized case (and our new calculation) it is. This can be accomplished through keyby = .(variable_name) in data.table parlance, which puts the variables in alphabetical order. So now we can refer to CPI Consumers as value[1] and CPI Workers as value[2] because in long table our (wide table) columns of data became rows and by restricting our calculations to unique dates, we know that for each calculation there will only be two values, ordered by their respective variable_name. So our calculation becomes summarize( value = ((value[1] * 6) / value[2]) * (value[2] / (value[1] * 6)) ).

Fifth, we give our newly calculated value a variable name with Jon's mutate command.

Sixth, we append the new data to our long table with the bind_rows command.

Putting this all together, we have:

bind_rows(
    DT_long,
    DT_long[variable_name %in% c("CPI - Workers", "CPI - Consumers"), .SD, keyby = .(variable_name)] %>%
        group_by(year, period, periodName, date) %>%
        summarize( value = ((value[1] * 6) / value[2]) * (value[2] / (value[1] * 6)) )  %>%
        mutate(variable_name = "CPI - Average3")
)

This works perfectly with all 1's.

Now we have generalized steps to create arbitrary calculations across what would be the same row in wide table, but implemented on a long table.

Wide Table Calculations Across Rows, within each Column, Applied to Long Tables

Economists often think about change across time. How much have prices increased each year? Is the rate of inflation increasing over the years or diminishing? We cannot see that from the CPI (Consumer Price Index), but can calculate it. Thinking wide table, this problem is not a calculation across columns (variables) within the same row (within the same time period). It is a calculation on a single variable across time or single column across rows.

Here is an attempt:

bind_rows(
    DT_long,
    DT_long[variable_name %in% c("CPI - Workers"),] %>%
        summarize( for(i in 1:6) {value = (((value[i+1] - value[i]) / value[i]) * 100)})  %>%
        mutate(variable_name = "CPI_growth")
)

Alas this fails.

But here is a solution for a common calculation for economists involving one variable, with calculation over time. This is the Year over Year growth calculation or more generally period over period growth calculation. The CPI is an index of prices, which starts with a particular base year as 100. (Actually the base is 1982 to 1984, see the link below.) If during the next year prices increase 10 percent, the index for that year is 110. If it continues growing at 10 percent, the index goes to 121 in the second year. Looking at this number we immediately know that prices have grown 21 percent since the base year when the index was 100. But how much prices grew within this second year is not intuitive. What we need is to calculate the growth rate of prices for each year. If CPI were reported yearly, this would be ((CPIt - CPIt-1) / CPIt-1) * 100, but of course it is reported monthly, so the t-1 becomes t-12. however, Sometimes we want monthly inflation rates, so we would use t-1.

Gross Domestic Product (GDP) is reported quarterly, so for yearly growth we want to calculate the growth over the last 4 quarters, ((GDPt - GDPt-4) / GDPt-4) * 100.

How do we make this calculation, easily adjusting for periodicity when our data is stored in a long table?

We start with a growth rate function. Note that the growth.rate in the tis package is not very flexible and forces the calculation to always be yearly. Note that this solution assumes that you have your data in ascending date order.

gr.rate <- function(x, l=1){
  (x - lag(x, l)) / lag(x, l) * 100
}

x is the column of figures that we want to calculate growth on and l is the number of lags, i.e., 12 to to from monthly data to year over year growth.

Now we need to apply this to our example long data table DT_long. We do this with the following function.

gr.rate.long <- function(x, var_title, var_name, val_title, new_var_name, lag_periods){
  temp <-x
  names(temp)[grep(val_title, colnames(x))] <- "value"
  names(temp)[grep(var_title, colnames(x))] <- "variable_name"
  temp <- temp[variable_name == var_name]
  temp$value <- gr.rate(temp[, .(value)], lag_periods)
  temp$variable_name <- new_var_name
  names(temp)[grep("value", colnames(x))] <- val_title
  names(temp)[grep("variable_name", colnames(x))] <- var_title
  return(bind_rows(x,temp))
}

Next we call it with the following arguments:

  • x = the name of the long table that we are working with
  • var_title = the name of the column of variable names
  • var_name = the name of the specific variable that we want to work with
  • val_title = the name of the column of values
  • new_var_name = the name of the new variable we are creating
  • lag_periods = the number of lag periods, i.e., 12 for a year over year growth rate calculation of monthly data and 4 for quarterly.

Note that in our long table example, the column of variable names is called "variable_name" and the column of values is called "value," however, your long table may have other names for those columns. Specify those names with the respective arguments and the function will find and use those columns.

So, using our test long table, called "DT_long" we can calculate the monthly rate of inflation with the following call to this function:

gr.rate.long(DT_long, "variable_name", "CPI - Workers", "value", "CPI-W-growth rate", 1)

There are other reasons to calculate across time for a single variable. For example, if we knew the price of eggs over time in dollars and cents and wanted to convert these into an index like the CPI, we could call it the EPI. Or perhaps we want to change the base year of the CPI from the current 1982 to 1984* period to 2020.

To adjust the function for these, we would need to swap the temp$value <- gr.rate(temp[, .(value)], lag_periods) line to the appropriate calculation. That might take some experimentation. Better yet, if we could parameterize that line, it would be even better.

I tend to get a bit wordy, but I like to contextualize coding in real world scenarios. I hope this discussion has been useful to others. Please leave a comment if you found it useful.