I have a long table with multiple variables (CPI - Workers
, CPI - Consumers
, (Seas) Unemployment Level (thous)
and many more, but am truncating the data set to 3 variables and 6 time periods for brevity. I want to create a new variable, which is a combination of the first two. Let's call it CPI - Average
which of course is simply the average of the first two or (CPI - Workers
+ CPI - Consumers
) / 2. This is a simple calculation in a wide table, however, in order to satisfy ggplot, I have stored my data in long form.
Note, I am storing all of my variables in one long table. When I need to visualize the trend, I filter to the desired variable or variables in my ggplot command.
My question is how do I create the new variable without first converting the data to wide format?
First, here is my dataset:
DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date
1994 M01 January 143.8 "CPI - Workers" 1994-01-01
1994 M02 February 144.0 "CPI - Workers" 1994-02-01
1994 M03 March 144.3 "CPI - Workers" 1994-03-01
1994 M04 April 144.5 "CPI - Workers" 1994-04-01
1994 M05 May 144.8 "CPI - Workers" 1994-05-01
1994 M06 June 145.3 "CPI - Workers" 1994-06-01
1994 M01 January 146.3 "CPI - Consumers" 1994-01-01
1994 M02 February 146.7 "CPI - Consumers" 1994-02-01
1994 M03 March 147.1 "CPI - Consumers" 1994-03-01
1994 M04 April 147.2 "CPI - Consumers" 1994-04-01
1994 M05 May 147.5 "CPI - Consumers" 1994-05-01
1994 M06 June 147.9 "CPI - Consumers" 1994-06-01
1994 M01 January 8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994 M02 February 8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994 M03 March 8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994 M04 April 8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994 M05 May 7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994 M06 June 7927 "(Seas) Unemployment Level (thous)" 1994-06-01
'))
Second, here is what the output of the calculation should look like:
DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date
1994 M01 January 143.8 "CPI - Workers" 1994-01-01
1994 M02 February 144.0 "CPI - Workers" 1994-02-01
1994 M03 March 144.3 "CPI - Workers" 1994-03-01
1994 M04 April 144.5 "CPI - Workers" 1994-04-01
1994 M05 May 144.8 "CPI - Workers" 1994-05-01
1994 M06 June 145.3 "CPI - Workers" 1994-06-01
1994 M01 January 146.3 "CPI - Consumers" 1994-01-01
1994 M02 February 146.7 "CPI - Consumers" 1994-02-01
1994 M03 March 147.1 "CPI - Consumers" 1994-03-01
1994 M04 April 147.2 "CPI - Consumers" 1994-04-01
1994 M05 May 147.5 "CPI - Consumers" 1994-05-01
1994 M06 June 147.9 "CPI - Consumers" 1994-06-01
1994 M01 January 8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994 M02 February 8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994 M03 March 8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994 M04 April 8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994 M05 May 7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994 M06 June 7927 "(Seas) Unemployment Level (thous)" 1994-06-01
1994 M01 January 145.05 "CPI - Average" 1994-01-01
1994 M02 February 145.35 "CPI - Average" 1994-02-01
1994 M03 March 145.70 "CPI - Average" 1994-03-01
1994 M04 April 148.85 "CPI - Average" 1994-04-01
1994 M05 May 146.15 "CPI - Average" 1994-05-01
1994 M06 June 146.60 "CPI - Average" 1994-06-01
'))
The fourth variable (CPI - Average) takes the average of the first two for each date. Please ignore the fact that this average makes no sense economically, I just wanted a simple calculation for the example.
Such calculations are quite straight forward in wide format. So let's first transform the data to wide and then make the calculation.
DT_wide <- DT_long %>% pivot_wider(names_from = variable_name, values_from = value)
DT_wide_with_average <- DT_wide %>% mutate(`CPI - Average` = (`CPI - Workers` + `CPI - Consumers`) / 2)
This takes the wide table and adds a new column with the calculated results:
DT_wide_with_average <- as.data.table(read.table(header=TRUE, text='year period periodName date `CPI - Workers` `CPI - Consumers` `(Seas) Unemployment Level (thous)` `CPI - Average`
1994 M01 January 1994-01-01 144. 146. 8630 145.
1994 M02 February 1994-02-01 144 147. 8583 145.
1994 M03 March 1994-03-01 144. 147. 8470 146.
1994 M04 April 1994-04-01 144. 147. 8331 146.
1994 M05 May 1994-05-01 145. 148. 7915 146.
1994 M06 June 1994-06-01 145. 148. 7927 147.
'))
Please ignore the fact that the decimals have been truncated by pivot_wider.
Working in wide mode, creating variables, analyzing them, revising the calculations, reordering the column orders, deleting unneeded columns is the way we mere humans think when analyzing simple data tables.
Unfortunately, ggplot requires the long format, considered "tidy" by the gods of R, but is quite messy in the eyes of us mere mortals. Sorry for the dig, but if I were to stack my couch, table, chairs, lamp and rug in one corner of the room, it would be quite messy, while if I left them as I normally keep my room, they would be quite tidy. In the real world, I might stack the furniture in one corner in order to paint the room or sand the floors. This would be useful for the task at hand, but it would be considered messy and not useful for ordinary living. So, considering long tables as tidy and wide tables as messy is counterintuitive. It took me a long time to figure out this counterintuitive logic when I was first introduced to tidyverse. Sorry for the rant, but hopefully it is useful customer feedback to the gods of R. At a minimum, it would be helpful to R learners if the gods would admit to the counterintuitive nomenclature. If I am warned, before entering the bathroom, that the faucet handle with an "H" is the cold water and the one with a "C" is the hot water, I am less likely to scald my hands!
Data analysis is iterative. I don't want to have to take the following steps for each iteration:
- pivot_wider
- calculate new variable
- pivot_longer
- examine trend in ggplot
I would rather:
- calculate new variable
- examine trend in ggplot
In short, I want to focus on my economic analysis, rather than on unnecessary R programming.
So, how can I select a subset of variables from my long format table, use them in a calculation to create a new variable and ensure that the new variable is rbind
-ed to the end of my long table...without having to convert to wide format?
Thanks for your help!