1
votes

I have a panel data set with a large number of time periods and units, and also a large number of variables which I observe for each period and unit.

Since I want to apply a univariate time series operation for each unit and variable, I have to bring the panel data to wide format (using data.table::dcast) so that each column shows now a variable for a given unit across time.

After applying my time series observation I want to go back to the "long" panel format (using data.table::melt), however, here, I lose information on the unit names and variables. Since the data.table is quite large, I am afraid of mixing up data here which is why I would like to find a melt operation that preserves the variables and value names.

Consider the following example panel data set:

require(data.table)

dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2012-01-01"), by = "month")
id <- paste0(c("A", "B", "C"), 1:10)


DT <- data.table(
  time = rep(dates, 10),
  idx = rep(id, each = 60),
  String1 = runif(600),
  String2 = runif(600),
  String3 = runif(600)
)       


           time idx   String1    String2   String3
  1: 2007-02-01  A1 0.5412122 0.23502234 0.3858354
  2: 2007-03-01  A1 0.3248168 0.32884580 0.7183147
  3: 2007-04-01  A1 0.4183034 0.40781723 0.7438458
  4: 2007-05-01  A1 0.3597997 0.51745402 0.1660566
  5: 2007-06-01  A1 0.6405351 0.96121729 0.7786483
 ---                                              
596: 2011-09-01 A10 0.7896711 0.64740298 0.8285408
597: 2011-10-01 A10 0.6582652 0.83986453 0.1292342
598: 2011-11-01 A10 0.1110465 0.41741672 0.7076345
599: 2011-12-01 A10 0.5108850 0.02940229 0.9038370
600: 2012-01-01 A10 0.2605052 0.10136480 0.3881788

I am bringing this panel data set to wide format. After applying some time series operations with it (not shown here), I will have to delete some columns if there's not enough data for them. Then, I bring the data back to long format

variable_names <- names(DT[,-c("time", "idx")])

DT_long <- dcast(DT, time ~ idx, value.var = variable_names)
DT_long <- DT_long[,-(5:10)]


DT_wide <- melt(DT_long, measure = patterns("^String1", "^String2", "^String3"), value.name = variable_names, variable.name = "idx)

           time idx   String1   String2    String3
  1: 2007-02-01   1 0.9794707 0.5290352 0.68009050
  2: 2007-03-01   1 0.4016173 0.9229200 0.38652407
  3: 2007-04-01   1 0.9475505 0.5956701 0.24686007
  4: 2007-05-01   1 0.6465847 0.8233340 0.08008369
  5: 2007-06-01   1 0.5704834 0.8232598 0.85790038
 ---                                              
596: 2011-09-01  10        NA 0.5525413 0.79994190
597: 2011-10-01  10        NA 0.3895864 0.41347910
598: 2011-11-01  10        NA 0.3123646 0.44461146
599: 2011-12-01  10        NA 0.2148686 0.37609448
600: 2012-01-01  10        NA 0.7314114 0.47138012

DT_wide now looks like this which means I have lost the information on variable (here: idx) names. One fix I imagine is numbering all idx through with numbers and then doing this operation. However, if possible I would like to preserve string names since they help me a lot to distinguish and understand the values. Can someone help me how I can rewrite melt to include this information?

1
Could you edit your post to include minimal sample data and matching expected output (think if it's really necessary to have a 600 row data.table). Also DT looks like it's in wide format already (a long table would have values from the String1, String2, String3 columns in a key & value column). So instead of "I am bringing this panel data set to wide format" did you mean to say "bringing this panel data set to long format"? - Maurits Evers
You are aware of the argument value.name in melt()? - sindri_baldur
Yes, value.name solves the "value" problem but not the idx problem. I am adjusting my question, thanks @sindri_baldur! - Florestan
@MauritsEvers For me the time - unit panel format is the right way to work with this data and I need to transform it as in my example to wide format. I see that it might not be the definition of long format most people use so I am replacing it by panel format. I do not think however that this really changes the way I framed my question. - Florestan
@Florestan The definition of long and wide data is pretty standardised now (but even Hadley refers to the distinction as being imprecise). The main issue is with your sample data being quite unwieldy. I don't think it's necessary to have a 600 row data.table to illustrate your point. You're more likely to get a quick & good response if you make it easy for people to help you. After all, you are asking others to help you with your problem. - Maurits Evers

1 Answers

1
votes

Reading through ?melt and the Efficient reshaping vignette, I cant see how to do this directly with melt.data.table. However, you could use pivot_longer() from the development version of tidyr:

library(data.table)

dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2007-04-01"), by = "month")
id <- c("A1", "B2")

DT <- data.table(
  time = rep(dates, 2),
  idx = rep(id, each = 3),
  String1 = runif(6),
  String2 = runif(6)
)

DT
#>          time idx   String1   String2
#> 1: 2007-02-01  A1 0.6453802 0.4641508
#> 2: 2007-03-01  A1 0.1106000 0.3750282
#> 3: 2007-04-01  A1 0.6356700 0.9601759
#> 4: 2007-02-01  B2 0.9821609 0.1782534
#> 5: 2007-03-01  B2 0.4786173 0.1557481
#> 6: 2007-04-01  B2 0.7720111 0.7982246

variable_names <- names(DT[, -c("time", "idx")])
DT_long <- dcast(DT, time ~ idx, value.var = variable_names)

DT_long
#>          time String1_A1 String1_B2 String2_A1 String2_B2
#> 1: 2007-02-01  0.6453802  0.9821609  0.4641508  0.1782534
#> 2: 2007-03-01  0.1106000  0.4786173  0.3750282  0.1557481
#> 3: 2007-04-01  0.6356700  0.7720111  0.9601759  0.7982246
library(tidyr) # devtools::install_github("tidyverse/tidyr")

pivot_longer(
  data = DT_long,
  cols = starts_with("String"),
  names_sep = "_",
  names_to = c(".value", "idx")
)
#> # A tibble: 6 x 4
#>   time       idx   String1 String2
#>   <date>     <chr>   <dbl>   <dbl>
#> 1 2007-02-01 A1      0.645   0.464
#> 2 2007-02-01 B2      0.982   0.178
#> 3 2007-03-01 A1      0.111   0.375
#> 4 2007-03-01 B2      0.479   0.156
#> 5 2007-04-01 A1      0.636   0.960
#> 6 2007-04-01 B2      0.772   0.798

Created on 2019-09-09 by the reprex package (v0.3.0)