Reshape large dataset from wide to long with two ID variables

Question

I want to change my data from long to wide format using two ID variables.

I have the below code that works with the below example dataset. However, when I run this code with a much larger dataset that I am working with, the code runs for a very long time and doesn't seem to finish running. When I use one ID variable the code runs fine, but I need to include two.

Is there a more efficient way of changing from long to wide format?

(I've also thought about creating an ID variable based on ID1 and ID2 for the purposes of converting from long to wide. Perhaps this is the best solution?)

Wide.vars <- names(df[,c("Date","V1")])


### 1. Reshape from wide to long format with two ID variables
df_wide <- reshape(as.data.frame(df),                                  
                     idvar = c("ID1","ID2"), 
                     direction = "wide",
                     v.names = Wide.vars,
                     timevar = "Timepoint")

Example data below (note that the dimensions of the example dataset are 15 rows 5 columns, whereas the dataset I'm working with is 15658 rows by 99 columns).

df <- structure(list(ID1 = c(5643923L, 5643923L, 5643923L, 3914822L, 
3914822L, 3914822L, 3914822L, 1156115L, 1506426L, 7183921L, 4753447L, 
4606792L, 8492773L, 8492773L, 8492773L), ID2 = c("02179", 
"02179", "04101", "00819", "00819", "00819", "00819", 
"01904", "01127", "00475", "02084", "04118", "15553", 
"15553", "15553"), Date = structure(c(16731, 16731, 
16731, 16732, 16733, 16733, 16733, 16733, 16733, 16733, 16733, 
16733, 16734, 16734, 16734), class = "Date"), Timepoint = structure(c(1L, 
3L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 4L), .Label = c("baseline", 
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"), V1 = c(0, 8, 4, 9.5, 7, 7, 12, 9, 11, 8.4, 
    7.8, 6.6, 5, 5.5, 8.9)), row.names = c(NA, 
-15L), groups = structure(list(CP1_t_210 = structure(1L, .Label = c("baseline", 
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"), 
    .rows = structure(list(1:15), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

I think the problem is you have too much columns in wide format: I have done some fast tests with one id as you suggest, and also with data.table (1000 iterations), and differencies are minimal with your test data. — s__

Ronak Shah Ronak Shah · Accepted Answer · 2021-08-17T10:40:24

data.table is usually faster, you can try using dcast from it.

library(data.table)
dcast(setDT(df), ID1+ID2~Timepoint, value.var = c('Date', 'V1'))

As suggested by @Mark Davies pivot_wider can also help.

tidyr::pivot_wider(df, names_from = Timepoint, values_from = c(Date, V1))

Reshape large dataset from wide to long with two ID variables

1 Answers