0
votes

I have a series of columns in my data frame (299 of them) which have the same information type repeated every 3 columns. I need to stack them so each set of 3 retain their referential integrity with each other, and the preceding 22 columns. Normally I'd use melt() to go from wide to long form, but that would involve one or two categories with common names. Each of these columns has a unique name (e.g. "x1", "y1", "z1","x2","y2","z2").

For the sake of example, lets take example data mtcars. These data contain the 11 columns, each with unique names: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb. Lets say I want to keep columns 1:5 as identifiers, then stack the data columns 6:8 on top of 9:11. In this simple example I could use stack or melt and say which columns to use, but when we extrapolate out to hundreds of columns this gets to be a pretty horrid task very quickly... thoughts?

1
Is there any sort naming convention that tell which columns to stack? Like, do all the columns starting with x get stacked together? Or is it purely positional? Using the mtcars example, cols 1:5 as you say are identifiers, then for 6:ncol(mtcars) it's is sets of 3?Gregor Thomas
Because both data.table::melt and tidyr::pivot_longer are very powerful as to which columns get stacked and could handle a naming pattern. data.table::melt will work well with column numbers too.Gregor Thomas
They follow a standard convention. All X whatevers go together, all Y whatevers go together, etc. The columns were created in excel expanding from a single "|" deliminated column, then importing them to R add's the .1, .2,.3 etc to common column names.Jesse001
@GregorThomas I've used melt when I can tell it specifically which variables to stack, but I'm confused about how I would apply it here. could you please elaborate? I have another data frame with this same issue, but it's 1100 columns wide... automation is keyJesse001
You'd need to use data.table::melt. reshape2::melt doesn't have that feature, but data.table::melt lets you pass a list to measure.vars. For every 3 columns between 2 and 1000 you would use measure.vars = split(2:1000, (seq_along(2:1000) - 1) %/% 3). But with a names pattern tidyr::pivot_longer is also easy.Gregor Thomas

1 Answers

1
votes

Given that your column names all share a prefix, tidyr::pivot_longer makes this easy. First I'll create some sample data based on your description:

## semi-realistic sample data
nr = 3
n_group = 4
n_per_group = 3
nc = n_group * n_per_group
df = as.data.frame(matrix(rnorm(nr * nc), nrow = nr))
names(df) = outer(letters[1:n_group], 1:n_per_group, paste, sep = ".")
df = cbind(id = 1:nr, df)
#   id        a.1         b.1         c.1         d.1        a.2       b.2        c.2        d.2
# 1  1  0.7898394 -2.34853091 -0.15575050 -1.18205111 -0.6348410  2.545300  0.8265652  0.8817947
# 2  2  0.5717691 -1.46586503  0.01206218  0.83525523  1.6629318  1.602706 -1.5781837  1.1245715
# 3  3 -1.5727813  0.04925075 -0.82081511 -0.06905755 -0.4137177 -1.734805 -0.3742163 -1.0701552
#         a.3        b.3         c.3         d.3
# 1 0.3238154  2.2401514 -0.70958661 -0.09062723
# 2 0.7016303 -0.9547929  0.23150225  0.39295502
# 3 0.7229709 -1.7137527  0.09044422 -0.59431548

With tidyr:

## convert to long format
library(tidyr)
pivot_longer(df, cols = -id, names_sep = "\\.", names_to = c(".value", "iteration"))
# # A tibble: 9 x 6
#      id iteration      a       b       c       d
#   <int> <chr>      <dbl>   <dbl>   <dbl>   <dbl>
# 1     1 1          0.790 -2.35   -0.156  -1.18  
# 2     1 2         -0.635  2.55    0.827   0.882 
# 3     1 3          0.324  2.24   -0.710  -0.0906
# 4     2 1          0.572 -1.47    0.0121  0.835 
# 5     2 2          1.66   1.60   -1.58    1.12  
# 6     2 3          0.702 -0.955   0.232   0.393 
# 7     3 1         -1.57   0.0493 -0.821  -0.0691
# 8     3 2         -0.414 -1.73   -0.374  -1.07  
# 9     3 3          0.723 -1.71    0.0904 -0.594 

Using the same sample data, here's a data.table::melt approach based on column numbers:

## note that `nc` is the number of non-ID columns.
## The `1:nc + 1` is a way to get the column numbers of the stack columns
## (+ 1 because there is 1 id column in my example)
dt = as.data.table(df)
melt(dt, id.vars = "id", measure.vars = split(1:nc + 1, rep(1:n_group, times = n_per_group)))
#    id variable          1           2           3           4
# 1:  1        1  0.7898394 -2.34853091 -0.15575050 -1.18205111
# 2:  2        1  0.5717691 -1.46586503  0.01206218  0.83525523
# 3:  3        1 -1.5727813  0.04925075 -0.82081511 -0.06905755
# 4:  1        2 -0.6348410  2.54530008  0.82656523  0.88179474
# 5:  2        2  1.6629318  1.60270621 -1.57818365  1.12457150
# 6:  3        2 -0.4137177 -1.73480524 -0.37421630 -1.07015525
# 7:  1        3  0.3238154  2.24015140 -0.70958661 -0.09062723
# 8:  2        3  0.7016303 -0.95479285  0.23150225  0.39295502
# 9:  3        3  0.7229709 -1.71375273  0.09044422 -0.59431548

(The results look different at a glance, but it's just a different order of the same rows.)