2
votes

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.

Question:
How do I do the following "partial melt" in a way that memory can support?

Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.

Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.

This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.

Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?

Thoughts:

  • this is a little like melt-colsplit-cast
  • libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
  • tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
  • reshape2's dcast (I'm looking for 2d output) wants to aggregate
  • brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.

Update (dummy code):

#libraries
library(stringr)

#reproducibility
set.seed(56873504)

#geometry
Ncol <- 2e3
Nrow <- 1e6

#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
  col_idx <- 1:200+200*(i-1)
  if(i<26){
  namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
  } else {
    namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
  }
}

#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist

The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.

(Ugly/Manual) Brute force version:

(working on it... )

1
Have you tried the tidyverse solution of tidyr? - bob1
Hard to provide help without a more reproducible example (can you add a demonstrative slice of data with dput?). But it seems like you should be able to use lapply to perform an operation on each column, which would spare you the cost of reshaping a huge data set. - jdobres
@jdobres - you have code for dummy data. I will check out dput and lapply, but I'm dubious. The block structure makes lapply look hard. - EngrStudent
@bob1 - I get stuck on the spread part. I can gather and separate, but it seems to not like me not having a unique row identifier. - EngrStudent
What's the expected output? Can you do code with only 100 rows with expected output as well? Im not sure my computer generated 2E9 data points that well - Cole

1 Answers

1
votes

Here are two options both using data.table.

If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.

melt(dt
     , measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
     , value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]

#this data set used Ncol_p_grp <- 5 to help condense the data. 
        variable         1          2         3          4          5
     1:        A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
     2:        A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
     3:        A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
     4:        A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
     5:        A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
    ---                                                              
259996:        Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997:        Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998:        Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999:        Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000:        Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571

Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.

rbindlist(
  lapply(namelist_letters,
       function(x) setnames(
         dt[, grep(x, names(dt), value = T), with = F]
         , as.character(seq_len(Ncol_p_grp)))
  )
  , idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]

With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.

#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
      melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521     3
 rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147     3

Data: Run this before everything above:

#packages ----
library(data.table)
library(stringr)

#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26

#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))

names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
                    , str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))

#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))