1
votes

I have a huge data.table, and I need to create new columns based on conditions inside the existing columns.

lets say my data looks like this:

library(data.table)            
dt=data.table(ID=rep(1:3,1000000),LABEL=rep(c("A","A","B"),1000000),COND=rep(c("C","D","D"),1000000),VALUE=sample(letters,1000000,replace=T))

Now, I need to assign values to the new column WHATEVER depending on values on the other columns. Lets say I'M doing this on a loop:

dt$WHATEVER=as.numeric(NA)
for(id in dt[,unique(ID)]){
  for(label in dt[,unique(LABEL)]){
  n=dt[which(ID==id&LABEL==label&COND=="C"),cumsum(grepl("a",VALUE))]
    set(dt,
        i=dt[,which(ID==id&LABEL==label&COND=="C")],
        j="WHATEVER",
        value=n)
  }
}

If I system.time() this, I get:

   user  system elapsed 
  0.788   0.000   0.788 

However my dataset (and my code) is much more complex and it takes hours. So I've tried to setkey to the columns that I use to select the data both outside and inside the loop, but I got virtually no change.

setkey(dt,ID,LABEL,COND)
for(id in dt[,unique(ID)]){
  for(label in dt[,unique(LABEL)]){
  #setkey(dt,ID,LABEL,COND)
  n=dt[which(ID==id&LABEL==label&COND=="C"),cumsum(grepl("a",VALUE))]
    set(dt,
        i=dt[,which(ID==id&LABEL==label&COND=="C")],
        j="WHATEVER",
        value=n)
  }
}

... as you can see:

   user  system elapsed 
  0.801   0.020   0.820

Is there something I'm doing wrong or could be doing better? (I know I can change to apply functions. My question is data.table wise)

As asked by Henrik, I'm gonna show a sample of my data set and explain what I'm trying to do. My dataset looks like this:

       ID                                 NAME      PROGRAM
 1:    2056                                 CE      348
 2:    2056                                 CE      348
 3:    2056                                 AE      348
 4:    2056                                 CE      348
 5:    2056                                 AE      348
 6:    2056                                 AE      348
 7:    2056                                 CE      348
 8:    2056                                 AE      348
 9:    2056                                 BC      348
10:    2056                                 CB      348

I'm trying to count, for each ID, how many times each NAME appears, and assign a number to the NEWCOLUMN that tells me which time is that one (the order matters, and possibly setkey may mess this up, although I could work it around), but only for some PROGRAM.

Latter I use the value assign to the new column to create another one that tells me which NAME was done FIRST and LAST, again for each ID and only certain PROGRAM. (This one is even slower, since its done BY some other column), althouth PoGibas's answer might help to speed it up.

1
Only at the example, unfortunately.Ricardo Fernandes Campos
Done, @PoGibas.Ricardo Fernandes Campos
This kinda help, but it still doesn't. I'm updating the question again.Ricardo Fernandes Campos
Alright, @PoGibas , that's closer to what I need.Ricardo Fernandes Campos
@Henrik I put some further explanation and a toy to my data.Ricardo Fernandes Campos

1 Answers

1
votes

You can select rows with COND == "C" and cumsum TRUE values generated from condition VALUE == "a" (rows with COND != "C" will be filled with NA).

# Assign cumsum to new column WHATEVER2 by ID and LABEL
dt[COND == "C", WHATEVER2 := cumsum(VALUE == "a"), .(ID, LABEL)]

# All values are equal to the ones generate by OP
dt[, all(WHATEVER == WHATEVER2, na.rm = TRUE)]