I have a huge data.table
, and I need to create new columns based on conditions inside the existing columns.
lets say my data looks like this:
library(data.table)
dt=data.table(ID=rep(1:3,1000000),LABEL=rep(c("A","A","B"),1000000),COND=rep(c("C","D","D"),1000000),VALUE=sample(letters,1000000,replace=T))
Now, I need to assign values to the new column WHATEVER
depending on values on the other columns. Lets say I'M doing this on a loop:
dt$WHATEVER=as.numeric(NA)
for(id in dt[,unique(ID)]){
for(label in dt[,unique(LABEL)]){
n=dt[which(ID==id&LABEL==label&COND=="C"),cumsum(grepl("a",VALUE))]
set(dt,
i=dt[,which(ID==id&LABEL==label&COND=="C")],
j="WHATEVER",
value=n)
}
}
If I system.time()
this, I get:
user system elapsed
0.788 0.000 0.788
However my dataset (and my code) is much more complex and it takes hours. So I've tried to setkey
to the columns that I use to select the data both outside and inside the loop, but I got virtually no change.
setkey(dt,ID,LABEL,COND)
for(id in dt[,unique(ID)]){
for(label in dt[,unique(LABEL)]){
#setkey(dt,ID,LABEL,COND)
n=dt[which(ID==id&LABEL==label&COND=="C"),cumsum(grepl("a",VALUE))]
set(dt,
i=dt[,which(ID==id&LABEL==label&COND=="C")],
j="WHATEVER",
value=n)
}
}
... as you can see:
user system elapsed
0.801 0.020 0.820
Is there something I'm doing wrong or could be doing better? (I know I can change to apply functions. My question is data.table wise)
As asked by Henrik, I'm gonna show a sample of my data set and explain what I'm trying to do. My dataset looks like this:
ID NAME PROGRAM
1: 2056 CE 348
2: 2056 CE 348
3: 2056 AE 348
4: 2056 CE 348
5: 2056 AE 348
6: 2056 AE 348
7: 2056 CE 348
8: 2056 AE 348
9: 2056 BC 348
10: 2056 CB 348
I'm trying to count, for each ID, how many times each NAME appears, and assign a number to the NEWCOLUMN that tells me which time is that one (the order matters, and possibly setkey may mess this up, although I could work it around), but only for some PROGRAM.
Latter I use the value assign to the new column to create another one that tells me which NAME was done FIRST and LAST, again for each ID and only certain PROGRAM. (This one is even slower, since its done BY some other column), althouth PoGibas's answer might help to speed it up.