I'm trying to assign some new variables within a for loop (I'm trying to create some variables with common structure, but which are subsample-dependent).
I've tried for the life of me to re-produce this error on sample data and I can't. Here's code that works & gets the gist of what I want to do:
DT <- data.table(
id = rep(1:100, each = 20L),
period = rep(-9:10, 100L),
grp = rep(sample(4L, size = 100L, replace = TRUE), each = 20L),
y = runif(2000, min=0, max=5), key = c("id", "period")
)
DT[ , x := cumsum(y), by = id]
DT2 <- DT[id %in% seq(1, 100, by=2)]
DT3 <- DT[id %in% seq(1, 100, by=3)]
for (dd in list(DT, DT2, DT3)){
setkey(setkey(dd, grp)[dd[period==0, sum(x), by = grp], x_at_0_by_grp := V1], id, period)
}
This works fine--however, when I do this to my own code, it generates the Invalid .internal.selfref warning (and doesn't create the variable I want):
In
[.data.table(setkey(dt, treatment), dt[posting_rel == 0, sum(current_balance), : Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
In fact, when I subset my data to only those columns needed within the merge, it also works fine on my data (though doesn't save to the original data sets).
This suggests to me it's a problem with keying, but I'm explicitly setting the keys every step of the way. I'm completely lost on how to debug this from here because I can't get the error to repeat except on my full data set.
If I break out the operation into steps, the error arises at the merge step:
for (dd in list(DT, DT2, DT3)){
dummy <- dd[period==0, sum(x), by = grp]
setkey(dd, grp)
dd[dummy, x_at_0_by_grp := V1] #***ERROR HERE***
setkey(dd, id, period)
}
Quick update--also produces the error if I cast this with lapply instead of within a for loop.
Any ideas what on earth is going on here?
UPDATE: I've come up with a workaround by doing:
nnames <- c("dt", "dt2", "dt3")
dt_list <- list(DT, DT2, DT3)
for (ii in 1:3){
dummy <- copy(dt_list[[ii]])
dummy[ , x_at_0_by_grp := sum(x[period == 0]), by=grp]
assign(nnames[ii], dummy)
}
Would still like to understand what's going on, and perhaps a better way of assigning variables iteratively in situations like this.
DT <- rbindlist(list(dt[,src:=1],dt2[,src:=2],dt3[,src:=3]))Even if you don't want to stack the data, your second step does not require a merge, just usesum(x[period==0])...DT[,x_at_0_by_grp:=sum(x[period==0]),by="src,grp"]- Franklistcall of adata.table(which is mentioned as troublesome in other questions involvingInvalid .internal.selfref) and using the:=operator, rather than the merge itself. - MichaelChiricosrcwithin my main data set as an indicator for subsample containment, but the number of subsample-specific variables means I'd have to keep 20-30 variables with subsample-specific names in the original dataset--feels more natural to loop over data sets with a consistent name for the variables. - MichaelChiricocopydid the trick; I think you could post that as an answer. Yeah, I can see why you wouldn't want to stack the data sets like that. Anyway, I've posted my next suggestion as an answer, since it is so long. - Frank