5
votes

I'm trying to assign some new variables within a for loop (I'm trying to create some variables with common structure, but which are subsample-dependent).

I've tried for the life of me to re-produce this error on sample data and I can't. Here's code that works & gets the gist of what I want to do:

DT <- data.table(
  id = rep(1:100, each = 20L),
  period = rep(-9:10, 100L),
  grp = rep(sample(4L, size = 100L, replace = TRUE), each = 20L),
  y = runif(2000, min=0, max=5), key = c("id", "period")
)
DT[ , x := cumsum(y), by = id]
DT2 <- DT[id %in% seq(1, 100, by=2)]
DT3 <- DT[id %in% seq(1, 100, by=3)]

for (dd in list(DT, DT2, DT3)){
  setkey(setkey(dd, grp)[dd[period==0, sum(x), by = grp], x_at_0_by_grp := V1], id, period)
}

This works fine--however, when I do this to my own code, it generates the Invalid .internal.selfref warning (and doesn't create the variable I want):

In [.data.table(setkey(dt, treatment), dt[posting_rel == 0, sum(current_balance), : Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.

In fact, when I subset my data to only those columns needed within the merge, it also works fine on my data (though doesn't save to the original data sets).

This suggests to me it's a problem with keying, but I'm explicitly setting the keys every step of the way. I'm completely lost on how to debug this from here because I can't get the error to repeat except on my full data set.

If I break out the operation into steps, the error arises at the merge step:

for (dd in list(DT, DT2, DT3)){
  dummy <- dd[period==0, sum(x), by = grp]
  setkey(dd, grp)
  dd[dummy, x_at_0_by_grp := V1] #***ERROR HERE***
  setkey(dd, id, period)
}

Quick update--also produces the error if I cast this with lapply instead of within a for loop.

Any ideas what on earth is going on here?


UPDATE: I've come up with a workaround by doing:

nnames <- c("dt", "dt2", "dt3")

dt_list <- list(DT, DT2, DT3)

for (ii in 1:3){
  dummy <- copy(dt_list[[ii]])
  dummy[ , x_at_0_by_grp := sum(x[period == 0]), by=grp]
  assign(nnames[ii], dummy)
}

Would still like to understand what's going on, and perhaps a better way of assigning variables iteratively in situations like this.

1
I don't have an explanation for your bug, but you might consider pooling this data into a single data.table and working with that. DT <- rbindlist(list(dt[,src:=1],dt2[,src:=2],dt3[,src:=3])) Even if you don't want to stack the data, your second step does not require a merge, just use sum(x[period==0]) ... DT[,x_at_0_by_grp:=sum(x[period==0]),by="src,grp"] - Frank
Thanks for the tip for avoiding a merge, I hadn't thought of that. However, the error still occurs--suggesting to me the problem is the combination of looping over a list call of a data.table (which is mentioned as troublesome in other questions involving Invalid .internal.selfref) and using the := operator, rather than the merge itself. - MichaelChirico
Also, I'd prefer not to pool the data because 1) each subset has ~300,000 observations and 2) i could define your src within my main data set as an indicator for subsample containment, but the number of subsample-specific variables means I'd have to keep 20-30 variables with subsample-specific names in the original dataset--feels more natural to loop over data sets with a consistent name for the variables. - MichaelChirico
Hm, weird that copy did the trick; I think you could post that as an answer. Yeah, I can see why you wouldn't want to stack the data sets like that. Anyway, I've posted my next suggestion as an answer, since it is so long. - Frank
Same doing this, but this ex do not warn `` iris <- data.table(iris) cols <- c("Petal.Width", ~Petal.Length/Petal.Width, ~Sepal.Width-Petal.Width) names <- c("Petal.Width", "Width.Length.Relation", "Width.Length.Difference") if(any(unlist(lapply(cols, inherits, what = "formula")))){ for(col in 1:length(cols)){ if(inherits(cols[[col]], "formula")){ text_f <- tail(as.character(cols[[col]]), 1) name_f <- names[[col]] # se produce un warning text_f <- enquote(text_f) iris[, (name_f) := eval(parse(text = text_f)) ] } } } `` - Captain Tyler

1 Answers

2
votes

With 20-30 criteria, keeping them outside of a list (with manual names like dt2, etc.) is too clunky, so I'll just assume you have them all in dt_list.

I suggest making tables with just the stat you're computing, and then rbinding them:

xxt <- rbindlist(lapply(1:length(dt_list),function(i) 
         dt_list[[i]][,list(cond=i,xx=sum(x[period==0])),by=grp]))

which creates

    grp cond       xx
 1:   1    1 623.3448
 2:   2    1 784.8438
 3:   4    1 699.2362
 4:   3    1 367.7196
 5:   1    2 323.6268
 6:   4    2 307.0374
 7:   2    2 447.0753
 8:   3    2 185.7377
 9:   1    3 275.4897
10:   4    3 243.0214
11:   2    3 149.6041
12:   3    3 166.3626

You can easily merge back if you really want those vars. For example, for dt2:

myi = 2
setkey(dt_list[[myi]],grp)[xxt[cond==myi,list(grp,xx)]]

This doesn't resolve the bug you're running into, but I think is a better approach.