split data.table

Question

I have a data.table which I want to split into two. I do this as follows:

dt <- data.table(a=c(1,2,3,3),b=c(1,1,2,2))
sdt <- split(dt,dt$b==2)

but if I want to to something like this as a next step

sdt[[1]][,c:=.N,by=a]

I get the following warning message.

Warning message: In [.data.table(sdt[[1]], , :=(c, .N), by = a) : Invalid .internal.selfref detected and fixed by taking a copy of the whole table, so that := can add this new column by reference. At an earlier point, this data.table has been copied by R. Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects), use reflist() instead if needed (to be implemented). If this message doesn't help, please report to datatable-help so the root cause can be fixed.

Just wondering if there is a better way of splitting the table so that it would be more efficient (and would not get this message)?

Why do you want to split the data.table in the first place? Splitting us creating a list, so the warning deals with why the copy has taken place — mnel
I'm creating two sets for my experiments, based on a time split. — jamborta
in 1.9.7 there is own split method for data.table, your code will run just fine on it. — jangorecki

Matt Dowle Matt Dowle · Accepted Answer · 2013-02-20T11:25:09

This works in v1.8.7 (and may work in v1.8.6 too) :

> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
   a b
1: 1 1
2: 2 1

$`TRUE`
   a b
1: 3 2
2: 3 2

> sdt[[1]][,c:=.N,by=a]     # now no warning
> sdt
$`FALSE`
   a b c
1: 1 1 1
2: 2 1 1

$`TRUE`
   a b
1: 3 2
2: 3 2

But, as @mnel said, that's inefficient. Please avoid splitting if possible.

split data.table

3 Answers