My group writes a lot of code using data.table and we occasionally get bitten by the 'Invalid .internal.selfref detected and fixed by taking a copy of the whole table ...' warning. This behaviour can break our code when a data table is passed by reference to a function and I am trying to figure out how to work around it.
Suppose I have a function which adds a column to a data.table as a side effect -- note the original data.table is not returned.
foo <- function(mydt){
mydt[, c := c("a", "b")]
return(123)
)
> x<- data.table(a=c(1,2), b=c(3,4))
> foo(x)
[1] 123
> x
a b c
1: 1 3 a
2: 2 4 b
x has been updated with the new column. This is the desired behavior.
Now suppose something happens that breaks the internal self-ref in x:
> x<- data.table(a=c(1,2), b=c(3,4))
> x[["a"]] <- c(7,8)
> foo(x)
[1] 123
Warning message:
In `[.data.table`(mydt, , `:=`(c, c("a", "b"))) :
Invalid .internal.selfref detected and fixed by taking a copy ...
> x
a b
1: 7 3
2: 8 4
I understand what happened (mostly). The [["a"]] construction is not data.table friendly; x was converted to a data frame and then back to a data table, which somehow messed up the internal workings. Then inside foo(), during the reference operation of adding a column, this problem was detected, and a copy of mydt was made; the new column 'c' was added to mydt. However, that copy operation severed the pass-by-reference relationship between x and mydt, so the additional columns are not part of x.
The function foo() is going to be used by different people and it will be difficult to protect against invalid internal selfref situations. Someone out there might easy do something like x[["a"]] which would lead to invalid input. I'm trying to figure out how to handle this from inside foo.
So far I have this idea, at the beginning of foo():
if(!data.table:::selfrefok(mydt)) stop("mydt is corrupt.")
That at least gives us a chance of spotting the problem, but it's not very friendly to the users of foo(), because the ways in which these inputs can get corrupted can be pretty opaque. Ideally I would like to be able to correct for corrupted input and maintain the desired functionality of foo(). But I can't see how, unless I restructure my code so that foo returns mydt and assigns it to x in the calling scope, which is possible but not ideal. Any ideas?