6
votes

I get a warning when I use := right after converting all data.frames to data.tables:

library(data.table) #Win R-3.5.1 x64 data.table_1.12.2
df1 <- data.frame(A=1, B=2)
df2 <- data.frame(D=3)
lapply(mget(ls()), function(x) {
    if (is.data.frame(x)) {
        setDT(x)
    }
})
df1[, rn:=.I]

Warning message: In [.data.table(df1, , :=(rn, .I)) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

The below also generates the same warning:

df3 <- data.frame(E=3)
df4 <- data.frame(FF=4)
for (l in list(df3, df4)) setDT(l)
df3[, rn:=.I]

Typing one by one works but tedious

df5 <- data.frame(G=5)
setDT(df5)
df[, rn := .I]    #no warning

What is the idiomatic way to convert all data.frames to data.tables?

Related:

  1. Using setDT inside a function
  2. Invalid .internal.selfref in data.table
3

3 Answers

3
votes

A little late, but this seems like a great—and rare—use eapply() (along with list2env()). Of course, this is another option, certainly not asserting it is the idiomatic way.

library(data.table)
df1 <- data.frame(A=1, B=2)
df2 <- data.frame(D=3)

list2env(eapply(.GlobalEnv, function(x) {if(is.data.frame(x)) {setDT(x)} else {x}}), .GlobalEnv)

df1[, rn:=.I]
df1
   A B rn
1: 1 2  1

Some timings and memory usage:

set.seed(0L)
sz <- 1e7
df1 <- data.frame(A=rnorm(sz))
df2 <- data.frame(B=rnorm(sz))
df3 <- copy(df1)
df4 <- copy(df2)

microbenchmark::microbenchmark(unit="ms", times=1L,
    assign_mtd = {
        for (x in ls()) {
            if (is.data.frame(get(x))) {
                assign(x, as.data.table(get(x)))
            }
        }
    },
    eval_sub_mtd = {
        for(x in ls()){
            if (is.data.frame(get(x))) {
                eval(substitute(setDT(x), list(x=as.name(x))))
            }
        }
    },
    eapply_mtd = {
        list2env(eapply(.GlobalEnv, function(x) {
                if (is.data.frame(x)) setDT(x) else x
            }), .GlobalEnv)
    }
)

timings:

Unit: milliseconds
         expr        min         lq       mean     median         uq        max neval
   assign_mtd 115.922802 115.922802 115.922802 115.922802 115.922802 115.922802     1
 eval_sub_mtd   3.293358   3.293358   3.293358   3.293358   3.293358   3.293358     1
   eapply_mtd   1.913802   1.913802   1.913802   1.913802   1.913802   1.913802     1
5
votes

setDT operates on the name/symbol, while get returns the value of the object. You can construct the setDT expression and evaluate it:

library(data.table) 
df1 <- data.frame(A=1, B=2)
df2 <- data.frame(D=3)
for(x in ls()){
  if (is.data.frame(get(x))) {
    eval(substitute(setDT(x), list(x=as.name(x))))
  }
}
rm(x)
df1[, rn:=.I]

I would use a loop rather than lapply to avoid complications (eg, with the evaluating environment).

3
votes

This should do the trick:

library(data.table) #Win R-3.5.1 x64 data.table_1.12.2
df1 <- data.frame(A=1, B=2)
df2 <- data.frame(D=3)
for (x in ls()) {
    if (is.data.frame(get(x))) {
        assign(x, as.data.table(get(x)))
    }
}
df1[, rn:=.I]

I guess (not sure though) that the for/lapply loop uses sort of an own environment which messes up with the by ref semantics of data.table.