3
votes

In a new user created function I like to do some data.table transformations, especially I like to create a new column with the ':=' command.

Assume I like to make a new column called Sex that capitalizes the first letter of the column df$sex in my example data.frame df.

The output of my prepare function should be a data.table with the same name as before but with the additional "capitalised" column.

I try several ways to loop over the data.table. However I always get the following warning (and no correct output):

Warning message: In [.data.table(x, , :=(Sex, stringr::str_to_title(sex))) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

library(data.table)
library(magrittr)
library(stringr)


df <- data.frame("age" = c(17, 04), 
                      sex = c("m", "f"))
df %>%   setDT()
is.data.table(df)

This is the easiest way to write my function:

prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
prepare1(df)
#--> WARNING. (as block quoted above)


prepare2<-function(x){
  x[, `:=`(Sex, stringr::str_to_title(sex))]
}
prepare2(df)
#--> WARNING. . (as block quoted above)


prepare3<-function(x){
  require(data.table)
  y <-as.data.table(list(x))
  y <- y[,Sex:=stringr::str_to_title(sex)]
  x <<- y
}
prepare3(df)

The last version does NOT throw the warning, but makes a new dataset called x. But I wanted to override the dataset I put in the function (if I have to go that way at all.)

From the := help file I also know I can use set, however I am not able to adapt the command appropriate. In case that could cure my problem I am happy to receive help on that, too! set(x, i = NULL, Sex, str_to_title(sex)) is apparently wrong ...

Up on request/to make the discussion in the comments clearer I show how my code produces the problem

    library(data.table)
library(stringr)


df <- data.frame("age" = c(17, 04), 
                      sex = c("m", "f"))

GetLastAssigned <- function(match = "<- *data.frame",
                            remove = " *<-.*") {
  f <- tempfile()
  savehistory(f)
  history <- readLines(f)
  unlink(f)
  match <- grep(match, history, value = TRUE)
  get(sub(remove, "", match[length(match)]))
}

#ok, no need for magrittr
setDT(GetLastAssigned())

#check the last function worked
is.data.table(df)

prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}

prepare1(GetLastAssigned())
# I get a warning and it does not work.
prepare1(df)
# I get a warning and it does not work, either.


#If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation. 
1
The culprit appears to be magrittr. If you just do setDT(df) this works as intended.Roland
If you look at the source of `%>%` you see quite a few functions that are good candidates for this kind of issues.Roland
thank you. but I need the Magrittr as I am in my real application need not to set df through another function. I.e. "myotherfunction" returns df. But it needs to be "myotherfunction %>% setDT() , or setDT(myotherfunction).canIchangethis
But have an upvote. This is a very well written question with a nice reproducible example.Roland
@Roland Not sure if it's the same issue, but I have run into related problems github.com/Rdatatable/data.table/issues/1628 which links to stackoverflow.com/a/26072152 where Arun in 2014 closed with "The idea so far is to use setDT to convert to data.tables before providing it to a function. But I'd like that these cases be resolved"Frank

1 Answers

1
votes

A workaround along the OP's lines:

library(data.table)
library(stringr)

GetLastAssigned2 <- function(match = "<- *data.frame", remove = " *<-.*") {
  f <- tempfile()
  savehistory(f)
  history <- readLines(f)
  unlink(f)
  match <- grep(match, history, value = TRUE)
  nm <- sub(remove, "", match[length(match)])
  list(nm = as.name(nm), addr = address(get(nm)))
}

prepit <- function(x){
  x[,Sex:=stringr::str_to_title(sex)]
}

# usage
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))

str(df) # it seemingly works, since there is a selfref

# usage 2
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
setDT(df)
prepit(df)
str(df) # works

# usage 3
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))
eval(substitute(prepit(x), list(x=z$nm)))
str(df) # works

Some big caveats:

  • savehistory is only effective in interactive use, based on my reading of the docs
  • using regex on human input (code typed in interactively) is complicated and risky
  • even this workaround will fail if data.table x passed to prepit is not sufficiently "pre-allocated" space for extra columns

The data.table interface is based on passing the name/symbol of the data.frame or data.table, rather than the value (which is what get provides), as explained by Arun one of the data.table authors. Note that the address cannot be passed around either. z$address soon fails to match address(df) in all examples above.


If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation.

One idea:

# helper to compose expressions
subit = function(cmd, df_nm) 
  do.call("substitute", list(cmd, list(x=as.name(df_nm))))

# list of expressions with x where the df name belongs
my_cmds = list(
  setDT  = quote(setDT(x)),
  prepit = quote(x[,Sex:=stringr::str_to_title(sex)])
)

# usage 4
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df" # somehow get this... hopefully not via regex of command history
eval(subit(my_cmds$setDT, df_nm))
eval(subit(my_cmds$prepit, df_nm))

# usage 5
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df" 
for(ex in lapply(my_cmds, subit, df_nm = df_nm)) eval(ex)

I think this is more aligned with recommended programmatic usage of data.table.

There is probably some way to wrap this in a function by altering the envir= argument to eval() but I'm not knowledgeable about that.

Regarding how to get the name of the assignment target in nm <- data.frame(...), it looks like there are no good options. Maybe see How do I access the name of the variable assigned to the result of a function within the function? or Get name of x when defining `(<-` operator