6
votes

I would like to ask whether the following behavior of data.table is a feature or a bug.

Given the data.table

dt = data.table(
  group = c(rep('group1',5),rep('group2',5)),
  x = as.numeric(c(1:5, 1:5)),
  y = as.numeric(c(5:1, 5:1)),
  z = as.numeric(c(1,2,3,2,1, 1,2,3,2,1))
)

and a vector of column names containing a duplicate,

cols = c('y','x','y','z') # contains a duplicate column name

data.table rightly prevents me from assigning values to the duplicate column names:

dt[,(cols) := lapply(.SD,identity), .SDcols=cols] # Error (OK)

This seems like appropriate behavior to me, because it can help avoid unintended consequences. However, if I do the same assignment by groups,

dt[,(cols) := lapply(.SD,identity), .SDcols=cols, by=group] # No error!

then data.table doesn't throw an error. The assignment goes through, and one can verify that columns y and z have been interchanged.

This occurred for me in a large application while demeaning variables by group, and it was difficult to trace the source of this behavior. The recommendation for the user, of course, is to avoid duplicate column names when assigning, and to avoid providing duplicate names to .SDcols. However, would it not be better for data.table to throw an error in this situation?

1
Sounds like a bug report. - r2evans

1 Answers

1
votes

This is a bug, which was fixed in version 1.12.4 of data.table. Here is the bug report: https://github.com/Rdatatable/data.table/issues/4874.

Other users with this issue can simply update their package version, for example using install.packages('data.table'). To check the current package version, load data.table and then look at the output of sessionInfo().

But it would be wise to avoid supplying duplicate column names to .SDcols.