I would like to ask whether the following behavior of data.table is a feature or a bug.
Given the data.table
dt = data.table(
group = c(rep('group1',5),rep('group2',5)),
x = as.numeric(c(1:5, 1:5)),
y = as.numeric(c(5:1, 5:1)),
z = as.numeric(c(1,2,3,2,1, 1,2,3,2,1))
)
and a vector of column names containing a duplicate,
cols = c('y','x','y','z') # contains a duplicate column name
data.table rightly prevents me from assigning values to the duplicate column names:
dt[,(cols) := lapply(.SD,identity), .SDcols=cols] # Error (OK)
This seems like appropriate behavior to me, because it can help avoid unintended consequences. However, if I do the same assignment by groups,
dt[,(cols) := lapply(.SD,identity), .SDcols=cols, by=group] # No error!
then data.table doesn't throw an error. The assignment goes through, and one can verify that columns y and z have been interchanged.
This occurred for me in a large application while demeaning variables by group, and it was difficult to trace the source of this behavior. The recommendation for the user, of course, is to avoid duplicate column names when assigning, and to avoid providing duplicate names to .SDcols. However, would it not be better for data.table to throw an error in this situation?