Getting started with the data.table
package (author/maintainer: Matt Dowle). Great package. I love that I can write dt[, x1]
instead of, say, dt[, dt$x1]
or df["x1"]
, for a data.table dt
, a column name x1
, a data.frame df
. Being able to pass column names directly is an attractive feature of data.table. But dispensing with quotes around a column name, (writing x1
instead of "x1"
) is not always feasible. Why?
Programming Question: Are there any reasons why it is not always possible to pass a vector of column names directly to a data.table or to the helper functions provided by the package? For instance, the subset
, merge
, and melt
functions have been rewritten for the data.table package, but while subset
can handle column names directly, merge
and melt
cannot (see below).
To clarify, my question is not when or how but why. There are excellent related discussions with very useful tips, e.g. Select / assign to data.table variables which names are stored in a character vector and r - passing variables as data.table column names . With these answers and a bit of trial and error, I'm able to find my way around the quote/unquote distinctions. My question is why is it not currently possible to always dispense with quotes around column names: is there a design to it? is it a transitional situation? are there programming difficulties?
Below, I give some examples and number the examples for clarity.
# load the package
library("data.table") # because I cannot do install.packages(data.table)!!
(i)
# make a data.table
set.seed(1)
dt <- data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, 1, 5), 0), key = "id")
I can define the data.table with either id = 1:10
or "id" = 1:10
, but I must define the key with key = "id"
as key = id
does not work:
dt <- data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, 1, 5), 0), key = id)
##Error in data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, :
## object 'id' not found
You'd think finding 'id' should be rather easy for a key if it were looking for it among the column names? Would it be programmatically sound to be allowed to drop the quotes on the RHS of key
?
(ii)
I can subset
with a vector of columns or with a vector of column names:
subset(dt, select = c(x1, x3))
## x1 x3
##1: 1 2
##2: 2 2
##3: 3 3
##4: 4 5
##5: 5 2
subset(dt, select = c("x1", "x3"))
## x1 x3
##1: 1 2
##2: 2 2
##3: 3 3
##4: 4 5
##5: 5 2
Nice and flexible.
(iii)
I can merge
with a vector of column names:
merge(dt, dt, by = c("x1", "x2"))
## id x1 x2 x3
##1: 1 1 5 2
##2: 2 2 4 2
##3: 3 3 3 3
##4: 4 4 2 5
##5: 5 5 1 2
(silly example that was!) but not with a vector of the columns:
merge(dt, dt, by = c(x1, x2))
##Error in merge.data.table(dt, dt, by = c(x1, x2)) : object 'x1' not found
Is there something about merge
that prevents it from accepting a vector of columns the way subset
does?
(iv)
Likewise, melt
must take quoted column names (or integers corresponding to the column numbers).
The help description is specific that melt
accepts "character vectors", while the help for merge
simply states "vectors of column names," but clearly with merge
as with melt
character vectors are expected.
(v)
In the case of the j
argument, quoting variable names is not usually the correct approach:
# Good:
dt[, .(x1, x2)]
## x1 x2
##1: 1 5
##2: 2 4
##3: 3 3
##4: 4 2
##5: 5 1
# Bad
dt[, .("x1", "x2")]
## V1 V2
##1: x1 x2
# This feature is well documented in the FAQs
# FAQ 2.3: "I'm using c() in the j and getting strange results."
Note to the reader not at all familiar with data.tables
that .()
is a shorthand for list()
and that dt[, c(x1, x2)]
is unlikely to be the desired command here -- the j
argument of dt[i, j]
very much expects a list.
(vi)
However, within the j
argument of dt[i, j]
, the LHS of the "assignment by reference" operator :=
has a confusing convention.
If the LHS is a single column, it may be passed without quotes. But if it has multiple columns, they must be passed as a vector of quoted column names. The manual only says "a vector of column names", but experimentation suggests they must be quoted:
# Good:
dt[, c("x1", "x2") := NULL][]
## id x3
##1: 1 2
##2: 2 2
##3: 3 3
##4: 4 5
##5: 5 2
# Bad:
dt[, c(x1, x2) := NULL]
##Error in eval(expr, envir, enclos) : object 'x1' not found
The error message is not particularly enlightening. But now I remember the FAQ's advice, "If 2 or more columns are required, use list() or .() instead." Silly me, c(x1, x2)
couldn't work because there is no way to tell where x1
ends and x2
starts. However, .(x1, x2)
could work, couldn't it?
# Bad:
dt[, .(x1, x2) := NULL]
##Error in eval(expr, envir, enclos) : object 'x1' not found
No, all things considered, the LHS of :=
expects a vector of quoted column names. The manual ought to be updated or, if feasible, data.table
extended to accept lists of unquoted columns on the LHS.
Oh wait. To delete multiple column names, can I pass a list of quoted names to LHS? No. Lists are usually desirable, but not on the LHS of :=
. The error message is clear:
# Bad:
dt[, .("x1", "x2") := NULL][]
##Error in `[.data.table`(dt, , `:=`(.("x1", "x2"), NULL)) :
## LHS of := must be a symbol, or an atomic vector (column names or positions).
(vii)
The i
argument of dt[i]
is also designed to accept unquoted columns, i.e. an "expression of column names"
dt[.(x1, x2)]
## id x1 x2 x3 V2
##1: 1 1 5 2 5
##2: 2 2 4 2 4
##3: 3 3 3 3 3
##4: 4 4 2 5 2
##5: 5 5 1 2 1
Note that if the idea was to subset the two columns x1
and x2
, that ought to be done inside the j
argument, i.e. dt[,.(x1, x2)]
dt[.("x1", "x2")]
##Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
## typeof x.id (integer) != typeof i.V1 (character)
dt[c(x1, x2)]
##id x1 x2 x3
## 1: 1 1 5 2
## 2: 2 2 4 2
## 3: 3 3 3 3
## 4: 4 4 2 5
## 5: 5 5 1 2
## 6: 5 5 1 2
## 7: 4 4 2 5
## 8: 3 3 3 3
## 9: 2 2 4 2
##10: 1 1 5 2
dt[c("x1", "x2")]
##Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
## typeof x.id (integer) != typeof i.V1 (character)
I have shown here several situations where columns must be passed as x1
or as "x1"
and situations where both can be done. These differences can cause confusion to new users like me. I suspect there is more than one reason for these two approaches to coexist. I'd appreciate if someone could clarify the matter, for some of my examples if not for all of them.