21
votes

Getting started with the data.table package (author/maintainer: Matt Dowle). Great package. I love that I can write dt[, x1] instead of, say, dt[, dt$x1] or df["x1"], for a data.table dt, a column name x1, a data.frame df. Being able to pass column names directly is an attractive feature of data.table. But dispensing with quotes around a column name, (writing x1 instead of "x1") is not always feasible. Why?

Programming Question: Are there any reasons why it is not always possible to pass a vector of column names directly to a data.table or to the helper functions provided by the package? For instance, the subset, merge, and melt functions have been rewritten for the data.table package, but while subset can handle column names directly, merge and melt cannot (see below).

To clarify, my question is not when or how but why. There are excellent related discussions with very useful tips, e.g. Select / assign to data.table variables which names are stored in a character vector and r - passing variables as data.table column names . With these answers and a bit of trial and error, I'm able to find my way around the quote/unquote distinctions. My question is why is it not currently possible to always dispense with quotes around column names: is there a design to it? is it a transitional situation? are there programming difficulties?

Below, I give some examples and number the examples for clarity.

# load the package
library("data.table") # because I cannot do install.packages(data.table)!!

(i)

# make a data.table
set.seed(1)
dt <- data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, 1, 5), 0), key = "id")

I can define the data.table with either id = 1:10 or "id" = 1:10, but I must define the key with key = "id" as key = id does not work:

dt <- data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5, 1, 5), 0), key = id)
##Error in data.table(id = 1:5, x1 = 1:5, x2 = 5:1, x3 = round(runif(5,  : 
##  object 'id' not found

You'd think finding 'id' should be rather easy for a key if it were looking for it among the column names? Would it be programmatically sound to be allowed to drop the quotes on the RHS of key?

(ii)

I can subset with a vector of columns or with a vector of column names:

subset(dt, select = c(x1, x3))
##   x1 x3
##1:  1  2
##2:  2  2
##3:  3  3
##4:  4  5
##5:  5  2

subset(dt, select = c("x1", "x3"))
##   x1 x3
##1:  1  2
##2:  2  2
##3:  3  3
##4:  4  5
##5:  5  2

Nice and flexible.

(iii)

I can merge with a vector of column names:

merge(dt, dt, by = c("x1", "x2"))
##       id x1 x2 x3
##1:  1  1  5  2
##2:  2  2  4  2
##3:  3  3  3  3
##4:  4  4  2  5
##5:  5  5  1  2

(silly example that was!) but not with a vector of the columns:

merge(dt, dt, by = c(x1, x2))
##Error in merge.data.table(dt, dt, by = c(x1, x2)) : object 'x1' not found

Is there something about merge that prevents it from accepting a vector of columns the way subset does?

(iv)

Likewise, melt must take quoted column names (or integers corresponding to the column numbers).

The help description is specific that melt accepts "character vectors", while the help for merge simply states "vectors of column names," but clearly with merge as with melt character vectors are expected.

(v)

In the case of the j argument, quoting variable names is not usually the correct approach:

# Good:
dt[, .(x1, x2)]
##   x1 x2
##1:  1  5
##2:  2  4
##3:  3  3
##4:  4  2
##5:  5  1

# Bad 
dt[, .("x1", "x2")]
##   V1 V2
##1: x1 x2
# This feature is well documented in the FAQs
# FAQ 2.3: "I'm using c() in the j and getting strange results."

Note to the reader not at all familiar with data.tables that .() is a shorthand for list() and that dt[, c(x1, x2)] is unlikely to be the desired command here -- the j argument of dt[i, j] very much expects a list.

(vi)

However, within the j argument of dt[i, j], the LHS of the "assignment by reference" operator := has a confusing convention.

If the LHS is a single column, it may be passed without quotes. But if it has multiple columns, they must be passed as a vector of quoted column names. The manual only says "a vector of column names", but experimentation suggests they must be quoted:

# Good:
dt[, c("x1", "x2") := NULL][]
##   id x3
##1:  1  2
##2:  2  2
##3:  3  3
##4:  4  5
##5:  5  2

# Bad:
dt[, c(x1, x2) := NULL]
##Error in eval(expr, envir, enclos) : object 'x1' not found

The error message is not particularly enlightening. But now I remember the FAQ's advice, "If 2 or more columns are required, use list() or .() instead." Silly me, c(x1, x2) couldn't work because there is no way to tell where x1 ends and x2 starts. However, .(x1, x2) could work, couldn't it?

# Bad:
dt[, .(x1, x2) := NULL]
##Error in eval(expr, envir, enclos) : object 'x1' not found

No, all things considered, the LHS of := expects a vector of quoted column names. The manual ought to be updated or, if feasible, data.table extended to accept lists of unquoted columns on the LHS.

Oh wait. To delete multiple column names, can I pass a list of quoted names to LHS? No. Lists are usually desirable, but not on the LHS of :=. The error message is clear:

# Bad:
dt[, .("x1", "x2") := NULL][]
##Error in `[.data.table`(dt, , `:=`(.("x1", "x2"), NULL)) : 
##  LHS of := must be a symbol, or an atomic vector (column names or positions).

(vii)

The i argument of dt[i] is also designed to accept unquoted columns, i.e. an "expression of column names"

dt[.(x1, x2)]
##   id x1 x2 x3 V2
##1:  1  1  5  2  5
##2:  2  2  4  2  4
##3:  3  3  3  3  3
##4:  4  4  2  5  2
##5:  5  5  1  2  1

Note that if the idea was to subset the two columns x1 and x2, that ought to be done inside the j argument, i.e. dt[,.(x1, x2)]

dt[.("x1", "x2")]
##Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
##  typeof x.id (integer) != typeof i.V1 (character)

dt[c(x1, x2)]
##id x1 x2 x3
## 1:  1  1  5  2
## 2:  2  2  4  2
## 3:  3  3  3  3
## 4:  4  4  2  5
## 5:  5  5  1  2
## 6:  5  5  1  2
## 7:  4  4  2  5
## 8:  3  3  3  3
## 9:  2  2  4  2
##10:  1  1  5  2

dt[c("x1", "x2")]
##Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
##  typeof x.id (integer) != typeof i.V1 (character)

I have shown here several situations where columns must be passed as x1 or as "x1" and situations where both can be done. These differences can cause confusion to new users like me. I suspect there is more than one reason for these two approaches to coexist. I'd appreciate if someone could clarify the matter, for some of my examples if not for all of them.

1

1 Answers

7
votes

(i), (iii) and (iv) sound like Feature Requests (FRs); see here (so, yes, it's partly due to data.table not having reached full maturity).

As to (v) you said "dt[, c(x1, x2)] is unlikely to be the desired command here", but in fact I have seen situations where that sort of use of c within j is what I'm after. Situations like (v) are what the with argument of [.data.table are for.

On (vi) and elsewhere, you suggest "The manual only says 'a vector of column names', but experimentation suggests they must be quoted"; but I think this is unambiguous. A vector of column names means a character vector, which c(x1,x2) is not, unless x1 and x2 are somewhere defined as character vectors themselves. You can also add a FR for documentation on GitHub.

I'm not sure what you're after in (vii), but in i, vectors of names are used for joins or keyed subsets (also a form of join); see the vignette on fast subsetting.