22
votes

I am trying to write some wrapper functions to reduce code duplication with data.table.

Here is an example using mtcars. First, set up some data:

library(data.table)
data(mtcars)
mtcars$car <- factor(gsub("(.*?) .*", "\\1", rownames(mtcars)), ordered=TRUE)
mtcars <- data.table(mtcars)

Now, here is what I would usually write to get a summary of counts by group. In this case I am grouping by car:

mtcars[, list(Total=length(mpg)), by="car"][order(car)]

      car Total
      AMC     1
 Cadillac     1
   Camaro     1
...
   Toyota     2
  Valiant     1
    Volvo     1

The complication is that, since the arguments i and j are evaluated in the frame of the data.table, one has to use eval(...) if you want to pass in variables:

This works:

group <- "car"
mtcars[, list(Total=length(mpg)), by=eval(group)]

But now I want to order the results by the same grouping variable. I can't get any variant of the following to give me correct results. Notice how I always get a single row of results, rather than the ordered set.

mtcars[, list(Total=length(mpg)), by=eval(group)][order(group)]
   car Total
 Mazda     2

I know why: it's because group is evaluated in the parent.frame, not the frame of the data.table.

How can I evaluate group in the context of the data.table?

More generally, how can I use this inside a function? I need the following function to give me all the results, not just the first row of data:

tableOrder <- function(x, group){
  x[, list(Total=length(mpg)), by=eval(group)][order(group)]
}

tableOrder(mtcars, "car")
3

3 Answers

14
votes

Gavin and Josh are right. This answer is only to add more background. The idea is that not only can you pass variable column names into a function like that, but expressions of column names, using quote().

group = quote(car)
mtcars[, list(Total=length(mpg)), by=group][order(group)]
      group Total
        AMC     1
   Cadillac     1
     ...
     Toyota     2
    Valiant     1
      Volvo     1

Although, admitedly more difficult to start with, it can be more flexible. That's the idea, anyway. Inside functions you need substitute(), like this :

tableOrder = function(x,.expr) {
    .expr = substitute(.expr)
    ans = x[,list(Total=length(mpg)),by=.expr]
    setkeyv(ans, head(names(ans),-1))    # see below re feature request #1780
    ans
}

tableOrder(mtcars, car)
      .expr Total
        AMC     1
   Cadillac     1
     Camaro     1
      ...
     Toyota     2
    Valiant     1
      Volvo     1

tableOrder(mtcars, substring(car,1,1))  # an expression, not just a column name
      .expr Total
 [1,]     A     1
 [2,]     C     3
 [3,]     D     3
 ...
 [8,]     P     2
 [9,]     T     2
[10,]     V     2

tableOrder(mtcars, list(cyl,gear%%2))   # by two expressions, so head(,-1) above
     cyl gear Total
[1,]   4    0     8
[2,]   4    1     3
[3,]   6    0     4
[4,]   6    1     3
[5,]   8    1    14

A new argument keyby was added in v1.8.0 (July 2012) making it simpler :

tableOrder = function(x,.expr) {
    .expr = substitute(.expr)
    x[,list(Total=length(mpg)),keyby=.expr]
}

Comments and feedback in the area of i,j and by variable expressions are most welcome. The other thing you can do is have a table where a column contains expressions and then look up which expression to put in i, j or by from that table.

11
votes

Use get(group) to refer to the object named in group:

> mtcars[, list(Total=length(mpg)), by=eval(group)][order(get(group))]
        car Total
        AMC     1
   Cadillac     1
     Camaro     1
   Chrysler     1
     Datsun     1
      Dodge     1
     Duster     1
    Ferrari     1
       Fiat     2
       Ford     1
      Honda     1
     Hornet     2
    Lincoln     1
      Lotus     1
   Maserati     1
      Mazda     2
       Merc     7
    Pontiac     1
    Porsche     1
     Toyota     2
    Valiant     1
      Volvo     1
cn      car Total
> # vs
> mtcars[, list(Total=length(mpg)), by=eval(group)][order(group)]
       car Total
[1,] Mazda     2

The reason order(get(group)) works is that the expression is evaluated in the frame of the data.table. There, get(group) will look for an find variable car. If you evaluate that in the global environment it does't exist

> get(group)
Error in get(group) : object 'car' not found

but it does in the frame where evaluation takes place. group doesn't exist there but following usual rules it searches back up the parent frames until it finds something that matches group, which is the global env in this case. As such you need to be careful about the name of the object you use as group in your real function - you don't want to use something that might match in the data.table object for example. Using something like .group as the function arg would be pretty safe I guess.

Here is your function, modified:

tableOrder <- function(x, .group){
  x[, list(Total=length(mpg)), by=eval(.group)][order(get(.group))]
}

> tableOrder(mtcars, "car")
        car Total
        AMC     1
   Cadillac     1
     Camaro     1
   Chrysler     1
     Datsun     1
....
10
votes

For the general question of how to control scoping within data.table, Gavin's answer has got you well covered.

To really take full advantage of the data.table package's strengths, though, you should be setting the key for your data.table objects. A key causes your data to be presorted so that rows from the same level (or combinations of levels) of the grouping factor(s) are stored in contiguous blocks of memory. This can in turn greatly speed up grouping operations compared to 'ad hoc by's of the sort used in your example. (Search for 'ad hoc' in the datatable-faq (warning, pdf) for more details).

In many situations (your example included) using keys also has the happy side-effect of simplifying the code needed to manipulate a data.table. Plus, it automatically outputs the results in the order specified by the key, which is often what you want as well.

First, if you will only be needing to subset by the 'car' column, you could simply do:

## Create data.table with a key
group <- "car"
mtcars <- data.table(mtcars, key = group)

## Outputs results in correct order
mtcars[, list(Total=length(mpg)), by = key(mtcars)]
        car Total
        AMC     1
   Cadillac     1
     Camaro     1
   Chrysler     1
     Datsun     1

Even if your key contains several columns, using a key still makes for simpler code (and you gain the speed-up that's likely your real reason for using data.table in the first place!):

group <- "car"
mtcars <- data.table(mtcars, key = c("car", "gear"))
mtcars[, list(Total=length(mpg)), by = eval(group)]

EDIT: A picky note of caution

If the by argument is used to perform grouping based on a column that is part of the key but that is not the first element of the key the order of the results may still need post processing. So, in the second example above, if key = c("gear", "car"), then "Dodge" sorts before "Datsun". In a situation like that, I might still prefer to reorder the key beforehand, rather than reorder the results after the fact. Perhaps Matthew Dowle will weigh in which of those two is preferred/faster.