How to pass a parameter by variable into data.table[J()]

Question

I'm brand new to the (completely marvelous) data.table package, and seem to have gotten stuck on a very basic, somewhat bizarre problem. I can't post the exact data set I'm working with, for which I apologize -- but I think the problem is simple enough to articulate that hopefully this will still be very clear.

Let's say I have a data.table like so, with key x:

set1
   x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a

I want to return a subset of set1 containing all rows where x == 1. This is wonderfully simple in data.table: set1[J(1)]. Bam. Done. I can also assign z <- 1, and call set1[J(z)]. Again: works great.

...except when I try to scale it up to my actual data set, which contains ~6M rows. When I call set1[J(1674)], I get back a 78-row return that's exactly what I'm looking for. But I need to be able to look up (literally) 4M of these subsets. When I assign the value I'm searching for to a variable, id <- 1674, and call set1[J(id)]... R nearly takes down my desktop.

Clearly something I don't understand is going on under the data.table hood, but I haven't been able to figure out what. Googling and slogging through Stack Overflow suggest that this should work. Out of pure whimsey, I've tried:

id <- quote(1674)
set1[J(eval(id))]

...but that is far, far worse. What... what's going on?

Can you provide a bit more detail on what R is printing out? — Señor O
Yes and no. Yes, I can provide more detail -- but R isn't printing out anything. According to top, when I call set1[J(id)], rsession starts using up to 97% of system memory. The box becomes functionally unusable until I manage to kill the rsession process some time later. This is by contrast to set1[J(1674)], which returns 78 rows as soon as I press enter. — Gastove

Matt Dowle Matt Dowle · Accepted Answer · 2012-12-17T21:52:48

[ @mnel beat me to it as I was writing ...]

Almost certainly, one column of set1 happens to be called "id"; i.e.,

isTRUE("id" %in% names(set1))

causing set1[J(id)] to self join set1$id to set1, ignoring the id in calling scope.

If so, there are several approaches to avoid scoping issues such as this :

.id = <your 4M ids>
set1[J(.id)]

or use the fact that a single name i is evaluated in calling scope :

JDT=J(id); set1[JDT]

or that eval is eval'd in calling scope, too :

set1[eval(J(id))]

or, we do want to make this clearer, more robust and easier, so one thought is to add .. :

set1[..(J(id))]     # .. alias for eval

or perhaps :

set1[J(..id)]

where .. borrows its meaning from the file system's .., meaning one-level-up. If the .. was a prefix to symbols, you could then do something like :

DT[colB==..id]

where == is used there for illustration. In that example colB is expected to be a column name and ..id will find id in calling scope (one level up). The thinking is that that would be quite clear to the reader of the code what the programmer intended.

How to pass a parameter by variable into data.table[J()]

1 Answers