20
votes

The data.table package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time.

In this example, X and Y are data.tables with a single key column "id"; Y also has a non-key column "x_id".

   X <- data.table(id = 1:5, a=4:8,key="id")
   Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id")

The following syntax would join the tables on their keys:

  X[Y]

How can I translate the following SQL syntax to data.table code?

  select * from X join Y on X.id = Y.x_id; 

The closest that I have gotten is:

Y[X,list(id, x_id),by = x_id,nomatch=0]

However, this does not do the same inner join as the SQL statement.


Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where X2$y_id = Y2$id.

    X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id")
    Y2 <- data.table(id = 1:5, b = letters[1:5], key="id")

I would like to produce the table:

   id  y_id  b
    1     1 "a"
    2     1 "a"
    3     2 "b"
    4     2 "b"
    5     2 "b"

similar to what is done by the following kludge:

> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id")
  y_id id b
1    1  1 a
2    1  2 a
3    2  3 b
4    2  4 b
5    2  5 b

However, when I do this:

    X2[Y2, 1:2,by = y_id]

I do not get the desired result:

    y_id V1
[1,]    1  1
[2,]    1  2
[3,]    2  1
[4,]    2  2
1
You can temporarily change X2 keys and set them to "y_id"; then perform a normal join Y2[X2] (or X2[Y2] depending on the direction) and then restore the previous key of X2.digEmAll
@digEmAll that is useful, I thought that was what the by argument does... but can this be applied to a multiple (>2) table join where the target (left) table has >1 foreign key?David LeBauer
I agree with @digEmAll: setkey(X2, y_id) followed by X2[Y2, nomatch=0] is all you need for your example. And this should work with several keys as well. However, I'm not really familiar with the foreign key syntax in SQL, so if you are struggling with more keys, could you extend your example?Christoph_J
"The following syntax would join the tables on their keys: X[Y] -- I think the equivalent SQL syntax would be: SELECT * FROM A NATURAL JOIN B;onedaywhen
I edited the top of the question to clear up terminology. Hope ok. Will answer ...Matt Dowle

1 Answers

18
votes

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
     id y_id
[1,] 11   14
[2,] 12   14
[3,] 13   11
[4,] 14   12
[5,] 15   12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
     id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)]  # binary search for each item of (unsorted and unkeyed) i
     id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
     id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE