4
votes

I am learning data.table. I have difficulty converting the dplyr join syntax. Can you please recommend the data.table equivalence for the following test cases?

library(data.table)
library(dplyr)

dtProduct <- data.table(
    ProductID  = c(6, 33, 17, 88, 44, 51),
    ProductName= c("Shirt", "Helmet", "Gloves", "Towel", "Chair", "Detergent"),
    Price= c(25, 60, 10, 7.5, 135, 16),
    key = 'ProductID'
)

set.seed(20141216)
dtOrder <- data.table(
    OrderID    = sample(1001:9999, 12),
    CustomerID = sample(271:279, 12, replace=TRUE),
    # NOTE: some non-existent ProductID intentionally introduced
    ProductID  = sample(c(dtProduct[, ProductID], 155, 439), 12, replace=TRUE),
    Qty = sample(1:3, 12, replace=TRUE),
    key = 'OrderID'
)

> tables()
     NAME      NROW NCOL MB COLS                             KEY      
[1,] dtOrder     12    4  1 OrderID,CustomerID,ProductID,Qty OrderID  
[2,] dtProduct    6    3  1 ProductID,ProductName,Price      ProductID

> dtProduct
   ProductID ProductName Price
1:         6       Shirt  25.0
2:        17      Gloves  10.0
3:        33      Helmet  60.0
4:        44       Chair 135.0
5:        51   Detergent  16.0
6:        88       Towel   7.5
> dtOrder
    OrderID CustomerID ProductID Qty
 1:    1651        275         6   3
 2:    2726        272        88   2
 3:    3079        275        88   2
 4:    3168        274        17   1
 5:    4816        277        88   1
 6:    4931        278        51   1
 7:    5134        274       439   2
 8:    5265        272        33   3
 9:    7702        275        33   2
10:    7727        279       155   2
11:    8412        273        88   2
12:    9130        271        17   3

Case1: Show Order Details, no-match ProductID are hidden

dtOrder %>%
    inner_join(dtProduct, by="ProductID") %>%
    transmute(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)

   OrderID ProductID ProductName Qty Price ExtPrice
1     1651         6       Shirt   3  25.0     75.0
2     3168        17      Gloves   1  10.0     10.0
3     9130        17      Gloves   3  10.0     30.0
4     5265        33      Helmet   3  60.0    180.0
5     7702        33      Helmet   2  60.0    120.0
6     4931        51   Detergent   1  16.0     16.0
7     2726        88       Towel   2   7.5     15.0
8     3079        88       Towel   2   7.5     15.0
9     4816        88       Towel   1   7.5      7.5
10    8412        88       Towel   2   7.5     15.0

Case2: Show Order Details, INCLUDING no-match ProductID

dtOrder %>%
    left_join(dtProduct, by="ProductID") %>%
    transmute(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)

   OrderID ProductID ProductName Qty Price ExtPrice
1     1651         6       Shirt   3  25.0     75.0
2     3168        17      Gloves   1  10.0     10.0
3     9130        17      Gloves   3  10.0     30.0
4     5265        33      Helmet   3  60.0    180.0
5     7702        33      Helmet   2  60.0    120.0
6     4931        51   Detergent   1  16.0     16.0
7     2726        88       Towel   2   7.5     15.0
8     3079        88       Towel   2   7.5     15.0
9     4816        88       Towel   1   7.5      7.5
10    8412        88       Towel   2   7.5     15.0
11    7727       155          NA   2    NA       NA
12    5134       439          NA   2    NA       NA

Case3: Show Order Errors (Only no-match ProductID)

dtOrder %>%
    left_join(dtProduct, by="ProductID") %>%
    filter(is.na(ProductName)) %>%
    select(OrderID, ProductID, ProductName, Qty)

  OrderID ProductID ProductName Qty
1    7727       155          NA   2
2    5134       439          NA   2

Case4: Various Aggregates by ProductID, sort result by TotalSales descending

dtOrder %>%
    inner_join(dtProduct, by="ProductID") %>%
    group_by(ProductID) %>%
    summarize(OrderCount=n(), TotalQty=sum(Qty), TotalSales=sum(Qty*Price)) %>%
    arrange(desc(TotalSales))

  ProductID OrderCount TotalQty TotalSales
1        33          2        5      300.0
2         6          1        3       75.0
3        88          4        7       52.5
4        17          2        4       40.0
5        51          1        1       16.0


Case5: Various Aggregates by ProductID, sort result by TotalSales descending

  • NOTE1: This time, ProductName is displayed along with ProductID
  • NOTE2: sort by descending TotalSales no longer working (BUG?)

    dtOrder %>%
       inner_join(dtProduct, by="ProductID") %>%
       group_by(ProductID, ProductName) %>%
       summarize(OrderCount=n(), TotalQty=sum(Qty), TotalSales=sum(Qty*Price)) %>%
       arrange(desc(TotalSales))
    
      ProductID ProductName OrderCount TotalQty TotalSales
    1         6       Shirt          1        3       75.0
    2        17      Gloves          2        4       40.0
    3        33      Helmet          2        5      300.0
    4        51   Detergent          1        1       16.0
    5        88       Towel          4        7       52.5
    

Thank you very much in advance for any help.

2
Hi David, thanks for the edit. I don't know why the code for Case5 cannot be rendered correctly. How could you fix it? Also, I added "Hi datatable experts" in the top, but somehow this text is never accepted. Is there a max length in the post?Polymerase
Hey, you just need to add enough spaces :). I don't know why your edit wasn't accepted, maybe because we were editing at the same time? Either way, there is no real need to add this sentence to the question IMO.David Arenburg
@Polymerase Weel you want to learn data.table , but I don't see any data.table code here ( except the one which create the table), What have you so far tried?agstudy
@agstudy, I don't understand such a harsh comment. The test cases I presented here were carefully prepared. If you really read my post, you would see that each test case came with a WORKING code using dplyr which is pretty well written. The dplyr query is made with an underlying datatable object. I am wondering if a "native" datatable syntax would be better. In case you have an alternative solution, other than mine or eddi's below. Would you be kind enough to share your experience?Polymerase
@Polymerase, agtudy's comment is not really harsh. Per SO standards this question can be considered as "no research effort" - you show code from one tool/package, and ask for code of another, which could be considered off-topic. Your intentions are honest, I understand, but so are his.Arun

2 Answers

9
votes
setkey(dtOrder, ProductID)

(1-2)

# this will be literally what you wrote

dtProduct[dtOrder,
          list(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price),
          nomatch = 0 # or omit this to get (2)
         ]

# but I think you'd be better off with this
dtProduct[dtOrder][, ExtPrice := Qty*Price][]

(3)

# you can again take the literal direction:
dtProduct[dtOrder][!is.na(ProductName)][,
          list(OrderID, ProductID, ProductName, Qty)]

# but again I think you'd be better off with
dtOrder[!dtProduct]

(4-5)

dtProduct[dtOrder, nomatch = 0][,
          list(OrderCount=.N, TotalQty=sum(Qty), TotalSales=sum(Qty*Price)),
          by = list(ProductID, ProductName)][
          order(-TotalSales)]
8
votes

You should look at ?data.table and go through the examples there. It's a very nice way of learning. We're in the process of writing more detailed vignettes FR #944, scheduled for 1.9.8. But until then:


- data.table form

data.table's syntax is of the form:

x[i, j, by, ...] # i = where, j = select|modify|update, by = group by

- subset operations

When i is an integer or logical expression, we call it a subset operation. For example:

x[a > 1]

What does this do? The column a from the data.table x is checked for the condition > 1, which results in a logical vector = length(a). And those rows where the condition evaluates to TRUE are identified, and all the columns corresponding to those rows are returned.

- joins as extension of subsets

concept

In data.table, joins can be seen as a natural extension of subsets. That is, we can think of a join as a subset operation, but using another data.table. This is what we mean by having a consistent syntax -- the form x[i, j, by] is intact.

The first step towards joining in data.tables is setting keys. This can be accomplished using the setkey() function whose purpose is two-fold:

  • reorder the rows of the data.table in increasing order (ascending) by the columns provided. This is done by reference to be memory efficient.

  • mark those columns provided as key columns on which a joins can be performed (if and when you perform a join).

Note that currently, for a join of the form x[i], x needs to have key column(s) set absolutely. i may or may not have it's key set.

  • If i also has it's key column set, then joins are performed by matching the first key column of i with the first key column of x, second with second, etc..

  • If i doesn't have key columns set, then the first column of i is matched to first key column of x, second column of i with second key column of x and so on..

Yes we are aware that it'd be nice to match by column names when i doesn't have key columns but we just didn't have the time to get to it yet.

The second and final step is to perform the join :-).

But how is a join operation an extension of subset? When i is a data.table, for each row in i, it finds the matching row indices in x by matching on x's key columns that we've set. This returns a set of row indices of x for each row in i (or NA if no match is found).

Now we have the matching row indices. All we have to return are the columns. But since i is also a data.table, it might have additional columns as well. So, we return the columns of both x and i for those matching row indices.

example

Here's a small example to help you internalise the concept before we move on. Consider the two data.tables X and Y as shown below:

X = data.table(a=c(1,1,1,2,2,5,6), b=1:7, key="a")
#    a b
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 4
# 5: 2 5
# 6: 5 6
# 7: 6 7

key(X)
# [1] "a"

Y = data.table(a=c(6,2), c=letters[1:2])
#    a c
# 1: 6 a
# 2: 2 b

key(Y)
# NULL

# join
X[Y]
#    a b c
# 1: 6 7 a
# 2: 2 4 b
# 3: 2 5 b

Note that we have used the key= argument in data.table() function to set the key columns directly. Alternatively we could have just created X without keys and then setkey(X, a).

The function key() returns the key columns if any. If no key is set, it returns NULL.

Y doesn't have key columns, and X has only one key column. So join is done using first column a of Y and first key column a of X. a=6 in Y matches with row 7 of X and a=2 on rows 4 and 5.

You can check this by using the argument which = TRUE:

X[as.data.table(6), which=TRUE] # [1] 7
X[as.data.table(2), which=TRUE] # [1] 4 5

This is also a handy (and fast) way to subset a data.table, but using data.table's fast binary search based subset. Since this operation is quite useful, data.table provides an easy way of doing this instead of having to write as.data.table() each time.

# faster way of doing X[a == 6] on data.table with 'a' as key column
X[J(6)] # J for Join
X[J(2)]

# (or)

X[.(6)] # . is an alias for J
X[.(2)]

I think this should further help towards understanding what we mean by subsets are extensions of joins.


back to your question

Now, let's forget for a moment about all these "left", "right", "inner", "outer" etc.. and look at the actual operation you want to perform. You've two data.tables - dtP and dtO (shortened for convenience).

case 1:

For each row of column ProductID in dtO, you want to find the matching rows in dtP, but you don't want to return NA. And you want to do also select the columns you want to output, along with some calculations.

That is, i = dtO and x = dtP. Key column for dtP is set correctly. But the key column for dtO is orderID. If we joine as such, it'll be joining orderID from dtO against productID from dtP, which is wrong.

Either we have to set key of dtO to productID or set key of dtO to NULL and move the column productID as the first column (until matching by names is implemented). Let's set key to productID here:

# set key
setkey(dtO, ProductID)
# join
dtP[dtO, .(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price), nomatch=0L]

It should be quite obvious what this does now. On only matching row indices, extract all these columns (including expression).

Why should we join first and select/aggregate after?

case 2:

Same as case 1, but you need even non matching rows. Key is already set properly from case 1.

# join
dtP[dtO, .(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)]

Return all rows of orderID even if there's no match, and all the columns specified (including expression).

case 3:

You want all the rows in dtO that has no match with dtP.

not-join or anti-join
dtO[!dtP]

Find all rows where dtP's key column matches with dtO. Return all other rows from dtO. If necessary, you can specify all necessary columns in j as well.

case 4:

Read about by=.EACHI from this post.

You are joining by productID and then aggregating by the same column. But why do we need that intermediate result? It's totally unnecessary and waste of memory and computational time! Instead we can make use of by=.EACHI which will evaluate the j-expression on matching rows for each row in i.

dtO[dtP, .(.N, sQty = sum(Qty), sSales = sum(Qty*Price)), by=.EACHI, nomatch=0L][order(-sSales)]

To test your understanding, try to find out why we did not do dtP[dtO, ...] here..

case 5:

Identical to @eddi's.


I personally find it more natural to think in terms of the actual task I want to perform instead of figuring out the type of join function associated with the task I want to perform (I can never remember which data.table is "left" and which one is "right"... and btw, what the heck is "inner", "outer" and "full outer" join anyways?).

HTH