convert dplyr join syntax into pure data.table syntax

Question

I am learning data.table. I have difficulty converting the dplyr join syntax. Can you please recommend the data.table equivalence for the following test cases?

library(data.table)
library(dplyr)

dtProduct <- data.table(
    ProductID  = c(6, 33, 17, 88, 44, 51),
    ProductName= c("Shirt", "Helmet", "Gloves", "Towel", "Chair", "Detergent"),
    Price= c(25, 60, 10, 7.5, 135, 16),
    key = 'ProductID'
)

set.seed(20141216)
dtOrder <- data.table(
    OrderID    = sample(1001:9999, 12),
    CustomerID = sample(271:279, 12, replace=TRUE),
    # NOTE: some non-existent ProductID intentionally introduced
    ProductID  = sample(c(dtProduct[, ProductID], 155, 439), 12, replace=TRUE),
    Qty = sample(1:3, 12, replace=TRUE),
    key = 'OrderID'
)

> tables()
     NAME      NROW NCOL MB COLS                             KEY      
[1,] dtOrder     12    4  1 OrderID,CustomerID,ProductID,Qty OrderID  
[2,] dtProduct    6    3  1 ProductID,ProductName,Price      ProductID

> dtProduct
   ProductID ProductName Price
1:         6       Shirt  25.0
2:        17      Gloves  10.0
3:        33      Helmet  60.0
4:        44       Chair 135.0
5:        51   Detergent  16.0
6:        88       Towel   7.5
> dtOrder
    OrderID CustomerID ProductID Qty
 1:    1651        275         6   3
 2:    2726        272        88   2
 3:    3079        275        88   2
 4:    3168        274        17   1
 5:    4816        277        88   1
 6:    4931        278        51   1
 7:    5134        274       439   2
 8:    5265        272        33   3
 9:    7702        275        33   2
10:    7727        279       155   2
11:    8412        273        88   2
12:    9130        271        17   3

Case1: Show Order Details, no-match ProductID are hidden

dtOrder %>%
    inner_join(dtProduct, by="ProductID") %>%
    transmute(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)

   OrderID ProductID ProductName Qty Price ExtPrice
1     1651         6       Shirt   3  25.0     75.0
2     3168        17      Gloves   1  10.0     10.0
3     9130        17      Gloves   3  10.0     30.0
4     5265        33      Helmet   3  60.0    180.0
5     7702        33      Helmet   2  60.0    120.0
6     4931        51   Detergent   1  16.0     16.0
7     2726        88       Towel   2   7.5     15.0
8     3079        88       Towel   2   7.5     15.0
9     4816        88       Towel   1   7.5      7.5
10    8412        88       Towel   2   7.5     15.0

Case2: Show Order Details, INCLUDING no-match ProductID

dtOrder %>%
    left_join(dtProduct, by="ProductID") %>%
    transmute(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)

   OrderID ProductID ProductName Qty Price ExtPrice
1     1651         6       Shirt   3  25.0     75.0
2     3168        17      Gloves   1  10.0     10.0
3     9130        17      Gloves   3  10.0     30.0
4     5265        33      Helmet   3  60.0    180.0
5     7702        33      Helmet   2  60.0    120.0
6     4931        51   Detergent   1  16.0     16.0
7     2726        88       Towel   2   7.5     15.0
8     3079        88       Towel   2   7.5     15.0
9     4816        88       Towel   1   7.5      7.5
10    8412        88       Towel   2   7.5     15.0
11    7727       155          NA   2    NA       NA
12    5134       439          NA   2    NA       NA

Case3: Show Order Errors (Only no-match ProductID)

dtOrder %>%
    left_join(dtProduct, by="ProductID") %>%
    filter(is.na(ProductName)) %>%
    select(OrderID, ProductID, ProductName, Qty)

  OrderID ProductID ProductName Qty
1    7727       155          NA   2
2    5134       439          NA   2

Case4: Various Aggregates by ProductID, sort result by TotalSales descending

dtOrder %>%
    inner_join(dtProduct, by="ProductID") %>%
    group_by(ProductID) %>%
    summarize(OrderCount=n(), TotalQty=sum(Qty), TotalSales=sum(Qty*Price)) %>%
    arrange(desc(TotalSales))

  ProductID OrderCount TotalQty TotalSales
1        33          2        5      300.0
2         6          1        3       75.0
3        88          4        7       52.5
4        17          2        4       40.0
5        51          1        1       16.0

Case5: Various Aggregates by ProductID, sort result by TotalSales descending

NOTE1: This time, ProductName is displayed along with ProductID

NOTE2: sort by descending TotalSales no longer working (BUG?)

dtOrder %>%
   inner_join(dtProduct, by="ProductID") %>%
   group_by(ProductID, ProductName) %>%
   summarize(OrderCount=n(), TotalQty=sum(Qty), TotalSales=sum(Qty*Price)) %>%
   arrange(desc(TotalSales))

  ProductID ProductName OrderCount TotalQty TotalSales
1         6       Shirt          1        3       75.0
2        17      Gloves          2        4       40.0
3        33      Helmet          2        5      300.0
4        51   Detergent          1        1       16.0
5        88       Towel          4        7       52.5

Thank you very much in advance for any help.

Hi David, thanks for the edit. I don't know why the code for Case5 cannot be rendered correctly. How could you fix it? Also, I added "Hi datatable experts" in the top, but somehow this text is never accepted. Is there a max length in the post? — Polymerase
Hey, you just need to add enough spaces :). I don't know why your edit wasn't accepted, maybe because we were editing at the same time? Either way, there is no real need to add this sentence to the question IMO. — David Arenburg
@Polymerase Weel you want to learn data.table , but I don't see any data.table code here ( except the one which create the table), What have you so far tried? — agstudy
@agstudy, I don't understand such a harsh comment. The test cases I presented here were carefully prepared. If you really read my post, you would see that each test case came with a WORKING code using dplyr which is pretty well written. The dplyr query is made with an underlying datatable object. I am wondering if a "native" datatable syntax would be better. In case you have an alternative solution, other than mine or eddi's below. Would you be kind enough to share your experience? — Polymerase
@Polymerase, agtudy's comment is not really harsh. Per SO standards this question can be considered as "no research effort" - you show code from one tool/package, and ask for code of another, which could be considered off-topic. Your intentions are honest, I understand, but so are his. — Arun

eddi eddi · Accepted Answer · 2014-12-18T00:00:59

setkey(dtOrder, ProductID)

(1-2)

# this will be literally what you wrote

dtProduct[dtOrder,
          list(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price),
          nomatch = 0 # or omit this to get (2)
         ]

# but I think you'd be better off with this
dtProduct[dtOrder][, ExtPrice := Qty*Price][]

(3)

# you can again take the literal direction:
dtProduct[dtOrder][!is.na(ProductName)][,
          list(OrderID, ProductID, ProductName, Qty)]

# but again I think you'd be better off with
dtOrder[!dtProduct]

(4-5)

dtProduct[dtOrder, nomatch = 0][,
          list(OrderCount=.N, TotalQty=sum(Qty), TotalSales=sum(Qty*Price)),
          by = list(ProductID, ProductName)][
          order(-TotalSales)]

convert dplyr join syntax into pure data.table syntax

2 Answers

- data.table form

- subset operations

- joins as extension of subsets

concept

example

back to your question

case 1:

case 2:

case 3:

case 4:

case 5: