This is a bit of a philosophical question about data.table join syntax. I am finding more and more uses for data.tables, but still learning...
The join format X[Y]
for data.tables is very concise, handy and efficient, but as far as I can tell, it only supports inner joins and right outer joins. To get a left or full outer join, I need to use merge
:
X[Y, nomatch = NA]
-- all rows in Y -- right outer join (default)X[Y, nomatch = 0]
-- only rows with matches in both X and Y -- inner joinmerge(X, Y, all = TRUE)
-- all rows from both X and Y -- full outer joinmerge(X, Y, all.x = TRUE)
-- all rows in X -- left outer join
It seems to me that it would be handy if the X[Y]
join format supported all 4 types of joins. Is there a reason only two types of joins are supported?
For me, the nomatch = 0
and nomatch = NA
parameter values are not very intuitive for the actions being performed. It is easier for me to understand and remember the merge
syntax: all = TRUE
, all.x = TRUE
and all.y = TRUE
. Since the X[Y]
operation resembles merge
much more than match
, why not use the merge
syntax for joins rather than the match
function's nomatch
parameter?
Here are code examples of the 4 join types:
# sample X and Y data.tables
library(data.table)
X <- data.table(t = 1:4, a = (1:4)^2)
setkey(X, t)
X
# t a
# 1: 1 1
# 2: 2 4
# 3: 3 9
# 4: 4 16
Y <- data.table(t = 3:6, b = (3:6)^2)
setkey(Y, t)
Y
# t b
# 1: 3 9
# 2: 4 16
# 3: 5 25
# 4: 6 36
# all rows from Y - right outer join
X[Y] # default
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
X[Y, nomatch = NA] # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
merge(X, Y, by = "t", all.y = TRUE) # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
identical(X[Y], merge(X, Y, by = "t", all.y = TRUE))
# [1] TRUE
# only rows in both X and Y - inner join
X[Y, nomatch = 0]
# t a b
# 1: 3 9 9
# 2: 4 16 16
merge(X, Y, by = "t") # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
merge(X, Y, by = "t", all = FALSE) # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
identical( X[Y, nomatch = 0], merge(X, Y, by = "t", all = FALSE) )
# [1] TRUE
# all rows from X - left outer join
merge(X, Y, by = "t", all.x = TRUE)
# t a b
# 1: 1 1 NA
# 2: 2 4 NA
# 3: 3 9 9
# 4: 4 16 16
# all rows from both X and Y - full outer join
merge(X, Y, by = "t", all = TRUE)
# t a b
# 1: 1 1 NA
# 2: 2 4 NA
# 3: 3 9 9
# 4: 4 16 16
# 5: 5 NA 25
# 6: 6 NA 36
Update: data.table v1.9.6 introduced the on=
syntax, which allows ad hoc joins on fields other than the primary key. jangorecki's answer to the question How to join (merge) data frames (inner, outer, left, right)? provides some examples of additional join types that data.table can handle.
Y[X]
if you want the left outer join ofX[Y]
andrbind(Y[X],X[Y])
if you want a full outer join – mnelunique()
approach below for the full join is preferable torbind(Y[X],X[Y])
, since the rbind would involve copying the table. Is that right? – Douglas Clarkunique(c(unique(X[,t]), unique(Y[,t]))
-- this should be more memory efficient as it is only combining two lists that are going to be less than or equal to the number of rows in X and Y. – mnel