36
votes

I have two data.table X and Y.

columns in X: area, id, value
columns in Y: ID, price, sales

Create the two data.tables:

X = data.table(area=c('US', 'UK', 'EU'),
               id=c('c001', 'c002', 'c003'),
               value=c(100, 200, 300)
              )

Y = data.table(ID=c('c001', 'c002', 'c003'),
               price=c(500, 200, 400),
               sales=c(20, 30, 15)
              )

And I set keys for X and Y:

setkey(X, id)
setkey(Y, ID)

Now I try to join X and Y by id in X and ID in Y:

merge(X, Y)
merge(X, Y, by=c('id', 'ID'))
merge(X, Y, by.x='id', by.y='ID')

All raised error saying that column names in the by argument invalid.

I referred to the manual of data.table and found the merge function not supporting by.x and by.y arguments.

How could I join two data.tables by different column names without changing the column names?

Append:
I managed to join the two tables by X[Y], but why merge function fails in data.table?

4
@akrun Thank you. I did checked the two posts. I managed to join X and Y by X[Y], but I still could not accomplish it by using merge function. I am quite confused about why merge fails in data.table.Zelong
by.x and by.y are not yet implemented for data.tables. A FR is filed. Check stackoverflow.com/questions/14069796/…Fabian Gehring
the by arguments are available in data.table v1.9.6 on CRAN as of Sep 2015. See my answer for use.tospig

4 Answers

17
votes

OUTDATED


Use this operation:

X[Y]
#    area   id value price sales
# 1:   US c001   100   500    20
# 2:   UK c002   200   200    30
# 3:   EU c003   300   400    15

or this operation:

Y[X]
#      ID price sales area value
# 1: c001   500    20   US   100
# 2: c002   200    30   UK   200
# 3: c003   400    15   EU   300

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") : 
# A non-empty vector of column names for `by` is required.

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID")
#     id area value price sales
# 1 c001   US   100   500    20
# 2 c002   UK   200   200    30
# 3 c003   EU   300   400    15

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

36
votes

As of data.table version 1.9.6 (on CRAN on sep 2015) you can specify the by.x and by.y arguments in data.table::merge

merge(x=X, y=Y, by.x="id", by.y="ID")[]
#     id area value price sales
#1: c001   US   100   500    20
#2: c002   UK   200   200    30
#3: c003   EU   300   400    15

However, in data.table 1.9.6 you can also specfy the on argument in the X[Y] notation

X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.

X[Y, on=c(id = "ID")]
#   area   id value price sales
#1:   US c001   100   500    20
#2:   UK c002   200   200    30
#3:   EU c003   300   400    15

this answer by the data.table author has more details

5
votes

Merge fails when you use by.x and by.y with data.table. Taking your data:

> merge(X,Y, by.x='id', by.y='ID')
Error in merge.data.table(X, Y, by.x = "id", by.y = "ID")

You can use data.table with merge , but you need to use by argument for joining (so rename the columns to have the same colnames)

Y = setNames(Y,c('id','price','sales'))

This will still not work:

merge(X,Y, by.x='id', by.y='id')
Error in merge.data.table(X, Y, by.x = "id", by.y = "id") :

But this will work:

> merge(X,Y, by='id')
#     id area value price sales
#1: c001   US   100   500    20
#2: c002   UK   200   200    30
#3: c003   EU   300   400    15

Alternatively, you would need to convert data.table to data.frame in order to use merge with by.x and by.y arguments:

merge(data.frame(X), data.frame(Y), by.x='id', by.y='ID')
2
votes

You can also merge using multiple columns having different names. see example below

# create data frame authors
authors <- data.frame(
FirstName=c("Lorne", "Loren", "Robin",
              "Robin", "Billy"),
LastName=c("Green", "Jaye", "Green",
             "Howe", "Jaye"),
Age=c(82, 40, 45, 2, 40),
Income=c(1200000, 40000, 25000, 0, 27500),
Home=c("California", "Washington", "Washington",
    "Alberta", "Washington"))

# create data frame books Note First name in authors is same as AuthorFirstname same thing with lastname.
books <- data.frame(
        AuthorFirstName=c("Lorne", "Loren", "Loren",
            "Loren", "Robin", "Rich"),
        AuthorLastName=c("Green", "Jaye", "Jaye", "Jaye",
            "Green", "Calaway"),
        Book=c("Bonanza", "Midwifery", "Gardening",
        "Perennials", "Who_dun_it?", "Support"))

merge(authors, books, by.x=c("FirstName", "LastName"),
      by.y=c("AuthorFirstName", "AuthorLastName"),
      all.x=TRUE)