3
votes

I have two data tables that I want to merge/join based on values in two columns that can appear in reverse order in the two data tables. Here are two example data tables:

library(data.table)
# df1
col1 <- c("aa", "bb", "cc", "dd") 
col2 <- c("bb", "zz", "dd", "ff") 
x <- c(130, 29, 122, 85)
dt1 <- data.table(col1, col2, x)

   col1  col2  x
1:   aa   bb 130
2:   bb   zz  29
3:   cc   dd 122
4:   dd   ff  85

# df2
col1 <- c("zz", "bb", "cc", "ff") 
col2 <- c("bb", "aa", "dd", "dd") 
y <- c(34, 567, 56, 101)
dt2 <- data.table(col1, col2, y)

    col1 col2  y
1:   zz   bb  34
2:   bb   aa 567
3:   cc   dd  56
4:   ff   dd 101

So the values in col1 and col2 taken together are the same for both data tables, but the distribution is different. E.g. aa is in col1 in dt1, but in col2 in dt2. I want to merge/join the data tables based on pairs of col1 and col2, but they might be in reverse order in the other data table. (Note that simply sorting them doesn't work.)

This means that the merge/join etc. must be able to 'see' that the pair aa+bb in dt1 occurs as bb+aa in dt2 and assign the correct value of dt2, i.e. the desired output is either:

   col1 col2   x   y
1:   aa   bb 130 567
2:   bb   zz  29  34
3:   cc   dd 122  56
4:   dd   ff  85 101

or this (i.e. whether the order of dt1 or dt2 is retained doesn't matter):

   col1 col2   x   y
1:   zz   bb  29  34
2:   bb   aa 130 567
3:   cc   dd 122  56
4:   ff   dd  85 101

My original data tables have approx. 3 million rows (yes, they're huge), so doing anything by hand is out of the question. I've looked around here, but I couldn't find any solution that is applicable to my case. Does anyone know how to do this?

Any hints are greatly appreciated!

3
dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]; dt1[dt2, on=.(col1, col2)] works. Or you could use := to add the column to dt1 as sirallen suggested.Frank
@Frank, this works! Thanks so much! Do you maybe want to post that as an answer?SandraA.
Np, glad it works :) Feel free to edit your answer with it or something similar.Frank

3 Answers

3
votes

You can do the following:

dt1[dt2, on=.(col1, col2), y:= y]

dt1[dt2, on=.(col1==col2, col2==col1), y:= i.y]

> dt1
#    col1 col2   x   y
# 1:   aa   bb 130 567
# 2:   bb   zz  29  34
# 3:   cc   dd 122  56
# 4:   dd   ff  85 101
2
votes

So, we have two solutions that work!

Version 1: Adapted from Frank's comment above:

 library(dplyr)
 final <- dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]
 final <- dt1[dt2, on=.(col1, col2)]
 final <- select(final, col1, col2, x, y) # select relevant columns
 final
  col1 col2   x   y
1:   bb   zz  29  34
2:   aa   bb 130 567
3:   cc   dd 122  56
4:   dd   ff  85 101

Version2: This is just a tweak of PritamJ's answer that simplifies a few things and makes this solution more applicable for large data tables. Hope it helps other people as well!

library(dplyr)
dt1$pairs <- paste(dt1$col1, dt1$col2) # creates new column with col1 and col2 
merged into one
dt2$pairs <- paste(dt2$col1, dt2$col2) # same here
dt2$revpairs <- paste(dt2$col2, dt2$col1) # creates new column with reverse pairs

f1 <- merge(dt1, dt2, by="pairs") # merge by pairs as they are in dt1
f1 <- select(f1, col1.x, col2.x, x, y) # select by name (easier for big dt) 

f2 <- merge(dt1, dt2, by.x = "pairs", by.y = "revpairs") # merge by pairs and reverse pairs
colnames(f2)[ncol(f2)] <- "revpairs" # rename last column because it has the same name as the first, which can cause errors
f2 <- select(f2, col1.x, col2.x, x, y) 


final <- bind_rows(f2, f1) # bind the two together
colnames(final)[1:2] <- c("col1", "col2") # this is not necessary, just for clarity
final
   col1 col2   x   y
1:   aa   bb 130 567
2:   bb   zz  29  34
3:   dd   ff  85 101
4:   cc   dd 122  56
1
votes

Didn't able to find any straight answer , so tried the below code. Hope it will help

require(stringi)
require(data.table)
require(dplyr)
dt1$as <- paste(dt1$col1,dt1$col2)
dt2$as <- paste(dt2$col1,dt2$col2)
dt2$as1 <- stringi::stri_reverse(dt2$as)

f1 <- merge(dt1,dt2,by="as")
f1 <- subset(f1,select=c(2,3,4,7))
f1 <- setnames(f1,c("col1.x","col2.x"),c("Col1","Col2"))
f2 <- merge(dt1,dt2,by.x = "as",by.y = "as1")
f2 <- subset(f2,select=c(2,3,4,7))
f2 <- setnames(f2,c("col1.x","col2.x"),c("Col1","Col2"))
final <- bind_rows(f2,f1)

final
    Col1 Col2   x   y
1:   aa   bb 130 567
2:   bb   zz  29  34
3:   dd   ff  85 101
4:   cc   dd 122  56