I'm using R to identify paired values in two columns of a data frame that are identical to paired values in two columns of another data frame. I thought I found a solution, but the result is not quite what I expected, and am wondering if anyone might be able to explain why.
df1 looks like this:
Col1 Col2 Col3
A 1000 5
A 1500 10
A 2000 15
A 2500 20
B 900 5
B 1200 10
B 1500 15
C 1000 5
C 1200 10
C 1700 15
C 2200 20
C 2250 22
df2 contains the same content as df1, plus some additional non-matching rows:
Col1 Col2 Col3
A 1000 5
A 1500 10
A 2000 15
A 2500 20
A 3000 20
A 3200 22
A 3400 25
B 900 5
B 1200 10
B 1500 15
B 1700 20
B 1900 26
C 1000 5
C 1200 10
C 1700 15
C 2200 20
C 2250 22
C 2400 25
C 3000 30
C 3500 31
I would like to find paired values in Col1 and Col2 of df1 that are also found in Col1 and Col2 of df2. I can't compare entire rows between data frames because the values in Col3 may differ between data frames. I've used the following code to find values in Col1 and Col2 of df1 that are also found in Col1 and Col2 of df2:
x = subset(df1, Col1 %in% df2$Col1 & Col2 %in% df2$Col2)
Which returns the expected result:
Col1 Col2 Col3
A 1000 5
A 1500 10
A 2000 15
A 2500 20
B 900 5
B 1200 10
B 1500 15
C 1000 5
C 1200 10
C 1700 15
C 2200 20
C 2250 22
However, when I switch which data frame is compared to the other:
y = subset(df2, Col1 %in% df1$Col1 & Col2 %in% df1$Col2)
I get an identical result, except there is now an additional row of B 1700 20:
Col1 Col2 Col3
A 1000 5
A 1500 10
A 2000 15
A 2500 20
B 900 5
B 1200 10
B 1500 15
B 1700 20
C 1000 5
C 1200 10
C 1700 15
C 2200 20
C 2250 22
Note that B 1700 20 is not found in df1, so why was it returned in this result? I think my code is not doing what I think it is doing, perhaps the paired nature of Col1 and Col2 is not actually retained between data frames? I'll appreciate any explanation and suggestions to improve my code!
intersect
– akrun%in%
doesn't care about the pairs... it's checking if the value indf1$Col1
occurs anywhere indf2$Col1
and thedf1$Col2
value occurs anywhere indf2$Col2
, but they don't have to occur in the same place. This is why you need a joining operation likemerge
ordplyr::inner_join
. – Gregor Thomas