1
votes

I'm using R to identify paired values in two columns of a data frame that are identical to paired values in two columns of another data frame. I thought I found a solution, but the result is not quite what I expected, and am wondering if anyone might be able to explain why.

df1 looks like this:

Col1 Col2 Col3
A   1000    5
A   1500    10
A   2000    15
A   2500    20
B   900 5
B   1200    10
B   1500    15
C   1000    5
C   1200    10
C   1700    15
C   2200    20
C   2250    22  

df2 contains the same content as df1, plus some additional non-matching rows:

Col1 Col2 Col3
A   1000    5
A   1500    10
A   2000    15
A   2500    20
A   3000    20
A   3200    22
A   3400    25
B   900 5
B   1200    10
B   1500    15
B   1700    20
B   1900    26
C   1000    5
C   1200    10
C   1700    15
C   2200    20
C   2250    22
C   2400    25
C   3000    30
C   3500    31

I would like to find paired values in Col1 and Col2 of df1 that are also found in Col1 and Col2 of df2. I can't compare entire rows between data frames because the values in Col3 may differ between data frames. I've used the following code to find values in Col1 and Col2 of df1 that are also found in Col1 and Col2 of df2:

x = subset(df1, Col1 %in% df2$Col1 & Col2 %in% df2$Col2)

Which returns the expected result:

Col1 Col2 Col3
A   1000    5
A   1500    10
A   2000    15
A   2500    20
B   900 5
B   1200    10
B   1500    15
C   1000    5
C   1200    10
C   1700    15
C   2200    20
C   2250    22

However, when I switch which data frame is compared to the other:

y = subset(df2, Col1 %in% df1$Col1 & Col2 %in% df1$Col2)

I get an identical result, except there is now an additional row of B 1700 20:

Col1 Col2 Col3
A   1000    5
A   1500    10
A   2000    15
A   2500    20
B   900 5
B   1200    10
B   1500    15
B   1700 20
C   1000    5
C   1200    10
C   1700    15
C   2200    20
C   2250    22

Note that B 1700 20 is not found in df1, so why was it returned in this result? I think my code is not doing what I think it is doing, perhaps the paired nature of Col1 and Col2 is not actually retained between data frames? I'll appreciate any explanation and suggestions to improve my code!

1
You may need intersectakrun
It seems like what you want is an inner join. See stackoverflow.com/questions/1299871/…IceCreamToucan
%in% doesn't care about the pairs... it's checking if the value in df1$Col1 occurs anywhere in df2$Col1 and the df1$Col2 value occurs anywhere in df2$Col2, but they don't have to occur in the same place. This is why you need a joining operation like merge or dplyr::inner_join.Gregor Thomas

1 Answers

1
votes

It is a case of inner_join

library(dplyr)
out1 <- inner_join(df1, df2[1:2], by = c("Col1", "Col2"))
out2 <- inner_join(df1, df1[1:2], by = c("Col1", "Col2"))
identical(nrow(out1), nrow(out2))

Also, note that intersect from dplyr can take data.frame and return a data.frame

out1 <-  intersect(df1, df2)
out2 <- intersect(df2, df1)

In the example, both datasets have the same values. If we want to compare subset of columns, subset the datasets and do intersect. The difference is that we get only those columns as output when compared to inner_join