2
votes

I have two dataframes:

> df1:
  a b
1 1 2
2 2 3
3 2 4
4 3 4
5 4 4
> df2:
  a b
1 1 1
2 1 2
3 2 3
4 3 4
5 5 5

Then merge df1 and df2 to get df3, please notice merge by=c("a","b")

df3<-merge(df1,df2)
> df3
  a b
1 1 2
2 2 3
3 3 4

I would like to get the index of rows in df1 which are selected, and add a column call "label" in df1.

   > df1:
      a b label    
    1 1 2  TRUE
    2 2 3  TRUE
    3 2 4 FALSE
    4 3 4  TRUE
    5 4 4 FALSE

I tried this:

df1$label<-apply(df1,1,function (x) ifelse(nrow(merge(x,df3))>0,TRUE,FALSE))

got the wrong result and it's very slow since my df1 is very large. Is there any easy way? like is.element in vectors? Thank you.

3

3 Answers

1
votes

Merge on a and b by doing the equivalent of a LEFT OUTER JOIN in SQL, and then assign non-matching rows the value FALSE:

df1 <- data.frame(a=c(1,2,2,3,4), b=c(2,3,4,4,4))
df2 <- data.frame(a=c(1,1,3,5), b=c(1,2,4,5))
df2$label <- TRUE                                  # df1 matches to df2 is TRUE
df3 <- merge(df1, df2, by=c("a", "b"), all.x=TRUE) # merge on a AND b 
df3$label[is.na(df3$label)] <- FALSE               # non-match is FALSE

Output:

> df3
  a b label
1 1 2  TRUE
2 2 3  TRUE
3 2 4 FALSE
4 3 4  TRUE
5 4 4 FALSE
1
votes

Here's another way using data.table:

require(data.table)
setkey(setDT(df1), a)                     ## (1)
idx = df1[df2, which=TRUE, nomatch=0L]    ## (2)
df1[, label := FALSE][idx, label := TRUE] ## (3)
df1
#    a b label
# 1: 1 2  TRUE
# 2: 2 3 FALSE
# 3: 3 4  TRUE
# 4: 4 5 FALSE
  1. setDT converts data.frame to data.table by reference. And on the data.table, we setkey on column a - which basically reorders the data.table df1 by column a and marks that column as sorted. This is a prerequisite for join done in step 2.

  2. We perform a join of the form x[i] where x=df1 and i=df2. x has to be a keyed data.table and i can be a list, data.frame or data.table. The join is performed on the key column a. For each row of column a of df2, we find the matching rows, and return them by using the argument which=TRUE, and if there is a no match, we return a "0" for that row using nomatch=0L.

    At this step, we get idx = 1, 3, 0. Because, "1" and "3" matches the first and third rows, and "5" has no match.

  3. We set the entire label column to FALSE and update only those rows of idx to TRUE (those are the matching rows). This is done by reference once again (for speed and memory efficiency) using the := operator.

Have a look at the HTML vignettes to learn more.

0
votes

If the dataset if big, you can also try data.table

 library(data.table)
 setkey(setDT(df2),a)[df1][,c('label','b') := 
                       list(!is.na(b), i.b)][,i.b:= NULL][]
 #    a b label
 #1: 1 2  TRUE
 #2: 2 3 FALSE
 #3: 3 4  TRUE
 #4: 4 5 FALSE