0
votes

I have two data frames, say df1 and df2. Now I want to subset df2 based on the matching of multiple columns between df1 and df2.

e.g

df1                                   
   A   B  #column names, rows in df1 are unique, A1,B1 etc are characters                  
   A1  B1                                
   A2  B2 
   ......                               



  df2  
  C     D     E   F    G  
  A1    B1    E1  F1   G1
  A2    B2    E2  .......  
  A1    B2    E3  .......  
  A1    B1    E4  .......  
  A2    B1    E5  .......  

Here I want to match columns A and B in df1 to columns C and D in df2,and construct a new data frame df3, where each row of df3 store the row index of df2 where matching happens. For my example, it should be

df3
c(1,4)
c(2)

Originally I am thinking of paste characters and do string comparison to do the matching, but I suspect this is not the efficient way to do this, any better idea?

2
I would go with pasteing characters sapply(seq(nrow(df1)) , function(i) which( match(paste0(df2$C,df2$D) , paste0(df1$A,df1$B)) %in% i)) - user20650
Thanks, your solution is same as my current version. - Jin

2 Answers

0
votes

I don't think my answer is the most efficient way if you are doing large data on production. I'd merge them if I'm only writing a prototype to see a quick answer..

df1<-data.frame(A=c("A1","A2"),B=c("B1","B2"))
df2<-data.frame(C=c("A1","A2","A1","A1","A2"),D=c("B1","B2","B2","B1","B1"))
names(df1)<-c("C","D")
df1$is_df1<-"Y"
df2$rownumber<-c(1:nrow(df2))
z<-merge(df2,df1,all.x=TRUE)
do.call(rbind,lapply(split(z,paste(z$C,z$D)),function(x)paste(x$rownumber,collapse=",")))
0
votes

Does this answer your needs?

df1 <- data.frame(A = c("A1", "A2"),
                  B = c("B1","B2"))

df2 <- data.frame(C = c("A1", "A2", "A1", "A1", "A2"),
                  D = c("B1", "B2", "B2", "B1", "B1"),
                  E = rnorm(5))

df2$row <- 1:nrow(df2)
df2
m <- merge(df1, df2, by.x = c("A","B"),
           by.y = c("C","D"),
           all.x = T, sort = FALSE)
res <- aggregate(row ~ A +B , data=m, paste, sep ="", collapse = ",")
sapply(res, class)