3
votes

I am trying to match the values in a column of one data frame with the values in a column of a second data frame. The tricky part is that I would like to do the matching using subsets of the second data frame (designated by a distinct column in the second data frame from the one that is being matched). This is different from the commonly posted problem of trying to subset based on matching between data frames.

My problem is the opposite - I want to match data frames based on subsets. To be specific, I would like to match subsets of the column in the second data frame with the entire column of the first data frame, and then create new columns in the first data frame that show whether or not a match has been made for each subset.

These subsets can have varying number of rows. Using the two dummy data frames below...

DF1 <- data.frame(number=1:10)

DF2 <- data.frame(category = rep(c("A","B","C"), c(5,7,3)), 
                  number = sample(10, size=15, replace=T))

...the objective would be to create three new columns (DF1$A, DF1$B, and DF$C) that show whether the values in DF1$number match with the values in DF2$number for each of the respective subsets of DF2$category. Ideally the rows in these new columns would show a '1' if a match has been made and a '0' if a match has not. With the dummy data below I would end up with DF1 having 4 columns (DF1$number, DF1$A, DF1$B, and DF$C) of 10 rows each.

Note that in my actual second data frame I have a huge number of categories, so I don't want to have to type them out individually for whatever operation is needed to accomplish this objective. I hope that makes sense! Sorry if I'm missing something obvious and thanks very much for any help you might be able to provide.

1

1 Answers

2
votes

This should work:

sapply(split(DF2$number, DF2$category), function(x) DF1$number %in% x + 0)

      A B C
 [1,] 0 0 1
 [2,] 1 1 0
 [3,] 1 1 1
 [4,] 0 1 0
 [5,] 0 0 1
 [6,] 0 1 0
 [7,] 1 1 0
 [8,] 1 0 0
 [9,] 1 0 0
[10,] 0 1 0

You can add this back to DF1 like:

data.frame(
   DF1,
   sapply(split(DF2$number, DF2$category), function(x) DF1$number %in% x + 0)
          )

   number A B C
1       1 0 0 1
2       2 1 1 0
3       3 1 1 1
4       4 0 1 0
5       5 0 0 1
6       6 0 1 0
7       7 1 1 0
8       8 1 0 0
9       9 1 0 0
10     10 0 1 0