Is there a way to compare two data frames of a different length, based on matching column values?

Question

Solution as posted by Jon Spring, in the comments of the answer provided)

#Applied to fruits example
df2 %>%
    select(id, name, score.x, year) %>%
    left_join(df1 %>% select(id, name, score.x, year),
    by = c("id", "name", "score.x", "year")) %>%
    mutate(match = score.x.x == score.x.y)

#Applied to df being worked with
Votesfull %>%
    select(rcid, session.x, country, unres, vote) %>%
    left_join(OTHER_DATA %>% select(rcid, session.x, country, unres, vote),
    by = c("rcid", "session.x","country", "unres")) %>%
    mutate(match = vote.x == vote.y)

I'm comparing dataframes with different lengths, but with similar structure. Is there a way to compare parts of the longer df to the shorter df?

For clarity, I'll refer to the shorter df as df1 and the longer one as df2. df1 was subsetted from the longer one, and df2 is a collection of similar tables made into one large df. Each individual sub-part of df2 is about 6,000 observations long, and df1 is also a similar length.

I want to seek help whether it is possible to call upon this small part of df2 to compare to df1, and iterate it continuously until I reach the end of df2.

I've searched for and tried solutions that regard dataframes with equal or similar sizes, but I was not able to find a solution for dataframes that were different in height. Of the dataframes I am working with, the larger one is about 150 times longer than the shorter one with slight differences in the total observations, meaning the two dfs' lengths are not multiples of each other.

It may be a problem with the data structure itself. If so, I apologize for my lack of thought and skill.

a sample df1 and df2 to illustrate the conundrum is the following:

df1 <- data.frame(
    "id" = 1:3,
    "name" = c('apple', 'apple', 'apple'),
    "score.x" = c(1, 3, 2),
    "year" = c(2000, 2001, 2002)
)

df2 <- data.frame(
    "id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
    "name" = c('orange', 'orange', 'orange', 'melon', 'melon', 'melon', 'grapes', 'grapes', 'grapes', 'lemon', 'lemon', 'lemon'),
    "score.x" = c(2, 3, 1, 1, 1, 2, 3, 3, 2, 1, 1, 1),
    "year" = c(2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002)
)

df1
    id  name    score.x  year
1   1   apple   1        2000
2   2   apple   3        2001
3   3   apple   2        2002

df2
    id  name    score.x  year
1   1   orange  2        2000
2   2   orange  3        2001
3   3   orange  1        2002
4   1   melon   1        2000
5   2   melon   1        2001
6   3   melon   2        2002
7   1   grapes  3        2000
8   2   grapes  3        2001
9   3   grapes  2        2002
10  1   lemon   1        2000
11  2   lemon   1        2001
12  3   lemon   1        2002

df2 is just like df1, except it has more observations under it.

Is there a way to compare just a part of df2, perhaps the orange(df2[df2$name == 'orange']), to df1 and iterate it over melon, grapes, and lemons?

Lastly, I want to thank anyone who answers this question and apologize to anyone who thinks the question is poorly posed. I'm very new to R and Stack Overflow in general--which doesn't excuse me, I know. All in all, I'll try to catch on quickly and generate better content for the community.

Edit: a part of the actual df I want to apply this to is below:

Votesfull
        rcid  ccode  session.x  member  vote  Country  year  date        unres
1       3     2      1          1       1     USA      1946  1946-01-01  R/1/66
2       3     20     1          1       3     CAN      1946  1946-01-01  R/1/66
3       3     31     1          NA      NA     BHS      1946  1946-01-01  R/1/66
4       3     40     1          1       1     CUB      1946  1946-01-01  R/1/66
5       3     41     1          1       1     HTI      1946  1946-01-01  R/1/66
...
512792  2550  2      38         1       3     USA      1983  1983-12-07  R/38/183C
512793  2550  20     38         1       3     CAN      1983  1983-12-07  R/38/183C
512794  2550  31     38         1       2     BHS      1983  1983-12-07  R/38/183C
512795  2550  40     38         1       1     CUB      1983  1983-12-07  R/38/183C
512795  2550  41     38         1       2     HTI      1983  1983-12-07  R/38/183C
...
1041717 5338  2      69         1       3     USA      2014  2014-12-02  R/69/53
1041718 5338  20     69         1       2     CAN      2014  2014-12-02  R/69/53
1041719 5338  31     69         1       1     BHS      2014  2014-12-02  R/69/53
1041720 5338  40     69         1       1     CUB      2014  2014-12-02  R/69/53 
2014721 5338  41     69         1       1     HTI      2014  2014-12-02  R/69/53

I apologize for the confusion regarding the names seemingly being non-overlapping in the sample df presented above.

This data is from Georgetown Univ. Dr. Voeten's UN General Assembly Voting Data, accessed via Harvard Dataverse. The df has overlapping rcid, session.x, and unres(UN resolution code), which can be used to make parallels to another row of a different country.

Edit2: A sketch of the wanted result is as follows(note the match column):

Votesfull
        rcid  ccode  session.x  member  vote  Country  year  date        unres     match
1       3     2      1          1       1     USA      1946  1946-01-01  R/1/66    TRUE
2       3     20     1          1       3     CAN      1946  1946-01-01  R/1/66    FALSE
3       3     31     1          NA      NA    BHS      1946  1946-01-01  R/1/66    NA
4       3     40     1          1       1     CUB      1946  1946-01-01  R/1/66    TRUE
5       3     41     1          1       1     HTI      1946  1946-01-01  R/1/66    TRUE
...
512792  2550  2      38         1       3     USA      1983  1983-12-07  R/38/183C TRUE
512793  2550  20     38         1       3     CAN      1983  1983-12-07  R/38/183C TRUE
512794  2550  31     38         1       2     BHS      1983  1983-12-07  R/38/183C FALSE
512795  2550  40     38         1       1     CUB      1983  1983-12-07  R/38/183C FALSE
512795  2550  41     38         1       2     HTI      1983  1983-12-07  R/38/183C FALSE
...
1041717 5338  2      69         1       3     USA      2014  2014-12-02  R/69/53   TRUE
1041718 5338  20     69         1       2     CAN      2014  2014-12-02  R/69/53   FALSE
1041719 5338  31     69         1       1     BHS      2014  2014-12-02  R/69/53   FALSE
1041720 5338  40     69         1       1     CUB      2014  2014-12-02  R/69/53   FALSE 
2014721 5338  41     69         1       1     HTI      2014  2014-12-02  R/69/53   FALSE

What I should have clarified is that the dataframes are not exactly the same lengths.

So basically, what I'm struggling with is testing whether each vote entry in Votesfull is equal to another df (of similar structure), based on matching rcid data(each rcid denotes a separate voting session, meaning each Country will have 1 vote entry per rcid entry).

Edit3: A sketch of the wanted results using the original fruits example:

df1
    id  name    score.x  year
1   1   apple   1        2000
2   2   apple   3        2001
3   3   apple   2        2002

#todo: compare apples to orange, melon, grapes, etc., for each id match
#e.g.) apple(id=1) vs orange(id=1), apple(id=2) vs orange(id=2), so on..

df2
    id  name    score.x  year  match_apple
1   1   orange  2        2000  FALSE       #for id=1, score 2 != 1
2   2   orange  3        2001  TRUE        #for id=2, score 3 == 3
3   3   orange  1        2002  FALSE       #for id=3, score 1 != 2
4   1   melon   1        2000  TRUE
5   2   melon   1        2001  FALSE
6   3   melon   2        2002  TRUE
7   1   grapes  3        2000  FALSE
8   2   grapes  3        2001  TRUE
9   3   grapes  2        2002  TRUE
10  1   lemon   1        2000  FALSE
11  2   lemon   1        2001  FALSE
12  3   lemon   1        2002  FALSE
13  1   berry   1        2000  TRUE        #added new fruit to demo NA
14  2   berry   2        2001  FALSE
15  3   berry   NA       2002  NA          #some values of df are NA

This looks like a good place to use a join (eg merge in base R or left_join in dplyr). Can you give an example of a row that you'd like to compare, and how you'd want to compare it between the two tables? (in the given data, it looks like none of the names overlap, so that might mean no matches) — Jon Spring
Hi, thanks for your comments. I've added a sample df of the working data, to clarify the issue with names not overlapping. I guess it's safe to say I made an error when making the sample df. Very sorry for the confusion. As for "compare", I'm curious if it's possible to match score.x together (apples - orange, apples - melon, so on) and get a TRUE/FALSE logical. For the sample working df, the vote variable would be compared with each other. — Hyunbin Choi

Jon Spring Jon Spring · Accepted Answer · 2019-04-29T03:02:15

Here's an example of doing a join between two tables to see if they match in score.x where the other columns match.

The "left_join" takes each row from the first table and outputs a row for each match that is found in the second table, defined by the columns named in the by = c("id", "name", "year") part. Since you'll now have two versions of score.x, the original from df1 is renamed score.x.x and the one from df2 is renamed score.x.y.

library(dplyr)
df1 %>%          # (Note, I've modified df1 to be "oranges" so we'll have matches)
  left_join(     # Keep everything in df1 and connect to each matching row in...
    df2,                          # df2, defined by matching...
    by = c("id", "name", "year")  # id, name, and year
  ) %>%
  mutate(match = score.x.x == score.x.y)  # ...and say whether they match

# Here's the output
  id   name score.x.x year score.x.y match
1  1 orange         1 2000         2 FALSE
2  2 orange         3 2001         3  TRUE
3  3 orange         2 2002         1 FALSE

Sample data, slightly modified from OP

df1 <- data.frame(
  "id" = 1:3,
  "name" = c('orange', 'orange', 'orange'),  # Changed to make matches
  "score.x" = c(1, 3, 2),
  "year" = c(2000, 2001, 2002)
)

df2 <- data.frame(
  "id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
  "name" = c('orange', 'orange', 'orange', 'melon', 'melon', 'melon', 'grapes', 'grapes', 'grapes', 'lemon', 'lemon', 'lemon'),
  "score.x" = c(2, 3, 1, 1, 1, 2, 3, 3, 2, 1, 1, 1),
  "year" = c(2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002)
)

Is there a way to compare two data frames of a different length, based on matching column values?

2 Answers