Solution as posted by Jon Spring, in the comments of the answer provided)
#Applied to fruits example
df2 %>%
select(id, name, score.x, year) %>%
left_join(df1 %>% select(id, name, score.x, year),
by = c("id", "name", "score.x", "year")) %>%
mutate(match = score.x.x == score.x.y)
#Applied to df being worked with
Votesfull %>%
select(rcid, session.x, country, unres, vote) %>%
left_join(OTHER_DATA %>% select(rcid, session.x, country, unres, vote),
by = c("rcid", "session.x","country", "unres")) %>%
mutate(match = vote.x == vote.y)
I'm comparing dataframes with different lengths, but with similar structure. Is there a way to compare parts of the longer df to the shorter df?
For clarity, I'll refer to the shorter df as df1 and the longer one as df2. df1 was subsetted from the longer one, and df2 is a collection of similar tables made into one large df. Each individual sub-part of df2 is about 6,000 observations long, and df1 is also a similar length.
I want to seek help whether it is possible to call upon this small part of df2 to compare to df1, and iterate it continuously until I reach the end of df2.
I've searched for and tried solutions that regard dataframes with equal or similar sizes, but I was not able to find a solution for dataframes that were different in height. Of the dataframes I am working with, the larger one is about 150 times longer than the shorter one with slight differences in the total observations, meaning the two dfs' lengths are not multiples of each other.
It may be a problem with the data structure itself. If so, I apologize for my lack of thought and skill.
a sample df1 and df2 to illustrate the conundrum is the following:
df1 <- data.frame(
"id" = 1:3,
"name" = c('apple', 'apple', 'apple'),
"score.x" = c(1, 3, 2),
"year" = c(2000, 2001, 2002)
)
df2 <- data.frame(
"id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
"name" = c('orange', 'orange', 'orange', 'melon', 'melon', 'melon', 'grapes', 'grapes', 'grapes', 'lemon', 'lemon', 'lemon'),
"score.x" = c(2, 3, 1, 1, 1, 2, 3, 3, 2, 1, 1, 1),
"year" = c(2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002)
)
df1
id name score.x year
1 1 apple 1 2000
2 2 apple 3 2001
3 3 apple 2 2002
df2
id name score.x year
1 1 orange 2 2000
2 2 orange 3 2001
3 3 orange 1 2002
4 1 melon 1 2000
5 2 melon 1 2001
6 3 melon 2 2002
7 1 grapes 3 2000
8 2 grapes 3 2001
9 3 grapes 2 2002
10 1 lemon 1 2000
11 2 lemon 1 2001
12 3 lemon 1 2002
df2 is just like df1, except it has more observations under it.
Is there a way to compare just a part of df2, perhaps the orange(df2[df2$name == 'orange']), to df1 and iterate it over melon, grapes, and lemons?
Lastly, I want to thank anyone who answers this question and apologize to anyone who thinks the question is poorly posed. I'm very new to R and Stack Overflow in general--which doesn't excuse me, I know. All in all, I'll try to catch on quickly and generate better content for the community.
Edit: a part of the actual df I want to apply this to is below:
Votesfull
rcid ccode session.x member vote Country year date unres
1 3 2 1 1 1 USA 1946 1946-01-01 R/1/66
2 3 20 1 1 3 CAN 1946 1946-01-01 R/1/66
3 3 31 1 NA NA BHS 1946 1946-01-01 R/1/66
4 3 40 1 1 1 CUB 1946 1946-01-01 R/1/66
5 3 41 1 1 1 HTI 1946 1946-01-01 R/1/66
...
512792 2550 2 38 1 3 USA 1983 1983-12-07 R/38/183C
512793 2550 20 38 1 3 CAN 1983 1983-12-07 R/38/183C
512794 2550 31 38 1 2 BHS 1983 1983-12-07 R/38/183C
512795 2550 40 38 1 1 CUB 1983 1983-12-07 R/38/183C
512795 2550 41 38 1 2 HTI 1983 1983-12-07 R/38/183C
...
1041717 5338 2 69 1 3 USA 2014 2014-12-02 R/69/53
1041718 5338 20 69 1 2 CAN 2014 2014-12-02 R/69/53
1041719 5338 31 69 1 1 BHS 2014 2014-12-02 R/69/53
1041720 5338 40 69 1 1 CUB 2014 2014-12-02 R/69/53
2014721 5338 41 69 1 1 HTI 2014 2014-12-02 R/69/53
I apologize for the confusion regarding the names seemingly being non-overlapping in the sample df presented above.
This data is from Georgetown Univ. Dr. Voeten's UN General Assembly Voting Data, accessed via Harvard Dataverse. The df has overlapping rcid, session.x, and unres(UN resolution code), which can be used to make parallels to another row of a different country.
Edit2: A sketch of the wanted result is as follows(note the match column):
Votesfull
rcid ccode session.x member vote Country year date unres match
1 3 2 1 1 1 USA 1946 1946-01-01 R/1/66 TRUE
2 3 20 1 1 3 CAN 1946 1946-01-01 R/1/66 FALSE
3 3 31 1 NA NA BHS 1946 1946-01-01 R/1/66 NA
4 3 40 1 1 1 CUB 1946 1946-01-01 R/1/66 TRUE
5 3 41 1 1 1 HTI 1946 1946-01-01 R/1/66 TRUE
...
512792 2550 2 38 1 3 USA 1983 1983-12-07 R/38/183C TRUE
512793 2550 20 38 1 3 CAN 1983 1983-12-07 R/38/183C TRUE
512794 2550 31 38 1 2 BHS 1983 1983-12-07 R/38/183C FALSE
512795 2550 40 38 1 1 CUB 1983 1983-12-07 R/38/183C FALSE
512795 2550 41 38 1 2 HTI 1983 1983-12-07 R/38/183C FALSE
...
1041717 5338 2 69 1 3 USA 2014 2014-12-02 R/69/53 TRUE
1041718 5338 20 69 1 2 CAN 2014 2014-12-02 R/69/53 FALSE
1041719 5338 31 69 1 1 BHS 2014 2014-12-02 R/69/53 FALSE
1041720 5338 40 69 1 1 CUB 2014 2014-12-02 R/69/53 FALSE
2014721 5338 41 69 1 1 HTI 2014 2014-12-02 R/69/53 FALSE
What I should have clarified is that the dataframes are not exactly the same lengths.
So basically, what I'm struggling with is testing whether each vote
entry in Votesfull
is equal to another df (of similar structure), based on matching rcid
data(each rcid
denotes a separate voting session, meaning each Country
will have 1 vote
entry per rcid
entry).
Edit3: A sketch of the wanted results using the original fruits example:
df1
id name score.x year
1 1 apple 1 2000
2 2 apple 3 2001
3 3 apple 2 2002
#todo: compare apples to orange, melon, grapes, etc., for each id match
#e.g.) apple(id=1) vs orange(id=1), apple(id=2) vs orange(id=2), so on..
df2
id name score.x year match_apple
1 1 orange 2 2000 FALSE #for id=1, score 2 != 1
2 2 orange 3 2001 TRUE #for id=2, score 3 == 3
3 3 orange 1 2002 FALSE #for id=3, score 1 != 2
4 1 melon 1 2000 TRUE
5 2 melon 1 2001 FALSE
6 3 melon 2 2002 TRUE
7 1 grapes 3 2000 FALSE
8 2 grapes 3 2001 TRUE
9 3 grapes 2 2002 TRUE
10 1 lemon 1 2000 FALSE
11 2 lemon 1 2001 FALSE
12 3 lemon 1 2002 FALSE
13 1 berry 1 2000 TRUE #added new fruit to demo NA
14 2 berry 2 2001 FALSE
15 3 berry NA 2002 NA #some values of df are NA
merge
in base R orleft_join
indplyr
). Can you give an example of a row that you'd like to compare, and how you'd want to compare it between the two tables? (in the given data, it looks like none of the names overlap, so that might mean no matches) – Jon Spring