is there any way to check if some duplicates rows are the same in 2 data frames?

Question

df1 is a subset of df2, I want to check that id number of duplicated rows in df1 is the same in df2 or not? so I want to make 2 new data frame from the bigger data frame df1, and in one of them keep rows with the same number of duplicated rows and otherwise in the other data set.

Example :

              SAMPN    PERNO       loop
                1        1          1
                1        1          1
                1        1          2
                1        2          2
                1        3          2
                2        1          1
                2        1          1
                2        2          2
                2        3          4


              SAMPN    PERNO       loop
                1        1          1
                1        1          1
                1        1          2
                1        2          2
                1        3          2
                1        3          2
                2        1          1
                2        1          1
                2        2          2
                2        2          2
                2        3          4
                2        3          4
                2        4          1

out put

data from df2 with the same number of duplicated rows in 2 dataset:

              SAMPN    PERNO       loop
                1        1          1
                1        1          1
                1        1          2
                1        2          2
                2        1          1
                2        1          1

data from df2 with not the same number of duplicated rows in 2 dataset:

              SAMPN    PERNO       loop

                1        3          2
                1        3          2
                2        2          2
                2        2          2
                2        3          4
                2        3          4
                2        4          1

data to check

structure(list(SAMPN = c(50, 50, 50, 50, 50, 50, 51, 53, 53, 
53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54), PERNO = c(4, 
4, 5, 5, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 5, 5, 5
), PLANO = c(4, 5, 2, 3, 2, 3, 3, 2, 3, 4, 5, 2, 3, 4, 5, 6, 
7, 2, 3, 2, 3, 4), loop = c(3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 3), TPURP = structure(c(16L, 2L, 
5L, 2L, 5L, 2L, 2L, 18L, 18L, 13L, 2L, 8L, 3L, 2L, 20L, 13L, 
2L, 5L, 2L, 5L, 2L, 3L), .Label = c("(1) Working at home (for pay)", 
"(2) All other home activities", "(3) Work/Job", "(4) All other activities at work", 
"(5) Attending class", "(6) All other activities at school", 
"(7) Change type of transportation/transfer", "(8) Dropped off passenger", 
"(9) Picked up passenger", "(10) Other, specify - transportation", 
"(11) Work/Business related", "(12) Service Private Vehicle", 
"(13) Routine Shopping", "(14) Shopping for major purchases", 
"(15) Household errands", "(16) Personal Business", "(17) Eat meal outside of home", 
"(18) Health care", "(19) Civic/Religious activities", "(20) Recreation/Entertainment", 
"(21) Visit friends/relative", "(24) Loop trip", "(97) Other, specify"
), class = "factor")), row.names = 431:452, class = "data.frame")


structure(list(SAMPN = c(48, 50, 50, 50, 50, 50, 56, 56, 58, 
58, 58, 58, 58, 58, 58, 58), PERNO = c(7, 1, 1, 2, 3, 6, 1, 3, 
1, 1, 1, 1, 2, 2, 2, 2), PLANO = c(3, 2, 4, 2, 4, 2, 6, 3, 2, 
3, 4, 5, 2, 3, 4, 5), loop = c(2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 
2, 2, 2, 2, 2, 2), TPURP = structure(c(2L, 8L, 22L, 8L, 22L, 
5L, 2L, 2L, 18L, 17L, 13L, 2L, 16L, 17L, 13L, 2L), .Label = c("(1) Working at home (for pay)", 
"(2) All other home activities", "(3) Work/Job", "(4) All other activities at work", 
"(5) Attending class", "(6) All other activities at school", 
"(7) Change type of transportation/transfer", "(8) Dropped off passenger", 
"(9) Picked up passenger", "(10) Other, specify - transportation", 
"(11) Work/Business related", "(12) Service Private Vehicle", 
"(13) Routine Shopping", "(14) Shopping for major purchases", 
"(15) Household errands", "(16) Personal Business", "(17) Eat meal outside of home", 
"(18) Health care", "(19) Civic/Religious activities", "(20) Recreation/Entertainment", 
"(21) Visit friends/relative", "(24) Loop trip", "(97) Other, specify"
), class = "factor")), row.names = c(412L, 420L, 422L, 423L, 
428L, 435L, 467L, 474L, 480L, 481L, 482L, 483L, 484L, 485L, 486L, 
487L), class = "data.frame")

Ronak Shah Ronak Shah · Accepted Answer · 2019-09-25T02:22:25

Maybe there is a simpler way but here is one approach using dplyr. We first count number of rows in each group in both the dataframe and do a left_join.

library(dplyr)

df3 <- left_join(df2 %>% count(SAMPN, PERNO, loop), 
                 df1 %>% count(SAMPN, PERNO, loop), by = c("SAMPN", "PERNO","loop"))

We select rows from df2 where the count matches in df3

df3 %>%
  filter(n.x == n.y) %>%
  select(names(df2)) %>%
  inner_join(df2)

 #  SAMPN PERNO  loop
 #  <int> <int> <int>
#1     1     1     1
#2     1     1     1
#3     1     1     2
#4     1     2     2
#5     2     1     1
#6     2     1     1

and another one where the count do not match.

df3 %>%
  filter(n.x != n.y | is.na(n.y)) %>%
  select(names(df2)) %>%
  inner_join(df2)

#  SAMPN PERNO  loop
#  <int> <int> <int>
#1     1     3     2
#2     1     3     2
#3     2     2     2
#4     2     2     2
#5     2     3     4
#6     2     3     4
#7     2     4     1

is there any way to check if some duplicates rows are the same in 2 data frames?

1 Answers