Identifying Duplicates in `data.frame` Using `dplyr`

Question

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.

Example:

  | A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1

Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.

Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.

I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.

To identify duplicates with dplyr you can try with distinct — patL
For E variable, as.integer(duplicated(d2)|duplicated(d2, fromLast = TRUE)) — Sotos
For column E, create a second tibble with duplicated rows eliminated but counted, and then join back the new tibble , but only the count column as E, to the old tibble (containing the duplicated rows). For column F maybe do the same steps but in between also add another field with mutate, using complex OR conditions in the right hand side of mutate). — knb
why F is not 1,2,3,4,5? rows 3 and 4 differ from 5 only one column too, A. — IBrum

IBrum IBrum · Accepted Answer · 2018-01-25T21:56:41

You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:

W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))

Just change D= for the number of different columns needed. It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.

Identifying Duplicates in `data.frame` Using `dplyr`

3 Answers