0
votes

I want to identify (not eliminate) duplicates in a data frame and add 0/1 variable accordingly (wether a row is a duplicate or not), using the R dplyr package.

Example:

  | A B C D
1 | 1 0 1 1
2 | 1 0 1 1
3 | 0 1 1 1
4 | 0 1 1 1
5 | 1 1 1 1

Clearly, row 1 and 2 are duplicates, so I want to create a new variable (with mutate?), say E, that is equal to 1 in row 1,2,3 and 4 since row 3 and 4 are also identical.

Moreover, I want to add another variable, F, that is equal to 1 if there is a duplicate differing only by one column. That is, F in row 1,2 and 5 would be equal to 1 since they only differ in the B column.

I hope it is clear what I want to do and I hope that dplyr offers a smooth solution to this problem. This is of course possible in "base" R but I believe (hope) that there exists a smoother solution.

3
To identify duplicates with dplyr you can try with distinctpatL
For E variable, as.integer(duplicated(d2)|duplicated(d2, fromLast = TRUE))Sotos
For column E, create a second tibble with duplicated rows eliminated but counted, and then join back the new tibble , but only the count column as E, to the old tibble (containing the duplicated rows). For column F maybe do the same steps but in between also add another field with mutate, using complex OR conditions in the right hand side of mutate).knb
why F is not 1,2,3,4,5? rows 3 and 4 differ from 5 only one column too, A.IBrum

3 Answers

1
votes

You can use dist() to compute the differences, and then a search in the resulting distance object can give the needed answers (E, F, etc.). Here is an example code, where X is the original data.frame:

W=as.matrix(dist(X, method="manhattan"))
X$E = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=0))
X$F = as.integer(sapply(1:ncol(W), function(i,D){any(W[-i,i]==D)}, D=1))

Just change D= for the number of different columns needed. It's all base R though. Using plyr::laply instead of sappy has same effect. dplyr looks overkill here.

0
votes

Here is a data.table solution that is extendable to an arbitrary case (1..n columns the same)- not sure if someone can convert to dpylr for you. I had to change your dataset a bit to show your desired F column - in your example all rows would get a 1 because 3 and 4 are one column different from 5 as well.

library(data.table)

DT <- data.frame(A = c(1,1,0,0,1), B = c(0,0,1,1,1), C = c(1,1,1,1,1), D = c(1,1,1,1,1), E = c(1,1,0,0,0))
DT
  A B C D E
1 1 0 1 1 1
2 1 0 1 1 1
3 0 1 1 1 0
4 0 1 1 1 0
5 1 1 1 1 0

setDT(DT)
DT_ncols <- length(DT)

base <- data.table(t(combn(1:nrow(DT), 2)))
setnames(base, c("V1","V2"),c("ind_x","ind_y"))

DT[, ind := .I)]

DT_melt <- melt(DT, id.var = "ind", variable.name = "column")

base <- merge(base, DT_melt, by.x = "ind_x", by.y = "ind", allow.cartesian = TRUE)
base <- merge(base, DT_melt, by.x = c("ind_y", "column"), by.y = c("ind", "column"))

base <- base[, .(common_cols = sum(value.x == value.y)), by = .(ind_x, ind_y)]

This gives us a data.frame that looks like this:

base
    ind_x ind_y common_cols
 1:     1     2           5
 2:     1     3           2
 3:     2     3           2
 4:     1     4           2
 5:     2     4           2
 6:     3     4           5
 7:     1     5           3
 8:     2     5           3
 9:     3     5           4
10:     4     5           4

This says that rows 1 and 2 have 5 common columns (duplicates). Rows 3 and 5 have 4 common columns, and 4 and 5 have 4 common columns. We can now use a fairly extendable format to flag any combination we want:

base <- melt(base, id.vars = "common_cols")
# Unique - common_cols == DT_ncols
DT[, F := ifelse(ind %in% unique(base[common_cols == DT_ncols, value]), 1, 0)]
# Same save 1 - common_cols == DT_ncols - 1
DT[, G := ifelse(ind %in% unique(base[common_cols == DT_ncols - 1, value]), 1, 0)]
# Same save 2 - common_cols == DT_ncols - 2
DT[, H := ifelse(ind %in% unique(base[common_cols == DT_ncols - 2, value]), 1, 0)]

This gives:

   A B C D E ind F G H
1: 1 0 1 1 1   1 1 0 1
2: 1 0 1 1 1   2 1 0 1
3: 0 1 1 1 0   3 1 1 0
4: 0 1 1 1 0   4 1 1 0
5: 1 1 1 1 0   5 0 1 1

Instead of manually selecting, you can append all combinations like so:

# run after base <- melt(base, id.vars = "common_cols")
base <- unique(base[,.(ind = value, common_cols)])
base[, common_cols := factor(common_cols, 1:DT_ncols)]
merge(DT, dcast(base, ind ~ common_cols, fun.aggregate = length, drop = FALSE), by = "ind")
   ind A B C D E 1 2 3 4 5
1:   1 1 0 1 1 1 0 1 1 0 1
2:   2 1 0 1 1 1 0 1 1 0 1
3:   3 0 1 1 1 0 0 1 0 1 1
4:   4 0 1 1 1 0 0 1 0 1 1
5:   5 1 1 1 1 0 0 0 1 1 0
0
votes

Here is a dplyr solution:

test%>%mutate(flag =         (A==lag(A)&
                              B==lag(B)&
                              C==lag(C)&
                              D==lag(D)))%>%
  mutate(twice = lead(flag)==T)%>%
  mutate(E = ifelse(flag == T | twice ==T,1,0))%>%
  mutate(E = ifelse(is.na(E),0,1))%>%
  mutate(FF = ifelse( ( (A +lag(A)) + (B +lag(B)) + (C+lag(C)) + (D + lag(D))) == 7,1,0))%>%
  mutate(FF = ifelse(is.na(FF)| FF == 0,0,1))%>%
  select(A,B,C,D,E,FF)

Result:

  A B C D E FF
1 1 0 1 1 1  0
2 1 0 1 1 1  0
3 0 1 1 1 1  0
4 0 1 1 1 1  0
5 1 1 1 1 0  1