Remove duplicate rows based on multiple columns using dplyr / tidyverse?

Question

I would like to remove duplicate rows based on >1 column using dplyr / tidyverse

Example

library(dplyr)

df <- data.frame(a=c(1,1,1,2,2,2), b=c(1,2,1,2,1,2), stringsAsFactors = F)

I thought this would return rows 3 and 6, but it returns 0 rows.

df %>% filter(duplicated(a, b))
# [1] a b
# <0 rows> (or 0-length row.names)

Conversely, I thought this would return rows 1,2,4 and 5, but it returns all rows.

df %>% filter(!duplicated(a, b))
#   a b
# 1 1 1
# 2 1 2
# 3 1 1
# 4 2 2
# 5 2 1
# 6 2 2

What am I missing?

r2evans r2evans · Accepted Answer · 2020-03-26T05:29:00

duplicated expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).

df %>%
  filter(duplicated(.))
#   a b
# 1 1 1
# 2 2 2

df %>%
  filter(!duplicated(.))
#   a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

If you prefer to reference a specific subset of columns, then use cbind:

df %>%
  filter(duplicated(cbind(a, b)))

As a side note, the dplyr verb for this can be distinct:

df %>%
  distinct(a, b, .keep_all = TRUE)
#   a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1

though I don't know that it has an inverse of this function.

Remove duplicate rows based on multiple columns using dplyr / tidyverse?

Example

2 Answers