3
votes

I'm trying to show duplicates by group using data.table. More specifically I'm trying to find out whether there are multiple observations for a country in a given year. Here's a sample dataset:

# load data.table package, assuming it's installed
library(data.table)

# create dataset
year    <- rep(2010:2012, 3)
country <- c(rep("A", 3), rep("B", 3), rep("C", 3))
value   <- sample(2:200, 9, replace = TRUE)
df <- cbind(year, country, value)
df <- rbind(df, c(2012, "C", 20))
df # show data

# put data frame in data.table format and set year as key
dt <- data.table(df)
setkey(dt, year)

Note how for 2012 the country "C" is two observations, whereas all other country observations are unique by year.

I'd expected that the following code would provide me the name of the duplicate country and year:

dt[duplicated(country), country, by = year]

Instead, it shows all data from 2011 onwards, which implies that the duplicated function is applied to the entire data.table and not by year. How would I go about to extract the single duplicate value of country "C" in 2012 using data.table?

1

1 Answers

6
votes

Just add country to your setkey and then just use duplicated(dt) within dt

setkey(dt, year, country)
dt[duplicated(dt)]
#    year country value
# 1: 2012       C    20

Or if you only want to display the country

dt[duplicated(dt), country]
## [1] "C"

If you don't want to key your data set (if you want to keep the order, for example), you can specify the columns you want to distinguish by in the by argument of duplicated

dt[duplicated(dt, by = c("year", "country")), country]
## [1] "C"

Edit:

Since v1.9.8, all columns are used in by (instead of the key), hence, one will need to explicitly pass by = key(dt) (if the data.table is keyed) as in

dt[duplicated(dt, by = key(dt))]