3
votes

Trying to fix a de-duplication problem using data.table in R.

Column A is a list of names, some of which appear multiple times. Column B is a list of dates. There are a bunch of other columns that I want to copy over as well (things that happened to Name on Date.)

However I only want to look at the most activity for each person in a new datatable which has 1 entry for each name that corresponds to the most recent date.

The example data

    name.last       date
 1:     Adams 2014-10-20
 2:     Adams 2014-07-07
 3:   Barnett 2014-11-06
 4:   Barnett 2014-09-22
 5:      Bell 2014-10-22
 6:      Bell 2014-07-29
 7:     Burns 2014-09-08
 8:     Burns 2014-09-03
 9:   Camacho 2014-08-12
10:   Camacho 2014-07-08
11:  Casillas 2014-10-07
12:  Casillas 2014-07-17
13:    Chavez 2014-09-23
14:    Chavez 2014-09-17
15:   Chavira 2014-07-15
16:   Chavira 2014-07-07
17:    Claren 2014-10-30
18:    Claren 2014-10-23
19:  Colleary 2014-11-11
20:  Colleary 2014-11-07

The answer would return only the first of each name (since here the rows are sorted with the most recent date for each first.) However if I set the dt key setkey(dt,name.last) in order to use unique() to remove duplicates, it reorders the table in key order (alphabetical on the names). The use of unique(dt) then returns the first appearance of each name which is not necessarily the most recent date.

If I set the key over both columns setkeyv(dt,c(name.last,date)) I cannot then remove duplicates using unique() as all keys are unique.

The problem is similar to the one post here: Collapsing data frame by selecting one row per group . However I cannot assume the data to be selected is first or last unless you can suggest a way to manipulate my data to make it so after setting the key.

3

3 Answers

3
votes

There are plenty of ways of doing this without ordering the data table (though ordering is preferred because duplicated is very efficient and you are also avoiding of using by - will get to that).

First of all, you have to make sure that date is of class Date in order to make things easier

dt[, date := as.Date(date)]

First simple method (though not the most efficient)

dt[, max(date), name.last]
#     name.last         V1
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

Second (proffered) method is similar to yours but is using data.tables setorder (for data.table version >= 1.9.4) and should be the most efficient

setorder(dt, name.last, -date)[!duplicated(name.last)]
#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

You could achieve the same using setkey (as you already did) ans specifying from.last = TRUE in duplicated and removing !

setkey(dt, name.last, date)[duplicated(name.last, from.last = TRUE)]

#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

Third method is using data.tables unique function (which should be very efficient too)

unique(setorder(dt, name.last, -date), by = "name.last")
#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11

Last method is using .SD. It is the least efficient, but is useful in some cases when you want to get all the column in return and you can't use functions such a sduplicated

setorder(dt, name.last, -date)[, .SD[1], name.last]
#     name.last       date
#  1:     Adams 2014-10-20
#  2:   Barnett 2014-11-06
#  3:      Bell 2014-10-22
#  4:     Burns 2014-09-08
#  5:   Camacho 2014-08-12
#  6:  Casillas 2014-10-07
#  7:    Chavez 2014-09-23
#  8:   Chavira 2014-07-15
#  9:    Claren 2014-10-30
# 10:  Colleary 2014-11-11
2
votes

If I am understanding your question, I think you can do this more cleanly with the sqldf package, but the downside is that you have to know sql.

install.packages("sqldf")
library("sqldf")
dt <-data.frame(read.table(header = TRUE, text = " name.last       date
1:     Adams 2014-10-20
2:     Adams 2014-07-07
3:   Barnett 2014-11-06
4:   Barnett 2014-09-22
5:      Bell 2014-10-22
6:      Bell 2014-07-29
7:     Burns 2014-09-08
8:     Burns 2014-09-03
9:   Camacho 2014-08-12
10:   Camacho 2014-07-08
11:  Casillas 2014-10-07
12:  Casillas 2014-07-17
13:    Chavez 2014-09-23
14:    Chavez 2014-09-17
15:   Chavira 2014-07-15
16:   Chavira 2014-07-07
17:    Claren 2014-10-30
18:    Claren 2014-10-23
19:  Colleary 2014-11-11
20:  Colleary 2014-11-07")
)
head(dt)
colnames(dt) <- c('names', 'date')
sqldf("select names, min(date), max(date) from dt group by names")

Hopefully this was helpful.

1
votes

In writing this up I figured it out. For posterity....

Order the table by name and date so that you can depend on the date you want being first or last in the group. For example: dt[order(names,-date)].

Then rather than setting a key and using unique(), just a simple:

dt[!duplicated(names)]

Where names is the duplicated column.

Should output the desired table. If there are more elegant / reliable ways to do this I'd be interested in hearing them.