Trying to fix a de-duplication problem using data.table in R.
Column A is a list of names, some of which appear multiple times. Column B is a list of dates. There are a bunch of other columns that I want to copy over as well (things that happened to Name on Date.)
However I only want to look at the most activity for each person in a new datatable which has 1 entry for each name that corresponds to the most recent date.
The example data
name.last date
1: Adams 2014-10-20
2: Adams 2014-07-07
3: Barnett 2014-11-06
4: Barnett 2014-09-22
5: Bell 2014-10-22
6: Bell 2014-07-29
7: Burns 2014-09-08
8: Burns 2014-09-03
9: Camacho 2014-08-12
10: Camacho 2014-07-08
11: Casillas 2014-10-07
12: Casillas 2014-07-17
13: Chavez 2014-09-23
14: Chavez 2014-09-17
15: Chavira 2014-07-15
16: Chavira 2014-07-07
17: Claren 2014-10-30
18: Claren 2014-10-23
19: Colleary 2014-11-11
20: Colleary 2014-11-07
The answer would return only the first of each name (since here the rows are sorted with the most recent date for each first.) However if I set the dt key setkey(dt,name.last)
in order to use unique()
to remove duplicates, it reorders the table in key order (alphabetical on the names). The use of unique(dt)
then returns the first appearance of each name which is not necessarily the most recent date.
If I set the key over both columns setkeyv(dt,c(name.last,date))
I cannot then remove duplicates using unique()
as all keys are unique.
The problem is similar to the one post here: Collapsing data frame by selecting one row per group . However I cannot assume the data to be selected is first or last unless you can suggest a way to manipulate my data to make it so after setting the key.