Finding the index of the maximum value by group for a data.table in R

Question

I have a data.table or data.frame that looks like the following:

Name     Person     Date
A        1          1/1/2004
A        2          1/1/2004
A        2          1/3/2004
A        3          1/1/2004
A        3          1/3/2004
A        3          1/9/2004
B        4          1/7/2004
B        5          1/7/2004
B        5          1/10/2004
B        6          1/7/2004
B        6          1/10/2004
B        6          1/17/2004

Here, I am trying to create a new data table that has "NA's" for the dates if they are not the maximum by person. Basically, I am trying to get the data table to look like:

    Name     Person     Date
A        1          1/1/2004
A        2          "NA"
A        2          1/3/2004
A        3          "NA"
A        3          "NA"
A        3          1/9/2004
B        4          1/7/2004
B        5          "NA"
B        5          1/10/2004
B        6          "NA"
B        6          "NA"
B        6          1/17/2004

Basically, the algorithm I thought of was to look at each grouping by person. If there is only one element for the grouping by person, then that one value is the maximum and hence we let it stay there. But, for example, in grouping by person 2, the maximum is 1/3/2004, so we let 1/1/2004 be "NA".

The only way I can think of doing this is to find the index of the data table corresponding to the maximal value by group (Person) and then to create a new vector of all NA's, then replace with the index of the maximal values.

The code would look like:

which.max(data$Date, by=data$Person)

This somehow doesn't work for me, but either way this code looks like it might be time intensive, especially if my data set is 100 million rows. Would there be a fast implementation for large data sets, especially working in the data.table package? Thanks!

thelatemail thelatemail · Accepted Answer · 2014-08-15T04:35:59

Using data.table:

#dat <- as.data.table(dat)
#dat$Date <- as.Date(dat$Date,format="%m/%d/%Y")
dat[dat[, Date != max(Date) , by=Person][,V1], Date := NA]
dat

 #   Name Person       Date
 #1:    A      1 2004-01-01
 #2:    A      2       <NA>
 #3:    A      2 2004-01-03
 #4:    A      3       <NA>
 #5:    A      3       <NA>
 #6:    A      3 2004-01-09
 #7:    B      4 2004-01-07
 #8:    B      5       <NA>
 #9:    B      5 2004-01-10
#10:    B      6       <NA>
#11:    B      6       <NA>
#12:    B      6 2004-01-17

Finding the index of the maximum value by group for a data.table in R

3 Answers