2
votes

I have a data.table or data.frame that looks like the following:

Name     Person     Date
A        1          1/1/2004
A        2          1/1/2004
A        2          1/3/2004
A        3          1/1/2004
A        3          1/3/2004
A        3          1/9/2004
B        4          1/7/2004
B        5          1/7/2004
B        5          1/10/2004
B        6          1/7/2004
B        6          1/10/2004
B        6          1/17/2004

Here, I am trying to create a new data table that has "NA's" for the dates if they are not the maximum by person. Basically, I am trying to get the data table to look like:

    Name     Person     Date
A        1          1/1/2004
A        2          "NA"
A        2          1/3/2004
A        3          "NA"
A        3          "NA"
A        3          1/9/2004
B        4          1/7/2004
B        5          "NA"
B        5          1/10/2004
B        6          "NA"
B        6          "NA"
B        6          1/17/2004

Basically, the algorithm I thought of was to look at each grouping by person. If there is only one element for the grouping by person, then that one value is the maximum and hence we let it stay there. But, for example, in grouping by person 2, the maximum is 1/3/2004, so we let 1/1/2004 be "NA".

The only way I can think of doing this is to find the index of the data table corresponding to the maximal value by group (Person) and then to create a new vector of all NA's, then replace with the index of the maximal values.

The code would look like:

which.max(data$Date, by=data$Person)

This somehow doesn't work for me, but either way this code looks like it might be time intensive, especially if my data set is 100 million rows. Would there be a fast implementation for large data sets, especially working in the data.table package? Thanks!

3

3 Answers

4
votes

Using data.table:

#dat <- as.data.table(dat)
#dat$Date <- as.Date(dat$Date,format="%m/%d/%Y")
dat[dat[, Date != max(Date) , by=Person][,V1], Date := NA]
dat

 #   Name Person       Date
 #1:    A      1 2004-01-01
 #2:    A      2       <NA>
 #3:    A      2 2004-01-03
 #4:    A      3       <NA>
 #5:    A      3       <NA>
 #6:    A      3 2004-01-09
 #7:    B      4 2004-01-07
 #8:    B      5       <NA>
 #9:    B      5 2004-01-10
#10:    B      6       <NA>
#11:    B      6       <NA>
#12:    B      6 2004-01-17
3
votes

This is a good use case for ave() which allows you to apply a function to the values for each person. If this is your sample data

dd<-structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
    2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    Person = c(1L, 2L, 2L, 3L, 3L, 3L, 4L, 5L, 5L, 6L, 6L, 6L
    ), Date = structure(c(12418, 12418, 12420, 12418, 12420, 
    12426, 12424, 12424, 12427, 12424, 12427, 12434), class = "Date")), 
    .Names = c("Name", "Person", "Date"), 
    row.names = c(NA, -12L), class = "data.frame")

then use

with(dd, ave(Date, Person,FUN=function(x) {x[x!=max(x)]<-NA; x}))
#  [1] "2004-01-01" NA           "2004-01-03" NA           NA          
#  [6] "2004-01-09" "2004-01-07" NA           "2004-01-10" NA          
# [11] NA           "2004-01-17"
3
votes

Another way using data.table (assuming that you don't have multiple max values per group)

 dat[dat[, order(Date)!=.N, by=Person]$V1, Date:= NA]
 dat
 #   Name Person       Date
 #1:    A      1 2004-01-01
 #2:    A      2       <NA>
 #3:    A      2 2004-01-03
 #4:    A      3       <NA>
 #5:    A      3       <NA>
 #6:    A      3 2004-01-09
 #7:    B      4 2004-01-07
 #8:    B      5       <NA>
 #9:    B      5 2004-01-10
#10:    B      6       <NA>
#11:    B      6       <NA>
#12:    B      6 2004-01-17

If you have multiple max values:

dat[dat[, rank(Date,ties.method="max")!=.N, by=Person]$V1, Date:=NA]

To format the date

dat[dat[, order(Date)!=.N, by=Person]$V1, Date:= NA][,Date:=format(Date, "%m/%d/%Y")]