15
votes

I have a data frame in R where the rows represent events, and one column is the date of the event. The thing the event is happening to is described by an ID column. So for each ID there are multiple entries.

How do I filter the data frame so that I retain only the most recent event for each ID? The IDs are integers and the dates are in the form mm/dd/yyyy.

5

5 Answers

25
votes

You can try

library(dplyr)
df %>% 
  group_by(ID) %>%
  slice(which.max(as.Date(date, '%m/%d/%Y')))

data

df <- data.frame(ID= rep(1:3, each=3), date=c('02/20/1989',
'03/14/2001', '02/25/1990',  '04/20/2002', '02/04/2005', '02/01/2008',
'08/22/2011','08/20/2009', '08/25/2010' ), stringsAsFactors=FALSE)
9
votes

For any solution, you might as well correct your date variable first, as shown by @akrun:

df$date <- as.Date(df$date, '%m/%d/%Y')

Base R

df[
  tapply(1:nrow(df),df$ID,function(ii) ii[which.max(df$date[ii])])
,]

This uses a selection of row numbers to subset the data. You can see the selection by running the middle line (between the []s) on its own.

Data.table

Similar to @rawr's:

require(data.table)
DT <- data.table(df)

unique(DT[order(date)], by="ID", fromLast=TRUE)
# or
unique(DT[order(-date)], by="ID")
4
votes

Or you can order the dates and

df <- data.frame(ID= rep(1:3, each=3), date=c('02/20/1989',
                                              '03/14/2001', '02/25/1990',  '04/20/2002', '02/04/2005', '02/01/2008',
                                              '08/22/2011','08/20/2009', '08/25/2010' ), stringsAsFactors=FALSE)

df$date <- as.Date(df$date, '%m/%d/%Y')

## make sure to order by both `ID` and `date` as Frank mentions in comments
## since the dates may be overlapping among IDs

df <- df[with(df, order(ID, date)), ]

1) select the last

df[cumsum(table(df$ID)), ]

#   ID       date
# 2  1 2001-03-14
# 6  2 2008-02-01
# 7  3 2011-08-22

2) or remove the duplicates

df[!duplicated(df$ID, fromLast = TRUE), ]

#   ID       date
# 2  1 2001-03-14
# 6  2 2008-02-01
# 7  3 2011-08-22

this data brought to you by @akrun

2
votes

It's probably a character flaw but I sometimes resist picking up new packages. The "base R" functions can often do the job. In this case I think the alue of the dplyr package shows through since I stumbled in creating a good solution since the ave function returned a character value for a logical test, which I still don't understand. So I think dplyr is a real gem. And if I could I'd like to insist that any upvotes be preceded by an upvote to akrun's answer. (It's hard to believe this hasn't already been asked and answered on SO.)

Anyway:

> df[ as.logical(
        ave(df$date, df$ID, FUN=function(d) as.Date(d , '%m/%d/%Y') == 
                                             max(as.Date(d, '%m/%d/%Y'))))
      , ]
  ID       date
2  1 03/14/2001
6  2 02/01/2008
7  3 08/22/2011

I thought this should work (fail) :

> df[ ave(df$date, df$ID, FUN=function(d) as.Date(d , '%m/%d/%Y') ==max(as.Date(d, '%m/%d/%Y'))) , ]
     ID date
NA   NA <NA>
NA.1 NA <NA>
NA.2 NA <NA>
NA.3 NA <NA>
NA.4 NA <NA>
NA.5 NA <NA>
NA.6 NA <NA>
NA.7 NA <NA>
NA.8 NA <NA>

Here's another base R solution that worked the first time with no surprises:

> do.call( rbind, by(df, df$ID, function(d) d[ which.max(as.Date(d$date, '%m/%d/%Y')), ] ) )
  ID       date
1  1 03/14/2001
2  2 02/01/2008
3  3 08/22/2011

Here's one inspired by @rawr's notion of taking the last one from an ordered subset:

> do.call( rbind, by(df, df$ID, function(d) tail( d[ order(as.Date(d$date, '%m/%d/%Y')), ] ,1)) )
  ID       date
1  1 03/14/2001
2  2 02/01/2008
3  3 08/22/2011
0
votes

I've never processed any data in R without plyr!

library(plyr)
ddply(df, .(ID), summarize, most_recent = max(as.Date(date, '%m/%d/%Y')))

   ID most_recent
1  1  2001-03-14
2  2  2008-02-01
3  3  2011-08-22