0
votes

I have a dataframe that looks like this:

    id      date1 value1      date2 value2      date3 value3
1 1113 2012-01-14     29 2012-09-29     22 2013-10-28     21
2 1622 2012-12-05     93 2012-12-05     82 2013-01-22     26
3 1609 2014-08-30     30 2013-04-07     53 2013-03-20    100
4 1624 2014-01-20     84 2013-03-17     92 2014-01-10     81
5 1861 2014-10-08     29 2012-08-19     84 2012-09-21     56
6 1640 2014-03-05     27 2012-02-28      5 2015-01-11     65

I want to create a new column that contains whichever value of the three columns "value1", "value2", and "value3" that is the most recent. I don't need to know which date it was associated with.

    id      date1 value1      date2 value2      date3 value3 value_recent
1 1113 2012-01-14     29 2012-09-29     22 2013-10-28     21           21
2 1622 2012-12-05     93 2012-12-05     82 2013-01-22     26           26
3 1609 2014-08-30     30 2013-04-07     53 2013-03-20    100           30
4 1624 2014-01-20     84 2013-03-17     92 2014-01-10     81           84
5 1861 2014-10-08     29 2012-08-19     84 2012-09-21     56           29
6 1640 2014-03-05     27 2012-02-28      5 2015-01-11     65           65

Code to create working example:

set.seed(1234)
id <- sample(1000:2000, 6, replace=TRUE)
date1 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value1 <- sample(1:100, 6, replace=TRUE)
date2 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value2 <- sample(1:100, 6, replace=TRUE)
date3 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value3 <- sample(1:100, 6, replace=TRUE)

df <- data.frame(id, date1, value1, date2, value2, date3, value3)
4
Use set.seed so the random process can be reporducedPierre L
@PierreLafortune, thanks, I have edited it.epi_n00b

4 Answers

1
votes

Edit: Per @Pierre Lafortune's answer, you can actually collapse this into one statement.

Edit 2: Added in data with NAs, also changed code to handle NAs.

This should do the trick rather nicely. It does require a loop and I would be interested to see if someone could come up with a concise vecotrized solution.

date_cols <- colnames(df)[grep("date",colnames(df))]
df$value_recent<-df[cbind(1:nrow(df),grep("date",colnames(df))[apply(sapply(df[,date_cols],as.numeric),1,which.max)]+1)]
df

  id      date1   value1    date2   value2   date3    value3  value_recent
1 1113       <NA>     29 2012-09-29     22 2013-10-28     21           21
2 1622 2012-12-05     93 2012-12-05     82 2013-01-22     26           26
3 1609       <NA>     30 2013-04-07     53 2013-03-20    100           53
4 1624 2014-01-20     84 2013-03-17     92 2014-01-10     81           84
5 1861 2014-10-08     29 2012-08-19     84 2012-09-21     56           29
6 1640 2014-03-05     27 2012-02-28      5 2015-01-11     65           65

Data:

df<-structure(list(id = c(1113L, 1622L, 1609L, 1624L, 1861L, 1640L
), date1 = structure(c(NA, 15679, NA, 16090, 16351, 16134), class = "Date"), 
    value1 = c(29L, 93L, 30L, 84L, 29L, 27L), date2 = structure(c(15612, 
    15679, 15802, 15781, 15571, 15398), class = "Date"), value2 = c(22L, 
    82L, 53L, 92L, 84L, 5L), date3 = structure(c(16006, 15727, 
    15784, 16080, 15604, 16446), class = "Date"), value3 = c(21L, 
    26L, 100L, 81L, 56L, 65L)), .Names = c("id", "date1", "value1", 
"date2", "value2", "date3", "value3"), row.names = c(NA, -6L), class = "data.frame")
1
votes

I'm using apply to go over the rows looking for the most recent date. Then use that index to find the value that corresponds. We use a matrix subsetting method to keep it concise:

indx <- apply(df[grep("date", names(df))], 1, function(x) which(x == max(x))[1])
df$value_recent <- df[grep("val", names(df))][cbind(1:nrow(df), indx)]
#     id      date1 value1      date2 value2      date3 value3 value_recent
# 1 1113 2012-01-14     29 2012-09-29     22 2013-10-28     21           21
# 2 1622 2012-12-05     93 2012-12-05     82 2013-01-22     26           26
# 3 1609 2014-08-30     30 2013-04-07     53 2013-03-20    100           30
# 4 1624 2014-01-20     84 2013-03-17     92 2014-01-10     81           84
# 5 1861 2014-10-08     29 2012-08-19     84 2012-09-21     56           29
# 6 1640 2014-03-05     27 2012-02-28      5 2015-01-11     65           65

(Note: arranging your data this way will create more trouble than good.)

0
votes

There are probably less verbose ways to do this, but here's one option. First move it to a "long" format, then split it by id, sort, and extract the most recent record and merge that back in with the original data frame.

ld <- reshape(df, 
        idvar = "id", 
        varying = list(paste0("date", 1:3),
                       paste0("value", 1:3)),
        v.names = c("date", "value"),
        direction = "long")

recent <- split(ld, ld$id)
recent <- lapply(recent, function(x) {
    d <- x[order(x$date), ]
    d <- d[nrow(d), c(1, 4)]
    names(d)[2] <- "value_recent"
d
})

recent <- do.call(rbind, recent)
merge(df, recent, by = "id")

#     id      date1 value1      date2 value2      date3 value3 value_recent
# 1 1204 2014-10-25     73 2012-12-22     39 2015-07-18     62           62
# 2 1667 2012-01-16     97 2014-02-28     30 2014-12-31     83           83
# 3 1673 2015-01-16     96 2014-12-16     50 2014-08-05     31           96
# 4 1722 2015-02-07     10 2013-12-25      4 2012-08-18     93           10
# 5 1882 2012-10-20     91 2014-12-28     71 2015-09-03     18           18
# 6 1883 2012-03-30     73 2015-04-26      4 2014-12-23     74            4
0
votes

Here's a similar solution that also starts with reshape but then does the rest in a series of pipes:

library(dplyr)
library(reshape)

df2 <- reshape(df,
               varying = list(names(df)[grep("date", names(df))],
                              names(df)[grep("value", names(df))]),
               v.names = c("date", "value"),
               direction = "long") %>%
  # order data for step to come
  arrange(id, date) %>%
  # next two steps cut down to last (ordered) obs for each id
  group_by(id) %>%
  slice(n()) %>%
  # keep only the columns we need and rename the value column for merging
  select(id, most.recent = value) %>%
  # merge the values back into the original data frame, matching on id
  left_join(df, .)