2
votes

I wonder if there is a way to apply a function to each row of a data.frame such that the column classes are preserved? Let's look at an example to clarify what I mean:

test <- data.frame(startdate = as.Date(c("2010-03-07", "2013-09-13", "2011-11-12")),
                   enddate = as.Date(c("2010-03-23", "2013-12-01", "2012-01-05")),
                   nEvents = c(123, 456, 789))

Suppose I would like to expand the data.frame test by inserting all days between startdate and enddate and distribute the number of events over those days. My first try to do so was this:

eventsPerDay1 <- function(row) {
    n_days <- as.numeric(row$enddate - row$startdate) + 1
    data.frame(date = seq(row$startdate, row$enddate, by = "1 day"),
               nEvents = rmultinom(1, row$nEvents, rep(1/n_days, n_days)))
}

apply(test, 1, eventsPerDay1)

This, however, is not possible because apply calls as.matrix on test and thus it gets converted to a character matrix and all column classes are lost.

I already found two workarounds which you can find below, so my question is more of a philosphical nature.

library(magrittr)
############# Workaround 1
eventsPerDay2 <- function(startdate, enddate, nEvents) {
    n_days <- as.numeric(enddate - startdate) + 1
    data.frame(date = seq(startdate, enddate, by = "1 day"),
               nEvents = rmultinom(1, nEvents, rep(1/n_days, n_days)))
}

mapply(eventsPerDay2, test$startdate, test$enddate, test$nEvents, SIMPLIFY = F) %>%
    do.call(rbind, .)


############# Workaround 2
seq_along(test) %>%
    lapply(function(i) test[i, ]) %>%
    lapply(eventsPerDay1) %>%
    do.call(rbind, .)

My "problem" with the workarounds is the following:

  • Workaround 1: It may not be the best reason, but I simply do not like mapply. It has a different signature than the other *apply functions (as the the order of arguments differs) and I always feel that a for loop would just have been clearer.
  • Workaround 2: While being very flexible, I think it is not clear at first sight what is happening.

So does anyone know a function whose call would look like apply(test, 1, eventsPerDay1) and that will work?

3
If you want to preserve the class, use lapply looping over the sequence of rows and not applyakrun
@akrun thanks for the suggestion, but isn't that exactly what I did in "workaround 2"? If not please elaborate what you mean. Thanks!AEF
yes, you are right about that. I posted a solution using data.table. Please check if that makes it any betterakrun
Workaround 1 is best. apply() is meant to work with matrices (and if you pass in a data.frame, it's converted via as.matrix) and matrices can only have one atomic data table. Do not use apply() with data.frames.MrFlick

3 Answers

2
votes

Another idea:

library(dplyr)
library(tidyr)

test %>%
  mutate(id = row_number()) %>%
  group_by(startdate) %>%
  complete(startdate = seq.Date(startdate, enddate, 1), nesting(id)) %>%
  group_by(id) %>%
  mutate(nEvents = rmultinom(1, first(nEvents), rep(1/n(), n()))) %>%
  select(startdate, nEvents)

Which gives:

#Source: local data frame [152 x 3]
#Groups: id [3]
#
#      id  startdate nEvents
#   <int>     <date>   <int>
#1      1 2010-03-07       6
#2      1 2010-03-08       6
#3      1 2010-03-09       6
#4      1 2010-03-10       7
#5      1 2010-03-11      12
#6      1 2010-03-12       5
#7      1 2010-03-13       8
#8      1 2010-03-14       5
#9      1 2010-03-15       5
#10     1 2010-03-16       9
## ... with 142 more rows
2
votes

We can do this with data.table

library(data.table)
res <- setDT(test)[,n_days := as.numeric(enddate - startdate) + 1 
           ][, .(date = seq(startdate, enddate, by= "1 day"),
          nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))),
        by =  1:nrow(test)][, nrow := NULL]
str(res)
#Classes ‘data.tableand 'data.frame':  152 obs. of  2 variables:
# $ date   : Date, format: "2010-03-07" "2010-03-08" "2010-03-09" "2010-03-10" ...
# $ nEvents: int  5 9 7 11 6 6 10 7 12 3 ...

The above can be wrapped in a function

eventsPerDay <- function(dat){  
      as.data.table(dat)[, n_days:= as.numeric(enddate - startdate) + 1
       ][, .(date = seq(startdate, enddate, by= "1 day"),
    nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))) , 1:nrow(dat)
        ][, nrow := NULL][]
  }

eventsPerDay(test)
1
votes

I have asked myself the same question.

I either end up splitting the df into a list (the base way)

xy <- data.frame()
xy.list <- split(xy, 1:nrow(xy))
out <- lapply(xy.list, function(x) ...)
answer <- unlist(out)

or try the hadleyverse dplyr way using rowwise (the blackbox way)

xy %>%
rowwise() %>%
mutate(newcol = function(x) ....)

I agree that their should be a base implementation of apply(xy, 1, function(x)) that doesn't coerce into character, but I imagine the R ancients implemented the matrix conversion for an advanced reason my primitive mind can't understand.