16
votes

If one wants to fill in missing values of a variable based on previous/posterior non NA observation within a group, the data.table command is

setkey(DT,id,date)
DT[, value_filled_in := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]]

which is quite complex. It's a shame since roll is a very fast and powerful option (esp compared with applying a function such as zoo::na.locf within each group)

I can write a convenience function to fill in missing values

   fill_na <-  function(x , by = NULL, roll =TRUE , rollends= if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE)){
    id <- seq_along(x)
    if (is.null(by)){
      DT <- data.table("x" = x, "id" = id, key = "id") 
      return(DT[!is.na(x)][DT[, list(id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])

    } else{
      DT <- data.table("x" = x, "by" = by, "id" = id, key = c("by", "id")) 
      return(DT[!is.na(x)][DT[, list(by, id)], x, roll = roll, rollends = rollends, allow.cartesian = TRUE])
    }
  }

And then write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value, by = id)]

This is not really satisfying since one would like to write

setkey(DT,id, date)
DT[, value_filled_in := fill_na(value), by = id]

However, this takes a huge amount of time to run. And, for the end-user, it is cumbersome to learn that fill_na should be called with the by option, and should not be used with data.table by. Is there an elegant solution around this?

Some speed test

N <- 2e6
set.seed(1)
DT <- data.table(
         date = sample(10, N, TRUE),
           id = sample(1e5, N, TRUE),   
        value = sample(c(NA,1:5), N, TRUE),
       value2 = sample(c(NA,1:5), N, TRUE)                   
      )
setkey(DT,id,date)
DT<- unique(DT)

system.time(DT[, filled0 := DT[!is.na(value), list(id, date, value)][DT[, list(id, date)], value, roll = TRUE]])
#> user  system elapsed 
#>  0.086   0.006   0.105 
system.time(DT[, filled1 := zoo::na.locf.default(value, na.rm = FALSE), by = id])
#> user  system elapsed 
#> 5.235   0.016   5.274 
# (lower speed and no built in option like roll=integer or roll=nearest, rollend, etc)
system.time(DT[, filled2 := fill_na(value, by = id)])
#>   user  system elapsed 
#>  0.194   0.019   0.221 
system.time(DT[, filled3 := fill_na(value), by = id])
#>    user  system elapsed 
#> 237.256   0.913 238.405 

Why don't I just use na.locf.default ? Even though the speed difference is not really important, the same issue arises for other kinds of data.table commands (those that rely on a merge by the variable in "by") - it's a shame to systematically ignore them in order to get an easier syntax. I also really like all the roll options.

2
How does the na.locf solution compare to this solution in terms of speed?GSee
Is wrapping the entire thing (a la dplyr::mutate) not an option?shadowtalker
It would be helpful if you provided code to create a sample data.table that we could use to check our results and to help with benchmarking.GSee
If you get rid of the :: call in the zoo one, it's about 30% faster for me. i.e. call na.locf.default instead of zoo::na.locf.defaultGSee
@ssdecontrol Because :: is a function and there is overhead associated with that extra function call.GSee

2 Answers

14
votes

Here's a slightly faster and more compact way of doing it (version 1.9.3+):

DT[, filled4 := DT[!is.na(value)][DT, value, roll = T]]
7
votes

There is now a native data.table way of filling missing values (as of 1.12.4).

This question spawned a github issue which was recently closed with the creation of functions nafill and setnafill. You can now use

DT[, value_filled_in := nafill(value, type = "locf")]

It is also possible to fill NA with a constant value or next observation carried back.

One difference to the approach in the question is that these functions currently only work on NA not NaN whereas is.na is TRUE for NaN - this is planned to be fixed in the next release through an extra argument.

I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.

Update: By default NaN is now treated same as NA.