I have a large dataset in which I'm looking to use data.table to identify the first non-missing value per grouping id.
I've currently had a go at doing this by defining a function and applying it to the whole data frame using lapply(). I've also trying using mclapply() but that appears to be even slower.
### Libraries ###
library(microbenchmark)
library(ggplot2)
library(data.table)
### Dummy Data Table ###
dt <- data.table(
id = rep(1:4, each = 4),
var_int = c(rep(NA, 3), 1L, rep(NA, 2), 10L, rep(NA, 2), 100L, rep(NA, 2), 1000L, rep(NA, 3)),
var_dou = c(rep(NA, 2), 1, rep(NA, 2), 1.01, rep(NA, 2), 1.001, rep(NA, 3), rep(NA, 3), 1.0001),
var_cha = c(NA, "a", rep(NA, 2), "b", rep(NA, 6), "c", rep(NA, 2), "d", NA),
var_intmi = c(1L, rep(NA, 14), 4L)
)
dt
## id var_int var_dou var_cha var_intmi
## 1: 1 NA NA <NA> 1
## 2: 1 NA NA a NA
## 3: 1 NA 1.0000 <NA> NA
## 4: 1 1 NA <NA> NA
## 5: 2 NA NA b NA
## 6: 2 NA 1.0100 <NA> NA
## 7: 2 10 NA <NA> NA
## 8: 2 NA NA <NA> NA
## 9: 3 NA 1.0010 <NA> NA
## 10: 3 100 NA <NA> NA
## 11: 3 NA NA <NA> NA
## 12: 3 NA NA c NA
## 13: 4 1000 NA <NA> NA
## 14: 4 NA NA <NA> NA
## 15: 4 NA NA d NA
## 16: 4 NA 1.0001 <NA> 4
### Functions ###
firstnonmiss_1 <- function(x){x[which(complete.cases(x))][1]}
firstnonmiss_2 <- function(x){first(x[complete.cases(x)])}
firstnonmiss_3 <- function(x){x[complete.cases(x)][1]}
### Desired Output ###
dt[, lapply(.SD, firstnonmiss_3), by = id]
## id var_int var_dou var_cha var_intmi
## 1: 1 1 1.0000 a 1
## 2: 2 10 1.0100 b NA
## 3: 3 100 1.0010 c NA
## 4: 4 1000 1.0001 d 4
### Benchmarking ###
t <- microbenchmark(
"which()[1]" = dt[, lapply(.SD, firstnonmiss_1), by = id],
"first()" = dt[, lapply(.SD, firstnonmiss_2), by = id],
"[1]" = dt[, lapply(.SD, firstnonmiss_3), by = id],
times = 1e4
)
t
## Unit: microseconds
## expr min lq mean median uq max neval
## which()[1] 414.438 426.8485 516.7795 437.9710 460.8930 161388.83 10000
## first() 401.574 413.6190 483.2857 424.6860 446.6475 41523.61 10000
## [1] 388.845 401.4700 468.9951 411.3505 432.2035 33320.75 10000
### Plot Outputs ###
units <- attributes(print(t))[["unit"]]
autoplot(t) +
labs(x = "Function", y = paste0("Timings, (", units, ")")) +
scale_x_discrete() +
scale_y_log10() +
geom_violin(fill = "skyblue", alpha = 0.5) +
theme_light() +
theme(axis.text.y = element_text(family = "Monaco", angle = 90, hjust = 0.5))
The benchmark times on the dummy dataset aren't too bad, but when I run the function on a my actual dataset (1,019 columns, 1,506,451 rows, 502,540 group ids) it takes around 11 minutes to complete. Is there a better/faster way to get a collapsed data frame containing the first non-missing observations per group id for each column/variable?


idalways sorted? - CPaksetkey(dt, id)first, but the missing data values aren't always in the same positions for each of the other variables. - Feakster