I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!
res[start <= endmonth & end >= startmonth, .N, by = .(startmonth, endmonth)]- markus