overlap join to calculate averages within intervals?

Question

I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.

As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:

#create dummy data frame
set.seed(1)
data <- data.frame(
  date = seq(from = as.POSIXct("2014-01-01 00:00"),
             to = as.POSIXct("2014-01-31 23:00"),
             by = 300),
  A = runif(8917),
  B = runif(8917),
  C = runif(8917),
  D = runif(8917)
)

#convert to long format
require(dplyr)
data <- data %>%
  gather(class,value,A:D)

# create dummy events
events <- data.frame(
  id = c("blue","red","green","yellow"),
  start = as.POSIXct(c("2014-01-03 13:00",
                       "2014-01-12 08:00",
                       "2014-01-18 10:00",
                       "2014-01-27 23:00")),
  stop = as.POSIXct(c("2014-01-03 19:00",
                       "2014-01-13 17:00",
                       "2014-01-20 10:00",
                       "2014-01-28 20:00"))
)


#average value within events, grouped by class
require(sqldf)
results <- sqldf("
     SELECT x.id, y.class, avg(y.value) AS mean 
     FROM events as x, data as y 
     WHERE y.date between x.start and x.stop 
     GROUP BY x.id, y.class
")

which gives the desired output

       id class      mean
1    blue     A 0.4879129
2    blue     B 0.4945888
3    blue     C 0.5312504
4    blue     D 0.4968260
5   green     A 0.5235671
6   green     B 0.5030602
7   green     C 0.5071219
8   green     D 0.5002010
9     red     A 0.5122966
10    red     B 0.4767966
11    red     C 0.5032387
12    red     D 0.5018389
13 yellow     A 0.4727868
14 yellow     B 0.4626688
15 yellow     C 0.4930207
16 yellow     D 0.5184966

However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.

Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps, which is called an 'overlap join', but I don't fully understand if this is what I need.

If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...

Any suggestions from the experts would be appreciated.

Update

As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.

Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.

I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!

Did you try using data.table::foverlaps? You can check its documentation or search over SO. There are many similar Q/A illustrating how to use it in different situations. — David Arenburg
Have you looked at ?foverlaps? If so, what don't you understand? — Arun
I did look at it, but I haven't tried using it, because I was not sure if it is what I need. To be honest, I am a bit intimidated by the data.table syntax, I am not very experienced in coding... But if this is the function I need, then I will try and understand it. Thanks. — user3460194
You should probably stick to dplyr / SQL if you're intimidated. Good luck! — Arun
Try adding an index: sqldf(c("CREATE INDEX a1 ON data(class, value)", "SELECT x.id, y.class, avg(y.value) AS mean FROM events as x, main.data as y WHERE y.date between x.start and x.stop GROUP BY x.id, y.class ")) — G. Grothendieck

David Arenburg David Arenburg · Accepted Answer · 2015-08-05T10:25:38

Here's a possible data.table::foverlaps solution

library(data.table)
setDT(data)[, `:=`(start = date, stop = date)]
setkey(setDT(events), start, stop)
foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)]
#         id class      Mean
#  1:   blue     A 0.4879129
#  2:   blue     B 0.4945888
#  3:   blue     C 0.5312504
#  4:   blue     D 0.4968260
#  5:  green     A 0.5235671
#  6:  green     B 0.5030602
#  7:  green     C 0.5071219
#  8:  green     D 0.5002010
#  9:    red     A 0.5122966
# 10:    red     B 0.4767966
# 11:    red     C 0.5032387
# 12:    red     D 0.5018389
# 13: yellow     A 0.4727868
# 14: yellow     B 0.4626688
# 15: yellow     C 0.4930207
# 16: yellow     D 0.5184966

The logic seems pretty straight forward to me.

Set start and stop columns within data to overlap against.
key the events data set by the same columns.
Run foverlaps and remove unmatched intervals (nomatch = 0L).
Calculate mean(value) by id and class

overlap join to calculate averages within intervals?

2 Answers