I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.
As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:
#create dummy data frame
set.seed(1)
data <- data.frame(
date = seq(from = as.POSIXct("2014-01-01 00:00"),
to = as.POSIXct("2014-01-31 23:00"),
by = 300),
A = runif(8917),
B = runif(8917),
C = runif(8917),
D = runif(8917)
)
#convert to long format
require(dplyr)
data <- data %>%
gather(class,value,A:D)
# create dummy events
events <- data.frame(
id = c("blue","red","green","yellow"),
start = as.POSIXct(c("2014-01-03 13:00",
"2014-01-12 08:00",
"2014-01-18 10:00",
"2014-01-27 23:00")),
stop = as.POSIXct(c("2014-01-03 19:00",
"2014-01-13 17:00",
"2014-01-20 10:00",
"2014-01-28 20:00"))
)
#average value within events, grouped by class
require(sqldf)
results <- sqldf("
SELECT x.id, y.class, avg(y.value) AS mean
FROM events as x, data as y
WHERE y.date between x.start and x.stop
GROUP BY x.id, y.class
")
which gives the desired output
id class mean
1 blue A 0.4879129
2 blue B 0.4945888
3 blue C 0.5312504
4 blue D 0.4968260
5 green A 0.5235671
6 green B 0.5030602
7 green C 0.5071219
8 green D 0.5002010
9 red A 0.5122966
10 red B 0.4767966
11 red C 0.5032387
12 red D 0.5018389
13 yellow A 0.4727868
14 yellow B 0.4626688
15 yellow C 0.4930207
16 yellow D 0.5184966
However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.
Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps
, which is called an 'overlap join', but I don't fully understand if this is what I need.
If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...
Any suggestions from the experts would be appreciated.
Update
As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.
Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.
I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!
data.table::foverlaps
? You can check its documentation or search over SO. There are many similar Q/A illustrating how to use it in different situations. – David Arenburg?foverlaps
? If so, what don't you understand? – Arunsqldf(c("CREATE INDEX a1 ON data(class, value)", "SELECT x.id, y.class, avg(y.value) AS mean FROM events as x, main.data as y WHERE y.date between x.start and x.stop GROUP BY x.id, y.class "))
– G. Grothendieck