15
votes

I am trying to extract interesting statistics for an irregular time series data set, but coming up short on finding the right tools for the job. The tools for manipulating regularly sampled time series or index-based series of any time are pretty easily found, though I'm not having much luck with the problems I'm trying to solve.

First, a reproducible data set:

library(zoo)
set.seed(0)
nSamples    <- 5000
vecDT       <- rexp(nSamples, 3)
vecTimes    <- cumsum(c(0,vecDT))
vecDrift    <- c(0, rnorm(nSamples, mean = 1/nSamples, sd = 0.01))
vecVals     <- cumsum(vecDrift)
vecZ        <- zoo(vecVals, order.by = vecTimes)
rm(vecDT, vecDrift)

Assume the times are in seconds. There are almost 1700 seconds (just shy of 30 minutes) in the vecZ series, and 5001 entries during that time. (NB: I'd try using xts, but xts seems to need date information, and I'd rather not use a particular date when it's not relevant.)

My goals are the following:

  • Identify the indices of the values 3 minutes before and 3 minutes after each point. As the times are continuous, I doubt that any two points are precisely 3 minutes apart. What I'd like to find are the points that are at most 3 minutes prior, and at least 3 minutes after, the given point, i.e. something like the following (in pseudocode):

    backIX(t, vecZ, tDelta) = min{ix in length(vecZ) : t - time(ix) < tDelta} forwardIX(t, vecZ, tDelta) = min{ix in length(vecZ) : time(ix) - t > tDelta}

    So, for 3 minutes, tDelta = 180. If t=2500, then the result for forwardIX() would be 3012 (i.e. time(vecZ)[2500] is 860.1462, and time(vecZ)[3012] is 1040.403, or just over 180 seconds later), and the output of backwardIX() would be 2020 (corresponding to time 680.7162 seconds).

    Ideally, I would like to use a function that does not require t, as that is going to require length(vecZ) calls to the function, which ignores the fact that sliding windows of time can be calculated more efficiently.

  • Apply a function to all values in a rolling window of time. I've seen rollapply, which takes a fixed window size (i.e. fixed number of indices, but not a fixed window of time). I can solve this the naive way, with a loop (or foreach ;-)) that is calculated per index t, but I wondered if there are some simple functions already implemented, e.g. a function to calculate the mean of all values in a given time frame. Since this can be done efficiently via simple summary statistics that slide over a window, it should be computationally cheaper than a function that accesses all of the data multiple times to calculate each statistic. Some fairly natural functions: mean, min, max, and median.

    Even if the window isn't varying by time, the ability to vary the window size would be adequate, and I can find that window size using the result of the question above. However, that still seems to require excess calculations, so being able to specify time-based intervals seems more efficient.

Are there packages in R that facilitate such manipulations of data in time-windows, or am I out of luck and I should write my own functions?


Note 1: This question seeks to do something similar, except over disjoint intervals, rather than rolling windows of time, e.g. I could adapt this to do my analysis on every successive 3 minute block, but I don't see a way to adapt this for rolling 3 minute intervals.

Note 2: I've found that switching from a zoo object to a numeric vector (for the times) has significantly sped up the issue of range-finding / window endpoint identification for the first goal. That's still a naive algorithm, but it's worth mentioning that working with zoo objects may not be optimal for the naive approach.

2
I think xts is probably the way to go. See ?endpoints, ?to.period, ?period.apply and ?split.xts. Coerce your object to xts like this: x <- .xts(vecVals, vecTimes)GSee
@GSee Thanks, though it seems to me that those functions split the data into successive, disjoint intervals (as mentioned in the note I appended to the question). If there's a way to make sliding/rolling windows of time, I haven't yet figured out how to make xts do that.Iterator
you can merge with a zero-width strictly regular xts object and na.locf to get your data to be strictly regular. Then use rollapplyGSee
@GSee You've stumped me. :) I don't yet see how that will work, but I'll give it more thought. I see now that rollapply supports width as a list - I just need to figure out how to get that list, I suppose.Iterator
hmm. If you want to keep the microsecond timestamp, my method would make your object much, much bigger. I should rethink..GSee

2 Answers

1
votes

Here's what I was suggeting, but I'm not sure it exactly answers your question

#Picking up where your code left off
library(xts)
library(TTR)
x <- .xts(vecZ, vecTimes)
xx <- na.locf(cbind(xts(, seq.POSIXt(from=start(x), to=end(x), by='sec')), x))
x$means <- runMean(xx, n=180)
out <- x[!is.na(x[, 1]), ]
tail(out)

                                  x     means
1969-12-31 18:28:17.376141 0.2053531 0.1325938
1969-12-31 18:28:17.379140 0.2101565 0.1329065
1969-12-31 18:28:17.619840 0.2139770 0.1332403
1969-12-31 18:28:17.762765 0.2072574 0.1335843
1969-12-31 18:28:17.866473 0.2065790 0.1339608
1969-12-31 18:28:17.924270 0.2114755 0.1344264
1
votes

As of version v1.9.8 (on CRAN 25 Nov 2016), has gained the ability to aggregate in a non-equi join which can be used to apply a rolling function on a sliding time window of an irregular time series.

For demonstration and verification, a smaller dataset is used.

library(data.table)   # development version 1.11.9 used

# create small dataset
set.seed(0)
nSamples    <- 10
vecDT       <- rexp(nSamples, 3)
vecTimes    <- cumsum(c(0,vecDT))
vecVals     <- 0:nSamples
vec         <- data.table(vecTimes, vecVals)
vec
      vecTimes vecVals
 1: 0.00000000       0
 2: 0.06134553       1
 3: 0.10991444       2
 4: 0.15651286       3
 5: 0.30186907       4
 6: 1.26685858       5
 7: 1.67671260       6
 8: 1.85660688       7
 9: 2.17546271       8
10: 2.22447804       9
11: 2.68805641      10
# define window size in seconds 
win_sec = 0.3

# aggregate in sliding window by a non-equi join
vec[.(t = vecTimes, upper = vecTimes + win_sec, lower = vecTimes - win_sec), 
    on = .(vecTimes < upper, vecTimes > lower), 
    .(t, .N, sliding_mean = mean(vecVals)), by = .EACHI]
     vecTimes     vecTimes          t N sliding_mean
 1: 0.3000000 -0.300000000 0.00000000 4          1.5
 2: 0.3613455 -0.238654473 0.06134553 5          2.0
 3: 0.4099144 -0.190085564 0.10991444 5          2.0
 4: 0.4565129 -0.143487143 0.15651286 5          2.0
 5: 0.6018691  0.001869065 0.30186907 4          2.5
 6: 1.5668586  0.966858578 1.26685858 1          5.0
 7: 1.9767126  1.376712596 1.67671260 2          6.5
 8: 2.1566069  1.556606875 1.85660688 2          6.5
 9: 2.4754627  1.875462707 2.17546271 2          8.5
10: 2.5244780  1.924478037 2.22447804 2          8.5
11: 2.9880564  2.388056413 2.68805641 1         10.0

The first two columns show the upper and lower bounds of the time intervall, resp., t is the original vecTimes, and N denotes the number of data points included in the calculation of the sliding mean.