1
votes

I have a large xts object and want to subset the seconds in the time column, but only if there is a sequence of minimum 5 consecutive seconds. I have up to 8 data points per second (which shouldn't be counted as 5 consecutive points as they are measured within the same second).

And_sub_xts is my xts object

> str(And_sub_xts)
An ‘xts’ object on 2010-04-09 20:32:56/2010-04-26 06:56:57 containing:
 Data: chr [1:164421, 1:11] "0.255416" "0.168836" "0.212126" "0.229442" "0.238100" "0.212126" "0.168836" ...
- attr(*, "dimnames")=List of 2
 ..$ : NULL
 ..$ : chr [1:11] "CalSurge" "CalSway" "CalHeave" "Stat_Surge" ...
 Indexed by objects of class: [POSIXct,POSIXt] TZ: 
 xts Attributes:  
NULL

and the first 100 values for

abs(diff(.indexsec(And_sub_xts)) are

56 8 23 34 40 40 41 42 25 27 34 35 38 38 40 40 41 56 59 59 19 19 20 20 20 20 22 22 23 23 24 24 24 25 25 26 27 27 27 27 27 28 28 30 30 30 37 38 40 40 41 44 44 46 46 47 48 51 52 54 54 54 54 55 56 59 1 4 4 4 6 6 6 6 7 7 11 12 12 14 14 15 16 16 17 18 18 19 19 21 21 22 22 23 23 25 25 26 26 26

I marked the keeps in bold, so the subset should just consist of these data points.

I just realize that theorethically it could happen that there are some data points distributed like this

2010-04-09 20:32:20
2010-04-09 20:32:20
2010-04-09 20:32:21
2010-04-09 20:32:22
2010-04-09 20:32:22
2010-04-09 20:40:22
2010-04-09 22:52:23
2010-04-10 20:52:24

which wouldn't be 5 consecutive seconds, but you can't account for this with the .indexsec command - maybe anybody knows a way to go around this.

Thanks for your help!

1
Does the index of And_sub_xts contain sub-seconds, or is it just second resolution? - Joshua Ulrich
It's just second resolution. Thanks so much, your code works!!! - Pat

1 Answers

1
votes

Here's one way to do it. x is sample data that contains index values with seconds equal to your first 100 values.

require(xts)
# sample data
s <- c(56, 8, 23, 34, 40, 40, 41, 42, 25, 27, 34, 35, 38, 38, 40, 
40, 41, 56, 59, 59, 19, 19, 20, 20, 20, 20, 22, 22, 23, 23, 24, 
24, 24, 25, 25, 26, 27, 27, 27, 27, 27, 28, 28, 30, 30, 30, 37, 
38, 40, 40, 41, 44, 44, 46, 46, 47, 48, 51, 52, 54, 54, 54, 54, 
55, 56, 59, 1, 4, 4, 4, 6, 6, 6, 6, 7, 7, 11, 12, 12, 14, 14, 
15, 16, 16, 17, 18, 18, 19, 19, 21, 21, 22, 22, 23, 23, 25, 25, 
26, 26, 26)
S <- cumsum(ifelse(c(0, diff(s)) < 0, 1, 0)) * 60 + s
x <- .xts(seq_along(S), S, tzone="UTC")

The basic idea is to aggregate your data to 1-second resolution, so you can use rle (run-length encoding) to find the consecutive 5-second observations. Then find the first and last timestamps of the sets of 5-second observations in your aggregated data, and then find the locations of those timestamps in your original data. Finally, use the locations of the timestamps in your original data to create sets of sequences you can use to subset the consecutive 5-second groups of observations.

# aggregate data to 1-second resolution
oneSec <- period.apply(x, endpoints(x, 'seconds'), identity) 
# find the runs of 5 or more consecutive one-second increments
consec <- rle(diff(.index(oneSec)))
gte5s <- consec$lengths >= 5
# get the location of the first obs of the run in the 1-second data
begLoc <- cumsum(c(1,consec$lengths))[gte5s]
endLoc <- begLoc + consec$lengths[gte5s]
# get the timestamp of the first and last obs from the original data
beg <- lapply(index(oneSec)[begLoc], function(i) first(x[i, which.i=TRUE]))
end <- lapply(index(oneSec)[endLoc], function(i) last(x[i, which.i=TRUE]))
# create index vector between each value in 'beg' and 'end'
loc <- unlist(mapply(seq, beg, end))
# subset original object using index vector
X <- x[loc,]