6
votes

For instance, let's say you have ~10 years of daily 1 min data for the volume of instrument x as follows (in xts format) from 9:30am to 4:30pm :

    Date.Time               Volume        
    2001-01-01 09:30:00     1200
    2001-01-01 09:31:00     1110
    2001-01-01 09:32:00     1303

All the way through to:

    2010-12-20 16:28:00     3200
    2010-12-20 16:29:00     4210
    2010-12-20 16:30:00     8303

I would like to:

  • Get the average volume at each minute for the entire series (ie average volume over all 10 years at 9:30, 9:31, 9:32...16:28, 16:29, 16:30)

How should I best go about:

  • Aggregating the data into one minute buckets
  • Getting the average of those buckets
  • Reconstituting those "average" buckets back to a single xts/zoo time series?

I've had a good poke around with aggregate, sapply, period.apply functions etc, but just cannot seem to "bin" the data correctly.

It's easy enough to solve this with a loop, but very slow. I'd prefer to avoid a programmatic solution and use a function that takes advantage of C++ architecture (ie xts based solution)

Can anyone offer some advice / a solution?

Thanks so much in advance.

2

2 Answers

5
votes

First lets create some test data:

library(xts) # also pulls in zoo
library(timeDate)
library(chron) # includes times class

# test data
x <- xts(1:3, timeDate(c("2001-01-01 09:30:00", "2001-01-01 09:31:00", 
    "2001-01-02 09:30:00")))

1) aggregate.zoo. Now try converting it to times class and aggregating using this one-liner:

aggregate(as.zoo(x), times(format(time(x), "%H:%M:%S")), mean)

1a) aggregate.zoo (variation). or this variation which converts the shorter aggregate series to times to avoid having to do it on the longer original series:

ag <- aggregate(as.zoo(x), format(time(x), "%H:%M:%S"), mean)
zoo(coredata(ag), times(time(ag)))

2) tapply. An alternative would be tapply which is likely faster:

ta <- tapply(coredata(x), format(time(x), "%H:%M:%S"), mean)
zoo(unname(ta), times(names(ta)))

EDIT: simplified (1) and added (1a) and (2)

3
votes

Here is a solution with ddply, but you can probably also use sqldf, tapply, aggregate, by, etc.

# Sample data
minutes <- 10 * 60
days <- 250 * 10
d <- seq.POSIXt( 
  ISOdatetime( 2011,01,01,09,00,00, "UTC" ), 
  by="1 min", length=minutes 
)
d <- outer( d, (1:days) * 24*3600, `+` )
d <- sort(d)
library(xts)
d <- xts( round(100*rlnorm(length(d))), d )

# Aggregate
library(plyr)
d <- data.frame( 
  minute=format(index(d), "%H:%M"), 
  value=coredata(d) 
)
d <- ddply( 
  d, "minute", 
  summarize, 
  value=mean(value, na.rm=TRUE) 
)

# Convert to zoo or xts
zoo(x=d$value, order.by=d$minute) # The index does not have to be a date or time
xts(x=d$value, order.by=as.POSIXct(sprintf("2012-01-01 %s:00",d$minute), "%Y-%m-%d %H:%M:%S") )