1
votes

I have an R zoo object. The zoo object (z) is indexed by date and has multiple columns:

  • V1 (aggregate value is the sum of all values in 'selected' rows)
  • V2 (aggregate value is q1 [first quartile] of all values in 'selected' rows)
  • V3 (aggregate value is the minima of all values in 'selected' rows)
  • V4 (aggregate value is the first value of all values in 'selected' rows)
  • v5 (aggregate value is the last value of all values in 'selected' rows)

I want to aggregate the data in each 'column' differently (i.e. using different functions), but aggregating over the same number of rows.

I want to aggregate using a function that allows me to specify the number of rows over which to aggregate. For example:

my_aggregate <- function(data, agg_rowcount) {
  # aggregate data over [agg_rowcount] rows....
  return (aggregated_data)
}

I initially thought of implementing this function by using the aptly named aggregate() function - but I could not get it to do what I wanted.

A simple example explaining the error I was getting using aggregate() is follows:

> indices <- seq.Date(as.Date('2000-01-01'),as.Date('2000-01-30'),by="day")
> a <- zoo(rnorm(30), order.by=indices)
> b <- zoo(rnorm(30), order.by=indices)
> c <- zoo(rnorm(30), order.by=indices)
> d <- merge(a,b)
> e <- merge(d,c)
> head(e)
                     a          b           c
2000-01-01 -0.07924078  0.6208785 -1.79826472
2000-01-02  1.15956208  1.1867218 -0.02124817
2000-01-03  0.20427523  0.3164863 -0.20153631
2000-01-04  1.21583902 -1.3728278  1.75872854
2000-01-05 -0.32845708  0.3857658 -1.01082787
2000-01-06 -1.95312879 -0.3824591 -1.33220075
>
> aggregate(e,by=e[[1]], nfrequency=8)
Error: length(time(x)) == length(by[[1]]) is not TRUE

So I failed at the very first hurdle. I would appreciate any help in helping me write the function that allows me to aggregate different columns differently, accross the same number of rows.

Note: I am only into my first few days of 'messing around' with R. For all I know, aggregate() may not be the way to solve this problem - I don't want the snippet of the code above to be a red herring, and receive answers on how to fix the problem I was getting when using the aggregate function - IF aggregate() is not the "best" (i.e. recommended R) way to approach this problem.

The only reasons why I included my attempt above are:

  1. Because I was asked to post a 'reproducable' error
  2. To show that I had tried to solve it myself first, before asking in here.
2
Please provide something reproducible.G. Grothendieck

2 Answers

3
votes

Suppose we wish to aggregate e by week, w, aggregating column a using sum, b using mean and c using the last value in the week:

w <- as.numeric(format(time(e), "%W"))
e.w <- with(e, cbind(a = aggregate(a, w, sum), 
    b = aggregate(b, w, mean), 
    c = aggregate(c, w, tail, 1)
))
0
votes

Wouldn't the ddply function in the plyr package help here?

To aggregate by more than one column:

names(e)[1] = 'group'
agg = ddply(e, c("group"), function(df) { 
    c( sum(df$a), mean(df$b), tail(df$c) ) 
})
names(agg) = c('group', 'a', 'b', 'c')