4
votes

I have a data file containing readings sampled at 30 seconds duration. File organization is:

> head(dframe)
            timestamp    power
1 2015-08-01 00:00:04 584.1379
2 2015-08-01 00:00:34 585.8087
3 2015-08-01 00:01:04 584.9335
4 2015-08-01 00:01:34 584.4366
5 2015-08-01 00:02:04 584.2829

Now to represent the 30 second duration data into hourly mean I use following R command:

df = aggregate(list(power=dframe$power),by=list(timestamp=cut(as.POSIXct(dframe$timestamp),"hour")),mean) 

This works perfectly. But, the actual issue is with time it takes for large files (data of one year). Can I somehow reduce the time it takes for conversion process? In other words, is there any other best alternative which takes less time for conversion of seconds data to hourly mean data in R?

UPDATE: I used 4 different methods for the same problem as suggested by @akrun and @Joshua. For other users of stack overflow here I am providing the usage of all methods and the respective time taken

dframe<-read.csv(path,head=TRUE,sep=",")
dframe$timestamp<- as.POSIXct(dframe$timestamp)
xframe = dframe
#using aggregate
system.time(
df1<- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")),mean)
)
# using data.table
system.time(
dfx<-setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(timestamp, 'hour'))]
)
# using dplyr
system.time( 
xframe %>% group_by(timestamp= cut(timestamp, 'hour')) %>% summarise(power=mean(power))
)
#using xts
system.time({
  x <- xts(dframe$power,dframe$timestamp)
  h <- period.apply(x, endpoints(x, "hours"), mean)
  h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})

Respective times taken on two (one month, three month) different datasets are: For one month data set:

Method       user  system elapsed 
Aggregate    0.137   0.005   0.142
data.table   0.031   0.001   0.032 
dplyr        0.035   0.001   0.036  
xts          0.053   0.000   0.053  

For three month data set:

Aggregate    0.456   0.019   0.475 
data.table   0.099   0.002   0.102  
dplyr        0.099   0.004   0.103  
xts          0.158   0.004   0.161

Caveat: All approaches except xts changes the type of timestamp from POSIXct to Factor. This means you have to convert again type of timestamp column, that will incur some more cpu cycles. In short, If in the end you further need POSIXct timestamp then xts is the best, otherwise go for data.table.

DATASET Dataset used can be found at link

2
Can you try with dplyr, library(dplyr); dframe %>% group_by(timestamp= cut(as.POSIXct(timestamp), 'hour') %>% summarise(power=mean(power))akrun
I'm guessing cut is causing a lot of the slow-downRich Scriven
@akrun. Thanks. data.table is taking lesser time than all the options I have used. Please provide your first (using data.table) comment as an answer. In the meantime I will add the statistics in the main question. I am still confused why did it gave bad results in the morning!Haroon Rashid

2 Answers

4
votes

You can do this aggregation in less than half the time using tools from the xts package.

# sample data
set.seed(21)
N <- 2e6
dframe <- data.frame(timestamp=seq(Sys.time(), by="30 sec", length.out=N),
                     power=rnorm(N))
# aggregate
system.time(a <- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")), mean))
#    user  system elapsed 
#   2.456   0.000   2.457 

# xts
system.time({
  x <- xts(dframe$power, dframe$timestamp)
  h <- period.apply(x, endpoints(x, "hours"), mean)
  h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})
#    user  system elapsed 
#   0.888   0.004   0.893 
3
votes

In general, aggregate is slow. We can use data.table for making this fast. Convert the 'data.frame' to 'data.table' (setDT(dframe)), we use cut to create a grouping variable from 'timestamp' and get the mean of 'power'.

library(data.table)
setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(as.POSIXct(timestamp), 'hour'))]